Please find answers below -<br><br><div class="gmail_quote">On Mon, Mar 18, 2013 at 12:03 AM, nlxswig <span dir="ltr">&lt;<a href="mailto:nlxswig@126.com" target="_blank">nlxswig@126.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="line-height:1.7;font-size:14px;font-family:arial">Good questions, <div>Why are there no reply?</div><div><br><pre>At 2011-08-16 04:53:50,&quot;Patrick J. LoPresti&quot; &lt;<a href="mailto:lopresti@gmail.com" target="_blank">lopresti@gmail.com</a>&gt; wrote:

&gt;(FUSE developers:  Although my questions are specifically about

&gt;Gluster, I suspect most of the answers have more to do with FUSE, so I

&gt;figure this is on-topic for your list.  If I figured wrong, I

&gt;apologize.)

&gt;

&gt;I have done quite a bit of searching looking for answers to these

&gt;questions, and I just cannot find them...

&gt;

&gt;I think I understand how the Linux page cache works for an ordinary

&gt;local (non-FUSE) partition.  Specifically:

&gt;

&gt;1) When my application calls read(), it reads from the page cache.  If

&gt;the page(s) are not resident, the kernel puts my application to sleep

&gt;and gets busy reading them from disk.

&gt;

&gt;2) When my application calls write(), it writes to the page cache.

&gt;The kernel will -- eventually, when it feels like it -- flush those

&gt;dirty pages to disk.

&gt;

&gt;3) When my application calls mmap(), page cache pages are mapped into

&gt;my process&#39;s address space, allowing me to create a dirty page or read

&gt;a page by accessing memory.

&gt;

&gt;4) When the kernel reads a page, it might decide to read some other

&gt;pages, depending on the underlying block device&#39;s read-ahead

&gt;parameters.  I can control these via &quot;blockdev&quot;.  On the write side, I

&gt;can exercise some control with various VM parameters (dirty_ratio

&gt;etc).  I can also use calls like fsync() and posix_fadvise() to exert

&gt;some control over page cache management at the application level.

&gt;

&gt;

&gt;My question is pretty simple.  If you had to re-write the above four

&gt;points for a Gluster file system, what would they look like?  If it

&gt;matters, I am specifically interested in Gluster 3.2.2 on Suse Linux

&gt;Enterprise Server 11 SP1 (Linux 2.6.32.43 + whatever Suse does to

&gt;their kernels).

&gt;

&gt;Does Gluster use the page cache on read()?  On write()?  If so, how

&gt;does it ensure coherency between clients?  If not, how does mmap()

&gt;work (or does it not work)?</pre></div></div></blockquote><div>Gluster or any FUSE filesystem by themselves do not use the page-cache directly. It serves read/write requests by either reading from or writing to /dev/fuse. The read/write implementations of the /dev/fuse &quot;device&quot; perform the copy. Now where the perform the copy to/from depends on whether the file is open with O_DIRECT and/or if &quot;direct_io&quot; was enabled on the open file. For &quot;normal&quot; IO, the copy happens to/from the page cache. For O_DIRECT or &quot;direct_io&quot; page-cache is bypassed completely, but care is taken to make sure that the copy of data in the page cache is flushed -- as a best effort attempt -- to give a consistent &quot;view&quot; of the file between two applications (on the SAME mount point ONLY) which are opening the file with different modes (O_DIRECT and otherwise).</div>

<div><br></div><div>As long as all the mounts are using &quot;direct_io&quot; mount option, coherency between mounts is really in the hands of the filesystem (like gluster) as FUSE is acting like a pure pass-through. On the other hand, if &quot;normal&quot; IO is happening, utilizing the page cache, then re-reads can always get served directly from the page-cache without the filesystem (like gluster) even knowing that a read() request was issued by a process. The filesystem could however use the reverse invalidation calls to invalidate the pages in all mounts if a write is happening from elsewhere (the co-ordination needs to happen in the filesystem, FUSE only provides the invalidation primitives) -- Gluster does NOT do this yet.</div>

<div><br></div><div>There is also a flag in open() FUSE operation to indicate whether or not to keep the page cache of the file. By default gluster asks FUSE to purge the page cache in open(). This provides you close-to-open consistency (i.e, if an open() from a process is performed strictly after close() from any other process, even on a different machine, then you are guaranteed to see all the content written by that application -- very similar consistency offered by NFS (v3) client in Linux.) </div>

<div><br></div><div>In summary, this means by default you get close-to-open consistency with gluster, but if you require strict consistency between two applications on different client which have opened the file at the same time, then you need BOTH a and b:</div>

<div><br></div><div>a. Either app opens with O_DIRECT or mount glusterfs with --enable-direct-io to keep page-cache out of the way of consistency</div><div><br></div><div>b. Either app opens with with O_DSYNC (or O_SYNC) or disable write-behind in the gluster volume configuration.</div>

<div><br></div><div>W.R.T mmap(), Getting strict consistency between the &quot;shared&quot; mapped regions of two applications on different machines is pretty much impossible (the filesystem/kernel knows only the first time an app attempts to &quot;write&quot; to the mapped region with a page fault, but once the page is marked dirty in the first write, nobody is getting notified that the app is modifying other memory regions of that page). There are four combinations - private vs shared, and mmap on &quot;direct_io&quot; file vs &quot;normal&quot; file.</div>

<div><br></div><div>shared and direct_io - not even supported (fails with ENODEV)</div><div>shared and normal - unless you do msync() data is not flushed to the server (i.e, other client mounts are not capable of receiving it when they ask for that region&#39;s data).</div>

<div>private (either direct_io or normal) - works, but in gluster you are not guaranteed to see modifications by another client in region that is already mapped and accessed once (this can be sort of made to work if the proper reverse invalidation wiring is done in the distributed filesystem)</div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="line-height:1.7;font-size:14px;font-family:arial"><div><pre>&gt;What read-ahead will the kernel use?  Does posix_fadvise(...,

&gt;POSIX_FADV_WILLNEED) have any effect on a Gluster file system?</pre></div></div></blockquote><div>read-ahead (and posix_fadvise) kicks in only if reads are going through the page cache. So you should not be mounting with --disable-direct-io-mode or opening with O_DIRECT.</div>

<div><br></div><div>Thanks,</div><div>Avati</div></div>