<div dir="ltr"><br><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Anand Babu Periasamy</b> <span dir="ltr">&lt;<a href="mailto:ab@gluster.com">ab@gluster.com</a>&gt;</span><br>

Date: 2009/3/9<br>Subject: Re: [Gluster-users] How caches are working on AFR?<br>To: Stas Oskin þ<a href="mailto:stas.oskin@gmail.com">stas.oskin@gmail.com</a>þ<br>Cc: Gluster General Discussion List þ<a href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a>þ<br>

<br><br>Stas Oskin wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi.<br>

<br>

2009/3/8 Anand Babu Periasamy &lt;<a href="mailto:ab@gluster.com" target="_blank">ab@gluster.com</a> &lt;mailto:<a href="mailto:ab@gluster.com" target="_blank">ab@gluster.com</a>&gt;&gt;<div class="im"><br>

<br>

    Replicate in 2.0 performs atomic writes by default. This means,<br>

    writes will return control<br>

    back to application only after both the volumes (or more) are<br>

    successfully written.<br>

<br>

<br>

Ok, so without write-behind cache, only when data physically written to all AFR disk, the app would continue?<br>

</div></blockquote>

<br>

Yes. Preciously speaking, when data is handed over to underlying diskfs<br>

and not physically written to disk. It may be written or journaled.<br>

<br>

Every parallel write operation is a transaction. It has to complete<br>

atomically on all volumes. If a volume is down, incomplete files<br>

are marked pending. It doesn&#39;t block then.<div class="im"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

    To mask the performance penalty of atomic writes, you should load<br>

    write-behind on top of<br>

    it. Write-behind returns control as soon as it receives the write<br>

    call from the<br>

    application, but it continues to write in background. Write-behind<br>

    also performs<br>

    block-aggregation. Smaller writes are aggregated into fewer large<br>

    writes.<br>

<br>

    POSIX says application should verify the return status of close<br>

    system call to ensure all<br>

    writes were successfully written. If they are any pending writes,<br>

    close call will block to<br>

     ensure all the data is completely written. There is an option in<br>

    write-behind to even<br>

    close in background. It is unsafe and turned off by default.<br>

<br>

<br>

So I need to call close() per each file (which should be done nevertheless for correct operations), in order to insure all was written to disk?<br>

<br>

And if the close() fails - this means some of the data was lost?<br>

<br>

</blockquote></div>

Yes correct. This behavior is expected even for regular disk file systems.<br>

<br>

If you want every write to be physically written to disk, you should<br>

either open with O_DIRECT or flush or use appropriate file system APIs<br>

for synchronous writes. GlusterFS respects all the flags/APIs and turns off<br>

write-behind or any such optimizations appropriately.<div class="im"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

    Applications that expect every write to succeed, issues synchronous<br>

    writes.<br>

<br>

<br>

By this you mean that no write-behind should be used, only the default atomic writes behavior?<br>

</blockquote>

<br></div>

No, Write-behind is good. Even NFS and regular disk file systems behave<br>

exactly like this.  See the excerpt from GNU Glibc reference manual below.<br>

<br>

In GlusterFS, all of the functionalities including basic performance<br>

features are implemented as modules. You will get awful performance<br>

with out these modules loaded. You can only expect GlusterFS to<br>

be functionally right.<br>

<br>

--------[ FROM GLIBC DOC ]--------------------------------<br>

for write (..)<br>

     Once `write&#39; returns, the data is enqueued to be written and can be<br>

     read back right away, but it is not necessarily written out to<br>

     permanent storage immediately.  You can use `fsync&#39; when you need<br>

     to be sure your data has been permanently stored before<br>

     continuing.  (It is more efficient for the system to batch up<br>

     consecutive writes and do them all at once when convenient.<br>

     Normally they will always be written to disk within a minute or<br>

     less.)  Modern systems provide another function `fdatasync&#39; which<br>

     guarantees integrity only for the file data and is therefore<br>

     faster.  You can use the `O_FSYNC&#39; open mode to make `write&#39; always<br>

     store the data to disk before returning;<br>

<br>

for close (..)<br>

`ENOSPC&#39;<br>

`EIO&#39;<br>

`EDQUOT&#39;<br>

     When the file is accessed by NFS, these errors from `write&#39;<br>

     can sometimes not be detected until `close&#39;.  *Note I/O<br>

     Primitives::, for details on their meaning.<br>

----------------------------------------------------------<br><font color="#888888">

<br>

-- <br></font><div><div></div><div class="h5">

Anand Babu Periasamy<br>

GPG Key ID: 0x62E15A31<br>

Blog [<a href="http://ab.multics.org" target="_blank">http://ab.multics.org</a>]<br>

GlusterFS [<a href="http://www.gluster.org" target="_blank">http://www.gluster.org</a>]<br>

The GNU Operating System [<a href="http://www.gnu.org" target="_blank">http://www.gnu.org</a>]<br>

</div></div></div></div>