<div dir="ltr"><br><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Anand Babu Periasamy</b> <span dir="ltr"><<a href="mailto:ab@gluster.com">ab@gluster.com</a>></span><br>
Date: 2009/3/9<br>Subject: Re: [Gluster-users] How caches are working on AFR?<br>To: Stas Oskin ž<a href="mailto:stas.oskin@gmail.com">stas.oskin@gmail.com</a>ž<br>Cc: Gluster General Discussion List ž<a href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a>ž<br>
<br><br>Stas Oskin wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi.<br>
<br>
2009/3/8 Anand Babu Periasamy <<a href="mailto:ab@gluster.com" target="_blank">ab@gluster.com</a> <mailto:<a href="mailto:ab@gluster.com" target="_blank">ab@gluster.com</a>>><div class="im"><br>
<br>
Replicate in 2.0 performs atomic writes by default. This means,<br>
writes will return control<br>
back to application only after both the volumes (or more) are<br>
successfully written.<br>
<br>
<br>
Ok, so without write-behind cache, only when data physically written to all AFR disk, the app would continue?<br>
</div></blockquote>
<br>
Yes. Preciously speaking, when data is handed over to underlying diskfs<br>
and not physically written to disk. It may be written or journaled.<br>
<br>
Every parallel write operation is a transaction. It has to complete<br>
atomically on all volumes. If a volume is down, incomplete files<br>
are marked pending. It doesn't block then.<div class="im"><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
To mask the performance penalty of atomic writes, you should load<br>
write-behind on top of<br>
it. Write-behind returns control as soon as it receives the write<br>
call from the<br>
application, but it continues to write in background. Write-behind<br>
also performs<br>
block-aggregation. Smaller writes are aggregated into fewer large<br>
writes.<br>
<br>
POSIX says application should verify the return status of close<br>
system call to ensure all<br>
writes were successfully written. If they are any pending writes,<br>
close call will block to<br>
ensure all the data is completely written. There is an option in<br>
write-behind to even<br>
close in background. It is unsafe and turned off by default.<br>
<br>
<br>
So I need to call close() per each file (which should be done nevertheless for correct operations), in order to insure all was written to disk?<br>
<br>
And if the close() fails - this means some of the data was lost?<br>
<br>
</blockquote></div>
Yes correct. This behavior is expected even for regular disk file systems.<br>
<br>
If you want every write to be physically written to disk, you should<br>
either open with O_DIRECT or flush or use appropriate file system APIs<br>
for synchronous writes. GlusterFS respects all the flags/APIs and turns off<br>
write-behind or any such optimizations appropriately.<div class="im"><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Applications that expect every write to succeed, issues synchronous<br>
writes.<br>
<br>
<br>
By this you mean that no write-behind should be used, only the default atomic writes behavior?<br>
</blockquote>
<br></div>
No, Write-behind is good. Even NFS and regular disk file systems behave<br>
exactly like this. See the excerpt from GNU Glibc reference manual below.<br>
<br>
In GlusterFS, all of the functionalities including basic performance<br>
features are implemented as modules. You will get awful performance<br>
with out these modules loaded. You can only expect GlusterFS to<br>
be functionally right.<br>
<br>
--------[ FROM GLIBC DOC ]--------------------------------<br>
for write (..)<br>
Once `write' returns, the data is enqueued to be written and can be<br>
read back right away, but it is not necessarily written out to<br>
permanent storage immediately. You can use `fsync' when you need<br>
to be sure your data has been permanently stored before<br>
continuing. (It is more efficient for the system to batch up<br>
consecutive writes and do them all at once when convenient.<br>
Normally they will always be written to disk within a minute or<br>
less.) Modern systems provide another function `fdatasync' which<br>
guarantees integrity only for the file data and is therefore<br>
faster. You can use the `O_FSYNC' open mode to make `write' always<br>
store the data to disk before returning;<br>
<br>
for close (..)<br>
`ENOSPC'<br>
`EIO'<br>
`EDQUOT'<br>
When the file is accessed by NFS, these errors from `write'<br>
can sometimes not be detected until `close'. *Note I/O<br>
Primitives::, for details on their meaning.<br>
----------------------------------------------------------<br><font color="#888888">
<br>
-- <br></font><div><div></div><div class="h5">
Anand Babu Periasamy<br>
GPG Key ID: 0x62E15A31<br>
Blog [<a href="http://ab.multics.org" target="_blank">http://ab.multics.org</a>]<br>
GlusterFS [<a href="http://www.gluster.org" target="_blank">http://www.gluster.org</a>]<br>
The GNU Operating System [<a href="http://www.gnu.org" target="_blank">http://www.gnu.org</a>]<br>
</div></div></div></div>