<html><body><div style="font-family: lucida console,sans-serif; font-size: 12pt; color: #000000"><div><br></div><div><br></div><hr id="zwchr"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"shishir gowda" &lt;gowda.shishir@gmail.com&gt;<br><b>To: </b>"Paul Cuzner" &lt;pcuzner@redhat.com&gt;<br><b>Cc: </b>gluster-devel@nongnu.org<br><b>Sent: </b>Tuesday, 28 January, 2014 6:13:13 PM<br><b>Subject: </b>Re: [Gluster-devel] bit rot support for glusterfs design draft v0.1<br><div><br></div>On 28 January 2014 03:48, Paul Cuzner &lt;pcuzner@redhat.com&gt; wrote:<br>&gt;<br>&gt;<br>&gt; ________________________________<br>&gt;<br>&gt; From: "shishir gowda" &lt;gowda.shishir@gmail.com&gt;<br>&gt; To: gluster-devel@nongnu.org<br>&gt; Sent: Monday, 27 January, 2014 6:30:13 PM<br>&gt; Subject: [Gluster-devel] bit rot support for glusterfs design draft v0.1<br>&gt;<br>&gt;<br>&gt; Hi All,<br>&gt;<br>&gt; Please find the updated bit-rot design for glusterfs volumes.<br>&gt;<br>&gt; Thanks to Vijay Bellur for his valuable inputs in the design.<br>&gt;<br>&gt; Phase 1: File level bit rot detection<br>&gt;<br>&gt; The initial approach is to achieve bit rot detection at file level,<br>&gt; where checksum is computed for a complete file, and checked during<br>&gt; access.<br>&gt;<br>&gt; A single daemon(say BitD) per node will be responsible for all the<br>&gt; bricks of the node. This daemon, will be registered to the gluster<br>&gt; management daemon, and any graph changes<br>&gt; (add-brick/remove-brick/replace-brick/stop bit-rot) will be handles<br>&gt; accordingly. This BitD will register with changelog xlator of all the<br>&gt; bricks for the node, and process changes from them.<br>&gt;<br>&gt;<br>&gt; Doesn't having a single daemon for all bricks, instead of a per brick 'bitd'<br>&gt; introduce the potential of a performance bottleneck?<br>&gt;<br>&gt;<br>Most of the current gluster related daemons work in this mode<br>(nfs/selfheal/quota). Additionally, if we introduce 1:1 mapping<br>between a brick and bitd, then managing these daemons will be bring in<br>their own over heads.</blockquote><div>OK - but bitd is in the "I/O path" isn't it and it's compute intensive - which is why I'm concerned about scale and the potential impact to latency.</div><div><br></div><div>If NFS access is going to be problematic - why not exclude them from interactive checksums, and resort to batch so the admin can keep performance up and schedule a scrub time where applications/users are not affected<br></div><div><br></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><div><br></div>&gt; Change log xlator, would give the list of files (in terms of gfid)<br>&gt; which have changed during a defined interval. Checksum's would have to<br>&gt; be computed for these based on either fd close() call for non NFS<br>&gt; access, or every write for anonymous fd access (NFS). The computed<br>&gt; checksum in addition to the timestamp of the computation would be<br>&gt; saved as a extended-attribute (xattr) of the file. By using change-log<br>&gt; xlators, we would prevent periodic scans of the bricks, to identify<br>&gt; the files whose checksums need to be updated.<br>&gt;</blockquote><div>with the checksum update being based on the close() - what happens to environments like ovirt or openstack. <br></div><div><br></div><div>It would be great to understand if there are use cases for gluster which the bit-rot plan addresses, and by consequence identify which use cases if any that would be problematic/impractical.<br></div><div><br></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;">&gt;<br>&gt; Using the changelog is a great idea, but I'd also see a requirement for an<br>&gt; admin initiated full scan at least when bringing existing volumes under bitd<br>&gt; control.<br>&gt;<br><div><br></div>Sorry, failed to mention it. Once bitrot is turned on, a full scan of<br>each bricks are started.<br><div><br></div>&gt; Also, what's the flow if the xattr is unreadable, due to bit rot. In btrfs<br>&gt; meta data is typically mirrored.<br>&gt;<br><div><br></div>Currently, if xattr is unreadable, we would treat it as a failure from<br>the brick end. If the volume<br>is replicated, then other brick might be able to serve the file<br><div><br></div>&gt;<br>&gt; Upon access (open for non-anonymous-fd calls, every read for<br>&gt; anonymous-fd calls) from any clients, the bit rot detection xlator<br>&gt; loaded ontop of the bricks, would recompute the checksum of the file,<br>&gt; and allow the calls to proceed if they match, or fail them if they<br>&gt; mis-match. This introduces extra workload for NFS workloads, and for<br>&gt; large files which require read of the complete file to recompute the<br>&gt; checksum(we try to solve this in phase-2).<br>&gt;<br>&gt; every read..? That's sounds like such an overhead, admins would just turn it<br>&gt; off.<br>&gt;<br><div><br></div>NFS does not send open calls, and sends read calls directly on<br>anonymous fd's. On such occasions, for anonymousfd reads, we will have<br>to do checksum for every read. This one of the reasons why in phase 2<br>we want block level checksum to prevent read of the complete file for<br>any read.<br><div><br></div>&gt; I assume failing a read due to checksum inconsistency in a replicated volume<br>&gt; would trigger one of the other replica's to be used, so the issue is<br>&gt; transparent to the end user/application.<br>&gt;<br>&gt;<br>That is the expected behaviour.<br><div><br></div>&gt;<br>&gt; Since a data write happens first, followed by a delayed checksum<br>&gt; compute, there is a time frame where we might have data updated, but<br>&gt; checksums yet to be computed. We should allow the access of such files<br>&gt; if the file timestamps (mtime) has changed, and is within a defined<br>&gt; range from the current time.<br>&gt;<br>&gt; Additionally, we could/should have the ability to switch of checksum<br>&gt; compute from glusterfs perspective, if the underlying FS<br>&gt; exposes/implements bit-rot detection(btrfs).<br>&gt;<br>&gt; +1 Why re-invent the wheel!<br>&gt;<br>&gt;<br>&gt; Phase 2: Block-level(User space/defined) bit rot detection and correction.<br>&gt;<br>&gt; The eventual aim is to be able to heal/correct bit rots in files. To<br>&gt; achieve this, computing checksum at a more fine grain level like a<br>&gt; block (size limited by the bit rot algorithm), so that we not only<br>&gt; detect bit rots, but also have the ability to restore them.<br>&gt; Additionally, for large files, checking the checksums at block level<br>&gt; is more efficient, rather than recompute the checksum of the whole<br>&gt; file for a an access.<br>&gt;<br>&gt;<br>&gt; In this phase, we could move the checksum computation phase to the<br>&gt; xlator loaded on-top of the posix translator at each bricks. with<br>&gt; every write, we could compute the checksum, and store the checksum and<br>&gt; continue with the write or vice versa. Every access would also be able<br>&gt; to read/compute the checksum of the requested block, check it with the<br>&gt; save checksum of the block, and act accordingly. This would take away<br>&gt; the dependency on the external BitD, and changelog xlator.<br>&gt;<br>&gt; Additionally, using a Error-correcting code(ECC) or<br>&gt; Forward-error-correction(FEC) alogrithm, would enable us the correct<br>&gt; few bits in the block which have gone corrupt. And compute of the<br>&gt; complete files checksum is eliminated, as we are dealing with blocks<br>&gt; of defined size.<br>&gt;<br>&gt; We require the ability to store these fine-grained checksums<br>&gt; efficiently, and extended attributes would not scale for this<br>&gt; implementation. Either a custom backed store, or a DB would be<br>&gt; preferrable in this instance.<br>&gt;<br>&gt; so if there is a per 'block' checksum, won't our capacity overheads increase<br>&gt; to store the extra meta data, ontop of our existing replication/raid<br>&gt; overhead?<br>&gt;<br><div><br></div>That is true. But, we need to address bit rot at brick level.<br><div><br></div>&gt; Where does Xavi's disperse volume fit into this? Would an Erasure Coded<br>&gt; volume lend itself easier to those use cases (cold data) where bit rot is<br>&gt; key consideration?<br>&gt;<br>&gt; If so, would a more simple bit rot strategy for gluster be<br>&gt; 1) disperse volume<br>&gt; 2) btrfs checksums + plumbing to trigger heal when scrub detects a problem<br>&gt;<br>&gt; I like simple :)<br>&gt;<br>&gt;<br>We haven't explored dispers/ or any other cluster xlators impact. The<br>idea here is irrespective of the clustering mechanisms, bit rot is<br>at the brick level and independent. So, in the future if the volume<br>type changes, bit rot can still exist.</blockquote><div>Not sure what you mean by clustering mechanisms? My comment about disperse is that this is a volume that already uses erasure coding that includes healing, so by design it lends itself to the cold data use cases that are more open to silent corruption <img src="cid:08aa001f35d93ac0d7da94fcf841f441d655ab20@zimbra" alt="Laughing" title="Laughing" border="0"><br></div><div><br></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><div><br></div>btrfs bypass will be provided, but it wont be the only backend. So,<br>gluster has to do its own bitrot detection and correction when<br>possible.</blockquote><div>Agree.<br></div><div><br></div><div><br></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><div><br></div>&gt;<br>&gt; Please feel free to comment/critique.<br>&gt;<br>&gt; With regards,<br>&gt; Shishir<br>&gt;<br>&gt; _______________________________________________<br>&gt; Gluster-devel mailing list<br>&gt; Gluster-devel@nongnu.org<br>&gt; https://lists.nongnu.org/mailman/listinfo/gluster-devel<br>&gt;<br>&gt;<br></blockquote><div><br></div></div></body></html>