<html><body><div style="font-family: lucida console,sans-serif; font-size: 12pt; color: #000000"><div><br></div><div><br></div><hr id="zwchr"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"shishir gowda" <gowda.shishir@gmail.com><br><b>To: </b>"Paul Cuzner" <pcuzner@redhat.com><br><b>Cc: </b>gluster-devel@nongnu.org<br><b>Sent: </b>Tuesday, 28 January, 2014 6:13:13 PM<br><b>Subject: </b>Re: [Gluster-devel] bit rot support for glusterfs design draft v0.1<br><div><br></div>On 28 January 2014 03:48, Paul Cuzner <pcuzner@redhat.com> wrote:<br>><br>><br>> ________________________________<br>><br>> From: "shishir gowda" <gowda.shishir@gmail.com><br>> To: gluster-devel@nongnu.org<br>> Sent: Monday, 27 January, 2014 6:30:13 PM<br>> Subject: [Gluster-devel] bit rot support for glusterfs design draft v0.1<br>><br>><br>> Hi All,<br>><br>> Please find the updated bit-rot design for glusterfs volumes.<br>><br>> Thanks to Vijay Bellur for his valuable inputs in the design.<br>><br>> Phase 1: File level bit rot detection<br>><br>> The initial approach is to achieve bit rot detection at file level,<br>> where checksum is computed for a complete file, and checked during<br>> access.<br>><br>> A single daemon(say BitD) per node will be responsible for all the<br>> bricks of the node. This daemon, will be registered to the gluster<br>> management daemon, and any graph changes<br>> (add-brick/remove-brick/replace-brick/stop bit-rot) will be handles<br>> accordingly. This BitD will register with changelog xlator of all the<br>> bricks for the node, and process changes from them.<br>><br>><br>> Doesn't having a single daemon for all bricks, instead of a per brick 'bitd'<br>> introduce the potential of a performance bottleneck?<br>><br>><br>Most of the current gluster related daemons work in this mode<br>(nfs/selfheal/quota). Additionally, if we introduce 1:1 mapping<br>between a brick and bitd, then managing these daemons will be bring in<br>their own over heads.</blockquote><div>OK - but bitd is in the "I/O path" isn't it and it's compute intensive - which is why I'm concerned about scale and the potential impact to latency.</div><div><br></div><div>If NFS access is going to be problematic - why not exclude them from interactive checksums, and resort to batch so the admin can keep performance up and schedule a scrub time where applications/users are not affected<br></div><div><br></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><div><br></div>> Change log xlator, would give the list of files (in terms of gfid)<br>> which have changed during a defined interval. Checksum's would have to<br>> be computed for these based on either fd close() call for non NFS<br>> access, or every write for anonymous fd access (NFS). The computed<br>> checksum in addition to the timestamp of the computation would be<br>> saved as a extended-attribute (xattr) of the file. By using change-log<br>> xlators, we would prevent periodic scans of the bricks, to identify<br>> the files whose checksums need to be updated.<br>></blockquote><div>with the checksum update being based on the close() - what happens to environments like ovirt or openstack. <br></div><div><br></div><div>It would be great to understand if there are use cases for gluster which the bit-rot plan addresses, and by consequence identify which use cases if any that would be problematic/impractical.<br></div><div><br></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;">><br>> Using the changelog is a great idea, but I'd also see a requirement for an<br>> admin initiated full scan at least when bringing existing volumes under bitd<br>> control.<br>><br><div><br></div>Sorry, failed to mention it. Once bitrot is turned on, a full scan of<br>each bricks are started.<br><div><br></div>> Also, what's the flow if the xattr is unreadable, due to bit rot. In btrfs<br>> meta data is typically mirrored.<br>><br><div><br></div>Currently, if xattr is unreadable, we would treat it as a failure from<br>the brick end. If the volume<br>is replicated, then other brick might be able to serve the file<br><div><br></div>><br>> Upon access (open for non-anonymous-fd calls, every read for<br>> anonymous-fd calls) from any clients, the bit rot detection xlator<br>> loaded ontop of the bricks, would recompute the checksum of the file,<br>> and allow the calls to proceed if they match, or fail them if they<br>> mis-match. This introduces extra workload for NFS workloads, and for<br>> large files which require read of the complete file to recompute the<br>> checksum(we try to solve this in phase-2).<br>><br>> every read..? That's sounds like such an overhead, admins would just turn it<br>> off.<br>><br><div><br></div>NFS does not send open calls, and sends read calls directly on<br>anonymous fd's. On such occasions, for anonymousfd reads, we will have<br>to do checksum for every read. This one of the reasons why in phase 2<br>we want block level checksum to prevent read of the complete file for<br>any read.<br><div><br></div>> I assume failing a read due to checksum inconsistency in a replicated volume<br>> would trigger one of the other replica's to be used, so the issue is<br>> transparent to the end user/application.<br>><br>><br>That is the expected behaviour.<br><div><br></div>><br>> Since a data write happens first, followed by a delayed checksum<br>> compute, there is a time frame where we might have data updated, but<br>> checksums yet to be computed. We should allow the access of such files<br>> if the file timestamps (mtime) has changed, and is within a defined<br>> range from the current time.<br>><br>> Additionally, we could/should have the ability to switch of checksum<br>> compute from glusterfs perspective, if the underlying FS<br>> exposes/implements bit-rot detection(btrfs).<br>><br>> +1 Why re-invent the wheel!<br>><br>><br>> Phase 2: Block-level(User space/defined) bit rot detection and correction.<br>><br>> The eventual aim is to be able to heal/correct bit rots in files. To<br>> achieve this, computing checksum at a more fine grain level like a<br>> block (size limited by the bit rot algorithm), so that we not only<br>> detect bit rots, but also have the ability to restore them.<br>> Additionally, for large files, checking the checksums at block level<br>> is more efficient, rather than recompute the checksum of the whole<br>> file for a an access.<br>><br>><br>> In this phase, we could move the checksum computation phase to the<br>> xlator loaded on-top of the posix translator at each bricks. with<br>> every write, we could compute the checksum, and store the checksum and<br>> continue with the write or vice versa. Every access would also be able<br>> to read/compute the checksum of the requested block, check it with the<br>> save checksum of the block, and act accordingly. This would take away<br>> the dependency on the external BitD, and changelog xlator.<br>><br>> Additionally, using a Error-correcting code(ECC) or<br>> Forward-error-correction(FEC) alogrithm, would enable us the correct<br>> few bits in the block which have gone corrupt. And compute of the<br>> complete files checksum is eliminated, as we are dealing with blocks<br>> of defined size.<br>><br>> We require the ability to store these fine-grained checksums<br>> efficiently, and extended attributes would not scale for this<br>> implementation. Either a custom backed store, or a DB would be<br>> preferrable in this instance.<br>><br>> so if there is a per 'block' checksum, won't our capacity overheads increase<br>> to store the extra meta data, ontop of our existing replication/raid<br>> overhead?<br>><br><div><br></div>That is true. But, we need to address bit rot at brick level.<br><div><br></div>> Where does Xavi's disperse volume fit into this? Would an Erasure Coded<br>> volume lend itself easier to those use cases (cold data) where bit rot is<br>> key consideration?<br>><br>> If so, would a more simple bit rot strategy for gluster be<br>> 1) disperse volume<br>> 2) btrfs checksums + plumbing to trigger heal when scrub detects a problem<br>><br>> I like simple :)<br>><br>><br>We haven't explored dispers/ or any other cluster xlators impact. The<br>idea here is irrespective of the clustering mechanisms, bit rot is<br>at the brick level and independent. So, in the future if the volume<br>type changes, bit rot can still exist.</blockquote><div>Not sure what you mean by clustering mechanisms? My comment about disperse is that this is a volume that already uses erasure coding that includes healing, so by design it lends itself to the cold data use cases that are more open to silent corruption <img src="cid:08aa001f35d93ac0d7da94fcf841f441d655ab20@zimbra" alt="Laughing" title="Laughing" border="0"><br></div><div><br></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><div><br></div>btrfs bypass will be provided, but it wont be the only backend. So,<br>gluster has to do its own bitrot detection and correction when<br>possible.</blockquote><div>Agree.<br></div><div><br></div><div><br></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><div><br></div>><br>> Please feel free to comment/critique.<br>><br>> With regards,<br>> Shishir<br>><br>> _______________________________________________<br>> Gluster-devel mailing list<br>> Gluster-devel@nongnu.org<br>> https://lists.nongnu.org/mailman/listinfo/gluster-devel<br>><br>><br></blockquote><div><br></div></div></body></html>