<div dir="ltr"><div><div><div><div><div><div><div><div>We can use bitrot to provide a &#39;health&#39; status for gluster volumes.  <br>Hence I would like to propose (from a upstream/community perspective) the notion of &#39;health&#39; status (as part of gluster volume info) which can derive its value from:<br></div><br>1) Bitrot<br></div>    If any files are corrupted and bitrot is yet to repair them and/or its a signal to admin to do some manual operation to repair the corrupted files (for cases where we only detect, not correct)<br><br></div>2) brick status<br></div><div>    Depending on brick offline/online<br></div><div><br></div>3) AFR status<br></div>    Whether we have all copies in sync or not<br><br></div>This i believe is on similar lines to what Ceph does today (health status : OK, WARN, ERROR)<br></div><div>The health status derivation can be pluggable, so that in future more components can be added to query for the composite health status of the gluster volume.<br></div><div><br></div>In all of the above cases, as long as data can be served by the gluster volume reliably gluster volume status will be Started/Available, but Health status can be &#39;degraded&#39; or &#39;warn&#39;<br></div><div><br>This has many uses:<br><br></div><div>1) It helps provide indication to the admin that something is amiss and he can check based on:<br></div><div>bitrot / scrub status<br></div><div>brick status<br></div><div>AFR status<br><br></div><div>and take necessary action<br><br></div><div>2) It helps mgmt applns (openstack for eg) make an intelligent decision based on the health status (whether or not to pick this gluster volume for this create volume operation), so it helps acts a a coarse level filter<br><br></div><div>3) In general it gives user an idea of the health of the volume (which is different than the availability status (whether or not volume can serve data))<br></div><div>For eg: If we have a pure DHT volume, and bitrot detects silent file corruption (and we are not auto correcting) having Gluster volume status as available/started isn&#39;t entirely correct !<br></div><div><br></div><div>thanx,<br>deepak<br><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Dec 5, 2014 at 11:31 PM, Venky Shankar <span dir="ltr">&lt;<a href="mailto:yknev.shankar@gmail.com" target="_blank">yknev.shankar@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Fri, Nov 28, 2014 at 10:00 PM, Vijay Bellur &lt;<a href="mailto:vbellur@redhat.com">vbellur@redhat.com</a>&gt; wrote:<br>

&gt; On 11/28/2014 08:30 AM, Venky Shankar wrote:<br>

&gt;&gt;<br>

&gt;&gt; [snip]<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; 1. Can the bitd be one per node like self-heal-daemon and other &quot;global&quot;<br>

&gt;&gt;&gt; services? I worry about creating 2 * N processes for N bricks in a node.<br>

&gt;&gt;&gt; Maybe we can consider having one thread per volume/brick etc. in a single<br>

&gt;&gt;&gt; bitd process to make it perform better.<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; Absolutely.<br>

&gt;&gt; There would be one bitrot daemon per node, per volume.<br>

&gt;&gt;<br>

&gt;<br>

&gt; Do you foresee any problems in having one daemon per node for all volumes?<br>

<br>

</span>Not technically :). Probably that&#39;s a nice thing to do.<br>

<div><div class="h5"><br>

&gt;<br>

&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; 3. I think the algorithm for checksum computation can vary within the<br>

&gt;&gt;&gt; volume. I see a reference to &quot;Hashtype is persisted along side the<br>

&gt;&gt;&gt; checksum<br>

&gt;&gt;&gt; and can be tuned per file type.&quot; Is this correct? If so:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; a) How will the policy be exposed to the user?<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; Bitrot daemon would have a configuration file that can be configured<br>

&gt;&gt; via Gluster CLI. Tuning hash types could be based on file types or<br>

&gt;&gt; file name patterns (regexes) [which is a bit tricky as bitrot would<br>

&gt;&gt; work on GFIDs rather than filenames, but this can be solved by a level<br>

&gt;&gt; of indirection].<br>

&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; b) It would be nice to have the algorithm for computing checksums be<br>

&gt;&gt;&gt; pluggable. Are there any thoughts on pluggability?<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; Do you mean the default hash algorithm be configurable? If yes, then<br>

&gt;&gt; that&#39;s planned.<br>

&gt;<br>

&gt;<br>

&gt; Sounds good.<br>

&gt;<br>

&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; c) What are the steps involved in changing the hashtype/algorithm for a<br>

&gt;&gt;&gt; file?<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; Policy changes for file {types, patterns} are lazy, i.e., taken into<br>

&gt;&gt; effect during the next recompute. For objects that are never modified<br>

&gt;&gt; (after initial checksum compute), scrubbing can recompute the checksum<br>

&gt;&gt; using the new hash _after_ verifying the integrity of a file with the<br>

&gt;&gt; old hash.<br>

&gt;<br>

&gt;<br>

&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; 4. Is the fop on which change detection gets triggered configurable?<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; As of now all data modification fops trigger checksum calculation.<br>

&gt;&gt;<br>

&gt;<br>

&gt; Wish I was more clear on this in my OP. Is the fop on which checksum<br>

&gt; verification/bitrot detection happens configurable? The feature page talks<br>

&gt; about &quot;open&quot; being a trigger point for this. Users might want to trigger<br>

&gt; detection on a &quot;read&quot; operation and not on open. It would be good to provide<br>

&gt; this flexibility.<br>

<br>

</div></div>Ah! ok. As of now it&#39;s mostly open() and read(). Inline verification<br>

would be &quot;off&quot; by default due to obvious reasons.<br>

<div><div class="h5"><br>

&gt;<br>

&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; 6. Any thoughts on integrating the bitrot repair framework with<br>

&gt;&gt;&gt; self-heal?<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; There are some thoughts on integration with self-heal daemon and EC.<br>

&gt;&gt; I&#39;m coming up with a doc which covers those [reason for delay in<br>

&gt;&gt; replying to your questions ;)]. Expect the doc in in gluster-devel@<br>

&gt;&gt; soon.<br>

&gt;<br>

&gt;<br>

&gt; Will look forward to this.<br>

&gt;<br>

&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; 7. How does detection figure out that lazy updation is still pending and<br>

&gt;&gt;&gt; not<br>

&gt;&gt;&gt; raise a false positive?<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; That&#39;s one of the things that myself and Rachana discussed yesterday.<br>

&gt;&gt; Should scrubbing *wait* till checksum updating is still in progress or<br>

&gt;&gt; is it expected that scrubbing happens when there is no active I/O<br>

&gt;&gt; operations on the volume (both of which imply that bitrot daemon needs<br>

&gt;&gt; to know when it&#39;s done it&#39;s job).<br>

&gt;&gt;<br>

&gt;&gt; If both scrub and checksum updating go in parallel, then there needs<br>

&gt;&gt; to be way to synchronize those operations. Maybe, compute checksum on<br>

&gt;&gt; priority which is provided by the scrub process as a hint (that leaves<br>

&gt;&gt; little window for rot though) ?<br>

&gt;&gt;<br>

&gt;&gt; Any thoughts?<br>

&gt;<br>

&gt;<br>

&gt; Waiting for no active I/O in the volume might be a difficult condition to<br>

&gt; reach in some deployments.<br>

&gt;<br>

&gt; Some form of waiting is necessary to prevent false positives. One<br>

&gt; possibility might be to mark an object as dirty till checksum updation is<br>

&gt; complete. Verification/scrub can then be skipped for dirty objects.<br>

<br>

</div></div>Makes sense. Thanks!<br>

<div class="HOEnZb"><div class="h5"><br>

&gt;<br>

&gt; -Vijay<br>

&gt;<br>

_______________________________________________<br>

Gluster-devel mailing list<br>

<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>

<a href="http://supercolony.gluster.org/mailman/listinfo/gluster-devel" target="_blank">http://supercolony.gluster.org/mailman/listinfo/gluster-devel</a><br>

</div></div></blockquote></div><br></div>