<div dir="ltr"><div>Thanks Venky, I also wanted to put forward how this can help in a openstack/cloud env.<br>where we have 2 distinct admin roles (virt/openstack admin and storage admin)<br><br><tt><br>

1) Gluster volume &#39;health&#39; should display the health status (OK, warn, fatal/error etc)<br>

2) Based on that the admin can query &#39;health status&#39; to know &#39;due to which component (AFR, quorum, geo-rep etc)  the health status is &#39;other than OK&#39;<br>

3) Based on that component, run the right gluster cmd ( scrub status, 

afr status, split brain status? etc) to go deeper into where the problem lies<br>

  <br>

1 &amp; 2 can be done by virt admin who then alerts the storage admin who then does 3 to figure the root cause and take necessary action<br><br></tt></div><tt>thanx,<br>deepak<br><br>

</tt><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 9, 2014 at 2:52 PM, Venky Shankar <span dir="ltr">&lt;<a href="mailto:yknev.shankar@gmail.com" target="_blank">yknev.shankar@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Tue, Dec 9, 2014 at 1:41 PM, Deepak Shetty &lt;<a href="mailto:dpkshetty@gmail.com">dpkshetty@gmail.com</a>&gt; wrote:<br>

&gt; We can use bitrot to provide a &#39;health&#39; status for gluster volumes.<br>

&gt; Hence I would like to propose (from a upstream/community perspective) the<br>

&gt; notion of &#39;health&#39; status (as part of gluster volume info) which can derive<br>

&gt; its value from:<br>

&gt;<br>

&gt; 1) Bitrot<br>

&gt;     If any files are corrupted and bitrot is yet to repair them and/or its a<br>

&gt; signal to admin to do some manual operation to repair the corrupted files<br>

&gt; (for cases where we only detect, not correct)<br>

&gt;<br>

&gt; 2) brick status<br>

&gt;     Depending on brick offline/online<br>

&gt;<br>

&gt; 3) AFR status<br>

&gt;     Whether we have all copies in sync or not<br>

<br>

</span>This makes sense. Having a notion of &quot;volume health&quot; helps choosing<br>

intelligently from a list of volumes.<br>

<span class=""><br>

&gt;<br>

&gt; This i believe is on similar lines to what Ceph does today (health status :<br>

&gt; OK, WARN, ERROR)<br>

<br>

</span>Yes, Ceph derives those notions from PGs. Gluster can do it for<br>

replicas and/or files marked by bitrot scrubber.<br>

<span class=""><br>

&gt; The health status derivation can be pluggable, so that in future more<br>

&gt; components can be added to query for the composite health status of the<br>

&gt; gluster volume.<br>

&gt;<br>

&gt; In all of the above cases, as long as data can be served by the gluster<br>

&gt; volume reliably gluster volume status will be Started/Available, but Health<br>

&gt; status can be &#39;degraded&#39; or &#39;warn&#39;<br>

<br>

</span>WARN may be too strict, but something lenient enough yes descriptive<br>

should be chosen. Ceph does it pretty well:<br>

<a href="http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/" target="_blank">http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/</a><br>

<span class=""><br>

&gt;<br>

&gt; This has many uses:<br>

&gt;<br>

&gt; 1) It helps provide indication to the admin that something is amiss and he<br>

&gt; can check based on:<br>

&gt; bitrot / scrub status<br>

&gt; brick status<br>

&gt; AFR status<br>

&gt;<br>

&gt; and take necessary action<br>

&gt;<br>

&gt; 2) It helps mgmt applns (openstack for eg) make an intelligent decision<br>

&gt; based on the health status (whether or not to pick this gluster volume for<br>

&gt; this create volume operation), so it helps acts a a coarse level filter<br>

&gt;<br>

&gt; 3) In general it gives user an idea of the health of the volume (which is<br>

&gt; different than the availability status (whether or not volume can serve<br>

&gt; data))<br>

&gt; For eg: If we have a pure DHT volume, and bitrot detects silent file<br>

&gt; corruption (and we are not auto correcting) having Gluster volume status as<br>

&gt; available/started isn&#39;t entirely correct !<br>

<br>

</span>+1<br>

<div class="HOEnZb"><div class="h5"><br>

&gt;<br>

&gt; thanx,<br>

&gt; deepak<br>

&gt;<br>

&gt;<br>

&gt; On Fri, Dec 5, 2014 at 11:31 PM, Venky Shankar &lt;<a href="mailto:yknev.shankar@gmail.com">yknev.shankar@gmail.com</a>&gt;<br>

&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; On Fri, Nov 28, 2014 at 10:00 PM, Vijay Bellur &lt;<a href="mailto:vbellur@redhat.com">vbellur@redhat.com</a>&gt; wrote:<br>

&gt;&gt; &gt; On 11/28/2014 08:30 AM, Venky Shankar wrote:<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; [snip]<br>

&gt;&gt; &gt;&gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt; 1. Can the bitd be one per node like self-heal-daemon and other<br>

&gt;&gt; &gt;&gt;&gt; &quot;global&quot;<br>

&gt;&gt; &gt;&gt;&gt; services? I worry about creating 2 * N processes for N bricks in a<br>

&gt;&gt; &gt;&gt;&gt; node.<br>

&gt;&gt; &gt;&gt;&gt; Maybe we can consider having one thread per volume/brick etc. in a<br>

&gt;&gt; &gt;&gt;&gt; single<br>

&gt;&gt; &gt;&gt;&gt; bitd process to make it perform better.<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Absolutely.<br>

&gt;&gt; &gt;&gt; There would be one bitrot daemon per node, per volume.<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Do you foresee any problems in having one daemon per node for all<br>

&gt;&gt; &gt; volumes?<br>

&gt;&gt;<br>

&gt;&gt; Not technically :). Probably that&#39;s a nice thing to do.<br>

&gt;&gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt; 3. I think the algorithm for checksum computation can vary within the<br>

&gt;&gt; &gt;&gt;&gt; volume. I see a reference to &quot;Hashtype is persisted along side the<br>

&gt;&gt; &gt;&gt;&gt; checksum<br>

&gt;&gt; &gt;&gt;&gt; and can be tuned per file type.&quot; Is this correct? If so:<br>

&gt;&gt; &gt;&gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt; a) How will the policy be exposed to the user?<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Bitrot daemon would have a configuration file that can be configured<br>

&gt;&gt; &gt;&gt; via Gluster CLI. Tuning hash types could be based on file types or<br>

&gt;&gt; &gt;&gt; file name patterns (regexes) [which is a bit tricky as bitrot would<br>

&gt;&gt; &gt;&gt; work on GFIDs rather than filenames, but this can be solved by a level<br>

&gt;&gt; &gt;&gt; of indirection].<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt; b) It would be nice to have the algorithm for computing checksums be<br>

&gt;&gt; &gt;&gt;&gt; pluggable. Are there any thoughts on pluggability?<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Do you mean the default hash algorithm be configurable? If yes, then<br>

&gt;&gt; &gt;&gt; that&#39;s planned.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Sounds good.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt; c) What are the steps involved in changing the hashtype/algorithm for<br>

&gt;&gt; &gt;&gt;&gt; a<br>

&gt;&gt; &gt;&gt;&gt; file?<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Policy changes for file {types, patterns} are lazy, i.e., taken into<br>

&gt;&gt; &gt;&gt; effect during the next recompute. For objects that are never modified<br>

&gt;&gt; &gt;&gt; (after initial checksum compute), scrubbing can recompute the checksum<br>

&gt;&gt; &gt;&gt; using the new hash _after_ verifying the integrity of a file with the<br>

&gt;&gt; &gt;&gt; old hash.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt; 4. Is the fop on which change detection gets triggered configurable?<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; As of now all data modification fops trigger checksum calculation.<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Wish I was more clear on this in my OP. Is the fop on which checksum<br>

&gt;&gt; &gt; verification/bitrot detection happens configurable? The feature page<br>

&gt;&gt; &gt; talks<br>

&gt;&gt; &gt; about &quot;open&quot; being a trigger point for this. Users might want to trigger<br>

&gt;&gt; &gt; detection on a &quot;read&quot; operation and not on open. It would be good to<br>

&gt;&gt; &gt; provide<br>

&gt;&gt; &gt; this flexibility.<br>

&gt;&gt;<br>

&gt;&gt; Ah! ok. As of now it&#39;s mostly open() and read(). Inline verification<br>

&gt;&gt; would be &quot;off&quot; by default due to obvious reasons.<br>

&gt;&gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt; 6. Any thoughts on integrating the bitrot repair framework with<br>

&gt;&gt; &gt;&gt;&gt; self-heal?<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; There are some thoughts on integration with self-heal daemon and EC.<br>

&gt;&gt; &gt;&gt; I&#39;m coming up with a doc which covers those [reason for delay in<br>

&gt;&gt; &gt;&gt; replying to your questions ;)]. Expect the doc in in gluster-devel@<br>

&gt;&gt; &gt;&gt; soon.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Will look forward to this.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt;<br>

&gt;&gt; &gt;&gt;&gt; 7. How does detection figure out that lazy updation is still pending<br>

&gt;&gt; &gt;&gt;&gt; and<br>

&gt;&gt; &gt;&gt;&gt; not<br>

&gt;&gt; &gt;&gt;&gt; raise a false positive?<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; That&#39;s one of the things that myself and Rachana discussed yesterday.<br>

&gt;&gt; &gt;&gt; Should scrubbing *wait* till checksum updating is still in progress or<br>

&gt;&gt; &gt;&gt; is it expected that scrubbing happens when there is no active I/O<br>

&gt;&gt; &gt;&gt; operations on the volume (both of which imply that bitrot daemon needs<br>

&gt;&gt; &gt;&gt; to know when it&#39;s done it&#39;s job).<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; If both scrub and checksum updating go in parallel, then there needs<br>

&gt;&gt; &gt;&gt; to be way to synchronize those operations. Maybe, compute checksum on<br>

&gt;&gt; &gt;&gt; priority which is provided by the scrub process as a hint (that leaves<br>

&gt;&gt; &gt;&gt; little window for rot though) ?<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Any thoughts?<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Waiting for no active I/O in the volume might be a difficult condition<br>

&gt;&gt; &gt; to<br>

&gt;&gt; &gt; reach in some deployments.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Some form of waiting is necessary to prevent false positives. One<br>

&gt;&gt; &gt; possibility might be to mark an object as dirty till checksum updation<br>

&gt;&gt; &gt; is<br>

&gt;&gt; &gt; complete. Verification/scrub can then be skipped for dirty objects.<br>

&gt;&gt;<br>

&gt;&gt; Makes sense. Thanks!<br>

&gt;&gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; -Vijay<br>

&gt;&gt; &gt;<br>

&gt;&gt; _______________________________________________<br>

&gt;&gt; Gluster-devel mailing list<br>

&gt;&gt; <a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>

&gt;&gt; <a href="http://supercolony.gluster.org/mailman/listinfo/gluster-devel" target="_blank">http://supercolony.gluster.org/mailman/listinfo/gluster-devel</a><br>

&gt;<br>

&gt;<br>

</div></div></blockquote></div><br></div>