<div dir="ltr"><div>Thanks Venky, I also wanted to put forward how this can help in a openstack/cloud env.<br>where we have 2 distinct admin roles (virt/openstack admin and storage admin)<br><br><tt><br>
1) Gluster volume 'health' should display the health status (OK, warn, fatal/error etc)<br>
2) Based on that the admin can query 'health status' to know 'due to which component (AFR, quorum, geo-rep etc) the health status is 'other than OK'<br>
3) Based on that component, run the right gluster cmd ( scrub status,
afr status, split brain status? etc) to go deeper into where the problem lies<br>
<br>
1 & 2 can be done by virt admin who then alerts the storage admin who then does 3 to figure the root cause and take necessary action<br><br></tt></div><tt>thanx,<br>deepak<br><br>
</tt><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 9, 2014 at 2:52 PM, Venky Shankar <span dir="ltr"><<a href="mailto:yknev.shankar@gmail.com" target="_blank">yknev.shankar@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Tue, Dec 9, 2014 at 1:41 PM, Deepak Shetty <<a href="mailto:dpkshetty@gmail.com">dpkshetty@gmail.com</a>> wrote:<br>
> We can use bitrot to provide a 'health' status for gluster volumes.<br>
> Hence I would like to propose (from a upstream/community perspective) the<br>
> notion of 'health' status (as part of gluster volume info) which can derive<br>
> its value from:<br>
><br>
> 1) Bitrot<br>
> If any files are corrupted and bitrot is yet to repair them and/or its a<br>
> signal to admin to do some manual operation to repair the corrupted files<br>
> (for cases where we only detect, not correct)<br>
><br>
> 2) brick status<br>
> Depending on brick offline/online<br>
><br>
> 3) AFR status<br>
> Whether we have all copies in sync or not<br>
<br>
</span>This makes sense. Having a notion of "volume health" helps choosing<br>
intelligently from a list of volumes.<br>
<span class=""><br>
><br>
> This i believe is on similar lines to what Ceph does today (health status :<br>
> OK, WARN, ERROR)<br>
<br>
</span>Yes, Ceph derives those notions from PGs. Gluster can do it for<br>
replicas and/or files marked by bitrot scrubber.<br>
<span class=""><br>
> The health status derivation can be pluggable, so that in future more<br>
> components can be added to query for the composite health status of the<br>
> gluster volume.<br>
><br>
> In all of the above cases, as long as data can be served by the gluster<br>
> volume reliably gluster volume status will be Started/Available, but Health<br>
> status can be 'degraded' or 'warn'<br>
<br>
</span>WARN may be too strict, but something lenient enough yes descriptive<br>
should be chosen. Ceph does it pretty well:<br>
<a href="http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/" target="_blank">http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/</a><br>
<span class=""><br>
><br>
> This has many uses:<br>
><br>
> 1) It helps provide indication to the admin that something is amiss and he<br>
> can check based on:<br>
> bitrot / scrub status<br>
> brick status<br>
> AFR status<br>
><br>
> and take necessary action<br>
><br>
> 2) It helps mgmt applns (openstack for eg) make an intelligent decision<br>
> based on the health status (whether or not to pick this gluster volume for<br>
> this create volume operation), so it helps acts a a coarse level filter<br>
><br>
> 3) In general it gives user an idea of the health of the volume (which is<br>
> different than the availability status (whether or not volume can serve<br>
> data))<br>
> For eg: If we have a pure DHT volume, and bitrot detects silent file<br>
> corruption (and we are not auto correcting) having Gluster volume status as<br>
> available/started isn't entirely correct !<br>
<br>
</span>+1<br>
<div class="HOEnZb"><div class="h5"><br>
><br>
> thanx,<br>
> deepak<br>
><br>
><br>
> On Fri, Dec 5, 2014 at 11:31 PM, Venky Shankar <<a href="mailto:yknev.shankar@gmail.com">yknev.shankar@gmail.com</a>><br>
> wrote:<br>
>><br>
>> On Fri, Nov 28, 2014 at 10:00 PM, Vijay Bellur <<a href="mailto:vbellur@redhat.com">vbellur@redhat.com</a>> wrote:<br>
>> > On 11/28/2014 08:30 AM, Venky Shankar wrote:<br>
>> >><br>
>> >> [snip]<br>
>> >>><br>
>> >>><br>
>> >>> 1. Can the bitd be one per node like self-heal-daemon and other<br>
>> >>> "global"<br>
>> >>> services? I worry about creating 2 * N processes for N bricks in a<br>
>> >>> node.<br>
>> >>> Maybe we can consider having one thread per volume/brick etc. in a<br>
>> >>> single<br>
>> >>> bitd process to make it perform better.<br>
>> >><br>
>> >><br>
>> >> Absolutely.<br>
>> >> There would be one bitrot daemon per node, per volume.<br>
>> >><br>
>> ><br>
>> > Do you foresee any problems in having one daemon per node for all<br>
>> > volumes?<br>
>><br>
>> Not technically :). Probably that's a nice thing to do.<br>
>><br>
>> ><br>
>> >><br>
>> >>><br>
>> >>> 3. I think the algorithm for checksum computation can vary within the<br>
>> >>> volume. I see a reference to "Hashtype is persisted along side the<br>
>> >>> checksum<br>
>> >>> and can be tuned per file type." Is this correct? If so:<br>
>> >>><br>
>> >>> a) How will the policy be exposed to the user?<br>
>> >><br>
>> >><br>
>> >> Bitrot daemon would have a configuration file that can be configured<br>
>> >> via Gluster CLI. Tuning hash types could be based on file types or<br>
>> >> file name patterns (regexes) [which is a bit tricky as bitrot would<br>
>> >> work on GFIDs rather than filenames, but this can be solved by a level<br>
>> >> of indirection].<br>
>> >><br>
>> >>><br>
>> >>> b) It would be nice to have the algorithm for computing checksums be<br>
>> >>> pluggable. Are there any thoughts on pluggability?<br>
>> >><br>
>> >><br>
>> >> Do you mean the default hash algorithm be configurable? If yes, then<br>
>> >> that's planned.<br>
>> ><br>
>> ><br>
>> > Sounds good.<br>
>> ><br>
>> >><br>
>> >>><br>
>> >>> c) What are the steps involved in changing the hashtype/algorithm for<br>
>> >>> a<br>
>> >>> file?<br>
>> >><br>
>> >><br>
>> >> Policy changes for file {types, patterns} are lazy, i.e., taken into<br>
>> >> effect during the next recompute. For objects that are never modified<br>
>> >> (after initial checksum compute), scrubbing can recompute the checksum<br>
>> >> using the new hash _after_ verifying the integrity of a file with the<br>
>> >> old hash.<br>
>> ><br>
>> ><br>
>> >><br>
>> >>><br>
>> >>> 4. Is the fop on which change detection gets triggered configurable?<br>
>> >><br>
>> >><br>
>> >> As of now all data modification fops trigger checksum calculation.<br>
>> >><br>
>> ><br>
>> > Wish I was more clear on this in my OP. Is the fop on which checksum<br>
>> > verification/bitrot detection happens configurable? The feature page<br>
>> > talks<br>
>> > about "open" being a trigger point for this. Users might want to trigger<br>
>> > detection on a "read" operation and not on open. It would be good to<br>
>> > provide<br>
>> > this flexibility.<br>
>><br>
>> Ah! ok. As of now it's mostly open() and read(). Inline verification<br>
>> would be "off" by default due to obvious reasons.<br>
>><br>
>> ><br>
>> >><br>
>> >>><br>
>> >>> 6. Any thoughts on integrating the bitrot repair framework with<br>
>> >>> self-heal?<br>
>> >><br>
>> >><br>
>> >> There are some thoughts on integration with self-heal daemon and EC.<br>
>> >> I'm coming up with a doc which covers those [reason for delay in<br>
>> >> replying to your questions ;)]. Expect the doc in in gluster-devel@<br>
>> >> soon.<br>
>> ><br>
>> ><br>
>> > Will look forward to this.<br>
>> ><br>
>> >><br>
>> >>><br>
>> >>> 7. How does detection figure out that lazy updation is still pending<br>
>> >>> and<br>
>> >>> not<br>
>> >>> raise a false positive?<br>
>> >><br>
>> >><br>
>> >> That's one of the things that myself and Rachana discussed yesterday.<br>
>> >> Should scrubbing *wait* till checksum updating is still in progress or<br>
>> >> is it expected that scrubbing happens when there is no active I/O<br>
>> >> operations on the volume (both of which imply that bitrot daemon needs<br>
>> >> to know when it's done it's job).<br>
>> >><br>
>> >> If both scrub and checksum updating go in parallel, then there needs<br>
>> >> to be way to synchronize those operations. Maybe, compute checksum on<br>
>> >> priority which is provided by the scrub process as a hint (that leaves<br>
>> >> little window for rot though) ?<br>
>> >><br>
>> >> Any thoughts?<br>
>> ><br>
>> ><br>
>> > Waiting for no active I/O in the volume might be a difficult condition<br>
>> > to<br>
>> > reach in some deployments.<br>
>> ><br>
>> > Some form of waiting is necessary to prevent false positives. One<br>
>> > possibility might be to mark an object as dirty till checksum updation<br>
>> > is<br>
>> > complete. Verification/scrub can then be skipped for dirty objects.<br>
>><br>
>> Makes sense. Thanks!<br>
>><br>
>> ><br>
>> > -Vijay<br>
>> ><br>
>> _______________________________________________<br>
>> Gluster-devel mailing list<br>
>> <a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>
>> <a href="http://supercolony.gluster.org/mailman/listinfo/gluster-devel" target="_blank">http://supercolony.gluster.org/mailman/listinfo/gluster-devel</a><br>
><br>
><br>
</div></div></blockquote></div><br></div>