<div dir="ltr">+1 for 2b.<div><br></div><div>I am in de planning stages for an RHS 2.0 deployement and I too have suggested a "cookbook" style guide for step-by-step procedures to my RedHat Solution Architect.</div>
<div><br></div><div>What can I do to have this upped in the prio-list?</div><div><br></div><div style>Cheers,</div><div style>Fred</div><div style><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 2, 2013 at 12:49 PM, Brian Candler <span dir="ltr"><<a href="mailto:B.Candler@pobox.com" target="_blank">B.Candler@pobox.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Thu, Dec 27, 2012 at 06:53:46PM -0500, John Mark Walker wrote:<br>
> I invite all sorts of disagreeable comments, and I'm all for public<br>
> discussion of things - as can be seen in this list's archives. But, for<br>
> better or worse, we've chosen the approach that we have. Anyone who would<br>
> like to challenge that approach is welcome to take up that discussion with<br>
> our developers on gluster-devel. This list is for those who need help<br>
> using glusterfs.<br>
><br>
> I am sorry that you haven't been able to deploy glusterfs in production.<br>
> Discussing how and why glusterfs works - or doesn't work - for particular<br>
> use cases is welcome on this list. Starting off a discussion about how<br>
> the entire approach is unworkable is kind of counter-productive and not<br>
> exactly helpful to those of us who just want to use the thing.<br>
<br>
</div>For me, the biggest problems with glusterfs are not in its design, feature<br>
set or performance; they are around what happens when something goes wrong.<br>
As I perceive them, the issues are:<br>
<br>
1. An almost total lack of error reporting, beyond incomprehensible entries<br>
in log files on a completely different machine, made very difficult to find<br>
because they are mixed in with all sorts of other incomprehensible log<br>
entries.<br>
<br>
2. Incomplete documentation. This breaks down further as:<br>
<br>
2a. A total lack of architecture and implementation documentation - such as<br>
what the translators are and how they work internally, what a GFID is, what<br>
xattrs are stored where and what they mean, and all the on-disk states you<br>
can expect to see during replication and healing. Without this level of<br>
documentation, it's impossible to interpret the log messages from (1) short<br>
of reverse-engineering the source code (which is also very minimalist when<br>
it comes to comments); and hence it's impossible to reason about what has<br>
happened when the system is misbehaving, and what would be the correct and<br>
safe intervention to make.<br>
<br>
glusterfs 2.x actually had fairly comprehensive internals documentation, but<br>
this has all been stripped out in 3.x to turn it into a "black box".<br>
Conversely, development on 3.x has diverged enough from 2.x to make the 2.x<br>
documentation unusable.<br>
<br>
2b. An almost total lack of procedural documentation, such as "to replace a<br>
failed server with another one, follow these steps" (which in that case<br>
involves manually copying peer UUIDs from one server to another), or "if<br>
volume rebalance gets stuck, do this". When you come across any of these<br>
issues you end up asking the list, and to be fair the list generally<br>
responds promptly and helpfully - but that approach doesn't scale, and<br>
doesn't necessarily help if you have a storage problem at 3am.<br>
<br>
For these reasons, I am holding back from deploying any of the more<br>
interesting features of glusterfs, such as replicated volumes and<br>
distributed volumes which might grow and need rebalancing. And without<br>
those, I may as well go back to standard NFS and rsync.<br>
<br>
And yes, I have raised a number of bug reports for specific issues, but<br>
reporting a bug whenever you come across a problem in testing or production<br>
is not the right answer. It seems to me that all these edge and error cases<br>
and recovery procedures should already have been developed and tested *as a<br>
matter of course*, for a service as critical as storage.<br>
<br>
I'm not saying there is no error handling in glusterfs, because that's<br>
clearly not true. What I'm saying is that any complex system is bound to<br>
have states where processes cannot proceed without external assistance, and<br>
these cases all need to be tested, and you need to have good error reporting<br>
and good documentation.<br>
<br>
I know I'm not the only person to have been affected, because there is a<br>
steady stream of people on this list who are asking for help with how to<br>
cope with replication and rebalancing failures.<br>
<br>
Please don't consider the above as non-constructive. I count myself amongst<br>
"those of us who just want to use the thing". But right now, I cannot<br>
wholeheartedly recommend it to my colleagues, because I cannot confidently<br>
say that I or they would be able to handle the failure scenarios I have<br>
already experienced, or other ones which may occur in the future.<br>
<br>
Regards,<br>
<br>
Brian.<br>
<div class="HOEnZb"><div class="h5">_______________________________________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
<a href="http://supercolony.gluster.org/mailman/listinfo/gluster-users" target="_blank">http://supercolony.gluster.org/mailman/listinfo/gluster-users</a><br>
</div></div></blockquote></div><br></div></div>