<div dir="ltr">+1 for 2b.<div><br></div><div>I am in de planning stages for an RHS 2.0 deployement and I too have suggested a &quot;cookbook&quot; style guide for step-by-step procedures to my RedHat Solution Architect.</div>

<div><br></div><div>What can I do to have this upped in the prio-list?</div><div><br></div><div style>Cheers,</div><div style>Fred</div><div style><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 2, 2013 at 12:49 PM, Brian Candler <span dir="ltr">&lt;<a href="mailto:B.Candler@pobox.com" target="_blank">B.Candler@pobox.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Thu, Dec 27, 2012 at 06:53:46PM -0500, John Mark Walker wrote:<br>

&gt; I invite all sorts of disagreeable comments, and I&#39;m all for public<br>

&gt; discussion of things - as can be seen in this list&#39;s archives.  But, for<br>

&gt; better or worse, we&#39;ve chosen the approach that we have.  Anyone who would<br>

&gt; like to challenge that approach is welcome to take up that discussion with<br>

&gt; our developers on gluster-devel.  This list is for those who need help<br>

&gt; using glusterfs.<br>

&gt;<br>

&gt; I am sorry that you haven&#39;t been able to deploy glusterfs in production.<br>

&gt; Discussing how and why glusterfs works - or doesn&#39;t work - for particular<br>

&gt; use cases is welcome on this list.  Starting off a discussion about how<br>

&gt; the entire approach is unworkable is kind of counter-productive and not<br>

&gt; exactly helpful to those of us who just want to use the thing.<br>

<br>

</div>For me, the biggest problems with glusterfs are not in its design, feature<br>

set or performance; they are around what happens when something goes wrong.<br>

As I perceive them, the issues are:<br>

<br>

1. An almost total lack of error reporting, beyond incomprehensible entries<br>

in log files on a completely different machine, made very difficult to find<br>

because they are mixed in with all sorts of other incomprehensible log<br>

entries.<br>

<br>

2. Incomplete documentation. This breaks down further as:<br>

<br>

2a. A total lack of architecture and implementation documentation - such as<br>

what the translators are and how they work internally, what a GFID is, what<br>

xattrs are stored where and what they mean, and all the on-disk states you<br>

can expect to see during replication and healing.  Without this level of<br>

documentation, it&#39;s impossible to interpret the log messages from (1) short<br>

of reverse-engineering the source code (which is also very minimalist when<br>

it comes to comments); and hence it&#39;s impossible to reason about what has<br>

happened when the system is misbehaving, and what would be the correct and<br>

safe intervention to make.<br>

<br>

glusterfs 2.x actually had fairly comprehensive internals documentation, but<br>

this has all been stripped out in 3.x to turn it into a &quot;black box&quot;.<br>

Conversely, development on 3.x has diverged enough from 2.x to make the 2.x<br>

documentation unusable.<br>

<br>

2b. An almost total lack of procedural documentation, such as &quot;to replace a<br>

failed server with another one, follow these steps&quot; (which in that case<br>

involves manually copying peer UUIDs from one server to another), or &quot;if<br>

volume rebalance gets stuck, do this&quot;.  When you come across any of these<br>

issues you end up asking the list, and to be fair the list generally<br>

responds promptly and helpfully - but that approach doesn&#39;t scale, and<br>

doesn&#39;t necessarily help if you have a storage problem at 3am.<br>

<br>

For these reasons, I am holding back from deploying any of the more<br>

interesting features of glusterfs, such as replicated volumes and<br>

distributed volumes which might grow and need rebalancing.  And without<br>

those, I may as well go back to standard NFS and rsync.<br>

<br>

And yes, I have raised a number of bug reports for specific issues, but<br>

reporting a bug whenever you come across a problem in testing or production<br>

is not the right answer.  It seems to me that all these edge and error cases<br>

and recovery procedures should already have been developed and tested *as a<br>

matter of course*, for a service as critical as storage.<br>

<br>

I&#39;m not saying there is no error handling in glusterfs, because that&#39;s<br>

clearly not true.  What I&#39;m saying is that any complex system is bound to<br>

have states where processes cannot proceed without external assistance, and<br>

these cases all need to be tested, and you need to have good error reporting<br>

and good documentation.<br>

<br>

I know I&#39;m not the only person to have been affected, because there is a<br>

steady stream of people on this list who are asking for help with how to<br>

cope with replication and rebalancing failures.<br>

<br>

Please don&#39;t consider the above as non-constructive. I count myself amongst<br>

&quot;those of us who just want to use the thing&quot;.  But right now, I cannot<br>

wholeheartedly recommend it to my colleagues, because I cannot confidently<br>

say that I or they would be able to handle the failure scenarios I have<br>

already experienced, or other ones which may occur in the future.<br>

<br>

Regards,<br>

<br>

Brian.<br>

<div class="HOEnZb"><div class="h5">_______________________________________________<br>

Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>

<a href="http://supercolony.gluster.org/mailman/listinfo/gluster-users" target="_blank">http://supercolony.gluster.org/mailman/listinfo/gluster-users</a><br>

</div></div></blockquote></div><br></div></div>