<br><br><div class="gmail_quote">On Fri, Mar 22, 2013 at 7:09 AM, Jeff Darcy <span dir="ltr">&lt;<a href="mailto:jdarcy@redhat.com" target="_blank">jdarcy@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

The need for some change here is keenly felt<br>

right now as we struggle to fix all of the race conditions that have<br>

resulted from the hasty addition of synctasks to make up for poor<br>

performance elsewhere in that 44K lines of C. </blockquote><div><br></div><div>synctasks were not added for performance at all. glusterd being single threaded was incapable of serving volfile in GETSPEC command or assign a port in PORTMAP query when the very process it spawned (glusterfs/glusterfs) would ask glusterd, and wait for the result from glusterd before &quot;finishing daemonizing&quot; (so that a proper exit status be returned), and glusterd would wait for glusterfsd to return before it got back to epoll() and pick the portmap/getspec request -- resulting in a deadlock.</div>

<div><br></div><div>Making it multi-threaded was inevitable if we wanted to even make &quot;basic&quot; behavior right - i.e &quot;gluster volume start&quot; return success only if glusterfsd successfully started or fail if it could not start (we would _always_ return success).</div>

<div><br></div><div>But this is yet another example of how retrofitting threads on a single threaded program can cause problems. It&#39;s not unusual to see races. Most of them are fixable with a &quot;general scheme of locking&quot; practices applied in a few places.</div>

<div><br></div><div>That being said, I&#39;m open to exploring using other projects which have a &quot;good fit&quot; with rest of glusterfs. It would certainly be nice to make it &quot;someone else&#39;s problem&quot;.</div>

<div><br></div><div>Avati</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Delegating as much as<br>

possible of this functionality to mature code that is mostly maintained<br>

elsewhere would be very beneficial.  I&#39;ve done some research since those<br>

meetings, and here are some results.<br>

<br>

The most basic idea here is to use an existing coordination service to<br>

store cluster configuration and state.  That service would then take<br>

responsibility for maintaining availability and consistency of the data<br>

under its care.  The best known example of such a coordination service<br>

is Apache&#39;s ZooKeeper[1], but there are others that don&#39;t have the<br>

noxious Java dependency - e.g. doozer[2] written in Go, Arakoon[3]<br>

written in OCaml, ConCoord[4] written in Python.  These all provide a<br>

tightly consistent generally-hierarchical namespace for relatively small<br>

amounts of data.  In addition, there are two other features that might<br>

be useful.<br>

<br>

* Watches: register for notification of changes to an object (or<br>

directory/container), without having to poll.<br>

<br>

* Ephemerals: certain objects go away when the client that created them<br>

drops its connection to the server(s).<br>

<br>

Here&#39;s a rough sketch of how we&#39;d use such a service.<br>

<br>

* Membership: a certain small set of servers (three or more) would be<br>

manually set up as coordination-service masters, e.g. via &quot;peer probe<br>

xxx as master&quot;).  Other servers would connect to these masters, which<br>

would use ephemerals to update a &quot;cluster map&quot; object.  Both clients and<br>

servers could set up watches on the cluster map object to be notified of<br>

servers joining and leaving.<br>

<br>

* Configuration: the information we currently store in each volume&#39;s<br>

&quot;info&quot; file as the basis for generating volfiles (and perhaps the<br>

volfiles themselves) would be stored in the configuration service.<br>

Again, servers and clients could set watches on these objects to be<br>

notified of changes and do the appropriate graph switches, reconfigures,<br>

quorum actions, etc.<br>

<br>

* Maintenance operations: these would still run in glusterd (which isn&#39;t<br>

going away).  They would use the coordination for leader election to<br>

make sure the same activity isn&#39;t started twice, and to keep status<br>

updated in a way that allows other nodes to watch for changes.<br>

<br>

* Status queries: these would be handled entirely by querying objects<br>

within the coordination service.<br>

<br>

Of the alternatives available to us, only ZooKeeper directly supports<br>

all of the functionality we&#39;d want.  However, the Java dependency is<br>

decidedly unpleasant for us and would be totally unacceptable to some of<br>

our users.  Doozer seems the closest of the remainder; it supports<br>

watches but not ephemerals, so we&#39;d either have to synthesize those on<br>

top of doozer itself or find another way to handle membership (the only<br>

place where we use that functionality) based on the features it does<br>

have.  The project also seems reasonably mature and active, though we&#39;d<br>

probably still have to devote some time to developing our own local<br>

doozer expertise.<br>

<br>

In a similar vein, another possibility would be to use *ourselves* as<br>

the coordination service, via a hand-configured AFR volume.  This is<br>

actually an approach Kaleb and I were seriously considering for HekaFS<br>

at the time of the acquisition, and it&#39;s not without its benefits.<br>

Using libgfapi we can prevent this special volume from having to be<br>

mounted, and we already know how to secure the communications paths for<br>

it (something that would require additional work with the other<br>

solutions).  On the other hand, it would probably require additional<br>

translators to provide both ephemerals and watches, and might require<br>

its own non-glusterd solution to issues like failure detection and<br>

self-heal, so it doesn&#39;t exactly meet the &quot;make it somebody else&#39;s<br>

problem&quot; criterion.<br>

<br>

In conclusion, I think our best (long term) way forward would be to<br>

prototype a doozer-based version of glusterd.  I could possibly be<br>

persuaded to try a &quot;gluster on gluster&quot; approach instead, but at this<br>

moment it wouldn&#39;t be my first choice.  Are there any other suggestions<br>

or objections before I forge ahead?<br>

<br>

[1] <a href="http://zookeeper.apache.org/" target="_blank">http://zookeeper.apache.org/</a><br>

[2] <a href="https://github.com/ha/doozerd" target="_blank">https://github.com/ha/doozerd</a><br>

[3] <a href="http://arakoon.org/" target="_blank">http://arakoon.org/</a><br>

[4] <a href="http://openreplica.org/doc/" target="_blank">http://openreplica.org/doc/</a><br>

<br>

_______________________________________________<br>

Gluster-devel mailing list<br>

<a href="mailto:Gluster-devel@nongnu.org">Gluster-devel@nongnu.org</a><br>

<a href="https://lists.nongnu.org/mailman/listinfo/gluster-devel" target="_blank">https://lists.nongnu.org/mailman/listinfo/gluster-devel</a><br>

</blockquote></div><br>