<br><br><div class="gmail_quote">On Fri, Mar 22, 2013 at 7:09 AM, Jeff Darcy <span dir="ltr"><<a href="mailto:jdarcy@redhat.com" target="_blank">jdarcy@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The need for some change here is keenly felt<br>
right now as we struggle to fix all of the race conditions that have<br>
resulted from the hasty addition of synctasks to make up for poor<br>
performance elsewhere in that 44K lines of C. </blockquote><div><br></div><div>synctasks were not added for performance at all. glusterd being single threaded was incapable of serving volfile in GETSPEC command or assign a port in PORTMAP query when the very process it spawned (glusterfs/glusterfs) would ask glusterd, and wait for the result from glusterd before "finishing daemonizing" (so that a proper exit status be returned), and glusterd would wait for glusterfsd to return before it got back to epoll() and pick the portmap/getspec request -- resulting in a deadlock.</div>
<div><br></div><div>Making it multi-threaded was inevitable if we wanted to even make "basic" behavior right - i.e "gluster volume start" return success only if glusterfsd successfully started or fail if it could not start (we would _always_ return success).</div>
<div><br></div><div>But this is yet another example of how retrofitting threads on a single threaded program can cause problems. It's not unusual to see races. Most of them are fixable with a "general scheme of locking" practices applied in a few places.</div>
<div><br></div><div>That being said, I'm open to exploring using other projects which have a "good fit" with rest of glusterfs. It would certainly be nice to make it "someone else's problem".</div>
<div><br></div><div>Avati</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Delegating as much as<br>
possible of this functionality to mature code that is mostly maintained<br>
elsewhere would be very beneficial. I've done some research since those<br>
meetings, and here are some results.<br>
<br>
The most basic idea here is to use an existing coordination service to<br>
store cluster configuration and state. That service would then take<br>
responsibility for maintaining availability and consistency of the data<br>
under its care. The best known example of such a coordination service<br>
is Apache's ZooKeeper[1], but there are others that don't have the<br>
noxious Java dependency - e.g. doozer[2] written in Go, Arakoon[3]<br>
written in OCaml, ConCoord[4] written in Python. These all provide a<br>
tightly consistent generally-hierarchical namespace for relatively small<br>
amounts of data. In addition, there are two other features that might<br>
be useful.<br>
<br>
* Watches: register for notification of changes to an object (or<br>
directory/container), without having to poll.<br>
<br>
* Ephemerals: certain objects go away when the client that created them<br>
drops its connection to the server(s).<br>
<br>
Here's a rough sketch of how we'd use such a service.<br>
<br>
* Membership: a certain small set of servers (three or more) would be<br>
manually set up as coordination-service masters, e.g. via "peer probe<br>
xxx as master"). Other servers would connect to these masters, which<br>
would use ephemerals to update a "cluster map" object. Both clients and<br>
servers could set up watches on the cluster map object to be notified of<br>
servers joining and leaving.<br>
<br>
* Configuration: the information we currently store in each volume's<br>
"info" file as the basis for generating volfiles (and perhaps the<br>
volfiles themselves) would be stored in the configuration service.<br>
Again, servers and clients could set watches on these objects to be<br>
notified of changes and do the appropriate graph switches, reconfigures,<br>
quorum actions, etc.<br>
<br>
* Maintenance operations: these would still run in glusterd (which isn't<br>
going away). They would use the coordination for leader election to<br>
make sure the same activity isn't started twice, and to keep status<br>
updated in a way that allows other nodes to watch for changes.<br>
<br>
* Status queries: these would be handled entirely by querying objects<br>
within the coordination service.<br>
<br>
Of the alternatives available to us, only ZooKeeper directly supports<br>
all of the functionality we'd want. However, the Java dependency is<br>
decidedly unpleasant for us and would be totally unacceptable to some of<br>
our users. Doozer seems the closest of the remainder; it supports<br>
watches but not ephemerals, so we'd either have to synthesize those on<br>
top of doozer itself or find another way to handle membership (the only<br>
place where we use that functionality) based on the features it does<br>
have. The project also seems reasonably mature and active, though we'd<br>
probably still have to devote some time to developing our own local<br>
doozer expertise.<br>
<br>
In a similar vein, another possibility would be to use *ourselves* as<br>
the coordination service, via a hand-configured AFR volume. This is<br>
actually an approach Kaleb and I were seriously considering for HekaFS<br>
at the time of the acquisition, and it's not without its benefits.<br>
Using libgfapi we can prevent this special volume from having to be<br>
mounted, and we already know how to secure the communications paths for<br>
it (something that would require additional work with the other<br>
solutions). On the other hand, it would probably require additional<br>
translators to provide both ephemerals and watches, and might require<br>
its own non-glusterd solution to issues like failure detection and<br>
self-heal, so it doesn't exactly meet the "make it somebody else's<br>
problem" criterion.<br>
<br>
In conclusion, I think our best (long term) way forward would be to<br>
prototype a doozer-based version of glusterd. I could possibly be<br>
persuaded to try a "gluster on gluster" approach instead, but at this<br>
moment it wouldn't be my first choice. Are there any other suggestions<br>
or objections before I forge ahead?<br>
<br>
[1] <a href="http://zookeeper.apache.org/" target="_blank">http://zookeeper.apache.org/</a><br>
[2] <a href="https://github.com/ha/doozerd" target="_blank">https://github.com/ha/doozerd</a><br>
[3] <a href="http://arakoon.org/" target="_blank">http://arakoon.org/</a><br>
[4] <a href="http://openreplica.org/doc/" target="_blank">http://openreplica.org/doc/</a><br>
<br>
_______________________________________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@nongnu.org">Gluster-devel@nongnu.org</a><br>
<a href="https://lists.nongnu.org/mailman/listinfo/gluster-devel" target="_blank">https://lists.nongnu.org/mailman/listinfo/gluster-devel</a><br>
</blockquote></div><br>