<br><br><div class="gmail_quote">On Tue, May 8, 2012 at 4:08 PM, Ian Latter <span dir="ltr">&lt;<a href="mailto:ian.latter@midnightcode.org" target="_blank">ian.latter@midnightcode.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5">&gt; On 05/08/2012 12:27 AM, Ian Latter wrote:<br>

&gt; &gt; The equivalent configuration in a glusterd world (from<br>

&gt; &gt; my experiments) pushed all of the distribute knowledge<br>

&gt; &gt; out to the client and I haven&#39;t had a response as to how<br>

&gt; &gt; to add a replicate on distributed volumes in this model,<br>

&gt; &gt; so I&#39;ve lost replicate.<br>

&gt;<br>

&gt; This doesn&#39;t seem to be a problem with replicate-first vs.<br>

distribute-first,<br>

&gt; but with client-side vs. server-side deployment of those<br>

translators.  You<br>

&gt; *can* construct your own volfiles that do these things on<br>

the servers.  It will<br>

&gt; work, but you won&#39;t get a lot of support for it.  The<br>

issue here is that we<br>

&gt; have only a finite number of developers, and a<br>

near-infinite number of<br>

&gt; configurations.  We can&#39;t properly qualify everything.<br>

One way we&#39;ve tried to<br>

&gt; limit that space is by preferring distribute over<br>

replicate, because replicate<br>

&gt; does a better job of shielding distribute from brick<br>

failures than vice versa.<br>

&gt; Another is to deploy both on the clients, following the<br>

scalability rule of<br>

&gt; pushing effort to the most numerous components.  The code<br>

can support other<br>

&gt; arrangements, but the people might not.<br>

<br>

</div></div>Sure, I have my own vol files that do (did) what I wanted<br>

and I was supporting myself (and users); the question<br>

(and the point) is what is the GlusterFS *intent*?</blockquote><div><br>The &quot;intent&quot; (more or less - I hate to use the word as it can imply a commitment to what I am about to say, but there isn&#39;t one) is to keep the bricks (server process) dumb and have the intelligence on the client side. This is a &quot;rough goal&quot;. There are cases where replication on the server side is inevitable (in the case of NFS access) but we keep the software architecture undisturbed by running a client process on the server machine to achieve it.<br>

<br>We do plan to support &quot;replication on the server&quot; in the future while still retaining the existing software architecture as much as possible. This is particularly useful in Hadoop environment where the jobs expect write performance of a single copy and expect copy to happen in the background. We have the proactive self-heal daemon running on the server 

machines now (which again is a client process which happens to be 

physically placed on the server) which gives us many interesting 

possibilities - i.e, with simple changes where we fool the client side replicate translator at the time of transaction initiation that only the closest server is up at that point of time and write to it alone, and have the proactive self-heal daemon perform the extra copies in the background. This would be consistent with other readers as they get directed to the &quot;right&quot; version of the file by inspecting the changelogs while the background replication is in progress.<br>

<br>The intention of the above example is to give a general sense of how we want to evolve the architecture (i.e, the &quot;intention&quot; you were referring to) - keep the clients intelligent and servers dumb. If some intelligence needs to be built on the physical server, tackle it by loading a client process there (there are also &quot;pathinfo xattr&quot; kind of internal techniques to figure out locality of the clients in a generic way without bringing &quot;server sidedness&quot; into them in a harsh way)<br>

<br><br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">  I&#39;ll<br>

write an rsyncd wrapper myself, to run on top of Gluster,<br>

if the intent is not allow the configuration I&#39;m after<br>

(arbitrary number of disks in one multi-host environment<br>

replicated to an arbitrary number of disks in another<br>

multi-host environment, where ideally each environment<br>

need not sum to the same data capacity, presented in a<br>

single contiguous consumable storage layer to an<br>

arbitrary number of unintelligent clients, that is as fault<br>

tolerant as I choose it to be including the ability to add<br>

and offline/online and remove storage as I so choose) ..<br>

or switch out the whole solution if Gluster is heading<br>

away from my  needs.  I just need to know what the<br>

direction is .. I may even be able to help get you there if<br>

you tell me :)<br>

<div class="im"><br></div></blockquote><div><br>There are good and bad in both styles (distribute on top v/s replicate on top). Replicate on top gives you much better flexibility of configuration. Distribute on top is easier for us developers. As a user I would like replicate on top as well. But the problem today is that replicate (and self-heal) does not understand &quot;partial failure&quot; of its subvolumes. If one of the subvolume of replicate is a distribute, then today&#39;s replicate only understands complete failure of the distribute set or it assumes everything is completely fine. An example is self-healing of directory entries. If a file is &quot;missing&quot; in one subvolume because a distribute node is temporarily down, replicate has no clue why it is missing (or that it should keep away from attempting to self-heal). Along the same lines, it does not know that once a server is taken off from its distribute subvolume for good that it needs to start recreating missing files.<br>

<br>The effort to fix this seems to be big enough to disturb the inertia of status quo. If this is fixed, we can definitely adopt a replicate-on-top mode in glusterd.<br><br>Avati<br></div></div>