<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Mar 6, 2014 at 11:19 AM, Krishnan Parthasarathi <span dir="ltr">&lt;<a href="mailto:kparthas@redhat.com" target="_blank">kparthas@redhat.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class=""><br>

<br>

----- Original Message -----<br>

&gt; On Thu, Mar 6, 2014 at 12:21 AM, Vijay Bellur &lt;<a href="mailto:vbellur@redhat.com">vbellur@redhat.com</a>&gt; wrote:<br>

&gt;<br>

&gt; &gt; Adding gluster-devel.<br>

&gt; &gt;<br>

&gt; &gt;<br>

&gt; &gt; On 03/06/2014 01:15 PM, Krishnan Parthasarathi wrote:<br>

&gt; &gt;<br>

&gt; &gt;&gt; All,<br>

&gt; &gt;&gt;<br>

&gt; &gt;&gt; In recent discussions around design (and implementation) of the barrier<br>

&gt; &gt;&gt; feature, couple of things came to light.<br>

&gt; &gt;&gt;<br>

&gt; &gt;&gt; 1) changelog xlator needs barrier xlator to block unlink and rename FOPs<br>

&gt; &gt;&gt;     in the call path. This is apart from the current list of FOPs that<br>

&gt; &gt;&gt; are blocked<br>

&gt; &gt;&gt;     in their call back path.<br>

&gt; &gt;&gt;     This is to make sure that the changelog has a bounded queue of unlink<br>

&gt; &gt;&gt; and rename FOPs,<br>

&gt; &gt;&gt;     from the time barriering is enabled, to be drained, committed to<br>

&gt; &gt;&gt; changelog file and published.<br>

&gt; &gt;&gt;<br>

&gt; &gt;<br>

&gt; Why is this necessary?<br>

<br>

</div>The only consumer of changelog today, georeplication, can&#39;t tolerate missing unlink/rename<br>

entries from changelog, even with the initial xsync based crawl, until changelog entries<br>

are available for the master volume.<br>

So, changelog xlator needs to ensure that the last rotated<br>

(publishable) changelog should have entries for all the unlink(s)/rename(s) that made<br>

it to the snapshot. For this, changelog needs barrier xlator to block unlink/rename<br>

FOPs in the call path too. Hope that helps.<br></blockquote><div><br></div><div>This sounds like a very changelog specific requirement. This is best addressed in the changelog translator itself. If unlink/rmdir/renames should not be &quot;in progress&quot; during a snapshot, then we need to hold off new ops in the call path, trigger a log rotation and the rotation should wait for completion of ongoing fops anyways.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div><div class="h5"><br>

&gt;<br>

&gt;<br>

&gt; 2) It is possible in a pure distribute volume that the following sequence<br>

&gt; &gt;&gt; of FOPs could result<br>

&gt; &gt;&gt;     in snapshots of bricks disagreeing on inode type for a file or<br>

&gt; &gt;&gt; directory.<br>

&gt; &gt;&gt;<br>

&gt; &gt;&gt;     t1: snap b1<br>

&gt; &gt;&gt;     t2: unlink /a<br>

&gt; &gt;&gt;     t3: mkdir /a<br>

&gt; &gt;&gt;     t4: snap b2<br>

&gt; &gt;&gt;<br>

&gt; &gt;&gt; where, b1 and b2 are bricks of a pure distribute volume V.<br>

&gt; &gt;&gt;<br>

&gt; &gt;&gt; The above sequence can happen with the current barrier xlator design,<br>

&gt; &gt;&gt; since we allow unlink FOPs<br>

&gt; &gt;&gt; to go through to the disk and only block their acknowledgement to the<br>

&gt; &gt;&gt; application. This implies<br>

&gt; &gt;&gt; a concurrent mkdir on the same name could succeed, since DHT doesn&#39;t<br>

&gt; &gt;&gt; serialize unlink and mkdir FOPs,<br>

&gt; &gt;&gt; unlike AFR.<br>

&gt; &gt;&gt;<br>

&gt; &gt;&gt; Avati,<br>

&gt; &gt;&gt;<br>

&gt; &gt;&gt; I hear that you have a solution for problem 2). Could you please start<br>

&gt; &gt;&gt; the discussion on this thread?<br>

&gt; &gt;&gt; It would help us to decide how to go about with the barrier xlator<br>

&gt; &gt;&gt; implementation.<br>

&gt; &gt;&gt;<br>

&gt; &gt;<br>

&gt;<br>

&gt; The solution is really a long pending implementation of dentry<br>

&gt; serialization in the resolver of protocol server. Today we allow multiple<br>

&gt; FOPs to happen in parallel which modify the same dentry. This results in<br>

&gt; hairy races (including non atomicity of rename) and has been kept open for<br>

&gt; a while now. Implementing the dentry serialization in the resolver will<br>

&gt; &quot;solve&quot; 2 as a side effect. Hence that is a better approach than making<br>

&gt; changes in the barrier translator.<br>

&gt;<br>

<br>

</div></div>I am not sure I understood how this works from the brief introduction above.<br>

Could you explain a bit?<br></blockquote><div><br></div><div>By dentry serialization, I mean we should have only one operation modifying a &lt;pargfid&gt;/bname at a given time. This needs changes in the resolver of protocol server and possibly some changes in the inode table. This is really for solving rare races, and I think is something we need to work on independent of the snapshot requirements.</div>

<div><br></div><div>Avati</div><div><br></div></div></div></div>