<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Mar 6, 2014 at 11:19 AM, Krishnan Parthasarathi <span dir="ltr"><<a href="mailto:kparthas@redhat.com" target="_blank">kparthas@redhat.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class=""><br>
<br>
----- Original Message -----<br>
> On Thu, Mar 6, 2014 at 12:21 AM, Vijay Bellur <<a href="mailto:vbellur@redhat.com">vbellur@redhat.com</a>> wrote:<br>
><br>
> > Adding gluster-devel.<br>
> ><br>
> ><br>
> > On 03/06/2014 01:15 PM, Krishnan Parthasarathi wrote:<br>
> ><br>
> >> All,<br>
> >><br>
> >> In recent discussions around design (and implementation) of the barrier<br>
> >> feature, couple of things came to light.<br>
> >><br>
> >> 1) changelog xlator needs barrier xlator to block unlink and rename FOPs<br>
> >> in the call path. This is apart from the current list of FOPs that<br>
> >> are blocked<br>
> >> in their call back path.<br>
> >> This is to make sure that the changelog has a bounded queue of unlink<br>
> >> and rename FOPs,<br>
> >> from the time barriering is enabled, to be drained, committed to<br>
> >> changelog file and published.<br>
> >><br>
> ><br>
> Why is this necessary?<br>
<br>
</div>The only consumer of changelog today, georeplication, can't tolerate missing unlink/rename<br>
entries from changelog, even with the initial xsync based crawl, until changelog entries<br>
are available for the master volume.<br>
So, changelog xlator needs to ensure that the last rotated<br>
(publishable) changelog should have entries for all the unlink(s)/rename(s) that made<br>
it to the snapshot. For this, changelog needs barrier xlator to block unlink/rename<br>
FOPs in the call path too. Hope that helps.<br></blockquote><div><br></div><div>This sounds like a very changelog specific requirement. This is best addressed in the changelog translator itself. If unlink/rmdir/renames should not be "in progress" during a snapshot, then we need to hold off new ops in the call path, trigger a log rotation and the rotation should wait for completion of ongoing fops anyways.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><div class="h5"><br>
><br>
><br>
> 2) It is possible in a pure distribute volume that the following sequence<br>
> >> of FOPs could result<br>
> >> in snapshots of bricks disagreeing on inode type for a file or<br>
> >> directory.<br>
> >><br>
> >> t1: snap b1<br>
> >> t2: unlink /a<br>
> >> t3: mkdir /a<br>
> >> t4: snap b2<br>
> >><br>
> >> where, b1 and b2 are bricks of a pure distribute volume V.<br>
> >><br>
> >> The above sequence can happen with the current barrier xlator design,<br>
> >> since we allow unlink FOPs<br>
> >> to go through to the disk and only block their acknowledgement to the<br>
> >> application. This implies<br>
> >> a concurrent mkdir on the same name could succeed, since DHT doesn't<br>
> >> serialize unlink and mkdir FOPs,<br>
> >> unlike AFR.<br>
> >><br>
> >> Avati,<br>
> >><br>
> >> I hear that you have a solution for problem 2). Could you please start<br>
> >> the discussion on this thread?<br>
> >> It would help us to decide how to go about with the barrier xlator<br>
> >> implementation.<br>
> >><br>
> ><br>
><br>
> The solution is really a long pending implementation of dentry<br>
> serialization in the resolver of protocol server. Today we allow multiple<br>
> FOPs to happen in parallel which modify the same dentry. This results in<br>
> hairy races (including non atomicity of rename) and has been kept open for<br>
> a while now. Implementing the dentry serialization in the resolver will<br>
> "solve" 2 as a side effect. Hence that is a better approach than making<br>
> changes in the barrier translator.<br>
><br>
<br>
</div></div>I am not sure I understood how this works from the brief introduction above.<br>
Could you explain a bit?<br></blockquote><div><br></div><div>By dentry serialization, I mean we should have only one operation modifying a <pargfid>/bname at a given time. This needs changes in the resolver of protocol server and possibly some changes in the inode table. This is really for solving rare races, and I think is something we need to work on independent of the snapshot requirements.</div>
<div><br></div><div>Avati</div><div><br></div></div></div></div>