<br><br><div class="gmail_quote">On Tue, Feb 19, 2013 at 6:11 PM, Pranith Kumar K <span dir="ltr">&lt;<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000"><div><div class="h5">

    <div>On 02/20/2013 07:03 AM, Anand Avati

      wrote:<br>

    </div>

    <blockquote type="cite"><br>

      <br>

      <div class="gmail_quote">On Tue, Feb 19, 2013 at 5:12 PM, Anand

        Avati <span dir="ltr">&lt;<a href="mailto:anand.avati@gmail.com" target="_blank">anand.avati@gmail.com</a>&gt;</span>

        wrote:<br>

        <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

          <br>

          <br>

          <div class="gmail_quote">

            <div>

              <div>On Tue, Feb 19, 2013 at 3:59 AM, Pranith

                Kumar K <span dir="ltr">&lt;<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>&gt;</span>

                wrote:<br>

                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                  <div bgcolor="#FFFFFF" text="#000000">

                    <div>

                      <div>

                        <div>On 02/19/2013 11:26 AM, Anand Avati wrote:<br>

                        </div>

                        <blockquote type="cite">

                          <p>Thinking over this, looks like there is a

                            problem!</p>

                          <p>Write-behind guarantees: That a second

                            write request arriving after the

                            acknowledgement of a first overlapping

                            request (whether written-behind or

                            otherwise) will be guaranteed to be

                            fulfilled in the backend in the same order

                            (i.e, the second overlapping request will be

                            &quot;serialized&quot; behind the first one in the

                            fulfillment process)</p>

                          <p>Eager-lock requirement: That write-behind

                            will send no two write requests on an

                            overlapping region at the same time.</p>

                          <p>The requirement-set and guarantee-set have

                            a big overlap, but the requirement-set is

                            not a subset.</p>

                          <p>This is because of O_SYNC writes.

                            write-behind performs write-serialization at

                            fulfillment only for written behind requests

                            (which get covered under the conflict

                            detection code during liability

                            fulfillment). However, if two threads (or

                            apps) issue overlapping O_SYNC writes to the

                            same region at approx same time, then

                            write-behind will let both of them go by

                            without any kind of serialization, into

                            eager lock, violating the assumptions!</p>

                          <p>I&#39;m wondering if it is a safer idea to

                            implement overlap checks within eager-lock

                            code itself rather than depend on

                            write-behind :|</p>

                          <p>Avati</p>

                          <br>

                          <div class="gmail_quote">On Mon, Feb 11, 2013

                            at 10:07 PM, Anand Avati <span dir="ltr">&lt;<a href="mailto:anand.avati@gmail.com" target="_blank">anand.avati@gmail.com</a>&gt;</span>

                            wrote:<br>

                            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

                              <br>

                              <div class="gmail_quote">

                                <div>On Mon, Feb 11, 2013 at 9:32 PM,

                                  Pranith Kumar K <span dir="ltr">&lt;<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>&gt;</span>

                                  wrote:<br>

                                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                                    <div bgcolor="#FFFFFF" text="#000000"> hi,<br>

                                      Please note that this is a case in

                                      theory and I did not run into such

                                      situation, but I feel it is

                                      important to address this. <br>

                                      Configuration with &#39;Eager-lock on&quot;

                                      and &quot;write-behind off&quot; should not

                                      be allowed as it leads to lock

                                      synchronization problems which

                                      lead to data in-consistency among

                                      replicas in nfs.<br>

                                      lets say bricks b1, b2 are in

                                      replication.<br>

                                      Gluster Nfs server uses 1

                                      anonymous fd to perform all

                                      write-fops. If eager-lock is

                                      enabled in afr, the lock-owner is

                                      used as fd&#39;s address which will be

                                      same for all write-fops, so there

                                      will never be any inodelk

                                      contention. If write-behind is

                                      disabled, there can be writes that

                                      overlap. (Does nfs makes sure that

                                      the ranges don&#39;t overlap?)<br>

                                      <br>

                                      Now imagine the following

                                      scenario:<br>

                                      lets say w1, w2 are 2 write fops

                                      on same offset and length. w1 with

                                      all &#39;0&#39;s and w2 with all &#39;1&#39;s. If

                                      these 2 write fops are executed in

                                      2 different threads, the order of

                                      arrival of write fops on b1 can be

                                      w1, w2 where as on b2 it is w2, w1

                                      leading to data inconsistency

                                      between the two replicas. The lock

                                      contention will not happen as both

                                      lk-owner, transport are same for

                                      these 2 fops.<br>

                                    </div>

                                  </blockquote>

                                  <div><br>

                                  </div>

                                </div>

                                <div>Write-behind has to functions - a)

                                  performing operations in the

                                  background and b) serializing

                                  overlapping operations.</div>

                                <div><br>

                                </div>

                                <div>While the problem does exist, the

                                  specifics are different from what you

                                  describe. since all writes coming in

                                  from NFS will always use the same

                                  anonymous FD, two

                                  near-in-time/overlapping writes will

                                  never contend with inodelk() but

                                  instead the second write will inherit

                                  the lock and changelog from the first.

                                  In either case, it is a problem.</div>

                                <div>

                                  <div>�</div>

                                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                                    <div bgcolor="#FFFFFF" text="#000000"> We can add a check

                                      in glusterd for volume set to

                                      disallow such configuration, BUT

                                      by default write-behind is off in

                                      nfs graph and by default

                                      eager-lock is on. So we should

                                      either turn on write-behind for

                                      nfs or turn off eager-lock by

                                      default.<br>

                                      <br>

                                      Could you please suggest how to

                                      proceed with this if you agree

                                      that I did not miss any important

                                      detail that makes this theory

                                      invalid.</div>

                                  </blockquote>

                                  <div><br>

                                  </div>

                                </div>

                                <div>It seems loading write-behind

                                  xlator in NFS graph �looks like a

                                  simpler solution. eager-locking is

                                  crucial for replicated NFS write

                                  performance.</div>

                                <span><font color="#888888">

                                    <div><br>

                                    </div>

                                    <div>Avati</div>

                                  </font></span></div>

                            </blockquote>

                          </div>

                          <br>

                        </blockquote>

                      </div>

                    </div>

                    Shall we disable eager-lock for files opened with

                    O_SYNC, for now?</div>

                </blockquote>

                <div><br>

                </div>

              </div>

            </div>

            <div>Bad news: the problem is slightly worse than just this.

              Even with non-O_SYNC writes, there is a possibility in

              write-behind where, if a second overlapping write request

              comes so close to the first request that, if wb_enqueue()

              of the second one happens after wb_enqueue() of the first

              write, but before any unwind() after the first

              wb_enqueue() (i.e wb_inode-&gt;gen is not bumped), then

              the two write requests can be wound down together to eager

              lock.</div>

            <span><font color="#888888">

                <div><br>

                </div>

              </font></span></div>

        </blockquote>

        <div><br>

        </div>

        <div>But this has a simple fix - <a href="http://review.gluster.org/4550" target="_blank">http://review.gluster.org/4550</a>.

          Disabling eager-locking for O_SYNC files is a bad idea. We

          absolutely want eager-locking for O_SYNC files. Thinking

          more..</div>

        <div><br>

        </div>

        <div>Avati</div>

      </div>

    </blockquote></div></div>

    Why is disabling eager-lock for O_SYNC files a bad idea? It is

    acceptable to sacrifice a bit of performance for O_SYNC isn&#39;t it?</div></blockquote><div><br></div><div>�s/bit/quite a bit/. For O_SYNC writes, eager locking is the only saving grace in performance as write-behind stays out of the way completely. We would need overlap checks either in AFR or write-behind for O_SYNC writes.</div>

<div><br></div><div>Avati</div></div>