<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    On 05/05/2012 08:02 AM, Anand Avati wrote:

    <blockquote

cite="mid:CAFboF2yGMV=46SFAJe7qg5bZgX3sYeFPRcFaH_4cZBLvZ8X8qw@mail.gmail.com"

      type="cite"><br>

      <br>

      <div class="gmail_quote">On Wed, May 2, 2012 at 3:55 AM, Xavier

        Hernandez <span dir="ltr">&lt;<a moz-do-not-send="true"

            href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;</span>

        wrote:<br>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          Hello,<br>

          <br>

          I'm wondering if there are any requisites that translators

          must satisfy to work correctly inside glusterfs.<br>

          <br>

          In particular I need to know two things:<br>

          <br>

          1. Are translators required to respect the order in which they

          receive the requests ?<br>

          <br>

          This is specially important in translators such as

          performance/io-threads or caching ones. It seems that these

          translators can reorder requests. If this is the case, is

          there any way to force some order between requests ? can

          inodelk/entrylk be used to force the order ?<br>

          <br>

        </blockquote>

        <div><br>

        </div>

        <div>Translators are not expected to maintain ordering of

          requests. The only translator which takes care of ordering

          calls is write-behind. After acknowledging back write requests

          it has to make sure future requests see the true "effect" as

          though the previous write actually completed. To that end, it

          queues future "dependent" requests till the write

          acknowledgement is received from the server.</div>

        <div><br>

        </div>

        <div>inodelk/entrylk calls help achieve synchronization among

          clients (by getting into a critical section) - just like a

          mutex. It is an arbitrator. It does not help for ordering of

          two calls. If one call must strictly complete after another

          call from your translator's point of view (i.e, if it has such

          a requirement), then the latter call's STACK_WIND must happen

          in the callback of the former's STACK_UNWIND path. There are

          no guarantees maintained by the system to ensure that a second

          STACK_WIND issued right after a first STACK_WIND will complete

          and callback in the same order. Write-behind does all its

          ordering gimmicks only because it STACK_UNWINDs a write call

          prematurely and therefore must maintain the causal effects by

          means of queueing new requests behind the downcall towards the

          server.</div>

      </div>

    </blockquote>

    Good to know<br>

    <br>

    <blockquote

cite="mid:CAFboF2yGMV=46SFAJe7qg5bZgX3sYeFPRcFaH_4cZBLvZ8X8qw@mail.gmail.com"

      type="cite">

      <div class="gmail_quote">

        <div>Â </div>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          2. Are translators required to propagate callback arguments

          even if the result of the operation is an error ? and if an

          internal translator error occurs ?<br>

          <br>

        </blockquote>

        <div><br>

        </div>

        <div>Usually no. If op_ret is -1, only op_errno is expected to

          be a usable value. Rest of the callback parameters are junk.</div>

        <div>Â </div>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          When a translator has multiple subvolumes, I've seen that some

          arguments, such as xdata, are replaced with NULL. This can be

          understood, but are regular translators (those that only have

          one subvolume) allowed to do that or must they preserve the

          value of xdata, even in the case of an internal error ?<br>

          <br>

        </blockquote>

        <div><br>

        </div>

        <div>It is best to preserve the arguments unless you know

          specifically what you are doing. In case of error, all the

          non-op_{ret,errno} arguments are typically junk, including

          xdata.</div>

        <div><br>

        </div>

        <div>Â </div>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          If this is not a requisite, xdata loses it's function of

          delivering back extra information.<br>

          <br>

        </blockquote>

        <div><br>

        </div>

        <div>Can you explain? Are you seeing a use case for having a

          valid xdata in the callback even with op_ret == -1?</div>

        <div><br>

        </div>

      </div>

    </blockquote>

    As a part of a translator that I'm developing that works with

    multiple subvolumes, I need to implement some healing support to

    mantain data coherency (similar to AFR). After some thought, I

    decided that it could be advantageous to use a dedicated healing

    translator located near the bottom of the translators stack on the

    servers. This translator won't work by itself, it only adds support

    to be used by a higher level translator, which have to manage the

    logic of the healing and decide when a node needs to be healed.<br>

    <br>

    To do this, sometimes I need to return an error because an operation

    cannot be completed due to some condition related with healing

    itself (not with the underlying storage). However I need to send

    some specific healing information to let the upper translator know

    how it has to handle the detected condition.<br>

    <br>

    I cannot send a success answer because intermediate translators

    could take the fake data as valid and they could begin to operate

    incorrectly or even create inconsistencies. The other alternative is

    to use op_errno to encode the extra data, but this will also be

    difficult, even impossible in some cases, due to the amount of data

    and the complexity to combine it with an error code without mislead

    intermediate translators with strange or invalid error codes.<br>

    <br>

    I talked with John Mark about this translator and he suggested me to

    discuss it over the list. Therefore I'll initiate another thread to

    expose in more detail how it works and I would appreciate very much

    your opinion, and that of the other developers, about it. Especially

    if it can really be faster/safer that other solutions or not, or if

    you find any problem or have any suggestion to improve it. I think

    it could also be used by AFR and any future translator that may need

    some healing capabilities.<br>

    <br>

    Thank you very much,<br>

    <br>

    Xavi<br>

  </body>

</html>