<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    On 05/22/2012 02:11 AM, Anand Avati wrote:

    <blockquote

cite="mid:CAFboF2yivBAK7rtJvn1CTQ+xhL5-oVq3fniaDV6TfTat5UNwhA@mail.gmail.com"

      type="cite"><br>

      <br>

      <div class="gmail_quote">On Tue, May 8, 2012 at 2:34 AM, Xavier

        Hernandez <span dir="ltr">&lt;<a moz-do-not-send="true"

            href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;</span>

        wrote:<br>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          <div bgcolor="#FFFFFF" text="#000000"> Hello developers,<br>

            <br>

            I would like to expose some ideas we are working on to

            create a new kind of translator that should be able to unify

            and simplify to some extent the healing procedures of

            complex translators.<br>

            <br>

            Currently, the only translator with complex healing

            capabilities that we are aware of is AFR. We are developing

            another translator that will also need healing capabilities,

            so we thought that it would be interesting to create a new

            translator able to handle the common part of the healing

            process and hence to simplify and avoid duplicated code in

            other translators.<br>

            <br>

            The basic idea of the new translator is to handle healing

            tasks nearer the storage translator on the server nodes

            instead to control everything from a translator on the

            client nodes. Of course the heal translator is not able to

            handle healing entirely by itself, it needs a client

            translator which will coordinate all tasks. The heal

            translator is intended to be used by translators that work

            with multiple subvolumes.<br>

            <br>

            I will try to explain how it works without entering into too

            much details.<br>

            <br>

            There is an important requisite for all client translators

            that use healing: they must have exactly the same list of

            subvolumes and in the same order. Currently, I think this is

            not a problem.<br>

            <br>

            The heal translator treats each file as an independent

            entity, and each one can be in 3 modes:<br>

            <br>

            1. Normal mode<br>

            <blockquote>This is the normal mode for a copy or fragment

              of a file when it is synchronized and consistent with the

              same file on other nodes (for example with other replicas.

              It is the client translator who decides if it is

              synchronized or not).<br>

            </blockquote>

            2. Healing mode<br>

            <blockquote>This is the mode used when a client detects an

              inconsistency in the copy or fragment of the file stored

              on this node and initiates the healing procedures.<br>

            </blockquote>

            3. Provider mode (I don't like very much this name, though)<br>

            <blockquote>This is the mode used by client translators when

              an inconsistency is detected in this file, but the copy or

              fragment stored in this node is considered good and it

              will be used as a source to repair the contents of this

              file on other nodes.<br>

            </blockquote>

            Initially, when a file is created, it is set in normal mode.

            Client translators that make changes must guarantee that

            they send the modification requests in the same order to all

            the servers. This should be done using inodelk/entrylk.<br>

            <br>

            When a change is sent to a server, the client must include a

            bitmap mask of the clients to which the request is being

            sent. Normally this is a bitmap containing all the clients,

            however, when a server fails for some reason some bits will

            be cleared. The heal translator uses this bitmap to early

            detect failures on other nodes from the point of view of

            each client. When this condition is detected, the request is

            aborted with an error and the client is notified with the

            remaining list of valid nodes. If the client considers the

            request can be successfully server with the remaining list

            of nodes, it can resend the request with the updated bitmap.<br>

            <br>

            The heal translator also updates two file attributes for

            each change request to mantain the "version" of the data and

            metadata contents of the file. A similar task is currently

            made by AFR using xattrop. This would not be needed anymore,

            speeding write requests.<br>

            <br>

            The version of data and metadata is returned to the client

            for each read request, allowing it to detect inconsistent

            data.<br>

            <br>

            When a client detects an inconsistency, it initiates

            healing. First of all, it must lock the entry and inode

            (when necessary). Then, from the data collected from each

            node, it must decide which nodes have good data and which

            ones have bad data and hence need to be healed. There are

            two possible cases:<br>

            <br>

            1. File is not a regular file<br>

            <blockquote>In this case the reconstruction is very fast and

              requires few requests, so it is done while the file is

              locked. In this case, the heal translator does nothing

              relevant.<br>

            </blockquote>

            2. File is a regular file<br>

            <blockquote>For regular files, the first step is to

              synchronize the metadata to the bad nodes, including the

              version information. Once this is done, the file is set in

              healing mode on bad nodes, and provider mode on good

              nodes. Then the entry and inode are unlocked.<br>

            </blockquote>

            When a file is in provider mode, it works as in normal mode,

            but refuses to start another healing. Only one client can be

            healing a file.<br>

            <br>

            When a file is in healing mode, each normal write request

            from any client are handled as if the file were in normal

            mode, updating the version information and detecting

            possible inconsistencies with the bitmap. Additionally, the

            healing translator marks the written region of the file as

            "good".<br>

            <br>

            Each write request from the healing client intended to

            repair the file must be marked with a special flag. In this

            case, the area that wants to be written is filtered by the

            list of "good" ranges (if there are any intersection with a

            good range, it is removed from the request). The resulting

            set of ranges are propagated to the lower translator and

            added to the list of "good" ranges but the version

            information is not updated.<br>

            <br>

            Read requests are only served if the range requested is

            entirely contained into the "good" regions list.<br>

            <br>

            There are some additional details, but I think this is

            enough to have a general idea of its purpose and how it

            works.<br>

            <br>

            The main advantages of this translator are:<br>

            <br>

            1. Avoid duplicated code in client translators<br>

            2. Simplify and unify healing methods in client translators<br>

            3. xattrop is not needed anymore in client translators to

            keep track of changes<br>

            4. Full file contents are repaired without locking the file<br>

            5. Better detection and prevention of some split brain

            situations as soon as possible<br>

            <br>

            I think it would be very useful. It seems to me that it

            works correctly in all situations, however I don't have all

            the experience that other developers have with the healing

            functions of AFR, so I will be happy to answer any question

            or suggestion to solve problems it may have or to improve

            it.<br>

            <br>

            What do you think about it ?<br>

            <br>

          </div>

        </blockquote>

        <div><br>

          The goals you state above are all valid. What would really

          help (adoption) is if you can implement this as a modification

          of AFR by utilizing all the work already done, and you get

          brownie points if it is backward compatible with existing AFR.

          If you already have any code in a publishable state, please

          share it with us (github link?).<br>

          <br>

          Avati<br>

        </div>

      </div>

    </blockquote>

    I've tried to understand how AFR works and, in some way, some of the

    ideas have been taken from it. However it is very complex and a lot

    of changes have been carried out in the master branch over the

    latest months. It's hard for me to follow them while actively

    working on my translator. Nevertheless, the main reason to take a

    separate path was that AFR is strongly bound to replication (at

    least from what I saw when I analyzed it more deeply. Maybe things

    have changed now, but haven't had time to review them).<br>

    <br>

    The requirements for my translator didn't fit very well with AFR,

    and the needed effort to understand and modify it to adapt it was

    too high. It also seems that there isn't any detailed developer info

    about internals of AFR that could have helped to be more confident

    to modify it (at least I haven't found it).<br>

    <br>

    I'm currenty working on it, but it's not ready yet. As soon as it is

    in a minimally stable state we will publish it, probably on github.

    I'll write the url to this list.<br>

    <br>

    Thank you<br>

  </body>

</html>