<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    These are a few ideas I had about how to implement a MESI-like

    protocol on gluster. It's more a bunch of ideas than an structured

    proposal, however I hope it's clear enough to show the basic

    concepts as I see them.<br>

    <br>

    Each inode will have two separate access levels: one for the

    metadata and one for the data. They could be different.

    Additionally, many security checks, posix compliance and some other

    aspects will be needed to be performed on the client side since many

    requests could be directly satisfied without accessing the bricks.<br>

     <br>

    First the easy part: under normal circumstances without any node

    failures or other errors.

<br>

     <br>

    The main idea is that each client, before processing a request, will

    check if it has enough information from the related inode to process

    the request locally (i.e. at least shared access to the inode for

    read requests or exclusive access for writes). If it has enough

    information, the request will be immediately processed and returned

    to the upper translators and, for write requests, the operation will

    be continued in background.

<br>

     <br>

    If the client doesn't have enough access to the inode, it can attach

    information to the request to tell the bricks which kind of access

    it wants for the inode. By default an operation needs a specific

    access level (i.e. shared access for reads and exclusive access for

    writes), however it can request a level less strict if the client

    won't need the full access in a near future (for example a write

    request will need exclusive access, and bricks will execute it with

    exclusive access, however the client can ask for shared access only

    if it foresees that following operations will only be reads).

    Additionally, for exclusive requests, a required space estimate must

    also be attached to the request. This value will be used by the

    bricks to reserve the amount of space requested for this client.

    This is needed to control available space on writes and allow them

    to be executed locally on the client side (when the available space

    on a brick gets too low, it can deny any exclusive access to have a

    better control of available space). It can specify access levels for

    more than one inode (this is useful for operations like rename that

    involve more than one inode). This information will be sent as new

    entries inside the xdata argument. Then the request will be sent to

    the bricks and the client will wait until it receives the answer.

    Bricks can answer in three ways:

<br>

     <br>

    1. The operation cannot be processed due to impossibility to get the

    desired access to the inode(s) - It shouldn't happen, but it must be

    taken into account

<br>

    2. The operation has been processed successfully (even if the result

    of the operation is an error) but the desired level of access has

    not been granted

<br>

    3. The operation has been processed successfully and the desired

    level of access has been granted

<br>

     <br>

    When the operation succeeds and the request involved more than one

    inode, it might happen that the bricks grant access to one of them

    but not to the others. It's also possible that one brick grants

    access to one inode but another brick does not (for example if a

    bricks is in a very low space condition). In this case the client

    will consider that access has been denied.

<br>

     <br>

    When the access has been denied but the request has succeeded, it

    means that any future request involving the same inode will need to

    be sent to the bricks with the extra access information again.

<br>

     <br>

    This also gives enough control to the bricks to not grant exclusive

    access to some inode if it detects that multiple clients are

    accessing it concurrently.

<br>

     <br>

    All requests containing inode access information will need to be

    strictly ordered to guarantee that all bricks process the requests

    in the same order. Requests executed in background because the

    client already had exclusive access can be executed in any order

    (the exclusive access is enough to avoid corruptions).

<br>

     <br>

    Specific details about some fops:

<br>

     <br>

    * open(), opendir(). The open flags can be used to determine the

    desired access. An O_RDONLY open, will request 'shared' access. A

    O_RDWR or O_WRONLY will request 'exclusive' access. A O_WRONLY flag

    could also disable caching because it will never be used.

<br>

    * When the last fd of an inode is released, the current ownership

    can be released (i.e. set the cache entry to 'invalid').

<br>

    * Synchronization fops, like flush(), fsync() and fsyncdir(), will

    always be sent synchronously even if the client has exclusive access

    to the inode.

<br>

     <br>

    The not so easy part: If something fails.

<br>

     <br>

    The big problem is what to do when a client dies having exclusive

    access to some inodes or loses connection or a brick has any

    problem. There are a lot of cases and I haven't analyzed all of them

    deeply. This is only a first approach.

<br>

     <br>

    When a brick dies:

<br>

     <br>

    In this case all clients will cease to receive answers from it. This

    would need to be handled as it's currently done depending on the

    volume type (for replicate, the other bricks will maintain the

    volume working, for disperse, a part of the volume could be lost).

    When the bricks comes online again and reconnects, the current

    access levels owned by each client will need to be requested again

    (this is similar to the current procedure to reopen fd's). If any of

    the requests to restore ownership fail, the client will consider

    that it has lost the access to the inode and it will need to ask for

    it again in future requests.

<br>

     <br>

    When a client dies:

<br>

     <br>

    If it doesn't have ownership of any inode, nothing special happens.

    Otherwise, if it has 'exclusive' access to one or more inodes, all

    bricks will try to notify this client when another client requests

    'shared' or 'exclusive' access. This notification will have a

    timeout. If the client doesn't answer in the specified time, it will

    lose the ownership and all requests coming from that client without

    access information attached to the xdata will be denied. This can

    lead to some data loss, however, since the caching will be

    write-through and flush(), fsync() and fsyncdir() would have been

    executed synchronously, the likelihood of data loss is small and the

    semantics of posix allow it (I'm not a posix expert, but I think

    that posix doesn't guarantee data to be recoverable until flush() or

    fsync() have been executed successfully).

<br>

     <br>

    When the client reconnects, it will continue to execute normally. It

    could receive some notification of invalidation of one inode that it

    doesn't have anymore. In this case it will simply acknowledge the

    notification.

<br>

     <br>

    When a client disconnects but it does not die:

<br>

     <br>

    It's basically the same than the above case, however when the client

    reconnects it will try to recover its previous ownerships. If

    nothing has changed, it will recover them. Otherwise some of the

    inodes will be invalidated. Any pending operations on the

    invalidated inodes will be lost (it's as if the client had died).<br>

    <br>

    Xavi<br>

    <br>

    <div class="moz-cite-prefix">El 06/02/14 00:24, Anand Avati ha

      escrit:<br>

    </div>

    <blockquote

cite="mid:CAFboF2ybM7+UvKv4Di2jpN0fGKXoiFsYFaUuLe-KThYTHtr4Pw@mail.gmail.com"

      type="cite">

      <div dir="ltr">Xavi,

        <div>Getting such a caching mechanism has several aspects. First

          of all we need the framework pieces implemented (particularly

          server originated messages to the client for invalidation and

          revokes) in a well designed way. Particularly how we address a

          specific translator in a message originating from the server.

          Some of the recent changes to client_t allows for server-side

          translators to get a handle (the client_t object) on which

          messages can be submitted back to the client.</div>

        <div><br>

        </div>

        <div>Such a framework (of having server originated messages) is

          also necessary for implementing oplocks (and possibly leases)

          - particularly interesting for the Samba integration.</div>

        <div><br>

        </div>

        <div>As Jeff already mentioned, this is an area where gluster

          has not focussed on, given the targeted use case. However the

          benefits of extending this to internal use cases (to avoid

          per-operation inodelks can benefit many modules -

          encryption/crypt, afr, etc.) It seems possible to have a

          common framework for delegating locks to clients, and build

          caching coherency protocols / oplocks / inodelk avoidence on

          top of it.</div>

        <div><br>

        </div>

        <div>Feel free to share a more detailed proposal if you have

          have/plan - I'm sure the Samba folks (Ira copied) would be

          interested too.</div>

        <div><br>

        </div>

        <div>Thanks!</div>

        <div>Avati<br>

          <div class="gmail_extra">

            <br>

            <br>

            <div class="gmail_quote">On Wed, Feb 5, 2014 at 11:27 AM,

              Xavier Hernandez <span dir="ltr">&lt;<a

                  moz-do-not-send="true"

                  href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;</span>

              wrote:<br>

              <blockquote class="gmail_quote" style="margin:0 0 0

                .8ex;border-left:1px #ccc solid;padding-left:1ex">

                <div class="HOEnZb">

                  <div class="h5">On 04.02.2014 17:18, Jeff Darcy wrote:<br>

                    <br>

                    <blockquote class="gmail_quote" style="margin:0 0 0

                      .8ex;border-left:1px #ccc solid;padding-left:1ex">

                      <blockquote class="gmail_quote" style="margin:0 0

                        0 .8ex;border-left:1px #ccc

                        solid;padding-left:1ex">

                        The only synchronization point needed is to make

                        sure that all bricks<br>

                        agree on the inode state and which client owns

                        it. This can be achieved<br>

                        without locking using a method similar to what I

                        implemented in the DFC<br>

                        translator. Besides the lock-less architecture,

                        the main advantage is<br>

                        that much more aggressive caching strategies can

                        be implemented very<br>

                        near to the final user, increasing considerably

                        the throughput of the<br>

                        file system. Special care has to be taken with

                        things than can fail on<br>

                        background writes (basically brick space and

                        user access rights). Those<br>

                        should be handled appropiately on the client

                        side to guarantee future<br>

                        success of writes. Of course this is only a high

                        level overview. A<br>

                        deeper analysis should be done to see what to do

                        on each special case.<br>

                        What do you think ?<br>

                      </blockquote>

                      <br>

                      I think this is a great idea for where we can go -

                      and need to go - in the<br>

                      long term. However, it's important to recognize

                      that it *is* the long<br>

                      term. We had to solve almost exactly the same

                      problems in MPFS long ago.<br>

                      Whether the synchronization uses locks or not

                      *locally* is meaningless,<br>

                      because all of the difficult problems have to do

                      with recovering the<br>

                      *distributed* state. What happens when a brick

                      fails while holding an<br>

                      inode in any state but I? How do we recognize it,

                      what do we do about it,<br>

                      how do we handle the case where it comes back and

                      needs to re-acquire its<br>

                      previous state? How do we make sure that a brick

                      can successfully flush<br>

                      everything it needs to before it yields a

                      lock/lease/whatever? That's<br>

                      going to require some kind of flow control, which

                      is itself a pretty big<br>

                      project. It's not impossible, but it took multiple

                      people some years for<br>

                      MPFS, and ditto for every other project (e.g. Ceph

                      or XtreemFS) which<br>

                      adopted similar approaches. GlusterFS's historical

                      avoidance of this<br>

                      complexity certainly has some drawbacks, but it

                      has also been key to us<br>

                      making far more progress in other areas.<br>

                      <br>

                    </blockquote>

                  </div>

                </div>

                Well, it's true that there will be a lot of tricky cases

                that will need<br>

                to be handled to be sure that data integrity and system

                responsiveness is<br>

                guaranteed, however I think that they are not more

                difficult than what<br>

                can happen currently if a client dies or loses

                communication while it<br>

                holds a lock on a file.<br>

                <br>

                Anyway I think there is a great potential with this

                mechanism because it<br>

                can allow the implementation of powefull caches, even

                based on SSD that<br>

                could improve the performance a lot.<br>

                <br>

                Of course there is a lot of work solving all potential

                failures and<br>

                designing the right thing. An important consideration is

                that all<br>

                these methods try to solve a problem that is seldom

                found (i.e. having<br>

                more than one client modifying the same file at the same

                time). So a<br>

                solution that has almost 0 overhead for the normal case

                and allows the<br>

                implementation of aggressive caching mechanisms seems a

                big win.

                <div class="im"><br>

                  <br>

                  <blockquote class="gmail_quote" style="margin:0 0 0

                    .8ex;border-left:1px #ccc solid;padding-left:1ex">

                    To move forward on this, I think we need a *much*

                    more detailed idea of<br>

                    how we're going to handle the nasty cases. Would

                    some sort of online<br>

                    collaboration - e.g. Hangouts - make more sense than

                    continuing via<br>

                    email?<br>

                    <br>

                  </blockquote>

                </div>

                Of course, we can talk on irc or another place if you

                prefer<br>

                <br>

                Xavi

                <div class="HOEnZb">

                  <div class="h5"><br>

                    <br>

                    _______________________________________________<br>

                    Gluster-devel mailing list<br>

                    <a moz-do-not-send="true"

                      href="mailto:Gluster-devel@nongnu.org"

                      target="_blank">Gluster-devel@nongnu.org</a><br>

                    <a moz-do-not-send="true"

                      href="https://lists.nongnu.org/mailman/listinfo/gluster-devel"

                      target="_blank">https://lists.nongnu.org/mailman/listinfo/gluster-devel</a><br>

                  </div>

                </div>

              </blockquote>

            </div>

            <br>

          </div>

        </div>

      </div>

    </blockquote>

    <br>

  </body>

</html>