<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 11/05/2014 06:54 PM, Andreas Hollaus

      wrote:<br>

    </div>

    <blockquote cite="mid:545A2523.3010906@ericsson.com" type="cite">

      <pre wrap="">On 11/05/14 12:23, Ravishankar N wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">On 11/05/2014 03:18 PM, Andreas Hollaus wrote:

</pre>

        <blockquote type="cite">

          <pre wrap="">Hi,

I'm curious about this 5 phase transaction scheme that is described in the document

(lock, pre-op, op, post-op, unlock).

Are these stage switches all triggered from the client or can the server do it

without notifying the client, for instance switching from 'op' to 'post-op'?

</pre>

        </blockquote>

        <pre wrap="">

All stages are performed by the AFR translator in the client graph, where it is

loaded, in the sequence you listed.

</pre>

      </blockquote>

      <pre wrap="">So the counters are stored on the servers (as extended attributes on the bricks), but

increased and decreased by the client after fetching them from the servers? If so, I

guess that the messages between those are just synchronous file system operations

like read extended attributes, write file etc.</pre>

    </blockquote>

    You got it right. Lock the file on the bricks, set xattrs on bricks,

    write to bricks, clear xattrs on bricks (success case), unlock file

    on bricks.<br>

    <blockquote cite="mid:545A2523.3010906@ericsson.com" type="cite">

      <pre wrap="">

Is the client created whenever a GlusterFS volume is mounted?</pre>

    </blockquote>

    Correct. You give the hostname+volume name to mount process which it

    uses to fetches the volfile graph from the server, reads it and

    loads the appropriate xlators.<br>

    <blockquote cite="mid:545A2523.3010906@ericsson.com" type="cite">

      <pre wrap=""> As I'm running both

server and client on the same board it's a bit hard to distinguish them from each other.

</pre>

      <blockquote type="cite">

        <blockquote type="cite">

          <pre wrap="">Decreasing the counter for the local pending operations could be done without talking

to the client, even though I realize a message has to sent to the other server(s),

possibly through the client.

The reason I ask is that I'm trying to estimate the risk of ending up in a split

brain situation, or at least understand if our servers will 'accuse' each other

temporarily during this 5 phase transaction under normal circumstances. If I

understand who sends messages to who and I what order, I'll have a better chance to

see if we require any solution to split brain situations. As I've experienced

problems to setup the 'favorite-child' option, I want to know if it's required or

not. In our use case, quorum is not a solution, but losing some data is acceptable as

long as the bricks are in sync.

</pre>

        </blockquote>

        <pre wrap="">If a file is split-brained, AFR does not allow modifications  by clients on it

until the split-brain is resolved. The afr xattrs and heal mechanisms ensure that

the bricks are in sync, so worries on that front.

</pre>

      </blockquote>

      <pre wrap="">I know about the input/output error in case of a split brain and that is something we

must avoid at any cost. That's the reason why 'favorite-child' seems like a good idea

for us, but my filter script is not executed even though I tried a couple of probable

locations to store it at. It's a bit hard to be absolutely sure what that filter path

macro contained at the time the GlusterFS package was built. It would have been

easier if the path existed, even though it was empty if no filters were used.

According to the source code, there are some return statements due to errors that

could also be the reason for not running the filter script. Are there any ways to set

verbose level to get some more clues to what's going on?

</pre>

    </blockquote>

    Not sure I follow you on what a filter script is (hook scripts?),

    but yes, you can use the  favourite-child option to pick the source

    for split-brained files. I don't think it's a supported/tested

    feature though. It can't be set using gluster CLI. You will have to

    edit the volfile manually and add this option before starting the

    volume like so:<br>

    <br>

    #cat /var/lib/glusterd/vols/testvol/trusted-testvol-fuse.vol<br>

    <br>

    &lt;snip&gt;<br>

    volume testvol-replicate-0<br>

        type cluster/replicate<br>

        <b>option favorite-child testvol-client-1</b><br>

        subvolumes testvol-client-0 testvol-client-1<br>

    end-volume<br>

    &lt;/snip&gt;<br>

    <br>

    -Ravi<br>

    <blockquote cite="mid:545A2523.3010906@ericsson.com" type="cite">

      <pre wrap="">

Regards

Andreas

</pre>

      <blockquote type="cite">

        <pre wrap="">Thanks,

Ravi

</pre>

        <blockquote type="cite">

          <pre wrap="">

Regards

Andreas

On 10/31/14 15:37, Ravishankar N wrote:

</pre>

          <blockquote type="cite">

            <pre wrap="">On 10/30/2014 07:23 PM, Andreas Hollaus wrote:

</pre>

            <blockquote type="cite">

              <pre wrap="">Hi,

Thanks! Seems like an interesting document. Although I've read blogs about how

extended attributes are used as a change log, this seams like a more comprehensive

document.

I won't write directly to any brick. That's the reason I first have to create a

volume which consists of only one brick, until the other server is available, and

then add that second brick. I don't want to delay the file system clients until the

second server is available, hence the reason for add-brick.

I guess that this procedure is only needed the first time the volume is configured,

right? If any of these bricks would fail later on, the change log would keep

track of

all changes to the file system even though only one of the bricks is available(?).

</pre>

            </blockquote>

            <pre wrap="">Yes, if one one brick of a replica pair goes down, the other one keeps track of

file modifications by the client, and would sync it back to the first one when it

comes back up.

</pre>

            <blockquote type="cite">

              <pre wrap="">After a restart, volume settings stored in the configuration file would be accepted

even though not all servers were up and running yet at that time, wouldn't they?

</pre>

            </blockquote>

            <pre wrap="">glusterd running on all nodes ensures that the volume configurations stored on each

node are in sync.

</pre>

            <blockquote type="cite">

              <pre wrap="">Speaking about configuration files. When are these copied to each server?

If I create a volume which consists of two bricks, I guess that those servers will

create the configuration files, independently of each other, from the information

sent from the client (gluster volume create...).

</pre>

            </blockquote>

            <pre wrap="">All volume config/management commands must be run from any of the servers that make

up the volume and not the client (unless both happen to be in the same machine). As

mentioned above, when any of the volume commands are run on any one server,

glusterd orchestrates the necessary action on all servers and keeps them in sync.

</pre>

            <blockquote type="cite">

              <pre wrap="">   In case I later on add a brick, I guess that the settings have to be copied

to the

new brick after they have been modified on the first one, right (or will they be

recreated on all servers from the information specified by the client, like in the

previous case)?

Will configuration files be copied in other situations as well, for instance in

case

one of the servers which is part of the volume for some reason would be missing

those

files? In my case, the root file system is recreated from an image at each

reboot, so

everything created in /etc will be lost. Will GlusterFS settings be restored

from the

other server automatically

</pre>

            </blockquote>

            <pre wrap="">No, it is expected that servers have persistent file-systems.  There are ways to

restore such bricks; see

<a class="moz-txt-link-freetext" href="http://gluster.org/community/documentation/index.php/Gluster_3.4:_Brick_Restoration_-_Replace_Crashed_Server">http://gluster.org/community/documentation/index.php/Gluster_3.4:_Brick_Restoration_-_Replace_Crashed_Server</a>

-Ravi

</pre>

            <blockquote type="cite">

              <pre wrap="">or do I need to backup and restore those myself? Even

though the brick doesn't know that it is part of a volume in case it lose the

configuration files, both the other server(s) and the client(s) will probably

recognize it as being part of the volume. I therefore believe that such a

self-healing would actually be possible, even though it may not be implemented.

Regards

Andreas

  On 10/30/14 05:21, Ravishankar N wrote:

</pre>

              <blockquote type="cite">

                <pre wrap="">On 10/28/2014 03:58 PM, Andreas Hollaus wrote:

</pre>

                <blockquote type="cite">

                  <pre wrap="">Hi,

I'm curious about how GlusterFS manages to sync the bricks in the initial phase,

when

the volume is created or

extended.

I first create a volume consisting of only one brick, which clients will start to

read and write.

After a while I add a second brick to the volume to create a replicated volume.

If this new brick is empty, I guess that files will be copied from the first

brick to

get the bricks in sync, right?

However, if the second brick is not empty but rather contains a subset of the

files

on the first brick I don't see

how GlusterFS will solve the problem of syncing the bricks.

I guess that all files which lack extended attributes could be removed in this

scenario, because they were created

when the disk was not part of a GlusterFS volume. However, in case the brick was

used

in the volume previously,

for instance before that server restarted, there will be extended attributes for

the

files on the second brick which

weren't updated during the downtime (when the volume consisted of only one

brick).

There could be multiple

changes to the files during this time. In this case I don't understand how the

extended attributes could be used to

determine which of the bricks contains the most recent file.

Can anyone explain how this works? Is it only allowed to add empty bricks to a

volume?

</pre>

                </blockquote>

                <pre wrap="">It is allowed to add only empty bricks to the volume. Writing directly to

bricks is

not supported. One needs to access the volume only from a mount point or using

libgfapi.

After adding a brick to increase the distribute count, you need to run the volume

rebalance command so that the some of the existing files are hashed (moved) to

this

newly added brick.

After adding a brick to increase the replica count, you need to run the volume

heal

full command to sync the files from the other replica into the newly added brick.

<a class="moz-txt-link-freetext" href="https://github.com/gluster/glusterfs/blob/master/doc/features/afr-v1.md">https://github.com/gluster/glusterfs/blob/master/doc/features/afr-v1.md</a> will give

you an idea of how the replicate translator uses xattrs to keep files in sync.

HTH,

Ravi

</pre>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">

</pre>

        </blockquote>

        <pre wrap="">

</pre>

      </blockquote>

      <pre wrap="">

</pre>

    </blockquote>

    <br>

  </body>

</html>