<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Al 06/09/13 20:43, En/na Anand Avati ha

      escrit:<br>

    </div>

    <blockquote

cite="mid:CAFboF2w2ZJzZBN6eHY_Bmi7XHF0xhcSq7pcCp3K+QedRKabJsg@mail.gmail.com"

      type="cite">

      <div dir="ltr"><br>

        <div class="gmail_extra">

          <div class="gmail_quote">On Fri, Sep 6, 2013 at 1:46 AM,

            Xavier Hernandez <span dir="ltr">&lt;<a

                moz-do-not-send="true"

                href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;</span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div text="#000000" bgcolor="#FFFFFF">

                <div>Al 04/09/13 18:10, En/na Anand Avati ha escrit:<br>

                </div>

                <div>

                  <div class="h5">

                    <blockquote type="cite">

                      <div dir="ltr">On Wed, Sep 4, 2013 at 6:37 AM,

                        Xavier Hernandez <span dir="ltr">&lt;<a

                            moz-do-not-send="true"

                            href="mailto:xhernandez@datalab.es"

                            target="_blank">xhernandez@datalab.es</a>&gt;</span>

                        wrote:<br>

                        <div class="gmail_extra">

                          <div class="gmail_quote">

                            <blockquote class="gmail_quote"

                              style="margin:0 0 0 .8ex;border-left:1px

                              #ccc solid;padding-left:1ex">Al 04/09/13

                              14:05, En/na Jeff Darcy ha escrit:

                              <div>

                                <div><br>

                                  <blockquote class="gmail_quote"

                                    style="margin:0 0 0

                                    .8ex;border-left:1px #ccc

                                    solid;padding-left:1ex"> On

                                    09/04/2013 04:27 AM, Xavier

                                    Hernandez wrote:<br>

                                    <blockquote class="gmail_quote"

                                      style="margin:0 0 0

                                      .8ex;border-left:1px #ccc

                                      solid;padding-left:1ex"> I would

                                      also like to note that each node

                                      can store multiple elements.<br>

                                      Current implementation creates a

                                      node for each byte in the key. In

                                      my<br>

                                      implementation I only create a

                                      node if there is a prefix

                                      coincidence between<br>

                                      2 or more keys. This reduces the

                                      number of nodes and the number of<br>

                                      indirections.<br>

                                    </blockquote>

                                    <br>

                                    Whatever we do, we should try to

                                    make sure that the changes are

                                    profiled<br>

                                    against real usage. &nbsp;When I was

                                    making my own dict optimizations

                                    back in March<br>

                                    of last year, I started by looking

                                    at how they're actually used. At

                                    that time,<br>

                                    a significant majority of

                                    dictionaries contained just one

                                    item. That's why I<br>

                                    only implemented a simple mechanism

                                    to pre-allocate the first data_pair

                                    instead<br>

                                    of doing something more ambitious.

                                    &nbsp;Even then, the difference in actual<br>

                                    performance or CPU usage was barely

                                    measurable. &nbsp;Dict usage has

                                    certainly<br>

                                    changed since then, but I think

                                    you'd still be hard pressed to find

                                    a case<br>

                                    where a single dict contains more

                                    than a handful of entries, and

                                    approaches<br>

                                    that are optimized for dozens to

                                    hundreds might well perform worse

                                    than simple<br>

                                    ones (e.g. because of cache aliasing

                                    or branch misprediction).<br>

                                    <br>

                                    If you're looking for other

                                    optimization opportunities that

                                    might provide even<br>

                                    bigger "bang for the buck" then I

                                    suggest that stack-frame or

                                    frame-&gt;local<br>

                                    allocations are a good place to

                                    start. &nbsp;Or string copying in places

                                    like<br>

                                    loc_copy. &nbsp;Or the entire

                                    fd_ctx/inode_ctx subsystem. &nbsp;Let me

                                    know and I'll come<br>

                                    up with a few more. &nbsp;To put a bit of

                                    a positive spin on things, the

                                    GlusterFS<br>

                                    code offers many opportunities for

                                    improvement in terms of CPU and

                                    memory<br>

                                    efficiency (though it's surprisingly

                                    still way better than Ceph in that

                                    regard).<br>

                                    <br>

                                  </blockquote>

                                </div>

                              </div>

                              Yes. The optimizations on dictionary

                              structures are not a big improvement in

                              the overall performance of GlusterFS. I

                              tried it on a real situation and the

                              benefit was only marginal. However I

                              didn't test new features like an atomic

                              lookup and remove if found (because I

                              would have had to review all the code). I

                              think this kind of functionalities could

                              improve a bit more the results I obtained.<br>

                              <br>

                              However this is not the only reason to do

                              these changes. While I've been writing

                              code I've found that it's tedious to do

                              some things just because there isn't such

                              functions in dict_t. Some actions require

                              multiple calls, having to check multiple

                              errors and adding complexity and limiting

                              readability of the code. Many of these

                              situations could be solved using functions

                              similar to what I proposed.<br>

                              <br>

                              On the other side, if dict_t must be truly

                              considered a concurrent structure, there

                              are a lot of race conditions that might

                              appear when doing some operations. It

                              would require a great effort to take care

                              of all these possibilities everywhere. It

                              would be better to pack most of these

                              situations into functions inside the

                              dict_t itself where it is easier to

                              combine some operations.<br>

                              <br>

                              By the way, I've made some tests with

                              multiple bricks and it seems that there is

                              a clear speed loss on directory listings

                              as the number of bricks increases. Since

                              bricks should be independent and they can

                              work in parallel, I didn't expected such a

                              big performance degradation.</blockquote>

                            <div><br>

                            </div>

                            <div>The likely reason is that, even though

                              bricks are parallel for IO, readdir is

                              essentially a sequential operation and DHT

                              has a limitation that a readdir reply

                              batch does not cross server boundaries. So

                              if you have 10 files and 1 server, all 10

                              entries are returned in one call to the

                              app/libc. If you have 10 files and 10

                              servers evenly distributed, the app/libc

                              has to perform 10 calls and keeps getting

                              one file at a time. This problem goes away

                              when each server has enough files to fill

                              up a readdir batch. It's only when you

                              have too few files and too many servers

                              that this "dilution" problem shows up.

                              However, this is just a theory and your

                              problem may be something else too..</div>

                            <div><br>

                            </div>

                          </div>

                        </div>

                      </div>

                    </blockquote>

                  </div>

                </div>

                I didn't know that DHT was doing a sequential brick scan

                on readdir(p) (my fault). Why is that ? Why it cannot

                return entries crossing a server boundary ? is it due to

                a technical reason or is it only due to the current

                implementation ?<br>

                <br>

                I've made a test using only directories (50 directories

                with 50 subdirectories each). I started with one brick

                and I measured the time to do a recursive 'ls'. Then I

                sequentially added an additional brick, up to 6 (all of

                them physically independent), and repeated the ls. The

                time increases linearly as the number of bricks

                augments. As more bricks were added, the rebalancing

                time was also growing linearly.<br>

                <br>

                I think this is a big problem for scalability. It can be

                partially hidden by using some caching or preloading

                mechanisms, but it will be there and it will hit sooner

                or later.

                <div class="im"><br>

                  <br>

                  <blockquote type="cite">

                    <div dir="ltr">

                      <div class="gmail_extra">

                        <div class="gmail_quote">

                          <div>Note that Brian Foster's readdir-ahead

                            patch should address this problem to a large

                            extent. When loaded on top of DHT, the

                            prefiller effectively collapses the smaller

                            chunks returned by DHT into a larger chunk

                            requested by the app/libc.</div>

                          <div><br>

                          </div>

                        </div>

                      </div>

                    </div>

                  </blockquote>

                </div>

                I've seen it, however I think it will only partially

                mitigate and hide an existing problem. Imagine you have

                some hundreds or a thousand of bricks. I doubt

                readdir-ahead or anything else can hide the enormous

                latency that the sequential DHT scan will generate in

                that case.<br>

                <br>

                The main problem I see is that the full directory

                structure is read many times sequentially. I think it

                would be better to do the readdir(p) calls in parallel

                and combine them (possibly in background). This way the

                time to scan the directory structure would be almost

                constant, independently of the number of bricks.<br>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>The design of the directory entries in DHT makes this

              essentially a sequential operation because entries from

              servers are appended, not striped. What I mean is, the

              logical ordering of&nbsp;</div>

            <div><br>

            </div>

            <div>All entries in a directory = All files and dirs in 0th

              server + All files (no dirs) in 1st server + All files (no

              dirs) in 2nd server + .. + All files (no dirs) in N'th

              server.</div>

            <div><br>

            </div>

            <div>in a sequential manner. If we read the entries of 2nd

              server along with entries of 1st server, we cannot "use"

              it till we finish reading all entries of 1st server and

              get EOD from it - which is why readdir-ahead is a more

              natural solution than reading in parallel for the above

              design.</div>

            <div><br>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    As I understand it, what the read-ahead translator does is to

    collect one or more answers from the DHT translator and combine them

    to return a single answer as big as possible. If that is correct, it

    will certainly reduce the number of readdir calls from application,

    however I think it will still have a considerable latency when used

    on big clusters. Anyway I don't have any measurement or valid

    argument to support this, so lets see how readdir-ahead works in

    real environments before discussing about it.<br>

    <br>

    <blockquote

cite="mid:CAFboF2w2ZJzZBN6eHY_Bmi7XHF0xhcSq7pcCp3K+QedRKabJsg@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">

            <div>Also, this is a problem only if each server has fewer

              entries than what can be returned in a single readdir()

              request by the application. As long as the server has more

              than this "minimum threshold" of number of files, the

              number of batched readdir() made by the client is going to

              be fixed, and those various requests will be spread across

              various servers (as opposed to, sending them all to the

              same server).</div>

            <div><br>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    I've seen customers with large amounts of empty, or almost empty,

    directories. Don't ask me why, I don't understand it either...<br>

    <br>

    <blockquote

cite="mid:CAFboF2w2ZJzZBN6eHY_Bmi7XHF0xhcSq7pcCp3K+QedRKabJsg@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">

            <div>So yes, as you add servers for a given small set of

              files the scalability drops, but that is only till you

              create more files, when the # of servers stop mattering

              again.</div>

            <div><br>

            </div>

            <div>Can you share the actual numbers from the tests you

              ran?</div>

            <div><br>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    I've made the tests in 6 physical servers (Quad Atom D525 1.8 GHz.

    These are the only servers I can use regularly to do tests)

    connected through a dedicated 1 Gbit switch. Bricks are stored in

    1TB SATA disks with ZFS. One of the servers was also used as a

    client to do the tests.<br>

    <br>

    Initially I created a volume with a single brick. I initialized the

    volume with 50 directories with 50 subdirectories each (a total of

    2500 directories). No files.<br>

    <br>

    After each test, I added a new brick and started a rebalance. Once

    the rebalance was completed, I umounted and stopped the volume and

    restarted it again.<br>

    <br>

    The test consisted of 4 'time ls -lR /&lt;testdir&gt; | wc -l'. The

    first result was discarded. The result shown below is the mean of

    the other 3 results.<br>

    <br>

    1 brick: 11.8 seconds<br>

    2 bricks: 19.0 seconds<br>

    3 bricks: 23.8 seconds<br>

    4 bricks: 29.8 seconds<br>

    5 bricks: 34.6 seconds<br>

    6 bricks: 41.0 seconds<br>

    12 bricks (2 bricks on each server): 78.5 seconds<br>

    <br>

    The rebalancing time also grew considerably (these times are the

    result of a single rebalance. They might not be very accurate):<br>

    <br>

    From 1 to 2 bricks: 91 seconds<br>

    From 2 to 3 bricks: 102 seconds<br>

    From 3 to 4 bricks: 119 seconds<br>

    From 4 to 5 bricks: 138 seconds<br>

    From 5 to 6 bricks: 151 seconds<br>

    From 6 to 12 bricks: 259 seconds<br>

    <br>

    The number of disk IOPS didn't exceed 40 in any server in any case.

    The network bandwidth didn't go beyond 6 Mbits/s between any pair of

    servers and none of them reached 100% core usage.<br>

    <br>

    Xavi<br>

    <br>

    <blockquote

cite="mid:CAFboF2w2ZJzZBN6eHY_Bmi7XHF0xhcSq7pcCp3K+QedRKabJsg@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">

            <div>Avati</div>

            <div><br>

            </div>

          </div>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Gluster-devel mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Gluster-devel@nongnu.org">Gluster-devel@nongnu.org</a>

<a class="moz-txt-link-freetext" href="https://lists.nongnu.org/mailman/listinfo/gluster-devel">https://lists.nongnu.org/mailman/listinfo/gluster-devel</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>