<div dir="ltr"><br><div class="gmail_extra"><div class="gmail_quote">On Fri, Sep 6, 2013 at 1:46 AM, Xavier Hernandez <span dir="ltr">&lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div text="#000000" bgcolor="#FFFFFF">

    <div>Al 04/09/13 18:10, En/na Anand Avati ha

      escrit:<br>

    </div><div><div class="h5">

    <blockquote type="cite">

      <div dir="ltr">On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez <span dir="ltr">&lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;</span>

        wrote:<br>

        <div class="gmail_extra">

          <div class="gmail_quote">

            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Al

              04/09/13 14:05, En/na Jeff Darcy ha escrit:

              <div>

                <div><br>

                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                    On 09/04/2013 04:27 AM, Xavier Hernandez wrote:<br>

                    <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                      I would also like to note that each node can store

                      multiple elements.<br>

                      Current implementation creates a node for each

                      byte in the key. In my<br>

                      implementation I only create a node if there is a

                      prefix coincidence between<br>

                      2 or more keys. This reduces the number of nodes

                      and the number of<br>

                      indirections.<br>

                    </blockquote>

                    <br>

                    Whatever we do, we should try to make sure that the

                    changes are profiled<br>

                    against real usage.  When I was making my own dict

                    optimizations back in March<br>

                    of last year, I started by looking at how they&#39;re

                    actually used. At that time,<br>

                    a significant majority of dictionaries contained

                    just one item. That&#39;s why I<br>

                    only implemented a simple mechanism to pre-allocate

                    the first data_pair instead<br>

                    of doing something more ambitious.  Even then, the

                    difference in actual<br>

                    performance or CPU usage was barely measurable.

                     Dict usage has certainly<br>

                    changed since then, but I think you&#39;d still be hard

                    pressed to find a case<br>

                    where a single dict contains more than a handful of

                    entries, and approaches<br>

                    that are optimized for dozens to hundreds might well

                    perform worse than simple<br>

                    ones (e.g. because of cache aliasing or branch

                    misprediction).<br>

                    <br>

                    If you&#39;re looking for other optimization

                    opportunities that might provide even<br>

                    bigger &quot;bang for the buck&quot; then I suggest that

                    stack-frame or frame-&gt;local<br>

                    allocations are a good place to start.  Or string

                    copying in places like<br>

                    loc_copy.  Or the entire fd_ctx/inode_ctx subsystem.

                     Let me know and I&#39;ll come<br>

                    up with a few more.  To put a bit of a positive spin

                    on things, the GlusterFS<br>

                    code offers many opportunities for improvement in

                    terms of CPU and memory<br>

                    efficiency (though it&#39;s surprisingly still way

                    better than Ceph in that regard).<br>

                    <br>

                  </blockquote>

                </div>

              </div>

              Yes. The optimizations on dictionary structures are not a

              big improvement in the overall performance of GlusterFS. I

              tried it on a real situation and the benefit was only

              marginal. However I didn&#39;t test new features like an

              atomic lookup and remove if found (because I would have

              had to review all the code). I think this kind of

              functionalities could improve a bit more the results I

              obtained.<br>

              <br>

              However this is not the only reason to do these changes.

              While I&#39;ve been writing code I&#39;ve found that it&#39;s tedious

              to do some things just because there isn&#39;t such functions

              in dict_t. Some actions require multiple calls, having to

              check multiple errors and adding complexity and limiting

              readability of the code. Many of these situations could be

              solved using functions similar to what I proposed.<br>

              <br>

              On the other side, if dict_t must be truly considered a

              concurrent structure, there are a lot of race conditions

              that might appear when doing some operations. It would

              require a great effort to take care of all these

              possibilities everywhere. It would be better to pack most

              of these situations into functions inside the dict_t

              itself where it is easier to combine some operations.<br>

              <br>

              By the way, I&#39;ve made some tests with multiple bricks and

              it seems that there is a clear speed loss on directory

              listings as the number of bricks increases. Since bricks

              should be independent and they can work in parallel, I

              didn&#39;t expected such a big performance degradation.</blockquote>

            <div><br>

            </div>

            <div>The likely reason is that, even though bricks are

              parallel for IO, readdir is essentially a sequential

              operation and DHT has a limitation that a readdir reply

              batch does not cross server boundaries. So if you have 10

              files and 1 server, all 10 entries are returned in one

              call to the app/libc. If you have 10 files and 10 servers

              evenly distributed, the app/libc has to perform 10 calls

              and keeps getting one file at a time. This problem goes

              away when each server has enough files to fill up a

              readdir batch. It&#39;s only when you have too few files and

              too many servers that this &quot;dilution&quot; problem shows up.

              However, this is just a theory and your problem may be

              something else too..</div>

            <div><br>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div></div>

    I didn&#39;t know that DHT was doing a sequential brick scan on

    readdir(p) (my fault). Why is that ? Why it cannot return entries

    crossing a server boundary ? is it due to a technical reason or is

    it only due to the current implementation ?<br>

    <br>

    I&#39;ve made a test using only directories (50 directories with 50

    subdirectories each). I started with one brick and I measured the

    time to do a recursive &#39;ls&#39;. Then I sequentially added an additional

    brick, up to 6 (all of them physically independent), and repeated

    the ls. The time increases linearly as the number of bricks

    augments. As more bricks were added, the rebalancing time was also

    growing linearly.<br>

    <br>

    I think this is a big problem for scalability. It can be partially

    hidden by using some caching or preloading mechanisms, but it will

    be there and it will hit sooner or later.<div class="im"><br>

    <br>

    <blockquote type="cite">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">

            <div>Note that Brian Foster&#39;s readdir-ahead patch should

              address this problem to a large extent. When loaded on top

              of DHT, the prefiller effectively collapses the smaller

              chunks returned by DHT into a larger chunk requested by

              the app/libc.</div>

            <div><br>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    I&#39;ve seen it, however I think it will only partially mitigate and

    hide an existing problem. Imagine you have some hundreds or a

    thousand of bricks. I doubt readdir-ahead or anything else can hide

    the enormous latency that the sequential DHT scan will generate in

    that case.<br>

    <br>

    The main problem I see is that the full directory structure is read

    many times sequentially. I think it would be better to do the

    readdir(p) calls in parallel and combine them (possibly in

    background). This way the time to scan the directory structure would

    be almost constant, independently of the number of bricks.<br></div></blockquote><div><br></div><div>The design of the directory entries in DHT makes this essentially a sequential operation because entries from servers are appended, not striped. What I mean is, the logical ordering of </div>

<div><br></div><div>All entries in a directory = All files and dirs in 0th server + All files (no dirs) in 1st server + All files (no dirs) in 2nd server + .. + All files (no dirs) in N&#39;th server.</div><div><br></div>

<div>in a sequential manner. If we read the entries of 2nd server along with entries of 1st server, we cannot &quot;use&quot; it till we finish reading all entries of 1st server and get EOD from it - which is why readdir-ahead is a more natural solution than reading in parallel for the above design.</div>

<div><br></div><div>Also, this is a problem only if each server has fewer entries than what can be returned in a single readdir() request by the application. As long as the server has more than this &quot;minimum threshold&quot; of number of files, the number of batched readdir() made by the client is going to be fixed, and those various requests will be spread across various servers (as opposed to, sending them all to the same server).</div>

<div><br></div><div>So yes, as you add servers for a given small set of files the scalability drops, but that is only till you create more files, when the # of servers stop mattering again.</div><div><br></div><div>Can you share the actual numbers from the tests you ran?</div>

<div><br></div><div>Avati</div><div><br></div></div></div></div>