<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Al 04/09/13 18:10, En/na Anand Avati ha
escrit:<br>
</div>
<blockquote
cite="mid:CAFboF2x2X+xqR2xgQ7wfsju5=w6o59Q1w5+TkVFqcVPn-zMhjQ@mail.gmail.com"
type="cite">
<div dir="ltr">On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez <span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>></span>
wrote:<br>
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Al
04/09/13 14:05, En/na Jeff Darcy ha escrit:
<div>
<div class="h5"><br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
On 09/04/2013 04:27 AM, Xavier Hernandez wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
I would also like to note that each node can store
multiple elements.<br>
Current implementation creates a node for each
byte in the key. In my<br>
implementation I only create a node if there is a
prefix coincidence between<br>
2 or more keys. This reduces the number of nodes
and the number of<br>
indirections.<br>
</blockquote>
<br>
Whatever we do, we should try to make sure that the
changes are profiled<br>
against real usage. When I was making my own dict
optimizations back in March<br>
of last year, I started by looking at how they're
actually used. At that time,<br>
a significant majority of dictionaries contained
just one item. That's why I<br>
only implemented a simple mechanism to pre-allocate
the first data_pair instead<br>
of doing something more ambitious. Even then, the
difference in actual<br>
performance or CPU usage was barely measurable.
Dict usage has certainly<br>
changed since then, but I think you'd still be hard
pressed to find a case<br>
where a single dict contains more than a handful of
entries, and approaches<br>
that are optimized for dozens to hundreds might well
perform worse than simple<br>
ones (e.g. because of cache aliasing or branch
misprediction).<br>
<br>
If you're looking for other optimization
opportunities that might provide even<br>
bigger "bang for the buck" then I suggest that
stack-frame or frame->local<br>
allocations are a good place to start. Or string
copying in places like<br>
loc_copy. Or the entire fd_ctx/inode_ctx subsystem.
Let me know and I'll come<br>
up with a few more. To put a bit of a positive spin
on things, the GlusterFS<br>
code offers many opportunities for improvement in
terms of CPU and memory<br>
efficiency (though it's surprisingly still way
better than Ceph in that regard).<br>
<br>
</blockquote>
</div>
</div>
Yes. The optimizations on dictionary structures are not a
big improvement in the overall performance of GlusterFS. I
tried it on a real situation and the benefit was only
marginal. However I didn't test new features like an
atomic lookup and remove if found (because I would have
had to review all the code). I think this kind of
functionalities could improve a bit more the results I
obtained.<br>
<br>
However this is not the only reason to do these changes.
While I've been writing code I've found that it's tedious
to do some things just because there isn't such functions
in dict_t. Some actions require multiple calls, having to
check multiple errors and adding complexity and limiting
readability of the code. Many of these situations could be
solved using functions similar to what I proposed.<br>
<br>
On the other side, if dict_t must be truly considered a
concurrent structure, there are a lot of race conditions
that might appear when doing some operations. It would
require a great effort to take care of all these
possibilities everywhere. It would be better to pack most
of these situations into functions inside the dict_t
itself where it is easier to combine some operations.<br>
<br>
By the way, I've made some tests with multiple bricks and
it seems that there is a clear speed loss on directory
listings as the number of bricks increases. Since bricks
should be independent and they can work in parallel, I
didn't expected such a big performance degradation.</blockquote>
<div><br>
</div>
<div>The likely reason is that, even though bricks are
parallel for IO, readdir is essentially a sequential
operation and DHT has a limitation that a readdir reply
batch does not cross server boundaries. So if you have 10
files and 1 server, all 10 entries are returned in one
call to the app/libc. If you have 10 files and 10 servers
evenly distributed, the app/libc has to perform 10 calls
and keeps getting one file at a time. This problem goes
away when each server has enough files to fill up a
readdir batch. It's only when you have too few files and
too many servers that this "dilution" problem shows up.
However, this is just a theory and your problem may be
something else too..</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
I didn't know that DHT was doing a sequential brick scan on
readdir(p) (my fault). Why is that ? Why it cannot return entries
crossing a server boundary ? is it due to a technical reason or is
it only due to the current implementation ?<br>
<br>
I've made a test using only directories (50 directories with 50
subdirectories each). I started with one brick and I measured the
time to do a recursive 'ls'. Then I sequentially added an additional
brick, up to 6 (all of them physically independent), and repeated
the ls. The time increases linearly as the number of bricks
augments. As more bricks were added, the rebalancing time was also
growing linearly.<br>
<br>
I think this is a big problem for scalability. It can be partially
hidden by using some caching or preloading mechanisms, but it will
be there and it will hit sooner or later.<br>
<br>
<blockquote
cite="mid:CAFboF2x2X+xqR2xgQ7wfsju5=w6o59Q1w5+TkVFqcVPn-zMhjQ@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>Note that Brian Foster's readdir-ahead patch should
address this problem to a large extent. When loaded on top
of DHT, the prefiller effectively collapses the smaller
chunks returned by DHT into a larger chunk requested by
the app/libc.</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
I've seen it, however I think it will only partially mitigate and
hide an existing problem. Imagine you have some hundreds or a
thousand of bricks. I doubt readdir-ahead or anything else can hide
the enormous latency that the sequential DHT scan will generate in
that case.<br>
<br>
The main problem I see is that the full directory structure is read
many times sequentially. I think it would be better to do the
readdir(p) calls in parallel and combine them (possibly in
background). This way the time to scan the directory structure would
be almost constant, independently of the number of bricks.<br>
<br>
Xavi<br>
<br>
<blockquote
cite="mid:CAFboF2x2X+xqR2xgQ7wfsju5=w6o59Q1w5+TkVFqcVPn-zMhjQ@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>Avati</div>
<div><br>
</div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">However
the tests have not been exhaustive nor made in best
conditions so they might be misleading. Anyway it seems to
me that there might be a problem with some mutexes that
force too much serialization of requests (though I have no
real proves it's only a feeling). Maybe some more
"asynchronousity" on calls between translators could help.<br>
<br>
Only some thoughts...<br>
<br>
Best regards,<br>
<br>
Xavi
<div class="HOEnZb">
<div class="h5"><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
_______________________________________________<br>
Gluster-devel mailing list<br>
<a moz-do-not-send="true"
href="mailto:Gluster-devel@nongnu.org"
target="_blank">Gluster-devel@nongnu.org</a><br>
<a moz-do-not-send="true"
href="https://lists.nongnu.org/mailman/listinfo/gluster-devel"
target="_blank">https://lists.nongnu.org/mailman/listinfo/gluster-devel</a><br>
</blockquote>
<br>
<br>
_______________________________________________<br>
Gluster-devel mailing list<br>
<a moz-do-not-send="true"
href="mailto:Gluster-devel@nongnu.org"
target="_blank">Gluster-devel@nongnu.org</a><br>
<a moz-do-not-send="true"
href="https://lists.nongnu.org/mailman/listinfo/gluster-devel"
target="_blank">https://lists.nongnu.org/mailman/listinfo/gluster-devel</a><br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Gluster-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Gluster-devel@nongnu.org">Gluster-devel@nongnu.org</a>
<a class="moz-txt-link-freetext" href="https://lists.nongnu.org/mailman/listinfo/gluster-devel">https://lists.nongnu.org/mailman/listinfo/gluster-devel</a>
</pre>
</blockquote>
<br>
</body>
</html>