<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Al 06/09/13 20:43, En/na Anand Avati ha
escrit:<br>
</div>
<blockquote
cite="mid:CAFboF2w2ZJzZBN6eHY_Bmi7XHF0xhcSq7pcCp3K+QedRKabJsg@mail.gmail.com"
type="cite">
<div dir="ltr"><br>
<div class="gmail_extra">
<div class="gmail_quote">On Fri, Sep 6, 2013 at 1:46 AM,
Xavier Hernandez <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<div>Al 04/09/13 18:10, En/na Anand Avati ha escrit:<br>
</div>
<div>
<div class="h5">
<blockquote type="cite">
<div dir="ltr">On Wed, Sep 4, 2013 at 6:37 AM,
Xavier Hernandez <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:xhernandez@datalab.es"
target="_blank">xhernandez@datalab.es</a>></span>
wrote:<br>
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">Al 04/09/13
14:05, En/na Jeff Darcy ha escrit:
<div>
<div><br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> On
09/04/2013 04:27 AM, Xavier
Hernandez wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> I would
also like to note that each node
can store multiple elements.<br>
Current implementation creates a
node for each byte in the key. In
my<br>
implementation I only create a
node if there is a prefix
coincidence between<br>
2 or more keys. This reduces the
number of nodes and the number of<br>
indirections.<br>
</blockquote>
<br>
Whatever we do, we should try to
make sure that the changes are
profiled<br>
against real usage. When I was
making my own dict optimizations
back in March<br>
of last year, I started by looking
at how they're actually used. At
that time,<br>
a significant majority of
dictionaries contained just one
item. That's why I<br>
only implemented a simple mechanism
to pre-allocate the first data_pair
instead<br>
of doing something more ambitious.
Even then, the difference in actual<br>
performance or CPU usage was barely
measurable. Dict usage has
certainly<br>
changed since then, but I think
you'd still be hard pressed to find
a case<br>
where a single dict contains more
than a handful of entries, and
approaches<br>
that are optimized for dozens to
hundreds might well perform worse
than simple<br>
ones (e.g. because of cache aliasing
or branch misprediction).<br>
<br>
If you're looking for other
optimization opportunities that
might provide even<br>
bigger "bang for the buck" then I
suggest that stack-frame or
frame->local<br>
allocations are a good place to
start. Or string copying in places
like<br>
loc_copy. Or the entire
fd_ctx/inode_ctx subsystem. Let me
know and I'll come<br>
up with a few more. To put a bit of
a positive spin on things, the
GlusterFS<br>
code offers many opportunities for
improvement in terms of CPU and
memory<br>
efficiency (though it's surprisingly
still way better than Ceph in that
regard).<br>
<br>
</blockquote>
</div>
</div>
Yes. The optimizations on dictionary
structures are not a big improvement in
the overall performance of GlusterFS. I
tried it on a real situation and the
benefit was only marginal. However I
didn't test new features like an atomic
lookup and remove if found (because I
would have had to review all the code). I
think this kind of functionalities could
improve a bit more the results I obtained.<br>
<br>
However this is not the only reason to do
these changes. While I've been writing
code I've found that it's tedious to do
some things just because there isn't such
functions in dict_t. Some actions require
multiple calls, having to check multiple
errors and adding complexity and limiting
readability of the code. Many of these
situations could be solved using functions
similar to what I proposed.<br>
<br>
On the other side, if dict_t must be truly
considered a concurrent structure, there
are a lot of race conditions that might
appear when doing some operations. It
would require a great effort to take care
of all these possibilities everywhere. It
would be better to pack most of these
situations into functions inside the
dict_t itself where it is easier to
combine some operations.<br>
<br>
By the way, I've made some tests with
multiple bricks and it seems that there is
a clear speed loss on directory listings
as the number of bricks increases. Since
bricks should be independent and they can
work in parallel, I didn't expected such a
big performance degradation.</blockquote>
<div><br>
</div>
<div>The likely reason is that, even though
bricks are parallel for IO, readdir is
essentially a sequential operation and DHT
has a limitation that a readdir reply
batch does not cross server boundaries. So
if you have 10 files and 1 server, all 10
entries are returned in one call to the
app/libc. If you have 10 files and 10
servers evenly distributed, the app/libc
has to perform 10 calls and keeps getting
one file at a time. This problem goes away
when each server has enough files to fill
up a readdir batch. It's only when you
have too few files and too many servers
that this "dilution" problem shows up.
However, this is just a theory and your
problem may be something else too..</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
I didn't know that DHT was doing a sequential brick scan
on readdir(p) (my fault). Why is that ? Why it cannot
return entries crossing a server boundary ? is it due to
a technical reason or is it only due to the current
implementation ?<br>
<br>
I've made a test using only directories (50 directories
with 50 subdirectories each). I started with one brick
and I measured the time to do a recursive 'ls'. Then I
sequentially added an additional brick, up to 6 (all of
them physically independent), and repeated the ls. The
time increases linearly as the number of bricks
augments. As more bricks were added, the rebalancing
time was also growing linearly.<br>
<br>
I think this is a big problem for scalability. It can be
partially hidden by using some caching or preloading
mechanisms, but it will be there and it will hit sooner
or later.
<div class="im"><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>Note that Brian Foster's readdir-ahead
patch should address this problem to a large
extent. When loaded on top of DHT, the
prefiller effectively collapses the smaller
chunks returned by DHT into a larger chunk
requested by the app/libc.</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
</div>
I've seen it, however I think it will only partially
mitigate and hide an existing problem. Imagine you have
some hundreds or a thousand of bricks. I doubt
readdir-ahead or anything else can hide the enormous
latency that the sequential DHT scan will generate in
that case.<br>
<br>
The main problem I see is that the full directory
structure is read many times sequentially. I think it
would be better to do the readdir(p) calls in parallel
and combine them (possibly in background). This way the
time to scan the directory structure would be almost
constant, independently of the number of bricks.<br>
</div>
</blockquote>
<div><br>
</div>
<div>The design of the directory entries in DHT makes this
essentially a sequential operation because entries from
servers are appended, not striped. What I mean is, the
logical ordering of </div>
<div><br>
</div>
<div>All entries in a directory = All files and dirs in 0th
server + All files (no dirs) in 1st server + All files (no
dirs) in 2nd server + .. + All files (no dirs) in N'th
server.</div>
<div><br>
</div>
<div>in a sequential manner. If we read the entries of 2nd
server along with entries of 1st server, we cannot "use"
it till we finish reading all entries of 1st server and
get EOD from it - which is why readdir-ahead is a more
natural solution than reading in parallel for the above
design.</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
As I understand it, what the read-ahead translator does is to
collect one or more answers from the DHT translator and combine them
to return a single answer as big as possible. If that is correct, it
will certainly reduce the number of readdir calls from application,
however I think it will still have a considerable latency when used
on big clusters. Anyway I don't have any measurement or valid
argument to support this, so lets see how readdir-ahead works in
real environments before discussing about it.<br>
<br>
<blockquote
cite="mid:CAFboF2w2ZJzZBN6eHY_Bmi7XHF0xhcSq7pcCp3K+QedRKabJsg@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>Also, this is a problem only if each server has fewer
entries than what can be returned in a single readdir()
request by the application. As long as the server has more
than this "minimum threshold" of number of files, the
number of batched readdir() made by the client is going to
be fixed, and those various requests will be spread across
various servers (as opposed to, sending them all to the
same server).</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
I've seen customers with large amounts of empty, or almost empty,
directories. Don't ask me why, I don't understand it either...<br>
<br>
<blockquote
cite="mid:CAFboF2w2ZJzZBN6eHY_Bmi7XHF0xhcSq7pcCp3K+QedRKabJsg@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>So yes, as you add servers for a given small set of
files the scalability drops, but that is only till you
create more files, when the # of servers stop mattering
again.</div>
<div><br>
</div>
<div>Can you share the actual numbers from the tests you
ran?</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
I've made the tests in 6 physical servers (Quad Atom D525 1.8 GHz.
These are the only servers I can use regularly to do tests)
connected through a dedicated 1 Gbit switch. Bricks are stored in
1TB SATA disks with ZFS. One of the servers was also used as a
client to do the tests.<br>
<br>
Initially I created a volume with a single brick. I initialized the
volume with 50 directories with 50 subdirectories each (a total of
2500 directories). No files.<br>
<br>
After each test, I added a new brick and started a rebalance. Once
the rebalance was completed, I umounted and stopped the volume and
restarted it again.<br>
<br>
The test consisted of 4 'time ls -lR /<testdir> | wc -l'. The
first result was discarded. The result shown below is the mean of
the other 3 results.<br>
<br>
1 brick: 11.8 seconds<br>
2 bricks: 19.0 seconds<br>
3 bricks: 23.8 seconds<br>
4 bricks: 29.8 seconds<br>
5 bricks: 34.6 seconds<br>
6 bricks: 41.0 seconds<br>
12 bricks (2 bricks on each server): 78.5 seconds<br>
<br>
The rebalancing time also grew considerably (these times are the
result of a single rebalance. They might not be very accurate):<br>
<br>
From 1 to 2 bricks: 91 seconds<br>
From 2 to 3 bricks: 102 seconds<br>
From 3 to 4 bricks: 119 seconds<br>
From 4 to 5 bricks: 138 seconds<br>
From 5 to 6 bricks: 151 seconds<br>
From 6 to 12 bricks: 259 seconds<br>
<br>
The number of disk IOPS didn't exceed 40 in any server in any case.
The network bandwidth didn't go beyond 6 Mbits/s between any pair of
servers and none of them reached 100% core usage.<br>
<br>
Xavi<br>
<br>
<blockquote
cite="mid:CAFboF2w2ZJzZBN6eHY_Bmi7XHF0xhcSq7pcCp3K+QedRKabJsg@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>Avati</div>
<div><br>
</div>
</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Gluster-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Gluster-devel@nongnu.org">Gluster-devel@nongnu.org</a>
<a class="moz-txt-link-freetext" href="https://lists.nongnu.org/mailman/listinfo/gluster-devel">https://lists.nongnu.org/mailman/listinfo/gluster-devel</a>
</pre>
</blockquote>
<br>
</body>
</html>