Hi Edward,<div>- It will be interesting to see the comparison with the rebalance enhancements which have gone into 3.3 (to get a sense of urgency).</div><div>- Even though the proposed scheme is not compatible with existing DHT (+AFR), I can see ways in which it can be backward compatible with existing volumes if care is taken in implementation details.</div>
<div><br></div><div>Avati<br><br><div class="gmail_quote">On Mon, Apr 16, 2012 at 4:57 PM, Edward Shishkin <span dir="ltr"><<a href="mailto:edward@redhat.com">edward@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hello GlusterFS developers.<br>
<br>
We have found that current DHT translator is suboptimal: the number<br>
of files being moved during re-balancing (caused by small changes in<br>
the set of bricks) can be significantly reduced (see Appendix).<br>
<br>
To be precise, we can improve scalability: in the current DHT the amount<br>
of re-balancing work scales as O(M) (M is total number of files in the<br>
compound volume), whereas after changing the hashing technique it will<br>
scale as O(M/N) (N is the number of bricks).<br>
<br>
In the document below we first consider simple tables (section 1) and<br>
estimate minimal amount of rebalance work for them. Then we complicate<br>
them with techniques of virtual nodes (section 2) and replication<br>
(section 3), and show that this doesn't worse scalability.<br>
<br>
Unfortunately it is impossible to perform the improvements without<br>
format change, so it would be a new translator, which won't understand<br>
layouts created by current DHT (and back).<br>
<br>
We will be happy to see any feedbacks on this, and if everything is<br>
OK, to proceed development in this direction (with implementation<br>
details, etc).<br>
<br>
<br>
Thanks,<br>
Edward.<br>
<br>
<br>
Legend:<br>
<br>
<br>
C - namespace;<br>
R - 64-bit ring;<br>
N - number of real bricks that compose the volume;<br>
M - number of files in the compound volume;<br>
S - number of virtual components of any real brick;<br>
R - size of preference set (replication level);<br>
<br>
<br>
<br>
1. Simple DH tables based on consistent hashing<br>
<br>
<br>
<br>
We consider a 64-bit ring R, i.e. a regular 2^64-polygon with vertexes<br>
0, ..., 2^64 - 1. "Ring" means that for every vertex A there can be<br>
found vertexes B, C of R, so that C < A < B. Here "<" and "<=" mean<br>
relations between respective angles (for any vertex Z we consider the<br>
angle composed of O0 and OZ, where O is the center of the polygon).<br>
<br>
Then we consider namespace C and any mapping phi: C -> R, so that for<br>
every sequence {c1, c2, ... } of different names {phi(c1), phi(c2),<br>
...) is a pseudo-random variable, which has uniform distribution on R<br>
(see 0*).<br>
<br>
<br>
Suppose we have a volume composed of N bricks with unique names<br>
B1, B2, ... B_N. During system initialization we construct for this<br>
compound volume N respective unique tokens phi(B1), ..., phi(B_N)<br>
in the ring R, caching them, say, in rb-tree to preserve ordering.<br>
<br>
When creating a directory (mkdir(2)) we create respective directories<br>
on all bricks B_1, B_2, ... B_N.<br>
<br>
When creating a regular file (creat(2), etc) with a short name F,<br>
we create a respective file only in one brick B = B(F), which is<br>
determined by the following way: phi(B) is the minimal token in the<br>
ring so that phi(F) <= phi(B). This is where the the notion of ring<br>
works: if there is no any such token in the [phi(F), 2^64 - 1], then<br>
we continue search from the vertex 0.<br>
<br>
Lookup operation for any regular file F resolves to lookup(F) on the<br>
respective brick B(F).<br>
<br>
Deleting a regular file F resolves to deleting a file from the brick<br>
B(F).<br>
<br>
Looking for a brick (i.e. calculation F->B(F)) is a well-scalable<br>
operation: log(N) actions is required.<br>
<br>
When adding a new brick X = B_(N+1) we set a new unique token phi(X)<br>
to the ring R and move a subset of files from B_j to X, where B_j is<br>
the brick with the smallest phi(B_j), so that phi(X) < phi(B_j).<br>
Namely, every file W of B_j with phi(W) <= phi(X) should be moved to X.<br>
That said, the number of files to be moved during re-balancing is not<br>
larger than a number of files contained in one brick (B_j in our case)<br>
<br>
When removing a brick Y = B_k, we first find in R the "closest" brick<br>
B_s, which has minimal phi(B_s), so that phi(Y) < phi(B_s), and move<br>
all files from Y to B_s (no scans is required).<br>
<br>
<br>
Such hashing technique is called "consistent hashing associated with<br>
the variable phi, which has uniform distribution". This is a<br>
relatively new technique suggested by Karger at al (1*). The main<br>
advantage of this technique is that small changes in the set of bricks<br>
result in small amount of rebalancing work. Namely, adding/removing 1<br>
brick results in moving of only M/N files (2*) (M is total number of<br>
files in the compound volume). This is M/(M/N) = N times better then<br>
with traditional hashing, where we need to move all M files (median<br>
value). In particular, if we have 100 bricks, then with traditional<br>
hashing we'll need to move x100 files more than with consistent one.<br>
<br>
To construct a uniform distribution phi we can have any well-mixing<br>
64-hash, say fnv-hash, etc..<br>
<br>
Comment 1. The technique of consistent hashing is used in Amazon's<br>
Dynamo (4*)<br>
<br>
Comment 2. There is a disadvantage: in this approach all files<br>
<br>
/foo<br>
/dir1/foo<br>
/dir1/dir2/foo<br>
...<br>
<br>
will be accumulated on the same brick. However it is possible to<br>
"salt" a short file names with gfid (or another id) of respective<br>
directory before hashing, to avoid possible attacks.<br>
<br>
<br>
<br>
2. Virtual nodes<br>
<br>
<br>
<br>
The theory above works well for larger number of bricks N. However,<br>
when N is too small (2-3 bricks), then uniform distribution can result<br>
in bad partitioning, so one brick will accumulate much more files then<br>
other ones, which is not good. The graceful solution of this problem<br>
is so-called technique of "virtual nodes": with every brick we set S<br>
tokens on the ring, where S is a number of "virtual components" of a<br>
brick. So, every brick is represented by S unique tokens on the ring<br>
(S >= 1, S is a parameter of the cluster translator).<br>
<br>
In the case of S>1 the lookup-a-brick procedure above is not changed:<br>
the difference is that we search in a larger set of tokens (N*S), and<br>
since log(N*S) == log(N) + log(S) == log(N) + const, this search also<br>
scales as log(N), while with a larger number of tokens the<br>
distribution of files becomes more balanced (in terms of the standard<br>
deviation, see (3*) and the Appendix for details. In particular, S=100<br>
provides deviation ~10% of the mean.<br>
<br>
Adding a new brick with S > 1 looks like above with the difference<br>
that we steal files of S > 1 different old virtual bricks. Note, that<br>
2 or more of those virtual bricks can represent the same real brick<br>
though. So adding a real brick with S virtual components requires<br>
(M/N)*S scans, however, a median number of files to be moved during<br>
re-balancing is the same (M/N) as in the case of S==1.<br>
<br>
Removing a brick with S > 1 virtual components mostly looks like in<br>
the case of S == 1: no scans is requires. The difference is that we<br>
distribute files of the brick to be removed among S virtual bricks<br>
(which correspond to <= S real bricks).<br>
<br>
<br>
<br>
3. Replication and Recovery<br>
<br>
<br>
<br>
To achieve high availability and durability we replicate files on<br>
multiple bricks. In our case replication can be implemented as a set<br>
of operations with the same ring R, so we don't create a separate<br>
translator for replication.<br>
<br>
We store every file in its so-called preference set of real bricks.<br>
Ordinal number R of this set is the volume option. R is also called<br>
replication level (R = 1 means no replication: every file is stored<br>
only in a single brick).<br>
<br>
For every file F its preference set is defined as a set of "closest"<br>
virtual bricks B_(k_1), ... , B(k_R), which represent pairwise<br>
different real bricks, so that B_(k_1) = B(F), and<br>
phi(B_(k_1)) < phi(B_(k_2)) < ... < phi(B_(k_R)).<br>
<br>
We don't create 2 replicas of the same file on the same real brick,<br>
so, R shouldn't be larger than N.<br>
<br>
If we enable replication (R>1), regular file operations become more<br>
complicated: every such operation is performed for all respective<br>
files located on all bricks of the preference set.<br>
<br>
Operations on a set of bricks also become more complicated, but<br>
scalability doesn't suffer. When adding a new brick X = B_(N+1), we<br>
<br>
0) set a unique token phi(X) to the ring R.<br>
<br>
1) find R closest tokens B_(m_1), ..., B_(m_R), which represent<br>
different real bricks in the ring, so that B_(m_R) == X,<br>
and phi(B_(m_1)) < ... < phi(B_(m_R)).<br>
<br>
2) find R+1 closest tokens B_(p_0), ..., B_(p_R), which represent<br>
different real bricks in the ring, so that B_(p_0) == X,<br>
and phi(B_(p_0)) < ... < phi(B_(p_R)).<br>
<br>
3) The new brick X steals a portion of files of B_(p_1) as it has been<br>
described in section (1) above.<br>
<br>
4) The brick B_(p_R) becomes not belonging to the preference set of<br>
the stolen files, so we need to remove all the respective replicas<br>
from B_(p_R).<br>
<br>
5) X becomes a brick belonging to the preference sets of files stored<br>
in the bricks B_(m_1),... , B_(m_(R-1)), hence we should create<br>
respective replicas on X.<br>
<br>
So adding a new brick with replication level R > 1 results in<br>
<br>
a) moving a portion of files of one brick (step (3) above);<br>
b) replication of files located on on R-1 bricks (step (5) above);<br>
c) deleting replicas of a portion of files of one brick (step (4)).<br>
<br>
(a),(b),(c) above can be considered as re-balancing of (R+1)*(M/N) =<br>
const*(M/N) files (when R == 1, then (b),(c) are absent, and we need<br>
to re-balance M/N files, as it was shown in the section 1 above).<br>
<br>
Similarly we can show that with level of replication R removing one<br>
brick also leads to re-balancing of const*(M/N) files.<br>
<br>
<br>
If in our configuration L < R bricks don't respond for some reasons,<br>
then all regular file operations are still defined, however our system<br>
is marked as "unhealthy" (in some papers this state is called "sloppy<br>
quorum"), non-responding bricks are marked as "missed" in the ring and<br>
file operation are performed on other available bricks of the<br>
preference set. In such operations files on the available "non-<br>
primary" bricks are marked as "not synced with the missed replicas".<br>
<br>
In the state of sloppy quorum operations like add/remove a node can<br>
be already undefined. For example, when adding a brick we'll need to<br>
steal files from a brick, which doesn't respond.<br>
<br>
So we need to return the system back to a "consistent" state, when all<br>
operations are defined. It can be done by the following ways:<br>
<br>
1) Make sure that all missed bricks are available again and perform<br>
L1-recovery. L1-recovery means syncing all marked files with the<br>
again available bricks, so that resulting consistent system will<br>
have the same number N of bricks.<br>
2) Add new empty bricks instead of missed ones and perform L2-recovery<br>
It means filling the new empty bricks with files from other bricks,<br>
so that resulting consistent system will have the same number N of<br>
bricks.<br>
3) Remove missed bricks from the ring and perform L3-recovery, so that<br>
resulting consistent system will have smaller number of nodes (N-L)<br>
<br>
Comment 1. For any missed brick we can specify different type of<br>
recovery.<br>
<br>
Comment 2. When R == N replication "clogs" the distribution: in this<br>
case our system will work like mirrors: every bricks will contain<br>
the same set of files.<br>
<br>
<br>
<br>
APPENDIX<br>
<br>
<br>
<br>
------------------------------<u></u>------------------------------<u></u>----------<br>
<br>
In 3 distributed hash tables with different hashing techniques<br>
<br>
. GlusterFS DHT translator (3.2.5)<br>
. 64-bit ring with phi based on md5, R=1 (no replication), S=7<br>
. 64-bit ring with phi based on md5, R=1 (no replication), S=20<br>
<br>
we run the same scenario:<br>
<br>
1) Create 100 files ("file00", "file01", ..., "file99") in a volume<br>
composed of 9 bricks:<br>
<br>
"host:/root/exp0",<br>
"host:/root/exp1",<br>
...<br>
<br>
"host:/root/exp8".<br>
<br>
2) Add one brick "host:/root/exp9";<br>
3) re-balance;<br>
<br>
<br>
I. GlusterFS DHT translator (3.2.5)<br>
<br>
------------------------------<u></u>------------------------------<u></u>----------<br>
<br>
before re-balancing:<br>
<br>
exp0: file15 file22 file34 file35 file51 file6 file68 file78<br>
file8 file81 file89 file94 file95<br>
exp1: file10 file28 file3 file4 file43 file66 file75<br>
exp2: file40 file46 file47 file48 file50 file58 file86 file9<br>
exp3: file12 file13 file32 file37 file54 file55 file7 file71<br>
file91<br>
exp4: file31 file38 file41 file42 file53 file62 file63 file69<br>
file93 file97<br>
exp5: file11 file16 file17 file24 file25 file27 file29 file44<br>
file56 file73 file74 file80 file87 file90<br>
exp6: file0 file1 file2 file33 file36 file49 file57 file59<br>
file64 file77 file79 file84 file85 file88 file98<br>
exp7: file21 file26 file39 file52 file61 file70 file72 file83<br>
file92 file99<br>
exp8: file14 file20 file23 file30 file45 file5 file60 file65<br>
file67 file76 file82 file96<br>
<br>
after re-balancing:<br>
<br>
exp0: file11 file16 file17 file24 file25 file31 file44 file53<br>
file62 file69 file73 file80 file87 file93 file97<br>
exp1: file0 file27 file29 file33 file36 file56 file57 file64<br>
file74 file77 file84 file88 file90 file98<br>
exp2: file1 file2 file39 file49 file59 file72 file79 file85<br>
file92<br>
exp3: file21 file26 file30 file45 file52 file60 file61 file65<br>
file70 file83 file99<br>
exp4: file14 file20 file23 file5 file67 file76 file82 file96<br>
exp5: file15 file22 file34 file35 file51 file6 file68 file78<br>
file8 file81 file89 file94 file95<br>
exp6: file10 file28 file4 file43 file66 file75<br>
exp7: file3 file40 file47 file58 file9<br>
exp8: file12 file13 file32 file37 file46 file48 file50 file7<br>
file71 file86<br>
exp9: file38 file41 file42 file54 file55 file63 file91<br>
<br>
<br>
as the result 98 files have been rebalanced (total files scanned 139)<br>
<br>
<br>
<br>
II. 64-bit ring with phi based on md5.<br>
Every brick has number of virtual components S=7<br>
<br>
------------------------------<u></u>------------------------------<u></u>----------<br>
<br>
before re-balancing:<br>
<br>
exp0: file02 file18 file22 file42 file48 file58 file62 file70<br>
exp1: file01 file08 file15 file23 file33 file51 file64 file75<br>
file82 file85 file86 file87 file95<br>
exp2: file00 file10 file11 file14 file25 file29 file40 file45<br>
file63 file81 file91 file96<br>
exp3: file09 file16 file19 file21 file24 file28 file32 file35<br>
file36 file44 file47 file50 file52 file57 file67 file73<br>
file88 file98<br>
exp4: file27 file49 file53 file55 file69 file97<br>
exp5: file05 file20 file43<br>
exp6: file34 file68 file72 file74 file79<br>
exp7: file03 file04 file26 file39 file41 file54 file60 file71<br>
file77 file78 file83 file84 file89 file93 file94<br>
exp8: file06 file07 file12 file17 file30 file31 file37 file38<br>
file46 file56 file59 file61 file65 file66 file76 file80<br>
file90 file92 file99<br>
<br>
after re-balancing:<br>
only the following files has been moved (to exp9):<br>
<br>
exp0: file70<br>
exp1: file82 file85<br>
exp2: file00 file14 file25 file40 file45 file91 file96<br>
exp3: file88<br>
<br>
as the result 11 files have been rebalanced (total files scanned 51)<br>
<br>
<br>
<br>
III. 64-bit ring with phi based on md5.<br>
Every brick has number of virtual components S=20<br>
<br>
------------------------------<u></u>------------------------------<u></u>-----------<br>
<br>
before re-balancing:<br>
<br>
exp0: file00 file04 file22 file30 file40 file42 file70 file96<br>
exp1: file06 file08 file13 file15 file17 file19 file23 file32<br>
file33 file78 file81 file86 file95<br>
exp2: file11 file14 file16 file24 file29 file57 file63 file67<br>
file73 file76<br>
exp3: file02 file10 file12 file18 file21 file28 file35 file51<br>
file56 file59 file80 file87<br>
exp4: file39 file41 file49 file53 file54 file62 file69 file77<br>
file83 file84 file91<br>
exp5: file05 file20 file31 file43 file68 file74<br>
exp6: file34 file37 file38 file46 file48 file58 file66 file71<br>
file75 file79 file88<br>
exp7: file03 file07 file26 file47 file60 file72 file89 file93<br>
file94<br>
exp8: file01 file09 file25 file27 file36 file44 file45 file50<br>
file52 file55 file59 file61 file64 file65 file82 file85<br>
file90 file92 file97 file98 file99<br>
<br>
after re-balancing:<br>
only the following files has been moved (to exp9):<br>
<br>
exp0: file70<br>
exp6: file88<br>
exp7: file07 file93<br>
exp8: file45 file82 file85<br>
<br>
as the result 7 files have been rebalanced (total files scanned 48)<br>
<br>
<br>
------------------------------<u></u>------------------------------<u></u>----------<br>
<br>
Table with results<br>
<br>
Re-balance results and standard deviations (sd)<br>
for current gluster DHT translator (3.2.5), and<br>
for 64-bit rings with S=7 and S=20.<br>
<br>
<br>
------------------------------<u></u>------------------------------<u></u>----------<br>
<br>
DHT(Gluster 3.2.5) 64-bit ring, S=7 64-bit ring, S=20<br>
<br>
------------------------------<u></u>------------------------------<u></u>----------<br>
<br>
files moved 98 11 7<br>
<br>
files scanned 139 51 48<br>
<br>
sd before 2.8 5.4 3.7<br>
<br>
sd after 3.2 5.3 3.2<br>
<br>
------------------------------<u></u>------------------------------<u></u>----------<br>
<br>
<br>
<br>
----------<br>
<br>
<br>
(0*) <a href="http://en.wikipedia.org/wiki/Uniform_distribution_%28discrete%29" target="_blank">http://en.wikipedia.org/wiki/<u></u>Uniform_distribution_%<u></u>28discrete%29</a><br>
(1*) Karger, D.; Lehman, E.; Leighton, T.; Panigrahy, R.; Levine, M.; Lewin, D.<br>
(1997). "Consistent Hashing and Random Trees: Distributed Caching Protocols<br>
for Relieving Hot Spots on the World Wide Web"<br>
(2*) M/N is a median value<br>
(3*) <a href="http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html" target="_blank">http://weblogs.java.net/blog/<u></u>tomwhite/archive/2007/11/<u></u>consistent_hash.html</a><br>
(4*) <a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html" target="_blank">http://www.<u></u>allthingsdistributed.com/2007/<u></u>10/amazons_dynamo.html</a><br>
<br>
______________________________<u></u>_________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@nongnu.org" target="_blank">Gluster-devel@nongnu.org</a><br>
<a href="https://lists.nongnu.org/mailman/listinfo/gluster-devel" target="_blank">https://lists.nongnu.org/<u></u>mailman/listinfo/gluster-devel</a><br>
</blockquote></div><br></div>