Hi Edward,<div>- It will be interesting to see the comparison with the rebalance enhancements which have gone into 3.3 (to get a sense of urgency).</div><div>- Even though the proposed scheme is not compatible with existing DHT (+AFR), I can see ways in which it can be backward compatible with existing volumes if care is taken in implementation details.</div>

<div><br></div><div>Avati<br><br><div class="gmail_quote">On Mon, Apr 16, 2012 at 4:57 PM, Edward Shishkin <span dir="ltr">&lt;<a href="mailto:edward@redhat.com">edward@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hello GlusterFS developers.<br>

<br>

We have found that current DHT translator is suboptimal: the number<br>

of files being moved during re-balancing (caused by small changes in<br>

the set of bricks) can be significantly reduced (see Appendix).<br>

<br>

To be precise, we can improve scalability: in the current DHT the amount<br>

of re-balancing work scales as O(M) (M is total number of files in the<br>

compound volume), whereas after changing the hashing technique it will<br>

scale as O(M/N) (N is the number of bricks).<br>

<br>

In the document below we first consider simple tables (section 1) and<br>

estimate minimal amount of rebalance work for them. Then we complicate<br>

them with techniques of virtual nodes (section 2) and replication<br>

(section 3), and show that this doesn&#39;t worse scalability.<br>

<br>

Unfortunately it is impossible to perform the improvements without<br>

format change, so it would be a new translator, which won&#39;t understand<br>

layouts created by current DHT (and back).<br>

<br>

We will be happy to see any feedbacks on this, and if everything is<br>

OK, to proceed development in this direction (with implementation<br>

details, etc).<br>

<br>

<br>

Thanks,<br>

Edward.<br>

<br>

<br>

                           Legend:<br>

<br>

<br>

C - namespace;<br>

R - 64-bit ring;<br>

N - number of real bricks that compose the volume;<br>

M - number of files in the compound volume;<br>

S - number of virtual components of any real brick;<br>

R - size of preference set (replication level);<br>

<br>

<br>

<br>

         1. Simple DH tables based on consistent hashing<br>

<br>

<br>

<br>

We consider a 64-bit ring R, i.e. a regular 2^64-polygon with vertexes<br>

0, ..., 2^64 - 1. &quot;Ring&quot; means that for every vertex A there can be<br>

found vertexes B, C of R, so that C &lt; A &lt; B. Here &quot;&lt;&quot; and &quot;&lt;=&quot; mean<br>

relations between respective angles (for any vertex Z we consider the<br>

angle composed of O0 and OZ, where O is the center of the polygon).<br>

<br>

Then we consider namespace C and any mapping phi: C -&gt; R, so that for<br>

every sequence {c1, c2, ... } of different names {phi(c1), phi(c2),<br>

...) is a pseudo-random variable, which has uniform distribution on R<br>

(see 0*).<br>

<br>

<br>

Suppose we have a volume composed of N bricks with unique names<br>

B1, B2, ... B_N. During system initialization we construct for this<br>

compound volume N respective unique tokens phi(B1), ..., phi(B_N)<br>

in the ring R, caching them, say, in rb-tree to preserve ordering.<br>

<br>

When creating a directory (mkdir(2)) we create respective directories<br>

on all bricks B_1, B_2, ... B_N.<br>

<br>

When creating a regular file (creat(2), etc) with a short name F,<br>

we create a respective file only in one brick B = B(F), which is<br>

determined by the following way: phi(B) is the minimal token in the<br>

ring so that phi(F) &lt;= phi(B). This is where the the notion of ring<br>

works: if there is no any such token in the [phi(F), 2^64 - 1], then<br>

we continue search from the vertex 0.<br>

<br>

Lookup operation for any regular file F resolves to lookup(F) on the<br>

respective brick B(F).<br>

<br>

Deleting a regular file F resolves to deleting a file from the brick<br>

B(F).<br>

<br>

Looking for a brick (i.e. calculation F-&gt;B(F)) is a well-scalable<br>

operation: log(N) actions is required.<br>

<br>

When adding a new brick X = B_(N+1) we set a new unique token phi(X)<br>

to the ring R and move a subset of files from B_j to X, where B_j is<br>

the brick with the smallest phi(B_j), so that phi(X) &lt; phi(B_j).<br>

Namely, every file W of B_j with phi(W) &lt;= phi(X) should be moved to X.<br>

That said, the number of files to be moved during re-balancing is not<br>

larger than a number of files contained in one brick (B_j in our case)<br>

<br>

When removing a brick Y = B_k, we first find in R the &quot;closest&quot; brick<br>

B_s, which has minimal phi(B_s), so that phi(Y) &lt; phi(B_s), and move<br>

all files from Y to B_s (no scans is required).<br>

<br>

<br>

Such hashing technique is called &quot;consistent hashing associated with<br>

the variable phi, which has uniform distribution&quot;. This is a<br>

relatively new technique suggested by Karger at al (1*). The main<br>

advantage of this technique is that small changes in the set of bricks<br>

result in small amount of rebalancing work. Namely, adding/removing 1<br>

brick results in moving of only M/N files (2*) (M is total number of<br>

files in the compound volume). This is M/(M/N) = N times better then<br>

with traditional hashing, where we need to move all M files (median<br>

value). In particular, if we have 100 bricks, then with traditional<br>

hashing we&#39;ll need to move x100 files more than with consistent one.<br>

<br>

To construct a uniform distribution phi we can have any well-mixing<br>

64-hash, say fnv-hash, etc..<br>

<br>

Comment 1. The technique of consistent hashing is used in Amazon&#39;s<br>

Dynamo (4*)<br>

<br>

Comment 2. There is a disadvantage: in this approach all files<br>

<br>

/foo<br>

/dir1/foo<br>

/dir1/dir2/foo<br>

...<br>

<br>

will be accumulated on the same brick. However it is possible to<br>

&quot;salt&quot; a short file names with gfid (or another id) of respective<br>

directory before hashing, to avoid possible attacks.<br>

<br>

<br>

<br>

                      2. Virtual nodes<br>

<br>

<br>

<br>

The theory above works well for larger number of bricks N. However,<br>

when N is too small (2-3 bricks), then uniform distribution can result<br>

in bad partitioning, so one brick will accumulate much more files then<br>

other ones, which is not good. The graceful solution of this problem<br>

is so-called technique of &quot;virtual nodes&quot;: with every brick we set S<br>

tokens on the ring, where S is a number of &quot;virtual components&quot; of a<br>

brick. So, every brick is represented by S unique tokens on the ring<br>

(S &gt;= 1, S is a parameter of the cluster translator).<br>

<br>

In the case of S&gt;1 the lookup-a-brick procedure above is not changed:<br>

the difference is that we search in a larger set of tokens (N*S), and<br>

since log(N*S) == log(N) + log(S) == log(N) + const, this search also<br>

scales as log(N), while with a larger number of tokens the<br>

distribution of files becomes more balanced (in terms of the standard<br>

deviation, see (3*) and the Appendix for details. In particular, S=100<br>

provides deviation ~10% of the mean.<br>

<br>

Adding a new brick with S &gt; 1 looks like above with the difference<br>

that we steal files of S &gt; 1 different old virtual bricks. Note, that<br>

2 or more of those virtual bricks can represent the same real brick<br>

though. So adding a real brick with S virtual components requires<br>

(M/N)*S scans, however, a median number of files to be moved during<br>

re-balancing is the same (M/N) as in the case of S==1.<br>

<br>

Removing a brick with S &gt; 1 virtual components mostly looks like in<br>

the case of S == 1: no scans is requires. The difference is that we<br>

distribute files of the brick to be removed among S virtual bricks<br>

(which correspond to &lt;= S real bricks).<br>

<br>

<br>

<br>

                   3. Replication and Recovery<br>

<br>

<br>

<br>

To achieve high availability and durability we replicate files on<br>

multiple bricks. In our case replication can be implemented as a set<br>

of operations with the same ring R, so we don&#39;t create a separate<br>

translator for replication.<br>

<br>

We store every file in its so-called preference set of real bricks.<br>

Ordinal number R of this set is the volume option. R is also called<br>

replication level (R = 1 means no replication: every file is stored<br>

only in a single brick).<br>

<br>

For every file F its preference set is defined as a set of &quot;closest&quot;<br>

virtual bricks B_(k_1), ... , B(k_R), which represent pairwise<br>

different real bricks, so that B_(k_1) = B(F), and<br>

phi(B_(k_1)) &lt; phi(B_(k_2)) &lt; ... &lt; phi(B_(k_R)).<br>

<br>

We don&#39;t create 2 replicas of the same file on the same real brick,<br>

so, R shouldn&#39;t be larger than N.<br>

<br>

If we enable replication (R&gt;1), regular file operations become more<br>

complicated: every such operation is performed for all respective<br>

files located on all bricks of the preference set.<br>

<br>

Operations on a set of bricks also become more complicated, but<br>

scalability doesn&#39;t suffer. When adding a new brick X = B_(N+1), we<br>

<br>

0) set a unique token phi(X) to the ring R.<br>

<br>

1) find R closest tokens B_(m_1), ..., B_(m_R), which represent<br>

   different real bricks in the ring, so that B_(m_R) == X,<br>

   and phi(B_(m_1)) &lt; ... &lt; phi(B_(m_R)).<br>

<br>

2) find R+1 closest tokens B_(p_0), ..., B_(p_R), which represent<br>

   different real bricks in the ring, so that B_(p_0) == X,<br>

   and phi(B_(p_0)) &lt; ... &lt; phi(B_(p_R)).<br>

<br>

3) The new brick X steals a portion of files of B_(p_1) as it has been<br>

   described in section (1) above.<br>

<br>

4) The brick B_(p_R) becomes not belonging to the preference set of<br>

   the stolen files, so we need to remove all the respective replicas<br>

   from B_(p_R).<br>

<br>

5) X becomes a brick belonging to the preference sets of files stored<br>

   in the bricks B_(m_1),... , B_(m_(R-1)), hence we should create<br>

   respective replicas on X.<br>

<br>

So adding a new brick with replication level R &gt; 1 results in<br>

<br>

a) moving a portion of files of one brick (step (3) above);<br>

b) replication of files located on on R-1 bricks (step (5) above);<br>

c) deleting replicas of a portion of files of one brick (step (4)).<br>

<br>

(a),(b),(c) above can be considered as re-balancing of (R+1)*(M/N) =<br>

const*(M/N) files (when R == 1, then (b),(c) are absent, and we need<br>

to re-balance M/N files, as it was shown in the section 1 above).<br>

<br>

Similarly we can show that with level of replication R removing one<br>

brick also leads to re-balancing of const*(M/N) files.<br>

<br>

<br>

If in our configuration L &lt; R bricks don&#39;t respond for some reasons,<br>

then all regular file operations are still defined, however our system<br>

is marked as &quot;unhealthy&quot; (in some papers this state is called &quot;sloppy<br>

quorum&quot;), non-responding bricks are marked as &quot;missed&quot; in the ring and<br>

file operation are performed on other available bricks of the<br>

preference set. In such operations files on the available &quot;non-<br>

primary&quot; bricks are marked as &quot;not synced with the missed replicas&quot;.<br>

<br>

In the state of sloppy quorum operations like add/remove a node can<br>

be already undefined. For example, when adding a brick we&#39;ll need to<br>

steal files from a brick, which doesn&#39;t respond.<br>

<br>

So we need to return the system back to a &quot;consistent&quot; state, when all<br>

operations are defined. It can be done by the following ways:<br>

<br>

1) Make sure that all missed bricks are available again and perform<br>

   L1-recovery. L1-recovery means syncing all marked files with the<br>

   again available bricks, so that resulting consistent system will<br>

   have the same number N of bricks.<br>

2) Add new empty bricks instead of missed ones and perform L2-recovery<br>

   It means filling the new empty bricks with files from other bricks,<br>

   so that resulting consistent system will have the same number N of<br>

   bricks.<br>

3) Remove missed bricks from the ring and perform L3-recovery, so that<br>

   resulting consistent system will have smaller number of nodes (N-L)<br>

<br>

Comment 1. For any missed brick we can specify different type of<br>

recovery.<br>

<br>

Comment 2. When R == N replication &quot;clogs&quot; the distribution: in this<br>

case our system will work like mirrors: every bricks will contain<br>

the same set of files.<br>

<br>

<br>

<br>

                            APPENDIX<br>

<br>

<br>

<br>

------------------------------<u></u>------------------------------<u></u>----------<br>

<br>

In 3 distributed hash tables with different hashing techniques<br>

<br>

. GlusterFS DHT translator (3.2.5)<br>

. 64-bit ring with phi based on md5, R=1 (no replication), S=7<br>

. 64-bit ring with phi based on md5, R=1 (no replication), S=20<br>

<br>

we run the same scenario:<br>

<br>

1) Create 100 files (&quot;file00&quot;, &quot;file01&quot;, ..., &quot;file99&quot;) in a volume<br>

   composed of 9 bricks:<br>

<br>

   &quot;host:/root/exp0&quot;,<br>

   &quot;host:/root/exp1&quot;,<br>

   ...<br>

<br>

   &quot;host:/root/exp8&quot;.<br>

<br>

2) Add one brick &quot;host:/root/exp9&quot;;<br>

3) re-balance;<br>

<br>

<br>

I. GlusterFS DHT translator (3.2.5)<br>

<br>

------------------------------<u></u>------------------------------<u></u>----------<br>

<br>

before re-balancing:<br>

<br>

exp0: file15  file22  file34  file35  file51  file6  file68  file78<br>

      file8   file81  file89  file94  file95<br>

exp1: file10  file28  file3   file4   file43  file66  file75<br>

exp2: file40  file46  file47  file48  file50  file58  file86  file9<br>

exp3: file12  file13  file32  file37  file54  file55  file7  file71<br>

      file91<br>

exp4: file31  file38  file41  file42  file53  file62  file63  file69<br>

      file93  file97<br>

exp5: file11  file16  file17  file24  file25  file27  file29  file44<br>

      file56  file73  file74  file80  file87  file90<br>

exp6: file0   file1   file2   file33  file36  file49  file57  file59<br>

      file64  file77  file79  file84  file85  file88  file98<br>

exp7: file21  file26  file39  file52  file61  file70  file72  file83<br>

      file92  file99<br>

exp8: file14  file20  file23  file30  file45  file5   file60  file65<br>

      file67  file76  file82  file96<br>

<br>

after re-balancing:<br>

<br>

exp0: file11  file16  file17  file24  file25  file31  file44  file53<br>

      file62  file69  file73  file80  file87  file93  file97<br>

exp1: file0   file27  file29  file33  file36  file56  file57  file64<br>

      file74  file77  file84  file88  file90  file98<br>

exp2: file1   file2   file39  file49  file59  file72  file79  file85<br>

      file92<br>

exp3: file21  file26  file30  file45  file52  file60  file61  file65<br>

      file70  file83  file99<br>

exp4: file14  file20  file23  file5   file67  file76  file82  file96<br>

exp5: file15  file22  file34  file35  file51  file6   file68  file78<br>

      file8   file81  file89  file94  file95<br>

exp6: file10  file28  file4   file43  file66  file75<br>

exp7: file3   file40  file47  file58  file9<br>

exp8: file12  file13  file32  file37  file46  file48  file50  file7<br>

      file71  file86<br>

exp9: file38  file41  file42  file54  file55  file63  file91<br>

<br>

<br>

as the result 98 files have been rebalanced (total files scanned 139)<br>

<br>

<br>

<br>

II.  64-bit ring with phi based on md5.<br>

     Every brick has number of virtual components S=7<br>

<br>

------------------------------<u></u>------------------------------<u></u>----------<br>

<br>

before re-balancing:<br>

<br>

exp0: file02  file18  file22  file42  file48  file58  file62  file70<br>

exp1: file01  file08  file15  file23  file33  file51  file64  file75<br>

      file82  file85  file86  file87  file95<br>

exp2: file00  file10  file11  file14  file25  file29  file40  file45<br>

      file63  file81  file91  file96<br>

exp3: file09  file16  file19  file21  file24  file28  file32  file35<br>

      file36  file44  file47  file50  file52  file57  file67  file73<br>

      file88  file98<br>

exp4: file27  file49  file53  file55  file69  file97<br>

exp5: file05  file20  file43<br>

exp6: file34  file68  file72  file74  file79<br>

exp7: file03  file04  file26  file39  file41  file54  file60  file71<br>

      file77  file78  file83  file84  file89  file93  file94<br>

exp8: file06  file07  file12  file17  file30  file31  file37  file38<br>

      file46  file56  file59  file61  file65  file66  file76  file80<br>

      file90  file92  file99<br>

<br>

after re-balancing:<br>

only the following files has been moved (to exp9):<br>

<br>

exp0: file70<br>

exp1: file82  file85<br>

exp2: file00  file14  file25  file40  file45  file91  file96<br>

exp3: file88<br>

<br>

as the result 11 files have been rebalanced (total files scanned 51)<br>

<br>

<br>

<br>

III.  64-bit ring with phi based on md5.<br>

      Every brick has number of virtual components S=20<br>

<br>

------------------------------<u></u>------------------------------<u></u>-----------<br>

<br>

before re-balancing:<br>

<br>

exp0: file00  file04  file22  file30  file40  file42  file70  file96<br>

exp1: file06  file08  file13  file15  file17  file19  file23  file32<br>

      file33  file78  file81  file86  file95<br>

exp2: file11  file14  file16  file24  file29  file57  file63  file67<br>

      file73  file76<br>

exp3: file02  file10  file12  file18  file21  file28  file35  file51<br>

      file56  file59  file80  file87<br>

exp4: file39  file41  file49  file53  file54  file62  file69  file77<br>

      file83  file84  file91<br>

exp5: file05  file20  file31  file43  file68  file74<br>

exp6: file34  file37  file38  file46  file48  file58  file66  file71<br>

      file75  file79  file88<br>

exp7: file03  file07  file26  file47  file60  file72  file89  file93<br>

      file94<br>

exp8: file01  file09  file25  file27  file36  file44  file45  file50<br>

      file52  file55  file59  file61  file64  file65  file82  file85<br>

      file90  file92  file97  file98  file99<br>

<br>

after re-balancing:<br>

only the following files has been moved (to exp9):<br>

<br>

exp0: file70<br>

exp6: file88<br>

exp7: file07  file93<br>

exp8: file45  file82  file85<br>

<br>

as the result 7 files have been rebalanced (total files scanned 48)<br>

<br>

<br>

------------------------------<u></u>------------------------------<u></u>----------<br>

<br>

                        Table with results<br>

<br>

          Re-balance results and standard deviations (sd)<br>

          for current gluster DHT translator (3.2.5), and<br>

          for 64-bit rings with S=7 and S=20.<br>

<br>

<br>

------------------------------<u></u>------------------------------<u></u>----------<br>

<br>

               DHT(Gluster 3.2.5)  64-bit ring, S=7  64-bit ring, S=20<br>

<br>

------------------------------<u></u>------------------------------<u></u>----------<br>

<br>

files moved            98                 11                 7<br>

<br>

files scanned         139                 51                48<br>

<br>

sd before             2.8                5.4               3.7<br>

<br>

sd after              3.2                5.3               3.2<br>

<br>

------------------------------<u></u>------------------------------<u></u>----------<br>

<br>

<br>

<br>

----------<br>

<br>

<br>

(0*) <a href="http://en.wikipedia.org/wiki/Uniform_distribution_%28discrete%29" target="_blank">http://en.wikipedia.org/wiki/<u></u>Uniform_distribution_%<u></u>28discrete%29</a><br>

(1*) Karger, D.; Lehman, E.; Leighton, T.; Panigrahy, R.; Levine, M.; Lewin, D.<br>

    (1997). &quot;Consistent Hashing and Random Trees: Distributed Caching Protocols<br>

     for Relieving Hot Spots on the World Wide Web&quot;<br>

(2*) M/N is a median value<br>

(3*) <a href="http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html" target="_blank">http://weblogs.java.net/blog/<u></u>tomwhite/archive/2007/11/<u></u>consistent_hash.html</a><br>

(4*) <a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html" target="_blank">http://www.<u></u>allthingsdistributed.com/2007/<u></u>10/amazons_dynamo.html</a><br>

<br>

______________________________<u></u>_________________<br>

Gluster-devel mailing list<br>

<a href="mailto:Gluster-devel@nongnu.org" target="_blank">Gluster-devel@nongnu.org</a><br>

<a href="https://lists.nongnu.org/mailman/listinfo/gluster-devel" target="_blank">https://lists.nongnu.org/<u></u>mailman/listinfo/gluster-devel</a><br>

</blockquote></div><br></div>