<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, Jul 1, 2014 at 2:23 PM, Xavier Hernandez <span dir="ltr">&lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">On Monday 30 June 2014 16:18:09 Shyamsundar Ranganathan wrote:<br>

&gt; &gt; Will this &quot;rebalance on access&quot; feature be enabled always or only during a<br>

&gt; &gt; brick addition/removal to move files that do not go to the affected brick<br>

&gt; &gt; while the main rebalance is populating or removing files from the brick ?<br>

&gt;<br>

&gt; The rebalance on access, in my head, stands as follows, (a little more<br>

&gt; detailed than what is in the feature page) Step 1: Initiation of the<br>

&gt; process<br>

&gt; - Admin chooses to &quot;rebalance _changed_&quot; bricks<br>

&gt;   - This could mean added/removed/changed size bricks<br>

&gt; [3]- Rebalance on access is triggered, so as to move files when they are<br>

&gt; accessed but asynchronously [1]- Background rebalance, acts only to<br>

&gt; (re)move data (from)to these bricks [2]- This would also change the layout<br>

&gt; for all directories, to include the new configuration of the cluster, so<br>

&gt; that newer data is placed in the correct bricks<br>

&gt;<br>

&gt; Step 2: Completion of background rebalance<br>

&gt; - Once background rebalance is complete, the rebalance status is noted as<br>

&gt; success/failure based on what the backgrould rebalance process did - This<br>

&gt; will not stop the on access rebalance, as data is still all over the place,<br>

&gt; and enhancements like lookup-unhashed=auto will have trouble<br>

<br>

</div>I don&#39;t see why stopping rebalance on access when lookup-unhashed=auto is a<br>

problem. If I understand <a href="http://review.gluster.org/7702/" target="_blank">http://review.gluster.org/7702/</a> correctly, when the<br>

directory commit hash does not match that of the volume root, a global lookup<br>

will be made. If we change layout in [3], it will also change (or it should)<br>

the commit of the directory. This means that even if files of that directory<br>

are not rebalanced yet, they will be found regardless if on access rebalance<br>

is enabled or not.<br>

<br>

Am I missing something ?<br>

<div class=""><br>

&gt;<br>

&gt; Step 3: Admin can initiate a full rebalance<br>

&gt; - When this is complete then the on access rebalance would be turned off, as<br>

&gt; the cluster is rebalanced!<br>

&gt;<br>

&gt; Step 2.5/4: Choosing to stop the on access rebalance<br>

&gt; - This can be initiated by the admin, post 3 which is more logical or<br>

&gt; between 2 and 3, in which case lookup everywhere for files etc. cannot be<br>

&gt; avoided due to [2] above<br>

&gt;<br>

<br>

</div>I like having the possibility for admins to enable/disable this feature seems<br>

interesting. However I also think this should be forcibly enabled when<br>

rebalancing _changed_ bricks.<br>

<div class=""><br>

&gt; Issues and possible solutions:<br>

&gt;<br>

&gt; [4] One other thought is to create link files, as a part of [1], for files<br>

&gt; that do not belong to the right bricks but are _not_ going to be rebalanced<br>

&gt; as their source/destination is not a changed brick. This _should_ be faster<br>

&gt; than moving data around and rebalancing these files. It should also avoid<br>

&gt; the problem that, post a &quot;rebalance _changed_&quot; command, the cluster may<br>

&gt; have files in the wrong place based on the layout, as the link files would<br>

&gt; be present to correct the situation. In this situation the rebalance on<br>

&gt; access can be left on indefinitely and turning it off does not serve much<br>

&gt; purpose.<br>

&gt;<br>

<br>

</div>I think that creating link files is a cheap task, specially if rebalance will<br>

handle files in parallel. However I&#39;m not sure if this will make any<br>

measurable difference in performance on future accesses (in theory it should<br>

avoid a global lookup once). This would need to be tested to decide.<br>

<div class=""><br>

&gt; Enabling rebalance on access always is fine, but I am not sure it buys us<br>

&gt; gluster states that mean the cluster is in a balanced situation, for other<br>

&gt; actions like the lookup-unhashed mentioned which may not just need the link<br>

&gt; files in place. Examples could be mismatched or overly space committed<br>

&gt; bricks with old, not accessed data etc. but do not have a clear example<br>

&gt; yet.<br>

&gt;<br>

<br>

</div>As I see it, rebalance on access should be a complement to normal rebalance to<br>

keep the volume _more_ balanced (keep accessed files on the right brick to<br>

avoid unnecessary delays due to global lookups or link file redirections), but<br>

it can not assure that the volume is fully rebalanced.<br>

<div class=""><br>

&gt; Just stating, the core intention of &quot;rebalance _changed_&quot; is to create space<br>

&gt; in existing bricks when the cluster grows faster, or be able to remove<br>

&gt; bricks from the cluster faster.<br>

&gt;<br>

<br>

</div>That is a very important feature. I&#39;ve missed it several times when expanding<br>

a volume. In fact we needed to write some scripts to do something similar<br>

before launching a full rebalance.<br>

<div class=""><br>

&gt; Redoing a &quot;rebalance _changed_&quot; again due to a gluster configuration change,<br>

&gt; i.e expanding the cluster again say, needs some thought. It does not impact<br>

&gt; if rebalance on access is running or not, the only thing it may impact is<br>

&gt; the choice of files that are already put into the on access queue based on<br>

&gt; the older layout, due to the older cluster configuration. Just noting this<br>

&gt; here.<br>

&gt;<br>

<br>

</div>This will need to be thought more deeply, but if we only have a queue of files<br>

that *may* need migration, and we really check the target volume at the time<br>

of migration, I think this won&#39;t pose much problem in case of successive<br>

rebalances.<br>

<div class=""><br>

&gt; In short if we do [4] then we can leave rebalance on access turned on<br>

&gt; always, unless we have some other counter examples or use cases that are<br>

&gt; not thought of. Doing [4] seems logical, so I would state that we should,<br>

&gt; but from a performance angle of improving rebalance, we need to determine<br>

&gt; the worth against access paths from IO post not having [4] (again<br>

&gt; considering the improvement that lookup-unhashed brings, this maybe obvious<br>

&gt; that [4] should be done).<br>

&gt;<br>

&gt; A note on [3], the intention is to start an asynchronous sync task that<br>

&gt; rebalances the file on access, and not impact the IO path. So if a file is<br>

&gt; chosen by the IO path as to needing a rebalance, then a sync task with the<br>

&gt; required xattr to trigger a file move is setup, and setxattr is called,<br>

&gt; that should take care of the file migration and enabling the IO path to<br>

&gt; progress as is.<br>

&gt;<br>

<br>

</div>Agreed. The file operation that triggered it must not be blocked while<br>

migration is performed.<br>

<div class=""><br>

&gt; Reading through your mail, a better way of doing this by sharing the load,<br>

&gt; would be to use an index, so that each node in the cluster has a list of<br>

&gt; files accessed that need a rebalance. The above method for [3] would be<br>

&gt; client heavy and would incur a network read and write, whereas the index<br>

&gt; manner of doing things on the node could help in local reads and remote<br>

&gt; writes operations and in spreading the work. It would incur a walk/crawl of<br>

&gt; the index, but each entry returned is a candidate, and the walk is limited,<br>

&gt; so should not be a bad thing by itself.<br>

<br>

</div>The idea of using index was more intended to easily detect renamed files on an<br>

otherwise balanced volume, and be able to perform quick rebalance operations<br>

to move them to the correct brick without having to crawl the entire file<br>

system. On almost all cases, all files present in the index will need<br>

rebalance, so the cost of crawling the index is worth it.<br></blockquote><div><br></div><div>We did consider using index for identifying files that need migration. In the normal case it suits our needs. However, after an add-brick we cannot rely on index to avoid crawl, since layout itself would&#39;ve been changed.<br>

 <br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

As I thought it, it was independent of the on access rebalance. However, it<br>

could be seen as something similar to the self-heal daemon. We could consider<br>

that a file not residing in the right brick is not healthy and initiate some<br>

sort of self-heal on it. Not sure if this should/could be done in the self-<br>

heal daemon or would need another daemon though.<br>

<br>

Using the daemon solution, I think that the client side &quot;on access rebalance&quot;<br>

is not needed. However I&#39;m not sure which one is easier to implement.<br>

<div><div class="h5"><br>

&gt; &gt; I like all the proposed ideas. I think they would improve the performance<br>

&gt; &gt; of the rebalance operation considerably. Probably we will need to define<br>

&gt; &gt; some policies to limit the amount of bandwidth that rebalance is allowed<br>

&gt; &gt; to use and<br>

&gt; &gt; at which hours, but this can be determined later.<br>

&gt;<br>

&gt; This [5] section of the feature page touches upon the same issue. i.e being<br>

&gt; IO path requirements aware and not let rebalance hog the node resources.<br>

&gt; But as you state, needs more thought and probably to be done once we see<br>

&gt; some improvements and also see that we are utilizing the resources heavily.<br>

&gt; &gt; I would also consider using index or changelog xlators to track renames<br>

&gt; &gt; and<br>

&gt; &gt; let rebalance consume it. Currently a file or directory rename makes that<br>

&gt; &gt; files correctly placed in the right brick need to be moved to another<br>

&gt; &gt; brick. A<br>

&gt; &gt; full rebalance crawling all the file system seems too expensive for this<br>

&gt; &gt; kind of local changes (the effects of this are orders of magnitude<br>

&gt; &gt; smaller than adding or removing a brick). Having a way to list pending<br>

&gt; &gt; moves due to rename without scanning all the file system would be great.<br>

&gt;<br>

&gt; Hmmm... to my knowledge a rename of a file does not move the file, it rather<br>

&gt; creates a link file if the hashed sub volume of the new name is different<br>

&gt; than the older sub volume where the file was placed. the rename of a<br>

&gt; directory does not change its layout (unless &#39;a still to be analyzed&#39;<br>

&gt; lookup races with the rename for layout fetching and healing). On any<br>

&gt; future layout fixes due to added bricks or removed bricks, the layout<br>

&gt; overlaps are computed so as to minimize data movements.<br>

&gt;<br>

&gt; Are you suggesting a change in behavior here, or am I missing something?<br>

<br>

</div></div>Not really. I&#39;m only considering the possibility of adding an additional step.<br>

The way rename works now is fine as it is now. I think that creating a link<br>

file is the most efficient way to be able to easily find the file in the<br>

future without wasting too much bandwidth and IOPS. However, as more and more<br>

file and directory renames are made, more and more data is left on the wrong<br>

brick and each access needs an additional jump. Even if this were cheap, a<br>

future layout change trying to minimize data movements will not be optimum<br>

because data is not really where it thinks it is.<br>

<br>

Recording all renames in an index each time a rename is made can allow a<br>

background daemon to scan it and incrementally process them to restore volume<br>

balance.<br>

<div class=""><br>

&gt; &gt; Another thing to consider for future versions is to modify the current DHT<br>

&gt; &gt; to a consistent hashing and even the hash value (using gfid instead of a<br>

&gt; &gt; hash of the name would solve the rename problem). The consistent hashing<br>

&gt; &gt; would drastically reduce the number of files that need to be moved and<br>

&gt; &gt; already solves some of the current problems. This change needs a lot of<br>

&gt; &gt; thinking though.<br>

&gt;<br>

&gt; Firstly agree that this is an area to explore and nail better in the<br>

&gt; _hopefully_ near future and that it takes some thinking time to get this<br>

&gt; straight, while learning from the current implementation.<br>

&gt;<br>

&gt; Also, would like to point out to a commit that changes this for directories<br>

&gt; using the GFID based hash rather than the name based hash, here [6]. It<br>

&gt; does not address the rename problem, but starts to do things that you put<br>

&gt; down here.<br>

<br>

</div>That&#39;s good. I missed this patch. I&#39;ll look at it. Thanks :)<br>

<br>

Xavi<br>

<div class="HOEnZb"><div class="h5"><br>

_______________________________________________<br>

Gluster-devel mailing list<br>

<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>

<a href="http://supercolony.gluster.org/mailman/listinfo/gluster-devel" target="_blank">http://supercolony.gluster.org/mailman/listinfo/gluster-devel</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br>Raghavendra G<br>

</div></div>