<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, Jul 1, 2014 at 2:23 PM, Xavier Hernandez <span dir="ltr"><<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">On Monday 30 June 2014 16:18:09 Shyamsundar Ranganathan wrote:<br>
> > Will this "rebalance on access" feature be enabled always or only during a<br>
> > brick addition/removal to move files that do not go to the affected brick<br>
> > while the main rebalance is populating or removing files from the brick ?<br>
><br>
> The rebalance on access, in my head, stands as follows, (a little more<br>
> detailed than what is in the feature page) Step 1: Initiation of the<br>
> process<br>
> - Admin chooses to "rebalance _changed_" bricks<br>
> - This could mean added/removed/changed size bricks<br>
> [3]- Rebalance on access is triggered, so as to move files when they are<br>
> accessed but asynchronously [1]- Background rebalance, acts only to<br>
> (re)move data (from)to these bricks [2]- This would also change the layout<br>
> for all directories, to include the new configuration of the cluster, so<br>
> that newer data is placed in the correct bricks<br>
><br>
> Step 2: Completion of background rebalance<br>
> - Once background rebalance is complete, the rebalance status is noted as<br>
> success/failure based on what the backgrould rebalance process did - This<br>
> will not stop the on access rebalance, as data is still all over the place,<br>
> and enhancements like lookup-unhashed=auto will have trouble<br>
<br>
</div>I don't see why stopping rebalance on access when lookup-unhashed=auto is a<br>
problem. If I understand <a href="http://review.gluster.org/7702/" target="_blank">http://review.gluster.org/7702/</a> correctly, when the<br>
directory commit hash does not match that of the volume root, a global lookup<br>
will be made. If we change layout in [3], it will also change (or it should)<br>
the commit of the directory. This means that even if files of that directory<br>
are not rebalanced yet, they will be found regardless if on access rebalance<br>
is enabled or not.<br>
<br>
Am I missing something ?<br>
<div class=""><br>
><br>
> Step 3: Admin can initiate a full rebalance<br>
> - When this is complete then the on access rebalance would be turned off, as<br>
> the cluster is rebalanced!<br>
><br>
> Step 2.5/4: Choosing to stop the on access rebalance<br>
> - This can be initiated by the admin, post 3 which is more logical or<br>
> between 2 and 3, in which case lookup everywhere for files etc. cannot be<br>
> avoided due to [2] above<br>
><br>
<br>
</div>I like having the possibility for admins to enable/disable this feature seems<br>
interesting. However I also think this should be forcibly enabled when<br>
rebalancing _changed_ bricks.<br>
<div class=""><br>
> Issues and possible solutions:<br>
><br>
> [4] One other thought is to create link files, as a part of [1], for files<br>
> that do not belong to the right bricks but are _not_ going to be rebalanced<br>
> as their source/destination is not a changed brick. This _should_ be faster<br>
> than moving data around and rebalancing these files. It should also avoid<br>
> the problem that, post a "rebalance _changed_" command, the cluster may<br>
> have files in the wrong place based on the layout, as the link files would<br>
> be present to correct the situation. In this situation the rebalance on<br>
> access can be left on indefinitely and turning it off does not serve much<br>
> purpose.<br>
><br>
<br>
</div>I think that creating link files is a cheap task, specially if rebalance will<br>
handle files in parallel. However I'm not sure if this will make any<br>
measurable difference in performance on future accesses (in theory it should<br>
avoid a global lookup once). This would need to be tested to decide.<br>
<div class=""><br>
> Enabling rebalance on access always is fine, but I am not sure it buys us<br>
> gluster states that mean the cluster is in a balanced situation, for other<br>
> actions like the lookup-unhashed mentioned which may not just need the link<br>
> files in place. Examples could be mismatched or overly space committed<br>
> bricks with old, not accessed data etc. but do not have a clear example<br>
> yet.<br>
><br>
<br>
</div>As I see it, rebalance on access should be a complement to normal rebalance to<br>
keep the volume _more_ balanced (keep accessed files on the right brick to<br>
avoid unnecessary delays due to global lookups or link file redirections), but<br>
it can not assure that the volume is fully rebalanced.<br>
<div class=""><br>
> Just stating, the core intention of "rebalance _changed_" is to create space<br>
> in existing bricks when the cluster grows faster, or be able to remove<br>
> bricks from the cluster faster.<br>
><br>
<br>
</div>That is a very important feature. I've missed it several times when expanding<br>
a volume. In fact we needed to write some scripts to do something similar<br>
before launching a full rebalance.<br>
<div class=""><br>
> Redoing a "rebalance _changed_" again due to a gluster configuration change,<br>
> i.e expanding the cluster again say, needs some thought. It does not impact<br>
> if rebalance on access is running or not, the only thing it may impact is<br>
> the choice of files that are already put into the on access queue based on<br>
> the older layout, due to the older cluster configuration. Just noting this<br>
> here.<br>
><br>
<br>
</div>This will need to be thought more deeply, but if we only have a queue of files<br>
that *may* need migration, and we really check the target volume at the time<br>
of migration, I think this won't pose much problem in case of successive<br>
rebalances.<br>
<div class=""><br>
> In short if we do [4] then we can leave rebalance on access turned on<br>
> always, unless we have some other counter examples or use cases that are<br>
> not thought of. Doing [4] seems logical, so I would state that we should,<br>
> but from a performance angle of improving rebalance, we need to determine<br>
> the worth against access paths from IO post not having [4] (again<br>
> considering the improvement that lookup-unhashed brings, this maybe obvious<br>
> that [4] should be done).<br>
><br>
> A note on [3], the intention is to start an asynchronous sync task that<br>
> rebalances the file on access, and not impact the IO path. So if a file is<br>
> chosen by the IO path as to needing a rebalance, then a sync task with the<br>
> required xattr to trigger a file move is setup, and setxattr is called,<br>
> that should take care of the file migration and enabling the IO path to<br>
> progress as is.<br>
><br>
<br>
</div>Agreed. The file operation that triggered it must not be blocked while<br>
migration is performed.<br>
<div class=""><br>
> Reading through your mail, a better way of doing this by sharing the load,<br>
> would be to use an index, so that each node in the cluster has a list of<br>
> files accessed that need a rebalance. The above method for [3] would be<br>
> client heavy and would incur a network read and write, whereas the index<br>
> manner of doing things on the node could help in local reads and remote<br>
> writes operations and in spreading the work. It would incur a walk/crawl of<br>
> the index, but each entry returned is a candidate, and the walk is limited,<br>
> so should not be a bad thing by itself.<br>
<br>
</div>The idea of using index was more intended to easily detect renamed files on an<br>
otherwise balanced volume, and be able to perform quick rebalance operations<br>
to move them to the correct brick without having to crawl the entire file<br>
system. On almost all cases, all files present in the index will need<br>
rebalance, so the cost of crawling the index is worth it.<br></blockquote><div><br></div><div>We did consider using index for identifying files that need migration. In the normal case it suits our needs. However, after an add-brick we cannot rely on index to avoid crawl, since layout itself would've been changed.<br>
<br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
As I thought it, it was independent of the on access rebalance. However, it<br>
could be seen as something similar to the self-heal daemon. We could consider<br>
that a file not residing in the right brick is not healthy and initiate some<br>
sort of self-heal on it. Not sure if this should/could be done in the self-<br>
heal daemon or would need another daemon though.<br>
<br>
Using the daemon solution, I think that the client side "on access rebalance"<br>
is not needed. However I'm not sure which one is easier to implement.<br>
<div><div class="h5"><br>
> > I like all the proposed ideas. I think they would improve the performance<br>
> > of the rebalance operation considerably. Probably we will need to define<br>
> > some policies to limit the amount of bandwidth that rebalance is allowed<br>
> > to use and<br>
> > at which hours, but this can be determined later.<br>
><br>
> This [5] section of the feature page touches upon the same issue. i.e being<br>
> IO path requirements aware and not let rebalance hog the node resources.<br>
> But as you state, needs more thought and probably to be done once we see<br>
> some improvements and also see that we are utilizing the resources heavily.<br>
> > I would also consider using index or changelog xlators to track renames<br>
> > and<br>
> > let rebalance consume it. Currently a file or directory rename makes that<br>
> > files correctly placed in the right brick need to be moved to another<br>
> > brick. A<br>
> > full rebalance crawling all the file system seems too expensive for this<br>
> > kind of local changes (the effects of this are orders of magnitude<br>
> > smaller than adding or removing a brick). Having a way to list pending<br>
> > moves due to rename without scanning all the file system would be great.<br>
><br>
> Hmmm... to my knowledge a rename of a file does not move the file, it rather<br>
> creates a link file if the hashed sub volume of the new name is different<br>
> than the older sub volume where the file was placed. the rename of a<br>
> directory does not change its layout (unless 'a still to be analyzed'<br>
> lookup races with the rename for layout fetching and healing). On any<br>
> future layout fixes due to added bricks or removed bricks, the layout<br>
> overlaps are computed so as to minimize data movements.<br>
><br>
> Are you suggesting a change in behavior here, or am I missing something?<br>
<br>
</div></div>Not really. I'm only considering the possibility of adding an additional step.<br>
The way rename works now is fine as it is now. I think that creating a link<br>
file is the most efficient way to be able to easily find the file in the<br>
future without wasting too much bandwidth and IOPS. However, as more and more<br>
file and directory renames are made, more and more data is left on the wrong<br>
brick and each access needs an additional jump. Even if this were cheap, a<br>
future layout change trying to minimize data movements will not be optimum<br>
because data is not really where it thinks it is.<br>
<br>
Recording all renames in an index each time a rename is made can allow a<br>
background daemon to scan it and incrementally process them to restore volume<br>
balance.<br>
<div class=""><br>
> > Another thing to consider for future versions is to modify the current DHT<br>
> > to a consistent hashing and even the hash value (using gfid instead of a<br>
> > hash of the name would solve the rename problem). The consistent hashing<br>
> > would drastically reduce the number of files that need to be moved and<br>
> > already solves some of the current problems. This change needs a lot of<br>
> > thinking though.<br>
><br>
> Firstly agree that this is an area to explore and nail better in the<br>
> _hopefully_ near future and that it takes some thinking time to get this<br>
> straight, while learning from the current implementation.<br>
><br>
> Also, would like to point out to a commit that changes this for directories<br>
> using the GFID based hash rather than the name based hash, here [6]. It<br>
> does not address the rename problem, but starts to do things that you put<br>
> down here.<br>
<br>
</div>That's good. I missed this patch. I'll look at it. Thanks :)<br>
<br>
Xavi<br>
<div class="HOEnZb"><div class="h5"><br>
_______________________________________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>
<a href="http://supercolony.gluster.org/mailman/listinfo/gluster-devel" target="_blank">http://supercolony.gluster.org/mailman/listinfo/gluster-devel</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>Raghavendra G<br>
</div></div>