<div dir="ltr">On Thu, Oct 3, 2013 at 8:57 AM, KueiHuan Chen <span dir="ltr"><<a href="mailto:kueihuan.chen@gmail.com" target="_blank">kueihuan.chen@gmail.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi, Avati<br>
<br>
In your chained configuration, how to replace whole h1 without<br>
replace-brick ? Is there has a better way than replace brick in this<br>
situation ?<br>
<br>
h0:/b1 h1:/b2 h1:/b1 h2:/b2 h2:/b1 h0:/b2 (A new h3 want to replace old h1.)<br></blockquote><div><br></div><div><br></div><div>You have a couple of options,</div><div><br></div><div>A)</div><div><br></div><div>replace-brick h1:/b1 h3:/b1</div>
<div>replace-brick h1:/b2 h3:/b2</div><div><br></div><div>and let self-heal bring the disks up to speed, or</div><div><br></div><div>B)</div><div><br></div><div>add-brick replica 2 h3:/b1 h2:/b2a</div><div>add-brick replica 2 h3:/b2 h0:/b1a</div>
<div><br></div><div>remove-brick h0:/b1 h1:/b2 start .. commit</div><div>remove-brick h2:/b2 h1:/b1 start .. commit</div><div><br></div><div>Let me know if you still have questions.</div><div><br></div><div>Avati</div><div>
<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Thanks.<br>
Best Regards,<br>
<br>
KueiHuan-Chen<br>
Synology Incorporated.<br>
Email: <a href="mailto:khchen@synology.com">khchen@synology.com</a><br>
Tel: <a href="tel:%2B886-2-25521814%20ext.827" value="+886225521814">+886-2-25521814 ext.827</a><br>
<br>
<br>
2013/9/30 Anand Avati <<a href="mailto:avati@gluster.org">avati@gluster.org</a>>:<br>
<div class="HOEnZb"><div class="h5">><br>
><br>
><br>
> On Fri, Sep 27, 2013 at 1:56 AM, James <<a href="mailto:purpleidea@gmail.com">purpleidea@gmail.com</a>> wrote:<br>
>><br>
>> On Fri, 2013-09-27 at 00:35 -0700, Anand Avati wrote:<br>
>> > Hello all,<br>
>> Hey,<br>
>><br>
>> Interesting timing for this post...<br>
>> I've actually started working on automatic brick addition/removal. (I'm<br>
>> planning to add this to puppet-gluster of course.) I was hoping you<br>
>> could help out with the algorithm. I think it's a bit different if<br>
>> there's no replace-brick command as you are proposing.<br>
>><br>
>> Here's the problem:<br>
>> Given a logically optimal initial volume:<br>
>><br>
>> volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2<br>
>><br>
>> suppose I know that I want to add/remove bricks such that my new volume<br>
>> (if I had created it new) looks like:<br>
>><br>
>> volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2<br>
>> h5:/b2 h6:/b2<br>
>><br>
>> What is the optimal algorithm for determining the correct sequence of<br>
>> transforms that are needed to accomplish this task. Obviously there are<br>
>> some simpler corner cases, but I'd like to solve the general case.<br>
>><br>
>> The transforms are obviously things like running the add-brick {...} and<br>
>> remove-brick {...} commands.<br>
>><br>
>> Obviously we have to take into account that it's better to add bricks<br>
>> and rebalance before we remove bricks and risk the file system if a<br>
>> replica is missing. The algorithm should work for any replica N. We want<br>
>> to make sure the new layout makes sense to replicate the data on<br>
>> different servers. In many cases, this will require creating a circular<br>
>> "chain" of bricks as illustrated in the bottom of this image:<br>
>> <a href="http://joejulian.name/media/uploads/images/replica_expansion.png" target="_blank">http://joejulian.name/media/uploads/images/replica_expansion.png</a><br>
>> for example. I'd like to optimize for safety first, and then time, I<br>
>> imagine.<br>
>><br>
>> Many thanks in advance.<br>
>><br>
><br>
> I see what you are asking. First of all, when running a 2-replica volume you<br>
> almost pretty much always want to have an even number of servers, and add<br>
> servers in even numbers. Ideally the two "sides" of the replicas should be<br>
> placed in separate failures zones - separate racks with separate power<br>
> supplies or separate AZs in the cloud. Having an odd number of servers with<br>
> an 2 replicas is a very "odd" configuration. In all these years I am yet to<br>
> come across a customer who has a production cluster with 2 replicas and an<br>
> odd number of servers. And setting up replicas in such a chained manner<br>
> makes it hard to reason about availability, especially when you are trying<br>
> recover from a disaster. Having clear and separate "pairs" is definitely<br>
> what is recommended.<br>
><br>
> That being said, nothing prevents one from setting up a chain like above as<br>
> long as you are comfortable with the complexity of the configuration. And<br>
> phasing out replace-brick in favor of add-brick/remove-brick does not make<br>
> the above configuration impossible either. Let's say you have a chained<br>
> configuration of N servers, with pairs formed between every:<br>
><br>
> h(i):/b1 h((i+1) % N):/b2 | i := 0 -> N-1<br>
><br>
> Now you add N+1th server.<br>
><br>
> Using replace-brick, you have been doing thus far:<br>
><br>
> 1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was "part of a previous brick"<br>
> 2. replace-brick h0:/b2 hN:/b2 start ... commit<br>
><br>
> In case you are doing an add-brick/remove-brick approach, you would now<br>
> instead do:<br>
><br>
> 1. add-brick h(N-1):/b1a hN:/b2<br>
> 2. add-brick hN:/b1 h0:/b2a<br>
> 3. remove-brick h(N-1):/b1 h0:/b2 start ... commit<br>
><br>
> You will not be left with only 1 copy of a file at any point in the process,<br>
> and achieve the same "end result" as you were with replace-brick. As<br>
> mentioned before, I once again request you to consider if you really want to<br>
> deal with the configuration complexity of having chained replication,<br>
> instead of just adding servers in pairs.<br>
><br>
> Please ask if there are any more questions or concerns.<br>
><br>
> Avati<br>
><br>
><br>
>><br>
>> James<br>
>><br>
>> Some comments below, although I'm a bit tired so I hope I said it all<br>
>> right.<br>
>><br>
>> > DHT's remove-brick + rebalance has been enhanced in the last couple of<br>
>> > releases to be quite sophisticated. It can handle graceful<br>
>> > decommissioning<br>
>> > of bricks, including open file descriptors and hard links.<br>
>> Sweet<br>
>><br>
>> ><br>
>> > This in a way is a feature overlap with replace-brick's data migration<br>
>> > functionality. Replace-brick's data migration is currently also used for<br>
>> > planned decommissioning of a brick.<br>
>> ><br>
>> > Reasons to remove replace-brick (or why remove-brick is better):<br>
>> ><br>
>> > - There are two methods of moving data. It is confusing for the users<br>
>> > and<br>
>> > hard for developers to maintain.<br>
>> ><br>
>> > - If server being replaced is a member of a replica set, neither<br>
>> > remove-brick nor replace-brick data migration is necessary, because<br>
>> > self-healing itself will recreate the data (replace-brick actually uses<br>
>> > self-heal internally)<br>
>> ><br>
>> > - In a non-replicated config if a server is getting replaced by a new<br>
>> > one,<br>
>> > add-brick <new> + remove-brick <old> "start" achieves the same goal as<br>
>> > replace-brick <old> <new> "start".<br>
>> ><br>
>> > - In a non-replicated config, <replace-brick> is NOT glitch free<br>
>> > (applications witness ENOTCONN if they are accessing data) whereas<br>
>> > add-brick <new> + remove-brick <old> is completely transparent.<br>
>> ><br>
>> > - Replace brick strictly requires a server with enough free space to<br>
>> > hold<br>
>> > the data of the old brick, whereas remove-brick will evenly spread out<br>
>> > the<br>
>> > data of the bring being removed amongst the remaining servers.<br>
>><br>
>> Can you talk more about the replica = N case (where N is 2 or 3?)<br>
>> With remove brick, add brick you will need add/remove N (replica count)<br>
>> bricks at a time, right? With replace brick, you could just swap out<br>
>> one, right? Isn't that a missing feature if you remove replace brick?<br>
>><br>
>> ><br>
>> > - Replace-brick code is complex and messy (the real reason :p).<br>
>> ><br>
>> > - No clear reason why replace-brick's data migration is better in any<br>
>> > way<br>
>> > to remove-brick's data migration.<br>
>> ><br>
>> > I plan to send out patches to remove all traces of replace-brick data<br>
>> > migration code by 3.5 branch time.<br>
>> ><br>
>> > NOTE that replace-brick command itself will still exist, and you can<br>
>> > replace on server with another in case a server dies. It is only the<br>
>> > data<br>
>> > migration functionality being phased out.<br>
>> ><br>
>> > Please do ask any questions / raise concerns at this stage :)<br>
>> I heard with 3.4 you can somehow change the replica count when adding<br>
>> new bricks... What's the full story here please?<br>
>><br>
>> Thanks!<br>
>> James<br>
>><br>
>> ><br>
>> > Avati<br>
>> > _______________________________________________<br>
>> > Gluster-users mailing list<br>
>> > <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
>> > <a href="http://supercolony.gluster.org/mailman/listinfo/gluster-users" target="_blank">http://supercolony.gluster.org/mailman/listinfo/gluster-users</a><br>
>><br>
><br>
><br>
> _______________________________________________<br>
> Gluster-users mailing list<br>
> <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
> <a href="http://supercolony.gluster.org/mailman/listinfo/gluster-users" target="_blank">http://supercolony.gluster.org/mailman/listinfo/gluster-users</a><br>
</div></div></blockquote></div><br></div></div>