<div dir="ltr">Hi all,<br><div class="gmail_quote"><div dir="ltr"><div><div><div><div><div><br></div>I&#39;m pretty new to Gluster, and the company I work for uses it for storage across 2 data centres. An issue has cropped up fairly recently with regards to the self-heal mechanism.<br>


<br></div><div>Occasionally the connection between these 2 Gluster servers breaks or drops momentarily. Due to the nature of the business it&#39;s highly likely that files have been written during this time. When the self-heal daemon runs it notices a discrepancy and gets the volume up to date. The problem we&#39;ve been seeing is that this appears to cause the CPU load to increase massively on both servers whilst the healing process takes place.<br>


<br></div><div>After trying to find out if there were any persistent network issues I tried recreating this on a test system and can now re-produce at will. Our test system set up is made up of 3 VMs, 2 Gluster servers and a client. The process to cause this was:<br>


Add in an iptables rule to block one of the Gluster servers from being reached by the other server and the client.<br></div><div>Create some random files on the client.<br></div><div>Flush the iptables rules out so the server is reachable again.<br>


</div><div>Force a self heal to run.<br></div><div>Watch as the load on the Gluster servers goes bananas.<br><br></div><div>The problem with this is that whilst the self-heal happens one the gluster servers will be inaccessible from the client, meaning no files can be read or written, causing problems for our users.<br>


<br></div><div>I&#39;ve been searching for a solution, or at least someone else who has been having the same problem and not found anything. I don&#39;t know if this is a bug or config issue (see below for config details). I&#39;ve tried a variety of different options but none of them have had any effect.<br>


<br></div><div></div>Our production set up is as follows:<br></div>2 Gluster servers (1 in each DC) replicating to each other<br></div>We then have multiple other servers that store and retrieve files on Gluster using a local glusterfs mount point.<br>


</div>Only 1 data centre is active at any one time<br><div>The Gluster servers are VMs on a Xen hypervisor.<br></div><div>All our systems are CentOS 5<br></div><div>Gluster 3.3.1 (I&#39;ve also tried 3.3.2)<br></div><div>


<br></div><div>gluster02 ~ gluster volume info rmfs<br> <br>Volume Name: volume1<br>Type: Replicate<br>Volume ID: 3fef44e1-e840-452e-b16b-a9fc698e7dfd<br>Status: Started<br>Number of Bricks: 1 x 2 = 2<br>Transport-type: tcp<br>


Bricks:<br>Brick1: gluster01:/mnt/store1<br>Brick2: gluster02:/mnt/store1<br>Options Reconfigured:<br>nfs.disable: off<br>auth.allow: 172.30.98.*<br>network.ping-timeout: 5<br><br></div><div>Any help or suggestions would be greatly appreciated. If you need anything else from me, just ask.<br>


<br></div><div>Thanks,<br><br></div><div>Darren<br></div></div>

</div><br></div>