Hi Pranith,<div><br></div><div>OK, thanks, I can do that.</div><div><br></div><div>Is there any sign of _how_ we got into this situation?  Anything that I can go back and look for in the logs that might tell us more about how this happened and how we can prevent it from happening again?</div>

<div><br></div><div>Thanks again,</div><div><br>Remi<br><br><div class="gmail_quote">On Thu, May 19, 2011 at 03:12, Pranith Kumar. Karampuri <span dir="ltr">&lt;<a href="mailto:pranithk@gluster.com">pranithk@gluster.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Remi,<br>

     Sorry I think you want to keep web02 as the source and web01 as the sink, so the commands need to be executed on web01:<br>

<br>

1) sudo setxattr -n trusted.afr.shared-application-data-client-1 -v 0sAAAAAAAAAAAAAAAA &lt;file-name&gt;.<br>

<div class="im">2) then do a find on the &lt;file-name&gt;,<br>

<br>

</div>Thanks<br>

<font color="#888888">Pranith<br>

</font><div class="im"><br>

----- Original Message -----<br>

From: &quot;Pranith Kumar. Karampuri&quot; &lt;<a href="mailto:pranithk@gluster.com">pranithk@gluster.com</a>&gt;<br>

To: &quot;Remi Broemeling&quot; &lt;<a href="mailto:remi@goclio.com">remi@goclio.com</a>&gt;<br>

Cc: <a href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a><br>

</div><div><div></div><div class="h5">Sent: Thursday, May 19, 2011 2:14:52 PM<br>

Subject: Re: [Gluster-users] Rebuild Distributed/Replicated Setup<br>

<br>

hi Remi,<br>

    This is a classic case of split-brain. See if the md5sum of the files in question matches on both web01, web02. If yes you can safely reset the xattr of the file on one of the replicas to trigger self-heal. If the md5sums dont match, you will have to select the machine you want to keep as the source (In your case it is web01), go to the other machine (In your case it is web02) and execute the following commands:<br>


<br>

1) sudo setxattr -n trusted.afr.shared-application-data-client-0 -v 0sAAAAAAAAAAAAAAAA &lt;file-name&gt;.<br>

2) then do a find on the &lt;file-name&gt;,<br>

 that will trigger self-heal and both copies will be in replication again.<br>

<br>

Self-heal can cause a performance hit if you trigger self-heal for all the files at once if they are BIG files. so trigger 1 after the other upon completion in that case.<br>

<br>

Let me know if you need any more help with this. Removing the whole web02 data and triggering a total self-heal is very expensive operation, I wouldn&#39;t do that.<br>

<br>

Pranith.<br>

----- Original Message -----<br>

From: &quot;Remi Broemeling&quot; &lt;<a href="mailto:remi@goclio.com">remi@goclio.com</a>&gt;<br>

To: &quot;Pranith Kumar. Karampuri&quot; &lt;<a href="mailto:pranithk@gluster.com">pranithk@gluster.com</a>&gt;<br>

Cc: <a href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a><br>

Sent: Wednesday, May 18, 2011 8:21:33 PM<br>

Subject: Re: [Gluster-users] Rebuild Distributed/Replicated Setup<br>

<br>

Sure,<br>

<br>

These files are just a sampling -- a lot of other files are showing the same &quot;split-brain&quot; behaviour.<br>

<br>

[14:42:45][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809223185/contact.log<br>

# file: agc/production/log/809223185/contact.log<br>

trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAABQAAAAAAAAAA<br>

[14:45:15][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809223185/contact.log<br>

# file: agc/production/log/809223185/contact.log<br>

trusted.afr.shared-application-data-client-0=0sAAACOwAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA<br>

<br>

[14:42:53][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809223185/event.log<br>

# file: agc/production/log/809223185/event.log<br>

trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAADgAAAAAAAAAA<br>

[14:45:24][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809223185/event.log<br>

# file: agc/production/log/809223185/event.log<br>

trusted.afr.shared-application-data-client-0=0sAAAGXQAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA<br>

<br>

[14:43:02][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809223635/contact.log<br>

# file: agc/production/log/809223635/contact.log<br>

trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAACgAAAAAAAAAA<br>

[14:45:28][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809223635/contact.log<br>

# file: agc/production/log/809223635/contact.log<br>

trusted.afr.shared-application-data-client-0=0sAAAELQAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA<br>

<br>

[14:43:39][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809224061/contact.log<br>

# file: agc/production/log/809224061/contact.log<br>

trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAACQAAAAAAAAAA<br>

[14:45:32][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809224061/contact.log<br>

# file: agc/production/log/809224061/contact.log<br>

trusted.afr.shared-application-data-client-0=0sAAAD+AAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA<br>

<br>

[14:43:42][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809224321/contact.log<br>

# file: agc/production/log/809224321/contact.log<br>

trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAACAAAAAAAAAAA<br>

[14:45:37][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809224321/contact.log<br>

# file: agc/production/log/809224321/contact.log<br>

trusted.afr.shared-application-data-client-0=0sAAAERAAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA<br>

<br>

[14:43:45][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809215319/event.log<br>

# file: agc/production/log/809215319/event.log<br>

trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAABwAAAAAAAAAA<br>

[14:45:45][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m &quot;trusted.afr*&quot; agc/production/log/809215319/event.log<br>

# file: agc/production/log/809215319/event.log<br>

trusted.afr.shared-application-data-client-0=0sAAAC/QAAAAAAAAAA<br>

trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA<br>

<br>

<br>

On Wed, May 18, 2011 at 01:31, Pranith Kumar. Karampuri &lt; <a href="mailto:pranithk@gluster.com">pranithk@gluster.com</a> &gt; wrote:<br>

<br>

<br>

hi Remi,<br>

It seems the split-brain is detected on following files:<br>

/agc/production/log/809223185/contact.log<br>

/agc/production/log/809223185/event.log<br>

/agc/production/log/809223635/contact.log<br>

/agc/production/log/809224061/contact.log<br>

/agc/production/log/809224321/contact.log<br>

/agc/production/log/809215319/event.log<br>

<br>

Could you give the output of the following command for each file above on both the bricks in the replica pair.<br>

<br>

getxattr -d -m &quot;trusted.afr*&quot; &lt;filepath&gt;<br>

<br>

Thanks<br>

<br>

Pranith<br>

<br>

----- Original Message -----<br>

From: &quot;Remi Broemeling&quot; &lt; <a href="mailto:remi@goclio.com">remi@goclio.com</a> &gt;<br>

To: <a href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a><br>

<br>

<br>

<br>

Sent: Tuesday, May 17, 2011 9:02:44 PM<br>

Subject: Re: [Gluster-users] Rebuild Distributed/Replicated Setup<br>

<br>

<br>

Hi Pranith. Sure, here is a pastebin sampling of logs from one of the hosts: <a href="http://pastebin.com/1U1ziwjC" target="_blank">http://pastebin.com/1U1ziwjC</a><br>

<br>

<br>

On Mon, May 16, 2011 at 20:48, Pranith Kumar. Karampuri &lt; <a href="mailto:pranithk@gluster.com">pranithk@gluster.com</a> &gt; wrote:<br>

<br>

<br>

hi Remi,<br>

Would it be possible to post the logs on the client, so that we can find what issue you are running into.<br>

<br>

Pranith<br>

<br>

<br>

<br>

----- Original Message -----<br>

From: &quot;Remi Broemeling&quot; &lt; <a href="mailto:remi@goclio.com">remi@goclio.com</a> &gt;<br>

To: <a href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a><br>

Sent: Monday, May 16, 2011 10:47:33 PM<br>

Subject: [Gluster-users] Rebuild Distributed/Replicated Setup<br>

<br>

<br>

Hi,<br>

<br>

I&#39;ve got a distributed/replicated GlusterFS v3.1.2 (installed via RPM) setup across two servers (web01 and web02) with the following vol config:<br>

<br>

volume shared-application-data-client-0<br>

type protocol/client<br>

option remote-host web01<br>

option remote-subvolume /var/glusterfs/bricks/shared<br>

option transport-type tcp<br>

option ping-timeout 5<br>

end-volume<br>

<br>

volume shared-application-data-client-1<br>

type protocol/client<br>

option remote-host web02<br>

option remote-subvolume /var/glusterfs/bricks/shared<br>

option transport-type tcp<br>

option ping-timeout 5<br>

end-volume<br>

<br>

volume shared-application-data-replicate-0<br>

type cluster/replicate<br>

subvolumes shared-application-data-client-0 shared-application-data-client-1<br>

end-volume<br>

<br>

volume shared-application-data-write-behind<br>

type performance/write-behind<br>

subvolumes shared-application-data-replicate-0<br>

end-volume<br>

<br>

volume shared-application-data-read-ahead<br>

type performance/read-ahead<br>

subvolumes shared-application-data-write-behind<br>

end-volume<br>

<br>

volume shared-application-data-io-cache<br>

type performance/io-cache<br>

subvolumes shared-application-data-read-ahead<br>

end-volume<br>

<br>

volume shared-application-data-quick-read<br>

type performance/quick-read<br>

subvolumes shared-application-data-io-cache<br>

end-volume<br>

<br>

volume shared-application-data-stat-prefetch<br>

type performance/stat-prefetch<br>

subvolumes shared-application-data-quick-read<br>

end-volume<br>

<br>

volume shared-application-data<br>

type debug/io-stats<br>

subvolumes shared-application-data-stat-prefetch<br>

end-volume<br>

<br>

In total, four servers mount this via GlusterFS FUSE. For whatever reason (I&#39;m really not sure why), the GlusterFS filesystem has run into a bit of split-brain nightmare (although to my knowledge an actual split brain situation has never occurred in this environment), and I have been getting solidly corrupted issues across the filesystem as well as complaints that the filesystem cannot be self-healed.<br>


<br>

What I would like to do is completely empty one of the two servers (here I am trying to empty server web01), making the other one (in this case web02) the authoritative source for the data; and then have web01 completely rebuild it&#39;s mirror directly from web02.<br>


<br>

What&#39;s the easiest/safest way to do this? Is there a command that I can run that will force web01 to re-initialize it&#39;s mirror directly from web02 (and thus completely eradicate all of the split-brain errors and data inconsistencies)?<br>


<br>

Thanks!<br>

<br>

--<br>

<br>

Remi Broemeling<br>

System Administrator<br>

Clio - Practice Management Simplified<br>

<a href="tel:1-888-858-2546" value="+18888582546">1-888-858-2546</a> x(2^5) | <a href="mailto:remi@goclio.com">remi@goclio.com</a><br>

<a href="http://www.goclio.com" target="_blank">www.goclio.com</a> | blog | twitter | facebook<br>

</div></div>_______________________________________________<br>

<div><div></div><div class="h5">Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>

<a href="http://gluster.org/cgi-bin/mailman/listinfo/gluster-users" target="_blank">http://gluster.org/cgi-bin/mailman/listinfo/gluster-users</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br><div>Remi Broemeling</div><div>System Administrator</div><span style="font-family:arial, sans-serif;font-size:13px;border-collapse:collapse"><div><font face="arial, helvetica, sans-serif">Clio - Practice Management Simplified</font></div>

<div><font face="arial, helvetica, sans-serif">1-888-858-2546 x(2^5) | <a href="mailto:remi@goclio.com" target="_blank">remi@goclio.com</a></font></div><div><font face="arial, helvetica, sans-serif"><a href="http://www.goclio.com/" target="_blank">www.goclio.com</a> | <a href="http://www.goclio.com/blog" target="_blank">blog</a> | <a href="http://www.twitter.com/goclio" target="_blank">twitter</a> | <a href="http://www.facebook.com/goclio" target="_blank">facebook</a></font></div>

</span><br>

</div>