<div dir="ltr">I was able to run another set of tests this week and I was able to reproduce the issue again. Going by the extended attributes, I think i ran into the same issue I saw earlier..<div><br></div><div> Do you think i need to open up a bug report?<div><br></div><div><div>Brick 1: </div><div><br></div><div>trusted.afr.PL2-client-0=0x000000000000000000000000</div><div>trusted.afr.PL2-client-1=0x000000010000000000000000</div><div>trusted.afr.PL2-client-2=0x000000010000000000000000</div><div>trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c</div><div><br></div><div>Brick 2</div><div><br></div><div><div>trusted.afr.PL2-client-0=0x0000125c0000000000000000</div><div>trusted.afr.PL2-client-1=0x000000000000000000000000</div><div>trusted.afr.PL2-client-2=0x000000000000000000000000</div><div>trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c</div></div><div><br></div><div>Brick 3</div><div><br></div><div><div>trusted.afr.PL2-client-0=0x0000125c0000000000000000</div><div>trusted.afr.PL2-client-1=0x000000000000000000000000</div><div>trusted.afr.PL2-client-2=0x000000000000000000000000</div><div>trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c</div></div><div><br></div><div><br></div><div><div>[root@ip-172-31-12-218 ~]# gluster volume info</div><div> </div><div>Volume Name: PL1</div><div>Type: Replicate</div><div>Volume ID: bd351bae-d467-4e8c-bbd2-6a0fe99c346a</div><div>Status: Started</div><div>Number of Bricks: 1 x 3 = 3</div><div>Transport-type: tcp</div><div>Bricks:</div><div>Brick1: 172.31.38.189:/data/vol1/gluster-data</div><div>Brick2: 172.31.16.220:/data/vol1/gluster-data</div><div>Brick3: 172.31.12.218:/data/vol1/gluster-data</div><div>Options Reconfigured:</div><div>cluster.server-quorum-type: server</div><div>network.ping-timeout: 12</div><div>nfs.addr-namelookup: off</div><div>performance.cache-size: 2147483648</div><div>cluster.quorum-type: auto</div><div>performance.read-ahead: off</div><div>performance.client-io-threads: on</div><div>performance.io-thread-count: 64</div><div>cluster.eager-lock: on</div><div>cluster.server-quorum-ratio: 51%</div><div> </div><div>Volume Name: PL2</div><div>Type: Replicate</div><div>Volume ID: e6ad8787-05d8-474b-bc78-748f8c13700f</div><div>Status: Started</div><div>Number of Bricks: 1 x 3 = 3</div><div>Transport-type: tcp</div><div>Bricks:</div><div>Brick1: 172.31.38.189:/data/vol2/gluster-data</div><div>Brick2: 172.31.16.220:/data/vol2/gluster-data</div><div>Brick3: 172.31.12.218:/data/vol2/gluster-data</div><div>Options Reconfigured:</div><div>nfs.addr-namelookup: off</div><div>cluster.server-quorum-type: server</div><div>network.ping-timeout: 12</div><div>performance.cache-size: 2147483648</div><div>cluster.quorum-type: auto</div><div>performance.read-ahead: off</div><div>performance.client-io-threads: on</div><div>performance.io-thread-count: 64</div><div>cluster.eager-lock: on</div><div>cluster.server-quorum-ratio: 51%</div><div>[root@ip-172-31-12-218 ~]# </div></div><div><br></div><div><b>Mount command</b></div><div><br></div><div>Client</div><div><br></div><div>mount -t glusterfs -o defaults,enable-ino32,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log,backupvolfile-server=172.31.38.189,backupvolfile-server=172.31.12.218,background-qlen=256 172.31.16.220:/PL2  /mnt/vm<br></div><div><br></div><div>Server</div><div><br></div><div><div>/dev/xvdf    /data/vol1 xfs defaults,inode64,noatime 1 2</div><div>/dev/xvdg   /data/vol2 xfs defaults,inode64,noatime 1 2</div></div><div><br></div><div><b>Packages</b></div><div><br></div><div>Client</div><div><br></div><div><div>rpm -qa | grep gluster</div><div>glusterfs-fuse-3.5.2-1.el6.x86_64</div><div>glusterfs-3.5.2-1.el6.x86_64</div><div>glusterfs-libs-3.5.2-1.el6.x86_64</div></div><div><br></div><div>Server</div><div><br></div><div><div>[root@ip-172-31-12-218 ~]# rpm -qa | grep gluster</div><div>glusterfs-3.5.2-1.el6.x86_64</div><div>glusterfs-fuse-3.5.2-1.el6.x86_64</div><div>glusterfs-api-3.5.2-1.el6.x86_64</div><div>glusterfs-server-3.5.2-1.el6.x86_64</div><div>glusterfs-libs-3.5.2-1.el6.x86_64</div><div>glusterfs-cli-3.5.2-1.el6.x86_64</div><div>[root@ip-172-31-12-218 ~]# </div></div><div><br></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Sep 6, 2014 at 9:01 AM, Pranith Kumar Karampuri <span dir="ltr">&lt;<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>

On 09/06/2014 04:53 AM, Jeff Darcy wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I have<br>

client and server quorum turned on. I rebooted one of the 3 bricks. When it<br>

came back up, the client started throwing error messages that one of the<br>

files went into split brain.<br>

</blockquote>

This is a good example of how split brain can happen even with all kinds of<br>

quorum enabled.  Let&#39;s look at those xattrs.  BTW, thank you for a very<br>

nicely detailed bug report which includes those.<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

BRICK1<br>

========<br>

[root@ip-172-31-38-189 ~]# getfattr -d -m . -e hex<br>

/data/vol2/gluster-data/<u></u>apache_cp_mm1/logs/access_log.<a href="tel:2014-09-05-17" value="+12014090517" target="_blank">2014-09-05-17</a><u></u>_00_00<br>

getfattr: Removing leading &#39;/&#39; from absolute path names<br>

# file:<br>

data/vol2/gluster-data/apache_<u></u>cp_mm1/logs/access_log.<a href="tel:2014-09-05-17" value="+12014090517" target="_blank">2014-09-05-17</a>_00_00<br>

trusted.afr.PL2-client-0=<u></u>0x000000000000000000000000<br>

trusted.afr.PL2-client-1=<u></u>0x000000010000000000000000<br>

trusted.afr.PL2-client-2=<u></u>0x000000010000000000000000<br>

trusted.gfid=<u></u>0xea950263977e46bf89a0ef631ca1<u></u>39c2<br>

<br>

BRICK 2<br>

=======<br>

[root@ip-172-31-16-220 ~]# getfattr -d -m . -e hex<br>

/data/vol2/gluster-data/<u></u>apache_cp_mm1/logs/access_log.<a href="tel:2014-09-05-17" value="+12014090517" target="_blank">2014-09-05-17</a><u></u>_00_00<br>

getfattr: Removing leading &#39;/&#39; from absolute path names<br>

# file:<br>

data/vol2/gluster-data/apache_<u></u>cp_mm1/logs/access_log.<a href="tel:2014-09-05-17" value="+12014090517" target="_blank">2014-09-05-17</a>_00_00<br>

trusted.afr.PL2-client-0=<u></u>0x00000d460000000000000000<br>

trusted.afr.PL2-client-1=<u></u>0x000000000000000000000000<br>

trusted.afr.PL2-client-2=<u></u>0x000000000000000000000000<br>

trusted.gfid=<u></u>0xea950263977e46bf89a0ef631ca1<u></u>39c2<br>

BRICK 3<br>

=========<br>

[root@ip-172-31-12-218 ~]# getfattr -d -m . -e hex<br>

/data/vol2/gluster-data/<u></u>apache_cp_mm1/logs/access_log.<a href="tel:2014-09-05-17" value="+12014090517" target="_blank">2014-09-05-17</a><u></u>_00_00<br>

getfattr: Removing leading &#39;/&#39; from absolute path names<br>

# file:<br>

data/vol2/gluster-data/apache_<u></u>cp_mm1/logs/access_log.<a href="tel:2014-09-05-17" value="+12014090517" target="_blank">2014-09-05-17</a>_00_00<br>

trusted.afr.PL2-client-0=<u></u>0x00000d460000000000000000<br>

trusted.afr.PL2-client-1=<u></u>0x000000000000000000000000<br>

trusted.afr.PL2-client-2=<u></u>0x000000000000000000000000<br>

trusted.gfid=<u></u>0xea950263977e46bf89a0ef631ca1<u></u>39c2<br>

</blockquote>

Here, we see that brick 1 shows a single pending operation for the other<br>

two, while they show 0xd46 (3398) pending operations for brick 1.<br>

Here&#39;s how this can happen.<br>

<br>

(1) There is exactly one pending operation.<br>

<br>

(2) Brick1 completes the write first, and says so.<br>

<br>

(3) Client sends messages to all three, saying to decrement brick1&#39;s<br>

count.<br>

<br>

(4) All three bricks receive and process that message.<br>

<br>

(5) Brick1 fails.<br>

<br>

(6) Brick2 and brick3 complete the write, and say so.<br>

<br>

(7) Client tells all bricks to decrement remaining counts.<br>

<br>

(8) Brick2 and brick3 receive and process that message.<br>

<br>

(9) Brick1 is dead, so its counts for brick2/3 stay at one.<br>

<br>

(10) Brick2 and brick3 have quorum, with all-zero pending counters.<br>

<br>

(11) Client sends 0xd46 more writes to brick2 and brick3.<br>

<br>

Note that at no point did we lose quorum. Note also the tight timing<br>

required.  If brick1 had failed an instant earlier, it would not have<br>

decremented its own counter.  If it had failed an instant later, it<br>

would have decremented brick2&#39;s and brick3&#39;s as well.  If brick1 had not<br>

finished first, we&#39;d be in yet another scenario.  If delayed changelog<br>

had been operative, the messages at (3) and (7) would have been combined<br>

to leave us in yet another scenario.  As far as I can tell, we would<br>

have been able to resolve the conflict in all those cases.<br>

*** Key point: quorum enforcement does not totally eliminate split<br>

brain.  It only makes the frequency a few orders of magnitude lower. ***<br>

</blockquote>

<br></div></div>

Not quite right. After we fixed the bug <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1066996" target="_blank">https://bugzilla.redhat.com/<u></u>show_bug.cgi?id=1066996</a>, the only two possible ways to introduce split-brain are<br>

1) if we have an implementation bug in changelog xattr marking, I believe that to be the case here.<br>

2) Keep writing to the file from the mount then<br>

a) take brick 1 down, wait until at least one write is successful<br>

b) bring brick1 back up and take brick 2 down (self-heal should not happen) wait until at least one write is successful<br>

c) bring brick2 back up and take brick 3 down (self-heal should not happen) wait until at least one write is successful<br>

<br>

With outcast implementation case-2 will also be immune to split-brain errors.<br>

<br>

Then the only way we have split-brains in afr is implementation errors of changelog marking. If we test it thoroughly and fix such problems we can get it to be immune to split-brain :-).<span class="HOEnZb"><font color="#888888"><br>

<br>

Pranith<br>

</font></span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

So, is there any way to prevent this completely?  Some AFR enhancements,<br>

such as the oft-promised &quot;outcast&quot; feature[1], might have helped.<br>

NSR[2] is immune to this particular problem.  &quot;Policy based split brain<br>

resolution&quot;[3] might have resolved it automatically instead of merely<br>

flagging it.  Unfortunately, those are all in the future.  For now, I&#39;d<br>

say the best approach is to resolve the conflict manually and try to<br>

move on.  Unless there&#39;s more going on than meets the eye, recurrence<br>

should be very unlikely.<br>

<br>

[1] <a href="http://www.gluster.org/community/documentation/index.php/Features/outcast" target="_blank">http://www.gluster.org/<u></u>community/documentation/index.<u></u>php/Features/outcast</a><br>

<br>

[2] <a href="http://www.gluster.org/community/documentation/index.php/Features/new-style-replication" target="_blank">http://www.gluster.org/<u></u>community/documentation/index.<u></u>php/Features/new-style-<u></u>replication</a><br>

<br>

[3] <a href="http://www.gluster.org/community/documentation/index.php/Features/pbspbr" target="_blank">http://www.gluster.org/<u></u>community/documentation/index.<u></u>php/Features/pbspbr</a><br></span><span class="">

______________________________<u></u>_________________<br>

Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>

<a href="http://supercolony.gluster.org/mailman/listinfo/gluster-users" target="_blank">http://supercolony.gluster.<u></u>org/mailman/listinfo/gluster-<u></u>users</a><br>

</span></blockquote>

<br>

</blockquote></div><br></div>