<div class="gmail_quote">On Thu, Apr 12, 2012 at 3:49 PM, Jeff Darcy wrote:<br><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
<br>
(1) To a first approximation, it should be safe to "merge" directory contents<br>
despite there being a split-brain problem, by healing any file that exists on<br>
only one brick from there to its peer(s).</blockquote><div> </div><div> I am not sure if got this right, but if I did, this should be the two way scenario depicted at the end of the message.</div><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">
<p>
(3) The reason you continue to get I/O errors is probably that the xattrs on<br>
the *parent directory* still indicate pending operations on both sides. You<br>
can verify this with the following command on each brick:<br>
<br>
getfattr -d -e hex -n trusted.glusterfs.dht /a</p></blockquote><div> </div><div>Unfortunately:</div><div>getfattr: /a: Input/output error </div><div>And when running on any working instance, it says trusted.glusterfs.dht: No such attribute.</div>
<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote"><p>
If the result is non-zero (most likely in the last four-byte integer indicating<br>
a directory-entry operation) then that confirms our theory. It should be safe<br>
for the self-heal code to clear these counts if (and only if) the directories<br>
are checked and found identical. In fact, I think we already do this. Thus,<br>
manual copying of files followed by self-heal on the parent directory should<br>
make the errors go away. I encourage you to try that while I go look at the code.</p></blockquote><div> </div><div> Ok, I thought of two ways to manually copy files and making gluster think the directories are identical.</div>
<div> ----BTW, I found out that if I disrupt again connectivity between the nodes, I am able to do operations on the mountpoint (/a) ----</div><div> </div><div>1st way - node1 (10.0.2.14) </div><div>scp /local/howareyou 10.0.2.15:/local</div>
<div>scp 10.0.2.15:/local/hello /local</div><div>ls /a</div><div>ls: cannot access /a: Input/output error</div><div>iptables -A INPUT -s 10.0.2.15 -j DROP - so I can access mountpoint</div><div>ls -lh /a</div><div>????????????? ? ? ? ? ? hello</div>
<div>-rw-r--r-- 1 root root 0 Apr 6 01:48 howareyou</div><div> </div><div>2nd way - node1 (10.0.2.14) (from scratch)</div><div>iptables -A INPUT -p tcp -s 10.0.2.15 -j DROP - so I can access mountpoint</div><div>-allow ssh-</div>
<div>scp 10.0.2.15:/a/hello /a</div><div>scp /a/howareyou 10.0.2.14:/a</div><div>- now they are in sync -</div><div>iptables -F INPUT</div><div>ls /a - works briefly but after a while:</div><div>ls: cannot access /a: Input/output error</div>
<div> </div><div>As per documentation, triggering a self heal is done by</div><div>find <gluster-mount> -noleaf -print0 | xargs --null stat (where <gluster-mount> is /a) - but again, /a cannot be accessed.</div>
</div>