<div class="gmail_quote">On Thu, Apr 12, 2012 at 3:49 PM, Jeff Darcy wrote:<br><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">

<br>

(1) To a first approximation, it should be safe to &quot;merge&quot; directory contents<br>

despite there being a split-brain problem, by healing any file that exists on<br>

only one brick from there to its peer(s).</blockquote><div> </div><div> I am not sure if got this right, but if I did, this should be the two way scenario depicted at the end of the message.</div><blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">

<p>

(3) The reason you continue to get I/O errors is probably that the xattrs on<br>

the *parent directory* still indicate pending operations on both sides.  You<br>

can verify this with the following command on each brick:<br>

<br>

        getfattr -d -e hex -n trusted.glusterfs.dht /a</p></blockquote><div> </div><div>Unfortunately:</div><div>getfattr: /a: Input/output error </div><div>And when running on any working instance, it says trusted.glusterfs.dht: No such attribute.</div>

<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote"><p>


If the result is non-zero (most likely in the last four-byte integer indicating<br>

a directory-entry operation) then that confirms our theory.  It should be safe<br>

for the self-heal code to clear these counts if (and only if) the directories<br>

are checked and found identical.  In fact, I think we already do this.  Thus,<br>

manual copying of files followed by self-heal on the parent directory should<br>

make the errors go away.  I encourage you to try that while I go look at the code.</p></blockquote><div> </div><div> Ok, I thought of two ways to manually copy files and making gluster think the directories are identical.</div>

<div> ----BTW, I found out that if I disrupt again connectivity between the nodes, I am able to do operations on the mountpoint (/a) ----</div><div> </div><div>1st way - node1 (10.0.2.14) </div><div>scp /local/howareyou 10.0.2.15:/local</div>

<div>scp 10.0.2.15:/local/hello /local</div><div>ls /a</div><div>ls: cannot access /a: Input/output error</div><div>iptables -A INPUT -s 10.0.2.15 -j DROP - so I can access mountpoint</div><div>ls -lh /a</div><div>????????????? ? ?      ?       ?            ? hello</div>

<div>-rw-r--r-- 1 root root 0 Apr   6 01:48 howareyou</div><div> </div><div>2nd way - node1 (10.0.2.14) (from scratch)</div><div>iptables -A INPUT -p tcp -s 10.0.2.15 -j DROP - so I can access mountpoint</div><div>-allow ssh-</div>

<div>scp 10.0.2.15:/a/hello /a</div><div>scp /a/howareyou 10.0.2.14:/a</div><div>- now they are in sync -</div><div>iptables -F INPUT</div><div>ls /a - works briefly but after a while:</div><div>ls: cannot access /a: Input/output error</div>

<div> </div><div>As per documentation, triggering a self heal is done by</div><div>find &lt;gluster-mount&gt; -noleaf -print0 | xargs --null stat (where &lt;gluster-mount&gt; is /a) - but again, /a cannot be accessed.</div>

</div>