<font><font face="verdana,sans-serif">As a final(?) follow-up to my problem, after restarting the rebalance with:</font></font><div><font><font face="verdana,sans-serif"><br></font></font></div><div><font><font face="verdana,sans-serif"> gluster volume rebalance [vol-name] fix-layout start</font></font></div>
<div><font><font face="verdana,sans-serif"><br></font></font></div><div><font><font face="verdana,sans-serif">it finished up last night after plowing thru the entirety of the filesystem - fixing about ~1M files (apparently ~2.2TB), all while the fs remained live (tho probably a bit slower than users would have liked). That's a strong '+' in the gluster column for resiliency.</font></font></div>
<div><font><font face="verdana,sans-serif"><br></font></font></div><div><font face="verdana, sans-serif">I started the rebalance without waiting for any advice to the contrary. 3.3 is supposed to have a built-in rebalance operator, but I saw no evidence of it. Other info from <a href="http://gluster.org">gluster.org</a> suggested that it wouldn't do any harm to do this, so I went ahead and started it. Do the gluster wizards have any final words on this before I write this up in our trouble report?</font></div>
<div><font face="verdana, sans-serif"><br></font></div><div><font><font face="verdana,sans-serif">best wishes</font></font></div><div><font><font face="verdana,sans-serif">harry</font></font></div><div><font><font face="verdana,sans-serif"><br>
</font></font><br><div class="gmail_quote">On Thu, Aug 2, 2012 at 4:37 PM, Harry Mangalam <span dir="ltr"><<a href="mailto:hjmangalam@gmail.com" target="_blank">hjmangalam@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Further to what I wrote before:<br>
gluster server overload; recovers, now "Transport endpoint is not<br>
connected" for some files<br>
<<a href="http://goo.gl/CN6ud" target="_blank">http://goo.gl/CN6ud</a>><br>
<br>
I'm getting conflicting info here. On one hand, the peer that had its<br>
glusterfsd lock up seems to be in the gluster system, according to<br>
the frequently referenced 'gluster peer status'<br>
<br>
Thu Aug 02 15:48:46 [1.00 0.89 0.92] root@pbs1:~<br>
729 $ gluster peer status<br>
Number of Peers: 3<br>
<br>
Hostname: pbs4ib<br>
Uuid: 2a593581-bf45-446c-8f7c-212c53297803<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: pbs2ib<br>
Uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: pbs3ib<br>
Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42<br>
State: Peer in Cluster (Connected)<br>
<br>
On the other hand, some errors that I provided yesterday:<br>
===================================================<br>
[2012-08-01 18:07:26.104910] W<br>
[dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes<br>
down -- not fixing<br>
===================================================<br>
<br>
as well as this information:<br>
$ gluster volume status all detail<br>
<br>
[top 2 brick stanzas trimmed; they're online]<br>
------------------------------------------------------------------------------<br>
Brick : Brick pbs3ib:/bducgl<br>
Port : 24018<br>
Online : N <<=====================<br>
Pid : 20953<br>
File System : xfs<br>
Device : /dev/md127<br>
Mount Options : rw<br>
Inode Size : 256<br>
Disk Space Free : 6.1TB<br>
Total Disk Space : 8.2TB<br>
Inode Count : 1758158080<br>
Free Inodes : 1752326373<br>
------------------------------------------------------------------------------<br>
Brick : Brick pbs4ib:/bducgl<br>
Port : 24009<br>
Online : Y<br>
Pid : 20948<br>
File System : xfs<br>
Device : /dev/sda<br>
Mount Options : rw<br>
Inode Size : 256<br>
Disk Space Free : 4.6TB<br>
Total Disk Space : 6.4TB<br>
Inode Count : 1367187392<br>
Free Inodes : 1361305613<br>
<br>
The above implies fairly strongly that the brick did not re-establish<br>
connection to the volume, altho the gluster peer info did.<br>
<br>
Strangely enough, when I RE-restarted the glusterd, it DID come back<br>
and re-joined the gluster volume and now the (restarted) fix-layout<br>
job is proceeding without those "subvolumes<br>
down -- not fixing" errors, just a steady stream of 'found<br>
anomalies/fixing the layout' messages, tho at the rate that it's going<br>
it looks like it will take several days.<br>
<br>
Still better several days to fix the data on-disk and having the fs<br>
live than having to tell users that their data is gone and then having<br>
to rebuild from zero. Luckily, it's officially a /scratch filesystem.<br>
<span class="HOEnZb"><font color="#888888"><br>
Harry<br>
<br>
--<br>
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine<br>
[m/c 2225] / 92697 Google Voice Multiplexer: <a href="tel:%28949%29%20478-4487" value="+19494784487">(949) 478-4487</a><br>
415 South Circle View Dr, Irvine, CA, 92697 [shipping]<br>
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)<br>
</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br>Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine<br>[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487<br>415 South Circle View Dr, Irvine, CA, 92697 [shipping]<br>
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)<br><br>
</div>