<font><font face="verdana,sans-serif">As a final(?) follow-up to my problem, after restarting the rebalance with:</font></font><div><font><font face="verdana,sans-serif"><br></font></font></div><div><font><font face="verdana,sans-serif"> gluster volume rebalance [vol-name] fix-layout start</font></font></div>

<div><font><font face="verdana,sans-serif"><br></font></font></div><div><font><font face="verdana,sans-serif">it finished up last night after plowing thru the entirety of the filesystem - fixing about ~1M files (apparently ~2.2TB), all while the fs remained live (tho probably a bit slower than users would have liked).   That&#39;s a strong &#39;+&#39; in the gluster column for resiliency.</font></font></div>

<div><font><font face="verdana,sans-serif"><br></font></font></div><div><font face="verdana, sans-serif">I started the rebalance without waiting for any advice to the contrary.  3.3 is supposed to have a built-in rebalance operator, but I saw no evidence of it.  Other info from <a href="http://gluster.org">gluster.org</a> suggested that it wouldn&#39;t do any harm to do this, so I went ahead and started it.  Do the gluster wizards have any final words on this before I write this up in our trouble report?</font></div>

<div><font face="verdana, sans-serif"><br></font></div><div><font><font face="verdana,sans-serif">best wishes</font></font></div><div><font><font face="verdana,sans-serif">harry</font></font></div><div><font><font face="verdana,sans-serif"><br>

</font></font><br><div class="gmail_quote">On Thu, Aug 2, 2012 at 4:37 PM, Harry Mangalam <span dir="ltr">&lt;<a href="mailto:hjmangalam@gmail.com" target="_blank">hjmangalam@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Further to what I wrote before:<br>

gluster server overload; recovers, now &quot;Transport endpoint is not<br>

connected&quot; for some files<br>

&lt;<a href="http://goo.gl/CN6ud" target="_blank">http://goo.gl/CN6ud</a>&gt;<br>

<br>

I&#39;m getting conflicting info here.  On one hand, the peer that had its<br>

glusterfsd  lock up seems to be in the gluster system, according to<br>

the frequently referenced &#39;gluster peer status&#39;<br>

<br>

Thu Aug 02 15:48:46 [1.00 0.89 0.92]  root@pbs1:~<br>

729 $ gluster peer status<br>

Number of Peers: 3<br>

<br>

Hostname: pbs4ib<br>

Uuid: 2a593581-bf45-446c-8f7c-212c53297803<br>

State: Peer in Cluster (Connected)<br>

<br>

Hostname: pbs2ib<br>

Uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077<br>

State: Peer in Cluster (Connected)<br>

<br>

Hostname: pbs3ib<br>

Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42<br>

State: Peer in Cluster (Connected)<br>

<br>

On the other hand, some errors that I provided yesterday:<br>

===================================================<br>

[2012-08-01 18:07:26.104910] W<br>

[dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes<br>

down -- not fixing<br>

===================================================<br>

<br>

as well as this information:<br>

$ gluster volume status all detail<br>

<br>

[top 2 brick stanzas trimmed; they&#39;re online]<br>

------------------------------------------------------------------------------<br>

Brick                : Brick pbs3ib:/bducgl<br>

Port                 : 24018<br>

Online               : N                   &lt;&lt;=====================<br>

Pid                  : 20953<br>

File System          : xfs<br>

Device               : /dev/md127<br>

Mount Options        : rw<br>

Inode Size           : 256<br>

Disk Space Free      : 6.1TB<br>

Total Disk Space     : 8.2TB<br>

Inode Count          : 1758158080<br>

Free Inodes          : 1752326373<br>

------------------------------------------------------------------------------<br>

Brick                : Brick pbs4ib:/bducgl<br>

Port                 : 24009<br>

Online               : Y<br>

Pid                  : 20948<br>

File System          : xfs<br>

Device               : /dev/sda<br>

Mount Options        : rw<br>

Inode Size           : 256<br>

Disk Space Free      : 4.6TB<br>

Total Disk Space     : 6.4TB<br>

Inode Count          : 1367187392<br>

Free Inodes          : 1361305613<br>

<br>

The above implies fairly strongly that the brick did not re-establish<br>

connection to the volume, altho the gluster peer info did.<br>

<br>

Strangely enough, when I RE-restarted the glusterd, it DID come back<br>

and re-joined the gluster volume and now the (restarted) fix-layout<br>

job is proceeding without those  &quot;subvolumes<br>

down -- not fixing&quot; errors, just a steady stream of &#39;found<br>

anomalies/fixing the layout&#39; messages, tho at the rate that it&#39;s going<br>

it looks like it will take several days.<br>

<br>

Still better several days to fix the data on-disk and having the fs<br>

live than having to tell users that their data is gone and then having<br>

to rebuild from zero.  Luckily, it&#39;s officially a /scratch filesystem.<br>

<span class="HOEnZb"><font color="#888888"><br>

Harry<br>

<br>

--<br>

Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine<br>

[m/c 2225] / 92697 Google Voice Multiplexer: <a href="tel:%28949%29%20478-4487" value="+19494784487">(949) 478-4487</a><br>

415 South Circle View Dr, Irvine, CA, 92697 [shipping]<br>

MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)<br>

</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br>Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine<br>[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487<br>415 South Circle View Dr, Irvine, CA, 92697 [shipping]<br>

MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)<br><br>

</div>