<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Hi Stefano,<br>
<br>
I added some comments inline...<br>
<br>
Al 31/05/13 10:23, En/na Stefano Sinigardi ha escrit:<br>
</div>
<blockquote
cite="mid:CANmmi2SaGux8yTQqNiJkYf+sMtwpfnV==VA7PGEUCW+GdMCTig@mail.gmail.com"
type="cite">
<div dir="ltr">
<div> Dear Xavi,</div>
<div>thank you so much. The volname is "data"</div>
<div> </div>
<div> </div>
<div># gluster volume info data</div>
<div> </div>
<div>Volume Name: data</div>
<div>Type: Distributed-Replicate</div>
<div>Volume ID: e3a99db0-8643-41c1-b4a1-6a728bb1d08c</div>
<div>Status: Started</div>
<div>Number of Bricks: 5 x 2 = 10</div>
<div>Transport-type: tcp</div>
<div>Bricks:</div>
<div>Brick1: pedrillo.bo.infn.it:/storage/1/data</div>
<div>Brick2: pedrillo.bo.infn.it:/storage/2/data</div>
<div>Brick3: pedrillo.bo.infn.it:/storage/5/data</div>
<div>Brick4: pedrillo.bo.infn.it:/storage/6/data</div>
<div>Brick5: pedrillo.bo.infn.it:/storage/arc1/data</div>
<div>
Brick6: pedrillo.bo.infn.it:/storage/arc2/data</div>
<div>Brick7: osmino:/storageOsmino/1/data</div>
<div>Brick8: osmino:/storageOsmino/2/data</div>
<div>Brick9: osmino:/storageOsmino/4/data</div>
<div>Brick10: osmino:/storageOsmino/5/data</div>
<div> </div>
</div>
</blockquote>
This is not a recommended setup. You have paired bricks of the same
server to create the replica sets. This means that if one server
fails for some reason, both members of a replica set will fail and
its data will be inaccessible. With this configuration, each server
is a single point of failure.<br>
<br>
Anyway this shouldn't cause the problem with the missing files...<br>
<br>
<blockquote
cite="mid:CANmmi2SaGux8yTQqNiJkYf+sMtwpfnV==VA7PGEUCW+GdMCTig@mail.gmail.com"
type="cite">
<div dir="ltr">
<div> </div>
<div># gluster volume status data</div>
<div> </div>
<div>Status of volume: data</div>
<div>Gluster process
Port Online Pid</div>
<div>------------------------------------------------------------------------------</div>
<div>Brick pedrillo.bo.infn.it:/storage/1/data
24009 Y 1732</div>
<div>Brick pedrillo.bo.infn.it:/storage/2/data
24010 Y 1738</div>
<div>Brick pedrillo.bo.infn.it:/storage/5/data
24013 Y 1747</div>
<div>Brick pedrillo.bo.infn.it:/storage/6/data
24014 Y 1758</div>
<div>Brick pedrillo.bo.infn.it:/storage/arc1/data
24015 Y 1770</div>
<div>Brick pedrillo.bo.infn.it:/storage/arc2/data
24016 Y 1838</div>
<div>Brick osmino:/storageOsmino/1/data
24009 Y 1173</div>
<div>Brick osmino:/storageOsmino/2/data
24010 Y 1179</div>
<div>Brick osmino:/storageOsmino/4/data
24011 Y 1185</div>
<div>Brick osmino:/storageOsmino/5/data
24012 Y 1191</div>
<div>NFS Server on localhost
38467 Y 1847</div>
<div>Self-heal Daemon on localhost N/A
Y 1855</div>
<div>NFS Server on <a moz-do-not-send="true"
href="http://castor.bo.infn.it">castor.bo.infn.it</a>
38467 Y 1582</div>
<div>Self-heal Daemon on <a moz-do-not-send="true"
href="http://castor.bo.infn.it">castor.bo.infn.it</a>
N/A Y 1588</div>
<div>NFS Server on <a moz-do-not-send="true"
href="http://pollux.bo.infn.it">pollux.bo.infn.it</a>
38467 Y 1583</div>
<div>Self-heal Daemon on <a moz-do-not-send="true"
href="http://pollux.bo.infn.it">pollux.bo.infn.it</a>
N/A Y 1589</div>
<div>NFS Server on osmino
38467 Y 1197</div>
<div>Self-heal Daemon on osmino N/A
Y 1203</div>
<div> </div>
</div>
</blockquote>
All seems ok<br>
<br>
<blockquote
cite="mid:CANmmi2SaGux8yTQqNiJkYf+sMtwpfnV==VA7PGEUCW+GdMCTig@mail.gmail.com"
type="cite">
<div dir="ltr">
<div> </div>
<div>Then I decided to select a file, as you said. It's an
executable, "leggi_particelle", that should be located at</div>
<div>/data/stefano/leggi_particelle/leggi_particelle</div>
<div> </div>
<div>It's not there:</div>
<div> </div>
<div># ll /data/stefano/leggi_particelle</div>
<div>total 61</div>
<div>drwxr-xr-x 3 stefano user 20480 May 29 05:00 ./</div>
<div>drwxr-xr-x 14 stefano user 20480 May 28 11:32 ../</div>
<div>-rwxr-xr-x 1 stefano user 286 Feb 25 17:24 Espec.plt*</div>
<div>lrwxrwxrwx 1 stefano user 53 Feb 13 11:30 parametri.cpp</div>
<div>drwxr-xr-x 3 stefano user 20480 May 24 17:16 test/</div>
<div> </div>
<div>but look at this:</div>
<div> </div>
<div># ll /storage/5/data/stefano/leggi_particelle/</div>
<div> </div>
<div>total 892</div>
<div>drwxr-xr-x 3 stefano user 4096 May 24 17:16 ./</div>
<div>drwxr-xr-x 14 stefano user 4096 May 28 11:32 ../</div>
<div>lrwxrwxrwx 2 stefano user 50 Apr 11 19:20 filtro.cpp</div>
<div>lrwxrwxrwx 2 stefano user 70 Apr 11 19:20
leggi_binario_ALaDyn_fortran.h </div>
<div>-rwxr-xr-x 2 stefano user 705045 May 22 18:24
leggi_particelle*</div>
<div>-rwxr-xr-x 2 stefano user 61883 Dec 16 17:20
leggi_particelle.old01*</div>
<div>-rwxr-xr-x 2 stefano user 106014 Apr 11 19:20
leggi_particelle.old03*</div>
<div>---------T 2 root root 0 May 24 17:16
parametri.cpp</div>
<div>drwxr-xr-x 3 stefano user 4096 Apr 11 19:19 test/</div>
<div> </div>
<div> </div>
<div># ll /storage/6/data/stefano/leggi_particelle/</div>
<div> </div>
<div>total 892</div>
<div>drwxr-xr-x 3 stefano user 4096 May 24 17:16 ./</div>
<div>drwxr-xr-x 14 stefano user 4096 May 28 11:32 ../</div>
<div>lrwxrwxrwx 2 stefano user 50 Apr 11 19:20 filtro.cpp</div>
<div>lrwxrwxrwx 2 stefano user 70 Apr 11 19:20
leggi_binario_ALaDyn_fortran.h</div>
<div>
-rwxr-xr-x 2 stefano user 705045 May 22 18:24
leggi_particelle*</div>
<div>-rwxr-xr-x 2 stefano user 61883 Dec 16 17:20
leggi_particelle.old01*</div>
<div>-rwxr-xr-x 2 stefano user 106014 Apr 11 19:20
leggi_particelle.old03*</div>
<div>---------T 2 root root 0 May 24 17:16
parametri.cpp</div>
<div>drwxr-xr-x 3 stefano user 4096 Apr 11 19:19 test/</div>
<div> </div>
</div>
</blockquote>
It seems there are much more files that are not visible through the
mount point. There is also a file with attributes "----------T" that
means that this file should be in another brick (it is a pending
rebalance link), however I think this is not true (see later).<br>
<br>
<blockquote
cite="mid:CANmmi2SaGux8yTQqNiJkYf+sMtwpfnV==VA7PGEUCW+GdMCTig@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>So as you can see, "leggi_particelle" is there, in the
fifth and sixth brick (some files of the folder are in other
bricks, ls on the fuse mount point just lists the one on the
first brick). It's the same executable, working in both
location.</div>
<div>Most of all, what I found is that calling
/data/stefano/leggi_particelle/leggi_particelle correctly
launches the executable!! So ls doesn't find it but the os
does!</div>
<div>Is it just a very bad "bug" in ls??? I don't think so,
because even remote mounting the volume with nfs hides the
same files.</div>
</div>
</blockquote>
It's a problem with gluster not being able to correctly identify the
file during an 'ls', so it does not appear. This generally means
some discrepancy in the information that gluster collects from all
bricks.<br>
<br>
It's interesting that you say that all files shown in an 'ls' come
from the first brick...<br>
<br>
<blockquote
cite="mid:CANmmi2SaGux8yTQqNiJkYf+sMtwpfnV==VA7PGEUCW+GdMCTig@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>Anyway, back with what you asked:</div>
<div> </div>
<div>this is the tail of data.log</div>
<div> </div>
<div>[2013-05-31 10:00:02.236397] W
[rpc-transport.c:174:rpc_transport_load] 0-rpc-transport:
missing 'option transport-type'. defaulting to "socket"</div>
<div>[2013-05-31 10:00:02.283271] I
[cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk] 0-cli: Received
resp to get vol: 0</div>
<div>[2013-05-31 10:00:02.283465] I
[cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk] 0-cli: Returning:
0</div>
<div>[2013-05-31 10:00:02.283613] I
[cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk] 0-cli: Received
resp to get vol: 0</div>
<div>[2013-05-31 10:00:02.283816] I
[cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk] 0-cli: Returning:
0</div>
<div>[2013-05-31 10:00:02.283919] I
[cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk] 0-cli: Received
resp to get vol: 0</div>
<div>[2013-05-31 10:00:02.283943] I
[cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk] 0-cli: Returning:
0</div>
<div>[2013-05-31 10:00:02.283951] I [input.c:46:cli_batch] 0-:
Exiting with: 0</div>
<div>[2013-05-31 10:00:07.279855] W
[rpc-transport.c:174:rpc_transport_load] 0-rpc-transport:
missing 'option transport-type'. defaulting to "socket"</div>
<div>[2013-05-31 10:00:07.326202] I
[cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk] 0-cli: Received
resp to get vol: 0</div>
<div>[2013-05-31 10:00:07.326484] I
[cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk] 0-cli: Returning:
0</div>
<div>[2013-05-31 10:00:07.326493] I [input.c:46:cli_batch] 0-:
Exiting with: 0</div>
<div>[2013-05-31 10:01:29.718428] W
[rpc-transport.c:174:rpc_transport_load] 0-rpc-transport:
missing 'option transport-type'. defaulting to "socket"</div>
<div>[2013-05-31 10:01:29.767990] I [input.c:46:cli_batch] 0-:
Exiting with: 0</div>
<div> </div>
<div>but calling for many ls or ll didn't touch it</div>
<div> </div>
<div>and this is the tail of storage-5-data.log and
storage-6-data.log (they're almost the same)</div>
<div>[2013-05-31 07:59:19.090790] I
[server-handshake.c:571:server_setvolume] 0-data-server:
accepted client from
pedrillo-2510-2013/05/31-07:59:15:067773-data-client-3-0
(version: 3.3.1)</div>
<div>[2013-05-31 08:00:56.935205] I
[server-handshake.c:571:server_setvolume] 0-data-server:
accepted client from
pollux-2361-2013/05/31-08:00:52:937577-data-client-3-0
(version: 3.3.1)</div>
<div>[2013-05-31 08:01:03.611506] I
[server-handshake.c:571:server_setvolume] 0-data-server:
accepted client from
castor-2629-2013/05/31-08:00:59:614003-data-client-3-0
(version: 3.3.1)</div>
<div>[2013-05-31 08:02:15.940950] I
[server-handshake.c:571:server_setvolume] 0-data-server:
accepted client from
osmino-1844-2013/05/31-08:02:11:932993-data-client-3-0
(version: 3.3.1)</div>
<div> </div>
<div>Except for the warning in the data.log, they seem legit.</div>
</div>
</blockquote>
Nothing interesting here. Have you looked at the bricks logs ? is
there anything ?<br>
<br>
<blockquote
cite="mid:CANmmi2SaGux8yTQqNiJkYf+sMtwpfnV==VA7PGEUCW+GdMCTig@mail.gmail.com"
type="cite">
<div dir="ltr">
<div> </div>
<div>Here are the attributes:</div>
<div> </div>
<div># getfattr -m. -e hex -d
/storage/5/data/stefano/leggi_particelle/leggi_particelle</div>
<div> </div>
<div>getfattr: Removing leading '/' from absolute path names</div>
<div># file:
storage/5/data/stefano/leggi_particelle/leggi_particelle</div>
<div>trusted.afr.data-client-2=0x000000000000000000000000</div>
<div>trusted.afr.data-client-3=0x000000000000000000000000</div>
<div>trusted.gfid=0x883c343b9366478da660843da8f6b87c</div>
<div> </div>
<div> </div>
<div># getfattr -m. -e hex -d
/storage/6/data/stefano/leggi_particelle/leggi_particelle</div>
<div> </div>
<div>getfattr: Removing leading '/' from absolute path names</div>
<div># file:
storage/6/data/stefano/leggi_particelle/leggi_particelle</div>
<div>trusted.afr.data-client-2=0x000000000000000000000000</div>
<div>trusted.afr.data-client-3=0x000000000000000000000000</div>
<div>trusted.gfid=0x883c343b9366478da660843da8f6b87c</div>
<div> </div>
<div> </div>
<div># getfattr -m. -e hex -d
/storage/5/data/stefano/leggi_particelle</div>
<div> </div>
<div>getfattr: Removing leading '/' from absolute path names</div>
<div># file: storage/5/data/stefano/leggi_particelle</div>
<div>trusted.afr.data-client-2=0x000000000000000000000000</div>
<div>trusted.afr.data-client-3=0x000000000000000000000000</div>
<div>trusted.gfid=0xb62a16f0bdb94e3f8563ccfb278c2105</div>
<div>trusted.glusterfs.dht=0x00000001000000003333333366666665</div>
<div> </div>
<div> </div>
<div> # getfattr -m. -e hex -d
/storage/6/data/stefano/leggi_particelle</div>
<div> </div>
<div>getfattr: Removing leading '/' from absolute path names</div>
<div># file: storage/6/data/stefano/leggi_particelle</div>
<div>trusted.afr.data-client-2=0x000000000000000000000000</div>
<div>trusted.afr.data-client-3=0x000000000000000000000000</div>
<div>trusted.gfid=0xb62a16f0bdb94e3f8563ccfb278c2105</div>
<div>trusted.glusterfs.dht=0x00000001000000003333333366666665</div>
<div> </div>
</div>
</blockquote>
Based on this attributes, the file 'leggi_particelle' should not be
on these bricks. Are you sure the file is not on any other brick ?
(even with other attributes). Also the file 'parametri.cpp' should
be in these bricks, but there is only a link file (the one with
'----------T'). The other files also do not seem to be placed in the
right place.<br>
<br>
I think there is a problem with the layout of the volume. I need to
see the attributes from this directory on all other bricks. Also,
how was this volume created ? with a single replica and then growing
with more replicas and doing rebalances ?<br>
<br>
Best regards,<br>
<br>
Xavi<br>
<br>
<blockquote
cite="mid:CANmmi2SaGux8yTQqNiJkYf+sMtwpfnV==VA7PGEUCW+GdMCTig@mail.gmail.com"
type="cite">
<div dir="ltr">
<div> </div>
<div> </div>
<div><span>Sorry for the very long email. But your help is very
much appreciated and I hope I'm clear enough.</span></div>
<div> </div>
<div><span>Best regards,</span></div>
<div> </div>
<div><span> Stefano</span></div>
<div> </div>
<div> </div>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Fri, May 31, 2013 at 4:34 PM, Xavier
Hernandez <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hi
Stefano,<br>
<br>
it would help to see the results of the following commands:<br>
<br>
gluster volume info <volname><br>
gluster volume status <volname><br>
<br>
It would also be interesting to see fragments of the logs
containing Warnings or Errors generated while the 'ls'
command was executing (if any). The logs from the mount
point are normally located at /var/log/glusterfs/<mount
point>.log. Logs from the bricks are at
/var/log/glusterfs/bricks/<brick name>.log.<br>
<br>
Also, identify a file (I'll call it <file>, and its
parent directory <parent>) that does not appear in an
'ls' from the mount point. Then execute the following
command on all bricks:<br>
<br>
getfattr -m. -e hex -d <brick root>/<parent><br>
<br>
On all bricks that contain the file, execute the following
command:<br>
<br>
getfattr -m. -e hex -d <brick
root>/<parent>/<file><br>
<br>
This information might help to see what is happening.<br>
<br>
Best regards,<br>
<br>
Xavi<br>
<br>
Al 31/05/13 08:53, En/na Stefano Sinigardi ha escrit:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">
<div>
<div class="h5">
Dear all,<br>
Thanks again for your support.<br>
Files are already and exactly duplicated (diff
confirms it) on the<br>
bricks, so I'd like not to mess with them directly (in
order to not do<br>
anything worse to the volume than what it's already
suffering).<br>
I found out, thanks to your help, that in order to
trigger the<br>
self-healing there's no more the request of doing a
find on all the<br>
files, but that there's the<br>
<br>
gluster volume heal VOLNAME full<br>
<br>
command to run. So I did a it and it said "Launching
Heal operation on<br>
volume data has been successful. Use heal info
commands to check<br>
status". But then asking for<br>
<br>
gluster volume heal VOLNAME info<br>
<br>
it reported each and every entry at zero, like
"gluster volume heal<br>
VOLNAME info heal-failed" and "gluster volume heal
VOLNAME info<br>
split-brain". It should be a positive news, or no?<br>
If I requested a "gluster volume heal VOLNAME info
healed", on the<br>
other hand, revealed 1023 files per couple of bricks
that got healed<br>
(very strange that the number is always the same. Is
it an upper<br>
bound?). For sure all of them are missing from the
volume itself (not<br>
sure if just 1023 per couple are missing from the
total, maybe more).<br>
I thought that now at least those should have become
visible, but in<br>
fact those are not. How to check logs for this healing
process? Doing<br>
a "ls -ltr /var/log/glusterfs/" says that no logs are
being touched,<br>
and even looking at the most recent ones reveals that
just healing<br>
command launch is reported into them.<br>
I rebooted the nodes but still nothing changed, files
are still<br>
missing from the FUSE mount point but not from bricks.
I relaunched<br>
the healing and again, 1023 files per couple of bricks
got<br>
self-healed. But still, I think that they are the same
as before and<br>
those are still missing from the fuse mount point
(just tried a few<br>
reported by the gluster volume heal VOLNAME info
healed).<br>
<br>
Do you have any other suggestion? Things are looking
very bad for me<br>
now because of the time that I'm forcing others to
loose (as I said,<br>
we don't have any system administrator and I do it
just "for fun", but<br>
still people look at you as the main responsible)...<br>
<br>
Best regards,<br>
<br>
Stefano<br>
</div>
</div>
<div class="im">
_______________________________________________<br>
Gluster-users mailing list<br>
<a moz-do-not-send="true"
href="mailto:Gluster-users@gluster.org"
target="_blank">Gluster-users@gluster.org</a><br>
<a moz-do-not-send="true"
href="http://supercolony.gluster.org/mailman/listinfo/gluster-users"
target="_blank">http://supercolony.gluster.org/mailman/listinfo/gluster-users</a><br>
</div>
</blockquote>
<br>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>