<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<font face="Calibri">I started rebalancing my volume after updating
from 3.2.7 to 3.3.1. After a few hours, I noticed a large number
of failures in the rebalance status:<br>
<br>
<blockquote type="cite"> Node Rebalanced-files
size scanned failures status<br>
--------- ----------- ----------- -----------
----------- ------------<br>
localhost 0 0Bytes
4288805 0 stopped<br>
ml55 26275 206.2MB 4277101
14159 stopped<br>
ml29 0 0Bytes
4288844 0 stopped<br>
ml31 0 0Bytes
4288937 0 stopped<br>
ml48 0 0Bytes
4288927 0 stopped<br>
ml45 15041 50.8MB 4284304
41999 stopped<br>
ml40 40690 413.3MB 4269721
1012 stopped<br>
ml41 0 0Bytes
4288898 0 stopped<br>
ml51 28558 212.7MB 4277442
32195 stopped<br>
ml46 0 0Bytes
4288909 0 stopped<br>
ml44 0 0Bytes
4288824 0 stopped<br>
ml52 0 0Bytes
4288849 0 stopped<br>
ml30 14252 183.7MB 4270711
25336 stopped<br>
ml53 31431 354.9MB 4280450
31098 stopped<br>
ml43 13773 2.7GB 4285256
28574 stopped<br>
ml47 37618 241.3MB 4266889
24916 stopped</blockquote>
<br>
which prompted me to look at the rebalance log:<br>
<br>
<blockquote type="cite">[2012-11-30 11:06:12.533580] W
[client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12:
remote operation failed: File exists. Path:
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_269615212_b91ff3077e.t7
(00000000-0000-0000-0000-000000000000)<br>
[2012-11-30 11:06:12.533657] E [dht-common.c:1911:dht_getxattr]
0-bigdata-dht: layout is NULL<br>
[2012-11-30 11:06:12.533702] E
[dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht:
Failed to get node-uuid for
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_269615212_b91ff3077e.t7<br>
[2012-11-30 11:06:12.545497] W
[client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-13:
remote operation failed: File exists. Path:
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7
(00000000-0000-0000-0000-000000000000)<br>
[2012-11-30 11:06:12.546039] W
[client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12:
remote operation failed: File exists. Path:
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7
(00000000-0000-0000-0000-000000000000)<br>
[2012-11-30 11:06:12.546159] E [dht-common.c:1911:dht_getxattr]
0-bigdata-dht: layout is NULL<br>
[2012-11-30 11:06:12.546199] E
[dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht:
Failed to get node-uuid for
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7<br>
[2012-11-30 11:06:12.617940] W
[client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12:
remote operation failed: File exists. Path:
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7
(00000000-0000-0000-0000-000000000000)<br>
[2012-11-30 11:06:12.618024] W
[client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-13:
remote operation failed: File exists. Path:
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7
(00000000-0000-0000-0000-000000000000)<br>
[2012-11-30 11:06:12.618150] E [dht-common.c:1911:dht_getxattr]
0-bigdata-dht: layout is NULL<br>
[2012-11-30 11:06:12.618189] E
[dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht:
Failed to get node-uuid for
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7<br>
[2012-11-30 11:06:12.620798] I
[dht-common.c:954:dht_lookup_everywhere_cbk] 0-bigdata-dht:
deleting stale linkfile
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_282643649_15d4108d95.t7
on bigdata-replicate-6<br>
</blockquote>
<br>
[...] (at this point, I stopped rebalancing, and got the following
in the logs)<br>
<blockquote type="cite">[2012-11-30 11:06:33.152153] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix
layout failed for /foo/data/onemil/dataset/bar/f8old/baz/85<br>
[2012-11-30 11:06:33.153628] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix
layout failed for /foo/data/onemil/dataset/bar/f8old/baz<br>
[2012-11-30 11:06:33.154641] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix
layout failed for /foo/data/onemil/dataset/bar/f8old<br>
[2012-11-30 11:06:33.155602] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix
layout failed for /foo/data/onemil/dataset/bar<br>
[2012-11-30 11:06:33.156552] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix
layout failed for /foo/data/onemil/dataset<br>
[2012-11-30 11:06:33.157538] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix
layout failed for /foo/data/onemil<br>
[2012-11-30 11:06:33.158526] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix
layout failed for /foo/data<br>
[2012-11-30 11:06:33.159459] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix
layout failed for /foo<br>
[2012-11-30 11:06:33.160496] I
[dht-rebalance.c:1626:gf_defrag_status_get] 0-glusterfs:
Rebalance is stopped<br>
[2012-11-30 11:06:33.160518] I
[dht-rebalance.c:1629:gf_defrag_status_get] 0-glusterfs: Files
migrated: 14252, size: 192620657, lookups: 4270711, failures:
25336<br>
[2012-11-30 11:06:33.173344] W
[glusterfsd.c:831:cleanup_and_exit]
(-->/lib64/libc.so.6(clone+0x6d) [0x3d676e811d]
(-->/lib64/libpthread.so.0() [0x3d68207851]
(-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdd)
[0x405d4d]))) 0-: received signum (15), shutting down<br>
</blockquote>
<br>
<br>
This kind of error kept appearing many times per second. I
cancelled the rebalancing operation just in case it was doing
anything bad. There doesn't seem to be anything weird in the
system logs of the bricks that hold the files mentioned in the
errors. The files are still accessible through my mounted volume.<br>
<br>
Any idea what might be wrong?<br>
<br>
Thanks,<br>
<br>
Pierre<br>
<br>
<br>
<br>
volume info:<br>
<blockquote type="cite">Volume Name: bigdata<br>
Type: Distributed-Replicate<br>
Volume ID: 56498956-7b4b-4ee3-9d2b-4c8cfce26051<br>
Status: Started<br>
Number of Bricks: 20 x 2 = 40<br>
Transport-type: tcp<br>
Bricks:<br>
Brick1: ml43:/mnt/localb<br>
Brick2: ml44:/mnt/localb<br>
Brick3: ml43:/mnt/localc<br>
Brick4: ml44:/mnt/localc<br>
Brick5: ml45:/mnt/localb<br>
Brick6: ml46:/mnt/localb<br>
Brick7: ml45:/mnt/localc<br>
Brick8: ml46:/mnt/localc<br>
Brick9: ml47:/mnt/localb<br>
Brick10: ml48:/mnt/localb<br>
Brick11: ml47:/mnt/localc<br>
Brick12: ml48:/mnt/localc<br>
Brick13: ml45:/mnt/locald<br>
Brick14: ml46:/mnt/locald<br>
Brick15: ml47:/mnt/locald<br>
Brick16: ml48:/mnt/locald<br>
Brick17: ml51:/mnt/localb<br>
Brick18: ml52:/mnt/localb<br>
Brick19: ml51:/mnt/localc<br>
Brick20: ml52:/mnt/localc<br>
Brick21: ml51:/mnt/locald<br>
Brick22: ml52:/mnt/locald<br>
Brick23: ml53:/mnt/locald<br>
Brick24: ml54:/mnt/locald<br>
Brick25: ml53:/mnt/localc<br>
Brick26: ml54:/mnt/localc<br>
Brick27: ml53:/mnt/localb<br>
Brick28: ml54:/mnt/localb<br>
Brick29: ml55:/mnt/localb<br>
Brick30: ml29:/mnt/localb<br>
Brick31: ml55:/mnt/localc<br>
Brick32: ml29:/mnt/localc<br>
Brick33: ml30:/mnt/localc<br>
Brick34: ml31:/mnt/localc<br>
Brick35: ml30:/mnt/localb<br>
Brick36: ml31:/mnt/localb<br>
Brick37: ml40:/mnt/localb<br>
Brick38: ml41:/mnt/localb<br>
Brick39: ml40:/mnt/localc<br>
Brick40: ml41:/mnt/localc<br>
Options Reconfigured:<br>
performance.quick-read: on<br>
nfs.disable: on<br>
nfs.register-with-portmap: OFF<br>
</blockquote>
<br>
<br>
volume status:<br>
<blockquote type="cite">Status of volume: bigdata<br>
Gluster process Port
Online Pid<br>
------------------------------------------------------------------------------<br>
Brick ml43:/mnt/localb 24012
Y 2694<br>
Brick ml44:/mnt/localb 24012
Y 20374<br>
Brick ml43:/mnt/localc 24013
Y 2699<br>
Brick ml44:/mnt/localc 24013
Y 20379<br>
Brick ml45:/mnt/localb 24012
Y 3147<br>
Brick ml46:/mnt/localb 24012
Y 25789<br>
Brick ml45:/mnt/localc 24013
Y 3152<br>
Brick ml46:/mnt/localc 24013
Y 25794<br>
Brick ml47:/mnt/localb 24012
Y 3181<br>
Brick ml48:/mnt/localb 24012
Y 4852<br>
Brick ml47:/mnt/localc 24013
Y 3186<br>
Brick ml48:/mnt/localc 24013
Y 4857<br>
Brick ml45:/mnt/locald 24014
Y 3157<br>
Brick ml46:/mnt/locald 24014
Y 25799<br>
Brick ml47:/mnt/locald 24014
Y 3191<br>
Brick ml48:/mnt/locald 24014
Y 4862<br>
Brick ml51:/mnt/localb 24009
Y 30251<br>
Brick ml52:/mnt/localb 24012
Y 28541<br>
Brick ml51:/mnt/localc 24010
Y 30256<br>
Brick ml52:/mnt/localc 24013
Y 28546<br>
Brick ml51:/mnt/locald 24011
Y 30261<br>
Brick ml52:/mnt/locald 24014
Y 28551<br>
Brick ml53:/mnt/locald 24012
Y 9229<br>
Brick ml54:/mnt/locald 24012
Y 9341<br>
Brick ml53:/mnt/localc 24013
Y 9234<br>
Brick ml54:/mnt/localc 24013
Y 9346<br>
Brick ml53:/mnt/localb 24014
Y 9239<br>
Brick ml54:/mnt/localb 24014
Y 9351<br>
Brick ml55:/mnt/localb 24012
Y 30904<br>
Brick ml29:/mnt/localb 24012
Y 29233<br>
Brick ml55:/mnt/localc 24013
Y 30909<br>
Brick ml29:/mnt/localc 24013
Y 29238<br>
Brick ml30:/mnt/localc 24012
Y 6800<br>
Brick ml31:/mnt/localc N/A
Y 22000<br>
Brick ml30:/mnt/localb 24013
Y 6805<br>
Brick ml31:/mnt/localb N/A
Y 22005<br>
Brick ml40:/mnt/localb 24012
Y 26700<br>
Brick ml41:/mnt/localb 24012
Y 25762<br>
Brick ml40:/mnt/localc 24013
Y 26705<br>
Brick ml41:/mnt/localc 24013
Y 25767<br>
Self-heal Daemon on localhost N/A
Y 20392<br>
Self-heal Daemon on ml55 N/A
Y 30922<br>
Self-heal Daemon on ml54 N/A
Y 9365<br>
Self-heal Daemon on ml52 N/A
Y 28565<br>
Self-heal Daemon on ml29 N/A
Y 29253<br>
Self-heal Daemon on ml30 N/A
Y 6818<br>
Self-heal Daemon on ml43 N/A
Y 2712<br>
Self-heal Daemon on ml47 N/A
Y 3205<br>
Self-heal Daemon on ml46 N/A
Y 25813<br>
Self-heal Daemon on ml40 N/A
Y 26717<br>
Self-heal Daemon on ml31 N/A
Y 22038<br>
Self-heal Daemon on ml48 N/A
Y 4876<br>
Self-heal Daemon on ml45 N/A
Y 3171<br>
Self-heal Daemon on ml51 N/A
Y 30274<br>
Self-heal Daemon on ml41 N/A
Y 25779<br>
Self-heal Daemon on ml53 N/A
Y 9253<br>
</blockquote>
<br>
peer status:<br>
<blockquote type="cite">Number of Peers: 15<br>
<br>
Hostname: ml52<br>
Uuid: 4de42f67-4cca-4d28-8600-9018172563ba<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml41<br>
Uuid: b404851f-dfd5-4746-a3bd-81bb0d888009<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml46<br>
Uuid: af74d39b-09d6-47ba-9c3b-72d993dca4ce<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml54<br>
Uuid: c55580fa-2c9d-493d-b9d1-3bce016c8b29<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml51<br>
Uuid: 5491b6dc-0f96-43d9-95d9-a41018a8542c<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml48<br>
Uuid: efd79145-bfd9-4eea-b7a7-50be18d9ffe0<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml43<br>
Uuid: a9044e9a-39e1-4907-8921-43da870b7f31<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml45<br>
Uuid: 0eebbceb-8f62-4c55-8160-41348f90e191<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml47<br>
Uuid: e831092d-b196-46ec-947d-a5635e8fbd1e<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml30<br>
Uuid: e56b4c57-a058-4464-a1e6-c4676ebf00cc<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml40<br>
Uuid: ffcc06ae-100a-4fa2-888e-803a41ae946c<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml55<br>
Uuid: 366339ed-52e5-4722-a1b3-e3bb1c49ea4f<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml31<br>
Uuid: 699019f6-2f4a-45cb-bfa4-f209745f8a6d<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml29<br>
Uuid: 58aa8a16-5d2b-4c06-8f06-2fd0f7fc5a37<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: ml53<br>
Uuid: 1dc6ee08-c606-4755-8756-b553f66efa88<br>
State: Peer in Cluster (Connected)<br>
</blockquote>
<br>
gluster version:<br>
<blockquote type="cite">glusterfs 3.3.1 built on Oct 11 2012
21:49:37</blockquote>
<br>
rpms:<br>
<blockquote type="cite">glusterfs.x86_64
3.3.1-1.el6 @glusterfs-epel<br>
glusterfs-debuginfo.x86_64 3.3.1-1.el6
@glusterfs-epel<br>
glusterfs-fuse.x86_64 3.3.1-1.el6
@glusterfs-epel<br>
glusterfs-rdma.x86_64 3.3.1-1.el6
@glusterfs-epel<br>
glusterfs-server.x86_64 3.3.1-1.el6
@glusterfs-epel<br>
</blockquote>
<br>
kernel:<br>
<blockquote type="cite">Linux 2.6.32-131.17.1.el6.x86_64 #1 SMP
Wed Oct 5 17:19:54 CDT 2011 x86_64 x86_64 x86_64 GNU/Linux</blockquote>
<br>
OS: Scientific Linux 6.1 (this is based on CentOS)<br>
</font>
</body>
</html>