<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Let me re-iterate, I really, really want to see Gluster work for our
environment. I am hopeful this is something I did or something that
can be easily fixed.<br>
<br>
Yes, there was an error on the client server:<br>
<br>
[586898.273283] INFO: task flush-0:45:633954 blocked for more than
120 seconds.<br>
[586898.273290] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>
[586898.273295] flush-0:45 D ffff8806037592d0 0 633954
2 0 0x00000000<br>
[586898.273304] ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c
0000000000000000<br>
[586898.273312] ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80
ffff88000d1ebbf0<br>
[586898.273319] ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8
ffff88000d1ebfd8<br>
[586898.273326] Call Trace:<br>
[586898.273335] [<ffffffff81054444>] ?
find_busiest_group+0x244/0xb20<br>
[586898.273343] [<ffffffff811ab050>] ? inode_wait+0x0/0x20<br>
[586898.273349] [<ffffffff811ab05e>] inode_wait+0xe/0x20<br>
[586898.273357] [<ffffffff814e752f>] __wait_on_bit+0x5f/0x90<br>
[586898.273365] [<ffffffff811bbd6c>] ?
writeback_sb_inodes+0x13c/0x210<br>
[586898.273370] [<ffffffff811bab28>]
inode_wait_for_writeback+0x98/0xc0<br>
[586898.273377] [<ffffffff81095550>] ?
wake_bit_function+0x0/0x50<br>
[586898.273382] [<ffffffff811bc1f8>] wb_writeback+0x218/0x420<br>
[586898.273389] [<ffffffff814e637e>] ?
thread_return+0x4e/0x7d0<br>
[586898.273394] [<ffffffff811bc5a9>]
wb_do_writeback+0x1a9/0x250<br>
[586898.273402] [<ffffffff8107e2e0>] ?
process_timeout+0x0/0x10<br>
[586898.273407] [<ffffffff811bc6b3>]
bdi_writeback_task+0x63/0x1b0<br>
[586898.273412] [<ffffffff810953e7>] ?
bit_waitqueue+0x17/0xc0<br>
[586898.273419] [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100<br>
[586898.273424] [<ffffffff8114cf06>] bdi_start_fn+0x86/0x100<br>
[586898.273429] [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100<br>
[586898.273434] [<ffffffff81094f36>] kthread+0x96/0xa0<br>
[586898.273440] [<ffffffff8100c20a>] child_rip+0xa/0x20<br>
[586898.273445] [<ffffffff81094ea0>] ? kthread+0x0/0xa0<br>
[586898.273450] [<ffffffff8100c200>] ? child_rip+0x0/0x20<br>
[root@server-10 ~]# <br>
<br>
<br>
<br>
Here are the file sizes. Secure was big, but was hung for quite a
long time:<br>
<br>
-rw------- 1 root root 0 Dec 20 10:17 boot.log<br>
-rw------- 1 root utmp 281079168 Jun 15 21:53 btmp<br>
-rw------- 1 root root 337661 Jun 16 16:36 cron<br>
-rw-r--r-- 1 root root 0 Jun 9 18:33 dmesg<br>
-rw-r--r-- 1 root root 0 Jun 9 16:19 dmesg.old<br>
-rw-r--r-- 1 root root 98585 Dec 21 14:32 dracut.log<br>
drwxr-xr-x 5 root root 4096 Dec 21 16:53 glusterfs<br>
drwx------ 2 root root 4096 Mar 1 16:11 httpd<br>
-rw-r--r-- 1 root root 146000 Jun 16 13:36 lastlog<br>
drwxr-xr-x 2 root root 4096 Dec 20 10:35 mail<br>
-rw------- 1 root root 1072902 Jun 9 18:33 maillog<br>
-rw------- 1 root root 50638 Jun 16 12:13 messages<br>
drwxr-xr-x 2 root root 4096 Dec 30 16:14 nginx<br>
drwx------ 3 root root 4096 Dec 20 10:35 samba<br>
-rw------- 1 root root 222214339 Jun 16 13:37 secure<br>
-rw------- 1 root root 0 Sep 13 2011 spooler<br>
-rw------- 1 root root 0 Sep 13 2011 tallylog<br>
-rw-rw-r-- 1 root utmp 114432 Jun 16 13:37 wtmp<br>
-rw------- 1 root root 7015 Jun 16 12:13 yum.log<br>
<br>
<div class="moz-cite-prefix">On 06/16/2012 05:04 PM, Anand Avati
wrote:<br>
</div>
<blockquote
cite="mid:CAFboF2xg4stSCF82dSEU2V7Xadu52k6gPXuyBV6ZG932_7e=Xw@mail.gmail.com"
type="cite">Was there anything in dmesg on the servers? If you are
able to reproduce the hang, can you get the output of 'gluster
volume status <name> callpool' and 'gluster volume status
<name> nfs callpool' ?
<div>
<br>
</div>
<div>How big is the 'log/secure' file? Is it so large the the
client was just busy writing it for a very long time? Are there
any signs of disconnections or ping tmeouts in the logs?</div>
<div><br>
</div>
<div>Avati<br>
<br>
<div class="gmail_quote">On Sat, Jun 16, 2012 at 10:48 AM, Sean
Fulton <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:sean@gcnpublishing.com" target="_blank">sean@gcnpublishing.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
I do not mean to be argumentative, but I have to admit a
little frustration with Gluster. I know an enormous emount
of effort has gone into this product, and I just can't
believe that with all the effort behind it and so many
people using it, it could be so fragile.<br>
<br>
So here goes. Perhaps someone here can point to the error of
my ways. I really want this to work because it would be
ideal for our environment, but ...<br>
<br>
Please note that all of the nodes below are OpenVZ nodes
with nfs/nfsd/fuse modules loaded on the hosts.<br>
<br>
After spending months trying to get 3.2.5 and 3.2.6 working
in a production environment, I gave up on Gluster and went
with a Linux-HA/NFS cluster which just works. The problems I
had with gluster were strange lock-ups, split brains, and
too many instances where the whole cluster was off-line
until I reloaded the data.<br>
<br>
So wiith the release of 3.3, I decided to give it another
try. I created one relicated volume on my two NFS servers.<br>
<br>
I then mounted the volume on a client as follows:<br>
10.10.10.7:/pub2 /pub2 nfs
rw,noacl,noatime,nodiratime,soft,proto=tcp,vers=3,defaults 0
0<br>
<br>
I threw some data at it (find / -mount -print | cpio -pvdum
/pub2/test)<br>
<br>
Within 10 seconds it locked up solid. No error messages on
any of the servers, the client was unresponsive and load on
the client was 15+. I restarted glusterd on both of my NFS
servers, and the client remained locked. Finally I killed
the cpio process on the client. When I started another cpio,
it runs further than before, but now the logs on my
NFS/Gluster server say:<br>
<br>
[2012-06-16 13:37:35.242754] I [afr-self-heal-common.c:1318:afr_sh_missing_entries_lookup_done]
0-pub2-replicate-0: No sources for dir of
<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure,
in missing entry self-heal, continuing with the rest of the
self-heals<br>
[2012-06-16 13:37:35.243315] I [afr-self-heal-common.c:994:afr_sh_missing_entries_done]
0-pub2-replicate-0: split brain found, aborting selfheal of
<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure<br>
[2012-06-16 13:37:35.243350] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk]
0-pub2-replicate-0: background data gfid self-heal failed
on <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure<br>
<br>
<br>
This still seems to be an INCREDIBLY fragile system. Why
would it lock solid while copying a large file? Why no
errors in the logs?<br>
<br>
I am the only one seeing this kind of behavior?<span
class="HOEnZb"><font color="#888888"><br>
<br>
sean<br>
<br>
<br>
<br>
<br>
<br>
-- <br>
Sean Fulton<br>
GCN Publishing, Inc.<br>
Internet Design, Development and Consulting For Today's
Media Companies<br>
<a moz-do-not-send="true"
href="http://www.gcnpublishing.com" target="_blank">http://www.gcnpublishing.com</a><br>
<a moz-do-not-send="true"
href="tel:%28203%29%20665-6211%2C%20x203"
value="+12036656211" target="_blank">(203) 665-6211,
x203</a><br>
<br>
_______________________________________________<br>
Gluster-users mailing list<br>
<a moz-do-not-send="true"
href="mailto:Gluster-users@gluster.org"
target="_blank">Gluster-users@gluster.org</a><br>
<a moz-do-not-send="true"
href="http://gluster.org/cgi-bin/mailman/listinfo/gluster-users"
target="_blank">http://gluster.org/cgi-bin/mailman/listinfo/gluster-users</a><br>
</font></span></blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
<a class="moz-txt-link-freetext" href="http://www.gcnpublishing.com">http://www.gcnpublishing.com</a>
(203) 665-6211, x203
</pre>
<br>
<br>
</body>
</html>