<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Let me re-iterate, I really, really want to see Gluster work for our

    environment. I am hopeful this is something I did or something that

    can be easily fixed.<br>

    <br>

    Yes, there was an error on the client server:<br>

    <br>

    [586898.273283] INFO: task flush-0:45:633954 blocked for more than

    120 seconds.<br>

    [586898.273290] "echo 0 &gt;

    /proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>

    [586898.273295] flush-0:45&nbsp;&nbsp;&nbsp; D ffff8806037592d0&nbsp;&nbsp;&nbsp;&nbsp; 0 633954&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

    2&nbsp;&nbsp;&nbsp; 0 0x00000000<br>

    [586898.273304]&nbsp; ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c

    0000000000000000<br>

    [586898.273312]&nbsp; ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80

    ffff88000d1ebbf0<br>

    [586898.273319]&nbsp; ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8

    ffff88000d1ebfd8<br>

    [586898.273326] Call Trace:<br>

    [586898.273335]&nbsp; [&lt;ffffffff81054444&gt;] ?

    find_busiest_group+0x244/0xb20<br>

    [586898.273343]&nbsp; [&lt;ffffffff811ab050&gt;] ? inode_wait+0x0/0x20<br>

    [586898.273349]&nbsp; [&lt;ffffffff811ab05e&gt;] inode_wait+0xe/0x20<br>

    [586898.273357]&nbsp; [&lt;ffffffff814e752f&gt;] __wait_on_bit+0x5f/0x90<br>

    [586898.273365]&nbsp; [&lt;ffffffff811bbd6c&gt;] ?

    writeback_sb_inodes+0x13c/0x210<br>

    [586898.273370]&nbsp; [&lt;ffffffff811bab28&gt;]

    inode_wait_for_writeback+0x98/0xc0<br>

    [586898.273377]&nbsp; [&lt;ffffffff81095550&gt;] ?

    wake_bit_function+0x0/0x50<br>

    [586898.273382]&nbsp; [&lt;ffffffff811bc1f8&gt;] wb_writeback+0x218/0x420<br>

    [586898.273389]&nbsp; [&lt;ffffffff814e637e&gt;] ?

    thread_return+0x4e/0x7d0<br>

    [586898.273394]&nbsp; [&lt;ffffffff811bc5a9&gt;]

    wb_do_writeback+0x1a9/0x250<br>

    [586898.273402]&nbsp; [&lt;ffffffff8107e2e0&gt;] ?

    process_timeout+0x0/0x10<br>

    [586898.273407]&nbsp; [&lt;ffffffff811bc6b3&gt;]

    bdi_writeback_task+0x63/0x1b0<br>

    [586898.273412]&nbsp; [&lt;ffffffff810953e7&gt;] ?

    bit_waitqueue+0x17/0xc0<br>

    [586898.273419]&nbsp; [&lt;ffffffff8114ce80&gt;] ? bdi_start_fn+0x0/0x100<br>

    [586898.273424]&nbsp; [&lt;ffffffff8114cf06&gt;] bdi_start_fn+0x86/0x100<br>

    [586898.273429]&nbsp; [&lt;ffffffff8114ce80&gt;] ? bdi_start_fn+0x0/0x100<br>

    [586898.273434]&nbsp; [&lt;ffffffff81094f36&gt;] kthread+0x96/0xa0<br>

    [586898.273440]&nbsp; [&lt;ffffffff8100c20a&gt;] child_rip+0xa/0x20<br>

    [586898.273445]&nbsp; [&lt;ffffffff81094ea0&gt;] ? kthread+0x0/0xa0<br>

    [586898.273450]&nbsp; [&lt;ffffffff8100c200&gt;] ? child_rip+0x0/0x20<br>

    [root@server-10 ~]# <br>

    <br>

    <br>

    <br>

    Here are the file sizes. Secure was big, but was hung for quite a

    long time:<br>

    <br>

    -rw------- 1 root root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 Dec 20 10:17 boot.log<br>

    -rw------- 1 root utmp 281079168 Jun 15 21:53 btmp<br>

    -rw------- 1 root root&nbsp;&nbsp;&nbsp; 337661 Jun 16 16:36 cron<br>

    -rw-r--r-- 1 root root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 Jun&nbsp; 9 18:33 dmesg<br>

    -rw-r--r-- 1 root root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 Jun&nbsp; 9 16:19 dmesg.old<br>

    -rw-r--r-- 1 root root&nbsp;&nbsp;&nbsp;&nbsp; 98585 Dec 21 14:32 dracut.log<br>

    drwxr-xr-x 5 root root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 Dec 21 16:53 glusterfs<br>

    drwx------ 2 root root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 Mar&nbsp; 1 16:11 httpd<br>

    -rw-r--r-- 1 root root&nbsp;&nbsp;&nbsp; 146000 Jun 16 13:36 lastlog<br>

    drwxr-xr-x 2 root root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 Dec 20 10:35 mail<br>

    -rw------- 1 root root&nbsp;&nbsp; 1072902 Jun&nbsp; 9 18:33 maillog<br>

    -rw------- 1 root root&nbsp;&nbsp;&nbsp;&nbsp; 50638 Jun 16 12:13 messages<br>

    drwxr-xr-x 2 root root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 Dec 30 16:14 nginx<br>

    drwx------ 3 root root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 Dec 20 10:35 samba<br>

    -rw------- 1 root root 222214339 Jun 16 13:37 secure<br>

    -rw------- 1 root root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 Sep 13&nbsp; 2011 spooler<br>

    -rw------- 1 root root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 Sep 13&nbsp; 2011 tallylog<br>

    -rw-rw-r-- 1 root utmp&nbsp;&nbsp;&nbsp; 114432 Jun 16 13:37 wtmp<br>

    -rw------- 1 root root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7015 Jun 16 12:13 yum.log<br>

    <br>

    <div class="moz-cite-prefix">On 06/16/2012 05:04 PM, Anand Avati

      wrote:<br>

    </div>

    <blockquote

cite="mid:CAFboF2xg4stSCF82dSEU2V7Xadu52k6gPXuyBV6ZG932_7e=Xw@mail.gmail.com"

      type="cite">Was there anything in dmesg on the servers? If you are

      able to reproduce the hang, can you get the output of 'gluster

      volume status &lt;name&gt; callpool' and 'gluster volume status

      &lt;name&gt; nfs callpool' ?

      <div>

        <br>

      </div>

      <div>How big is the 'log/secure' file? Is it so large the the

        client was just busy writing it for a very long time? Are there

        any signs of disconnections or ping tmeouts in the logs?</div>

      <div><br>

      </div>

      <div>Avati<br>

        <br>

        <div class="gmail_quote">On Sat, Jun 16, 2012 at 10:48 AM, Sean

          Fulton <span dir="ltr">&lt;<a moz-do-not-send="true"

              href="mailto:sean@gcnpublishing.com" target="_blank">sean@gcnpublishing.com</a>&gt;</span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            I do not mean to be argumentative, but I have to admit a

            little frustration with Gluster. I know an enormous emount

            of effort has gone into this product, and I just can't

            believe that with all the effort behind it and so many

            people using it, it could be so fragile.<br>

            <br>

            So here goes. Perhaps someone here can point to the error of

            my ways. I really want this to work because it would be

            ideal for our environment, but ...<br>

            <br>

            Please note that all of the nodes below are OpenVZ nodes

            with nfs/nfsd/fuse modules loaded on the hosts.<br>

            <br>

            After spending months trying to get 3.2.5 and 3.2.6 working

            in a production environment, I gave up on Gluster and went

            with a Linux-HA/NFS cluster which just works. The problems I

            had with gluster were strange lock-ups, split brains, and

            too many instances where the whole cluster was off-line

            until I reloaded the data.<br>

            <br>

            So wiith the release of 3.3, I decided to give it another

            try. I created one relicated volume on my two NFS servers.<br>

            <br>

            I then mounted the volume on a client as follows:<br>

            10.10.10.7:/pub2 &nbsp; &nbsp;/pub2 &nbsp; &nbsp; nfs

            rw,noacl,noatime,nodiratime,soft,proto=tcp,vers=3,defaults 0

            0<br>

            <br>

            I threw some data at it (find / -mount -print | cpio -pvdum

            /pub2/test)<br>

            <br>

            Within 10 seconds it locked up solid. No error messages on

            any of the servers, the client was unresponsive and load on

            the client was 15+. I restarted glusterd on both of my NFS

            servers, and the client remained locked. Finally I killed

            the cpio process on the client. When I started another cpio,

            it runs further than before, but now the logs on my

            NFS/Gluster server say:<br>

            <br>

            [2012-06-16 13:37:35.242754] I [afr-self-heal-common.c:1318:afr_sh_missing_entries_lookup_done]

            0-pub2-replicate-0: No sources for dir of

            &lt;gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818&gt;/log/secure,

            in missing entry self-heal, continuing with the rest of the

            self-heals<br>

            [2012-06-16 13:37:35.243315] I [afr-self-heal-common.c:994:afr_sh_missing_entries_done]

            0-pub2-replicate-0: split brain found, aborting selfheal of

            &lt;gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818&gt;/log/secure<br>

            [2012-06-16 13:37:35.243350] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk]

            0-pub2-replicate-0: background &nbsp;data gfid self-heal failed

            on &lt;gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818&gt;/log/secure<br>

            <br>

            <br>

            This still seems to be an INCREDIBLY fragile system. Why

            would it lock solid while copying a large file? Why no

            errors in the logs?<br>

            <br>

            I am the only one seeing this kind of behavior?<span

              class="HOEnZb"><font color="#888888"><br>

                <br>

                sean<br>

                <br>

                <br>

                <br>

                <br>

                <br>

                -- <br>

                Sean Fulton<br>

                GCN Publishing, Inc.<br>

                Internet Design, Development and Consulting For Today's

                Media Companies<br>

                <a moz-do-not-send="true"

                  href="http://www.gcnpublishing.com" target="_blank">http://www.gcnpublishing.com</a><br>

                <a moz-do-not-send="true"

                  href="tel:%28203%29%20665-6211%2C%20x203"

                  value="+12036656211" target="_blank">(203) 665-6211,

                  x203</a><br>

                <br>

                _______________________________________________<br>

                Gluster-users mailing list<br>

                <a moz-do-not-send="true"

                  href="mailto:Gluster-users@gluster.org"

                  target="_blank">Gluster-users@gluster.org</a><br>

                <a moz-do-not-send="true"

                  href="http://gluster.org/cgi-bin/mailman/listinfo/gluster-users"

                  target="_blank">http://gluster.org/cgi-bin/mailman/listinfo/gluster-users</a><br>

              </font></span></blockquote>

        </div>

        <br>

      </div>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Sean Fulton

GCN Publishing, Inc.

Internet Design, Development and Consulting For Today's Media Companies

<a class="moz-txt-link-freetext" href="http://www.gcnpublishing.com">http://www.gcnpublishing.com</a>

(203) 665-6211, x203

</pre>

    <br>

    <br>

  </body>

</html>