<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Thanks Jeff, that's interesting.<br>

    <br>

    It is reassuring to know that these errors are self repairing.&nbsp; That

    does appear to be happening, but only when I run "find -print0 |

    xargs --null stat &gt;/dev/null" in affected directories.&nbsp; I will

    run that self-heal on the whole volume as well, but I have had to

    start with specific directories that people want to work in today.&nbsp;

    Does repeating the fix-layout operation have any effect, or are the

    xattr repairs all done by the self-heal mechanism?<br>

    <br>

    I have found the cause of the transient brick failure; it happened

    again this morning on a replicated pair of bricks.&nbsp; Suddenly the

    etc-glusterfs-glusterd.vol.log file was flooded with these messages

    every few seconds.<br>

    <br>

    E [socket.c:2080:socket_connect] 0-management: connection attempt

    failed (Connection refused)<br>

    <br>

    One of the clients then reported errors like the following.<br>

    <br>

    [2012-02-23 11:19:22.922785] E [afr-common.c:3164:afr_notify]

    2-atmos-replicate-3: All subvolumes are down. Going offline until

    atleast one of them comes back up.<br>

    [2012-02-23 11:19:22.923682] I

    [dht-layout.c:581:dht_layout_normalize] 0-atmos-dht: found anomalies

    in /. holes=1 overlaps=0<br>

    [2012-02-23 11:19:22.923714] I

    [dht-selfheal.c:569:dht_selfheal_directory] 0-atmos-dht: 1

    subvolumes down -- not fixing<br>

    <br>

    [2012-02-23 11:19:22.941468] W

    [socket.c:1494:__socket_proto_state_machine] 1-atmos-client-7:

    reading from socket failed. Error (Transport endpoint is not

    connected), peer (192.171.166.89:24019)<br>

    [2012-02-23 11:19:22.972307] I [client.c:1883:client_rpc_notify]

    1-atmos-client-7: disconnected<br>

    [2012-02-23 11:19:22.972352] E [afr-common.c:3164:afr_notify]

    1-atmos-replicate-3: All subvolumes are down. Going offline until

    atleast one of them comes back up.<br>

    <br>

    The servers causing trouble were still showing as Connected in

    "gluster peer status" and nothing appeared to be wrong except for

    glusterd misbehaving.&nbsp; Restarting glusterd solved the problem, but

    given that this has happened twice this week already I am worried

    that it could happen again at any time.&nbsp; Do you know what might be

    causing glusterd to stop responding like this?<br>

    <br>

    Regards<br>

    Dan.<br>

    <br>

    <br>

    On 02/22/2012 08:00 PM, <a class="moz-txt-link-abbreviated" href="mailto:gluster-users-request@gluster.org">gluster-users-request@gluster.org</a> wrote:

    <blockquote

      cite="mid:mailman.1.1329940801.15655.gluster-users@gluster.org"

      type="cite">

      <pre wrap="">Date: Wed, 22 Feb 2012 10:32:31 -0500

From: Jeff Darcy <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:jdarcy@redhat.com">&lt;jdarcy@redhat.com&gt;</a>

Subject: Re: [Gluster-users] "mismatching layouts" errors after

        expanding volume

To: <a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a>

Message-ID: <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:4F450A8F.6070809@redhat.com">&lt;4F450A8F.6070809@redhat.com&gt;</a>

Content-Type: text/plain; charset=ISO-8859-1

Following up on the previous reply...

On 02/22/2012 02:52 AM, Dan Bretherton wrote:

</pre>

      <blockquote type="cite" style="color: #000000;">

        <pre wrap=""><span class="moz-txt-citetags">&gt; </span>[2012-02-16 22:59:42.504907] I

<span class="moz-txt-citetags">&gt; </span>[dht-layout.c:682:dht_layout_dir_mismatch] 0-atmos-dht: subvol:

<span class="moz-txt-citetags">&gt; </span>atmos-replicate-0; inode layout - 0 - 0; disk layout - 9203501

<span class="moz-txt-citetags">&gt; </span>34 - 1227133511

<span class="moz-txt-citetags">&gt; </span>[2012-02-16 22:59:42.534399] I [dht-common.c:524:dht_revalidate_cbk]

<span class="moz-txt-citetags">&gt; </span>0-atmos-dht: mismatching layouts for /users/rle/TRACKTEMP/TRACKS

</pre>

      </blockquote>

      <pre wrap="">On 02/22/2012 09:19 AM, Jeff Darcy wrote:

</pre>

      <blockquote type="cite" style="color: #000000;">

        <pre wrap=""><span class="moz-txt-citetags">&gt; </span>OTOH, the log entries below do seem to indicate that there's something going on

<span class="moz-txt-citetags">&gt; </span>that I don't understand.  I'll dig a bit, and let you know if I find anything

<span class="moz-txt-citetags">&gt; </span>to change my mind wrt the safety of restoring write access.

</pre>

      </blockquote>

      <pre wrap="">The two messages above are paired, in the sense that the second is inevitable

after the first. The "disk layout" range shown in the first is exactly what I

would expect for subvolume 3 out of 0-13. That means the trusted.glusterfs.dht

value on disk seems reasonable. The corresponding in-memory "inode layout"

entry has the less reasonable value of all zero. That probably means we failed

to fetch the xattr at some point in the past. There might be something earlier

in your logs - perhaps a message about "holes" and/or one specifically

mentioning that subvolume - to explain why.

The good news is that this should be self-repairing. Once we get these

messages, we try to re-fetch the layout information from all subvolumes. If

<b class="moz-txt-star"><span class="moz-txt-tag">*</span>that<span class="moz-txt-tag">*</span></b> failed, we'd see more messages than those above. Since the on-disk

values seem OK and revalidation seems to be succeeding, I would say these

messages probably represent successful attempts to recover from a transient

brick failure, and that does <b class="moz-txt-star"><span class="moz-txt-tag">*</span>not<span class="moz-txt-tag">*</span></b> change what I said previously.</pre>

    </blockquote>

  </body>

</html>