<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 07/09/2013 06:47 AM, Greg Scott

      wrote</div>

    <blockquote

      cite="mid:838abd9323264298b279937d9b46ccd6@mail2013.infrasupport.local"

      type="cite">

      <meta http-equiv="Context-Type" content="text/html;

        charset=us-ascii">

      <meta name="Generator" content="Microsoft Exchange Server">

      <span>

        <div>I don&#8217;t get this.&nbsp; I have a replicated volume and 2 nodes.&nbsp;

          My challenge is, when I take one node offline, the other node

          can no longer access the volume until both nodes are back

          online again.</div>

        <div>&nbsp;</div>

        <div>Details:</div>

        <div>&nbsp;</div>

        <div>I have 2 nodes, fw1 and fw2.&nbsp;&nbsp; Each node has an XFS file

          system, /gluster-fw1 on node fw1 and gluster-fw2 no node

          fw2.&nbsp;&nbsp; Node fw1 is at IP Address 192.168.253.1.&nbsp; Node fw2 is

          at 192.168.253.2.&nbsp; </div>

        <div>&nbsp;</div>

        <div>I create a gluster volume named firewall-scripts which is a

          replica of those two XFS file systems.&nbsp; The volume holds a

          bunch of config files common to both fw1 and fw2.&nbsp; The

          application is an active/standby pair of firewalls and the

          idea is to keep config

          files in a gluster volume.</div>

        <div>&nbsp;</div>

        <div>When both nodes are online, everything works as expected.&nbsp;

          But when I take either node offline, node fw2 behaves badly:</div>

        <div>&nbsp;</div>

        <div>[root@chicago-fw2 ~]# ls /firewall-scripts</div>

        <div>ls: cannot access /firewall-scripts: Transport endpoint is

          not connected</div>

        <div>&nbsp;</div>

        <div>And when I bring the offline node back online, node fw2

          eventually behaves normally again.&nbsp; </div>

        <div>&nbsp;</div>

        <div>What&#8217;s up with that?&nbsp; Gluster is supposed to be resilient

          and self-healing and able to stand up to this sort of abuse.&nbsp;

          So I must be doing something wrong.&nbsp; </div>

        <div>&nbsp;</div>

        <div>Here is how I set up everything &#8211; it doesn&#8217;t get much

          simpler than this and my setup is right out the Getting

          Started Guide but using my own names.&nbsp; </div>

        <div>&nbsp;</div>

        <div>Here are the steps I followed, all from fw1:</div>

        <div>&nbsp;</div>

        <div>gluster peer probe 192.168.253.2</div>

        <div>gluster peer status</div>

        <div>&nbsp;</div>

        <div>Create and start the volume:</div>

        <div>&nbsp;</div>

        <div>gluster volume create firewall-scripts replica 2 transport

          tcp 192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2</div>

        <div>gluster volume start firewall-scripts</div>

        <div>&nbsp;</div>

        <div>On fw1:</div>

        <div>&nbsp;</div>

        <div>mkdir /firewall-scripts</div>

        <div>mount -t glusterfs 192.168.253.1:/firewall-scripts

          /firewall-scripts</div>

        <div>&nbsp;</div>

        <div>and add this line to /etc/fstab:</div>

        <div>192.168.253.1:/firewall-scripts /firewall-scripts glusterfs

          defaults,_netdev 0 0</div>

        <div>&nbsp;</div>

        <div>on fw2:</div>

        <div>&nbsp;</div>

        <div>mkdir /firewall-scripts</div>

        <div>mount -t glusterfs 192.168.253.2:/firewall-scripts

          /firewall-scripts</div>

        <div>&nbsp;</div>

        <div>and add this line to /etc/fstab:</div>

        <div>192.168.253.2:/firewall-scripts /firewall-scripts glusterfs

          defaults,_netdev 0 0</div>

        <div>&nbsp;</div>

        <div>That&#8217;s it.&nbsp; That&#8217;s the whole setup.&nbsp; When both nodes are

          online, everything replicates beautifully.&nbsp; But take one node

          offline and it all falls apart.&nbsp; </div>

        <div>&nbsp;</div>

        <div>Here is the output from gluster volume info, identical on

          both nodes:</div>

        <div>&nbsp;</div>

        <div>[root@chicago-fw1 etc]# gluster volume info</div>

        <div>&nbsp;</div>

        <div>Volume Name: firewall-scripts</div>

        <div>Type: Replicate</div>

        <div>Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c</div>

        <div>Status: Started</div>

        <div>Number of Bricks: 1 x 2 = 2</div>

        <div>Transport-type: tcp</div>

        <div>Bricks:</div>

        <div>Brick1: 192.168.253.1:/gluster-fw1</div>

        <div>Brick2: 192.168.253.2:/gluster-fw2</div>

        <div>[root@chicago-fw1 etc]#</div>

        <div>&nbsp;</div>

        <div>Looking at /var/log/glusterfs/firewall-scripts.log on fw2,

          I see errors like this every couple of seconds:</div>

        <div>&nbsp;</div>

        <div>[2013-07-09 00:59:04.706390] I

          [afr-common.c:3856:afr_local_init]

          0-firewall-scripts-replicate-0: no subvolumes up</div>

        <div>[2013-07-09 00:59:04.706515] W

          [fuse-bridge.c:1132:fuse_err_cbk] 0-glusterfs-fuse: 3160:

          FLUSH() ERR =&gt; -1 (Transport endpoint is not connected)</div>

        <div>&nbsp;</div>

        <div>And then when I bring fw1 back online, I see these messages

          on fw2:</div>

        <div>&nbsp;</div>

        <div>[2013-07-09 01:01:35.006782] I

          [rpc-clnt.c:1648:rpc_clnt_reconfig]

          0-firewall-scripts-client-0: changing port to 49152 (from 0)</div>

        <div>[2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv]

          0-firewall-scripts-client-0: readv failed (No data available)</div>

        <div>[2013-07-09 01:01:35.018546] I

          [client-handshake.c:1658:select_server_supported_programs]

          0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num

          (1298437), Version (330)</div>

        <div>[2013-07-09 01:01:35.019273] I

          [client-handshake.c:1456:client_setvolume_cbk]

          0-firewall-scripts-client-0: Connected to 192.168.253.1:49152,

          attached to remote volume '/gluster-fw1'.</div>

        <div>[2013-07-09 01:01:35.019356] I

          [client-handshake.c:1468:client_setvolume_cbk]

          0-firewall-scripts-client-0: Server and Client lk-version

          numbers are not same, reopening the fds</div>

        <div>[2013-07-09 01:01:35.019441] I

          [client-handshake.c:1308:client_post_handshake]

          0-firewall-scripts-client-0: 1 fds open - Delaying child_up

          until they are re-opened</div>

        <div>[2013-07-09 01:01:35.020070] I

          [client-handshake.c:930:client_child_up_reopen_done]

          0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd -

          notifying CHILD-UP</div>

        <div>[2013-07-09 01:01:35.020282] I

          [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0:

          Subvolume 'firewall-scripts-client-0' came back up; going

          online.</div>

        <div>[2013-07-09 01:01:35.020616] I

          [client-handshake.c:450:client_set_lk_version_cbk]

          0-firewall-scripts-client-0: Server lk version = 1</div>

        <div>&nbsp;</div>

        <div>So how do I make glusterfs survive a node failure, which is

          the whole point of all this?</div>

      </span><br>

    </blockquote>

    It looks like the brick processes on fw2 machine are not running and

    hence when fw1 is down, the entire replication process is stalled.

    can u do a ps and get the status of all the gluster processes and

    ensure that the brick process is up on fw2.<br>

    <br>

    Regards<br>

    Raghav<br>

  </body>

</html>