[Gluster-users] ping_timer_expired

Tue Nov 25 15:05:38 UTC 2014

We have three Gluster clusters, all three of which are exhibiting the
same symptom: FUSE clients report network ping timeouts from bricks,
disconnect from the volume, and then very quickly re-connect to all bricks.

An example from the client logs:

[2014-11-20 01:19:09.079725] C
[client-handshake.c:127:rpc_client_ping_timer_expired]
0-wp_uploads-client-2: server 192.168.135.37:49152 has not responded in
the last 5 seconds, disconnecting.
[2014-11-20 01:19:09.278701] E [rpc-clnt.c:369:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x38ca00fced]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
[0x38ca00f833] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x38ca00f74e]))) 0-wp_uploads-client-2: forced unwinding frame
type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-11-20 01:19:03.771414
(xid=0x7bd8a6)
[2014-11-20 01:19:09.278734] W
[client-rpc-fops.c:2774:client3_3_lookup_cbk] 0-wp_uploads-client-2:
remote operation failed: Transport endpoint is not connected. Path: /
(00000000-0000-0000-0000-000000000001)
[2014-11-20 01:19:09.278893] E [rpc-clnt.c:369:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x38ca00fced]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
[0x38ca00f833] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x38ca00f74e]))) 0-wp_uploads-client-2: forced unwinding frame
type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-11-20 01:19:03.771917
(xid=0x7bd8a7)
[2014-11-20 01:19:09.278901] W
[client-rpc-fops.c:2774:client3_3_lookup_cbk] 0-wp_uploads-client-2:
remote operation failed: Transport endpoint is not connected. Path: /
(00000000-0000-0000-0000-000000000001)
[2014-11-20 01:19:09.279008] E [rpc-clnt.c:369:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x38ca00fced]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
[0x38ca00f833] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x38ca00f74e]))) 0-wp_uploads-client-2: forced unwinding frame
type(GlusterFS Handshake) op(PING(3)) called at 2014-11-20
01:19:04.072860 (xid=0x7bd8a8)
[2014-11-20 01:19:09.279028] W [client-handshake.c:276:client_ping_cbk]
0-wp_uploads-client-2: timer must have expired
[2014-11-20 01:19:09.279090] E [rpc-clnt.c:369:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x38ca00fced]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
[0x38ca00f833] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x38ca00f74e]))) 0-wp_uploads-client-2: forced unwinding frame
type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-11-20 01:19:07.070544
(xid=0x7bd8a9)
[2014-11-20 01:19:09.279099] W
[client-rpc-fops.c:2774:client3_3_lookup_cbk] 0-wp_uploads-client-2:
remote operation failed: Transport endpoint is not connected. Path: /
(00000000-0000-0000-0000-000000000001)
[2014-11-20 01:19:09.287885] I [client.c:2229:client_rpc_notify]
0-wp_uploads-client-2: disconnected from 192.168.135.37:49152. Client
process will keep trying to connect to glusterd until brick's port is
available
[2014-11-20 01:19:09.377628] I [socket.c:3060:socket_submit_request]
0-wp_uploads-client-2: not connected (priv->connected = 0)
[2014-11-20 01:19:09.377669] W [rpc-clnt.c:1542:rpc_clnt_submit]
0-wp_uploads-client-2: failed to submit rpc-request (XID: 0x7bd8aa
Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport
(wp_uploads-client-2)
[2014-11-20 01:19:09.377692] W
[client-rpc-fops.c:2774:client3_3_lookup_cbk] 0-wp_uploads-client-2:
remote operation failed: Transport endpoint is not connected. Path:
/2014 (10537923-c903-4a34-af42-f74b9eb6cf11)
[2014-11-20 01:19:09.498741] I [rpc-clnt.c:1729:rpc_clnt_reconfig]
0-wp_uploads-client-2: changing port to 49152 (from 0)
[2014-11-20 01:19:09.501832] I
[client-handshake.c:1677:select_server_supported_programs]
0-wp_uploads-client-2: Using Program GlusterFS 3.3, Num (1298437),
Version (330)
[2014-11-20 01:19:09.538700] I
[client-handshake.c:1462:client_setvolume_cbk] 0-wp_uploads-client-2:
Connected to 192.168.135.37:49152, attached to remote volume
'/bricks/brick1/brick'.
[2014-11-20 01:19:09.538718] I
[client-handshake.c:1474:client_setvolume_cbk] 0-wp_uploads-client-2:
Server and Client lk-version numbers are not same, reopening the fds
[2014-11-20 01:19:09.548683] I
[client-handshake.c:450:client_set_lk_version_cbk]
0-wp_uploads-client-2: Server lk version = 1

As you can see, the error and the resolution occur within the same second.

The servers on cluster 1 are physical servers with 10G network devices.
 The clients for this cluster are VMs in a different subnet, so all
traffic passes through a firewall.  These servers and clients run
Gluster 3.6.1.

The servers on cluster 2 are VMs.  The clients for this cluster are VMs
in a different subnet, so all traffic passes through a firewall.  These
servers and clients run Gluster 3.6.1.

The servers on cluster 3 are VMs. The clients for this cluster are the
other servers in the cluster, and all are on the same subnet.  These
servers and clients run Gluster 3.5.2.

Clients on all three of these clusters are reporting timeouts.  Recovery
is usually sub-second, but I've seen it take as long as 10 seconds to
recover.

According to the folks in the #gluster IRC channel, these errors are
always due to network problems.  According the network folks here, there
are no indications of network problems that we can find. No firewall
logs indicating blocked traffic; no failures being reported in the
switches that we can find.

We see these errors much more frequently on the first two clusters:
usually daily, though not at consistent times.  Different clients report
failure at different times.  If it were a network issue, I would assume
more consistent failure between clients.

I'm not ruling out network issues, but the fact that we see any errors
at all on the third cluster with all local traffic seems to reduce the
implication of network issues.

What additional troubleshooting steps are recommended?  I know these
kinds of transient errors are the worst bug reports to receive, but I
simply don't know how to make the situation reproduce on demand.

Thanks,
Scott