<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">There are a few suspicious things going on here..</div><div class="gmail_quote"><br></div><div class="gmail_quote">On Tue, May 20, 2014 at 10:07 PM, Pranith Kumar Karampuri <span dir="ltr">&lt;<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>
&gt; &gt; hi,<br>
&gt; &gt;      crypt.t is failing regression builds once in a while and most of<br>
&gt; &gt; the times it is because of the failures just after the remount in the<br>
&gt; &gt; script.<br>
&gt; &gt;<br>
&gt; &gt; TEST rm -f $M0/testfile-symlink<br>
&gt; &gt; TEST rm -f $M0/testfile-link<br>
&gt; &gt;<br>
&gt; &gt; Both of these are failing with ENOTCONN. I got a chance to look at<br>
&gt; &gt; the logs. According to the brick logs, this is what I see:<br>
&gt; &gt; [2014-05-17 05:43:43.363979] E [posix.c:2272:posix_open]<br>
&gt; &gt; 0-patchy-posix: open on /d/backends/patchy1/testfile-symlink:<br>
&gt; &gt; Transport endpoint is not connected<br></div></div></blockquote><div><br></div><div>posix_open() happening on a symlink? This should NEVER happen. glusterfs itself should NEVER EVER by triggering symlink resolution on the server. In this case, for whatever reason an open() is attempted on a symlink, and it is getting followed back onto gluster&#39;s own mount point (test case is creating an absolute link).</div>
<div><br></div><div>So first find out: who is triggering fop-&gt;open() on a symlink. Fix the caller.</div><div><br></div><div>Next: add a check in posix_open() to fail with ELOOP or EINVAL if the inode is a symlink.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">
&gt; &gt;<br>
&gt; &gt; This is the very first time I saw posix failing with ENOTCONN. Do we<br>
&gt; &gt; have these bricks on some other network mounts? I wonder why it fails<br>
&gt; &gt; with ENOTCONN.<br>
&gt; &gt;<br>
&gt; &gt; I also see that it happens right after a call_bail on the mount.<br>
&gt; &gt;<br>
&gt; &gt; Pranith<br>
&gt;<br>
&gt; Hello.<br>
&gt; OK, I&#39;ll try to reproduce it.<br>
<br>
</div></div>I tried re-creating the issue on my fedora VM and it happened just now. When this issue happens I am not able to attach the process to gdb. From /proc/ the threads are in the following state for a while now:<br>

root@pranith-vm1 - /proc/4053/task<br>
10:20:50 :) ⚡ for i in `ls`; do cat $i/stack; echo &quot;---------------------------------&quot;; done<br>
[&lt;ffffffff811ed8ce&gt;] ep_poll+0x21e/0x330<br>
[&lt;ffffffff811ee7b5&gt;] SyS_epoll_wait+0xd5/0x100<br>
[&lt;ffffffff816533d9&gt;] system_call_fastpath+0x16/0x1b<br>
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff<br>
---------------------------------<br>
[&lt;ffffffff8108cb6d&gt;] hrtimer_nanosleep+0xad/0x170<br>
[&lt;ffffffff8108cc96&gt;] SyS_nanosleep+0x66/0x80<br>
[&lt;ffffffff816533d9&gt;] system_call_fastpath+0x16/0x1b<br>
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff<br>
---------------------------------<br>
[&lt;ffffffff81079271&gt;] do_sigtimedwait+0x161/0x200<br>
[&lt;ffffffff81079386&gt;] SYSC_rt_sigtimedwait+0x76/0xd0<br>
[&lt;ffffffff810793ee&gt;] SyS_rt_sigtimedwait+0xe/0x10<br>
[&lt;ffffffff816533d9&gt;] system_call_fastpath+0x16/0x1b<br>
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff<br>
---------------------------------<br>
[&lt;ffffffff810c277a&gt;] futex_wait_queue_me+0xda/0x140<br>
[&lt;ffffffff810c32be&gt;] futex_wait+0x17e/0x290<br>
[&lt;ffffffff810c4e26&gt;] do_futex+0xe6/0xc30<br>
[&lt;ffffffff810c59e1&gt;] SyS_futex+0x71/0x150<br>
[&lt;ffffffff816533d9&gt;] system_call_fastpath+0x16/0x1b<br>
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff<br>
---------------------------------<br>
[&lt;ffffffff810c277a&gt;] futex_wait_queue_me+0xda/0x140<br>
[&lt;ffffffff810c32be&gt;] futex_wait+0x17e/0x290<br>
[&lt;ffffffff810c4e26&gt;] do_futex+0xe6/0xc30<br>
[&lt;ffffffff810c59e1&gt;] SyS_futex+0x71/0x150<br>
[&lt;ffffffff816533d9&gt;] system_call_fastpath+0x16/0x1b<br>
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff<br>
---------------------------------<br>
[&lt;ffffffff810c277a&gt;] futex_wait_queue_me+0xda/0x140<br>
[&lt;ffffffff810c32be&gt;] futex_wait+0x17e/0x290<br>
[&lt;ffffffff810c4e26&gt;] do_futex+0xe6/0xc30<br>
[&lt;ffffffff810c59e1&gt;] SyS_futex+0x71/0x150<br>
[&lt;ffffffff816533d9&gt;] system_call_fastpath+0x16/0x1b<br>
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff<br>
---------------------------------<br>
[&lt;ffffffffa0426229&gt;] wait_answer_interruptible+0x89/0xd0 [fuse]  &lt;&lt;----------- This is the important thing I think<br>
[&lt;ffffffffa0426612&gt;] __fuse_request_send+0x232/0x290 [fuse]<br>
[&lt;ffffffffa0426682&gt;] fuse_request_send+0x12/0x20 [fuse]<br>
[&lt;ffffffffa042ebea&gt;] fuse_do_open+0xca/0x170 [fuse]<br>
[&lt;ffffffffa042ee06&gt;] fuse_open_common+0x56/0x80 [fuse]<br>
[&lt;ffffffffa042ee40&gt;] fuse_open+0x10/0x20 [fuse]<br>
[&lt;ffffffff811a6e4b&gt;] do_dentry_open+0x1eb/0x280<br>
[&lt;ffffffff811a6f11&gt;] finish_open+0x31/0x40<br>
[&lt;ffffffff811b77ba&gt;] do_last+0x4ca/0xe00<br>
[&lt;ffffffff811b8510&gt;] path_openat+0x420/0x690<br>
[&lt;ffffffff811b8e4a&gt;] do_filp_open+0x3a/0x90<br>
[&lt;ffffffff811a82ee&gt;] do_sys_open+0x12e/0x210<br>
[&lt;ffffffff811a83ee&gt;] SyS_open+0x1e/0x20<br>
[&lt;ffffffff816533d9&gt;] system_call_fastpath+0x16/0x1b<br>
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff<br>
---------------------------------<br>
[&lt;ffffffff810c277a&gt;] futex_wait_queue_me+0xda/0x140<br>
[&lt;ffffffff810c32be&gt;] futex_wait+0x17e/0x290<br>
[&lt;ffffffff810c4e26&gt;] do_futex+0xe6/0xc30<br>
[&lt;ffffffff810c59e1&gt;] SyS_futex+0x71/0x150<br>
[&lt;ffffffff816533d9&gt;] system_call_fastpath+0x16/0x1b<br>
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff<br>
---------------------------------<br>
[&lt;ffffffff810c277a&gt;] futex_wait_queue_me+0xda/0x140<br>
[&lt;ffffffff810c32be&gt;] futex_wait+0x17e/0x290<br>
[&lt;ffffffff810c4e26&gt;] do_futex+0xe6/0xc30<br>
[&lt;ffffffff810c59e1&gt;] SyS_futex+0x71/0x150<br>
[&lt;ffffffff816533d9&gt;] system_call_fastpath+0x16/0x1b<br>
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff<br>
---------------------------------<br>
[&lt;ffffffff8108cb6d&gt;] hrtimer_nanosleep+0xad/0x170<br>
[&lt;ffffffff8108cc96&gt;] SyS_nanosleep+0x66/0x80<br>
[&lt;ffffffff816533d9&gt;] system_call_fastpath+0x16/0x1b<br>
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff<br>
---------------------------------<br>
<br>
I don&#39;t know how to debug further but it seems like the system call hung<br></blockquote><div><br></div><div>The threads in the above process are of glusterfsd, and glusterfsd is ending up an open() attempt on a FUSE (its own) mount. Pretty obvious that it is deadlocking. Find the open()er on the symlink and you have your fix.</div>
<div><br></div><div>Avati</div></div></div></div>