<html><body><div style="font-family: garamond,new york,times,serif; font-size: 12pt; color: #000000"><div><br></div><div><br></div><hr id="zwchr"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;" data-mce-style="border-left: 2px solid #1010FF; margin-left: 5px; padding-left: 5px; color: #000; font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><b>From: </b>"Anand Avati" <aavati@redhat.com><br><b>To: </b>"Krutika Dhananjay" <kdhananj@redhat.com>, "Pranith Kumar Karampuri" <pkarampu@redhat.com><br><b>Cc: </b>gluster-devel@gluster.org<br><b>Sent: </b>Thursday, June 5, 2014 12:38:41 PM<br><b>Subject: </b>Re: [Gluster-devel] Regarding doing away with refkeeper in locks xlator<br><div><br></div>On 6/4/14, 9:43 PM, Krutika Dhananjay wrote:<br>><br>><br>> ------------------------------------------------------------------------<br>><br>> *From: *"Pranith Kumar Karampuri" <pkarampu@redhat.com><br>> *To: *"Krutika Dhananjay" <kdhananj@redhat.com>, "Anand Avati"<br>> <aavati@redhat.com><br>> *Cc: *gluster-devel@gluster.org<br>> *Sent: *Wednesday, June 4, 2014 12:23:59 PM<br>> *Subject: *Re: [Gluster-devel] Regarding doing away with refkeeper<br>> in locks xlator<br>><br>><br>> On 06/04/2014 12:02 PM, Pranith Kumar Karampuri wrote:<br>><br>><br>> On 06/04/2014 11:37 AM, Krutika Dhananjay wrote:<br>><br>> Hi,<br>><br>> Recently there was a crash in locks translator (BZ 1103347,<br>> BZ 1097102) with the following backtrace:<br>> (gdb) bt<br>> #0 uuid_unpack (in=0x8 <Address 0x8 out of bounds>,<br>> uu=0x7fffea6c6a60) at ../../contrib/uuid/unpack.c:44<br>> #1 0x00007feeba9e19d6 in uuid_unparse_x (uu=<value<br>> optimized out>, out=0x2350fc0<br>> "081bbc7a-7551-44ac-85c7-aad5e2633db9",<br>> fmt=0x7feebaa08e00<br>> "%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x") at<br>> ../../contrib/uuid/unparse.c:55<br>> #2 0x00007feeba9be837 in uuid_utoa (uuid=0x8 <Address 0x8<br>> out of bounds>) at common-utils.c:2138<br>> #3 0x00007feeb06e8a58 in pl_inodelk_log_cleanup<br>> (this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:396<br>> #4 pl_inodelk_client_cleanup (this=0x230d910,<br>> ctx=0x7fee700f0c60) at inodelk.c:428<br>> #5 0x00007feeb06ddf3a in pl_client_disconnect_cbk<br>> (this=0x230d910, client=<value optimized out>) at posix.c:2550<br>> #6 0x00007feeba9fa2dd in gf_client_disconnect<br>> (client=0x27724a0) at client_t.c:368<br>> #7 0x00007feeab77ed48 in server_connection_cleanup<br>> (this=0x2316390, client=0x27724a0, flags=<value optimized<br>> out>) at server-helpers.c:354<br>> #8 0x00007feeab77ae2c in server_rpc_notify (rpc=<value<br>> optimized out>, xl=0x2316390, event=<value optimized out>,<br>> data=0x2bf51c0) at server.c:527<br>> #9 0x00007feeba775155 in rpcsvc_handle_disconnect<br>> (svc=0x2325980, trans=0x2bf51c0) at rpcsvc.c:720<br>> #10 0x00007feeba776c30 in rpcsvc_notify (trans=0x2bf51c0,<br>> mydata=<value optimized out>, event=<value optimized out>,<br>> data=0x2bf51c0) at rpcsvc.c:758<br>> #11 0x00007feeba778638 in rpc_transport_notify (this=<value<br>> optimized out>, event=<value optimized out>, data=<value<br>> optimized out>) at rpc-transport.c:512<br>> #12 0x00007feeb115e971 in socket_event_poll_err (fd=<value<br>> optimized out>, idx=<value optimized out>, data=0x2bf51c0,<br>> poll_in=<value optimized out>, poll_out=0,<br>> poll_err=0) at socket.c:1071<br>> #13 socket_event_handler (fd=<value optimized out>,<br>> idx=<value optimized out>, data=0x2bf51c0, poll_in=<value<br>> optimized out>, poll_out=0, poll_err=0) at socket.c:2240<br>> #14 0x00007feeba9fc6a7 in event_dispatch_epoll_handler<br>> (event_pool=0x22e2d00) at event-epoll.c:384<br>> #15 event_dispatch_epoll (event_pool=0x22e2d00) at<br>> event-epoll.c:445<br>> #16 0x0000000000407e93 in main (argc=19,<br>> argv=0x7fffea6c7f88) at glusterfsd.c:2023<br>> (gdb) f 4<br>> #4 pl_inodelk_client_cleanup (this=0x230d910,<br>> ctx=0x7fee700f0c60) at inodelk.c:428<br>> 428 pl_inodelk_log_cleanup (l);<br>> (gdb) p l->pl_inode->refkeeper<br>> $1 = (inode_t *) 0x0<br>> (gdb)<br>><br>> pl_inode->refkeeper was found to be NULL even when there<br>> were some blocked inodelks in a certain domain of the inode,<br>> which when dereferenced by the epoll thread in the cleanup<br>> codepath led to a crash.<br>><br>> On inspecting the code (for want of a consistent<br>> reproducer), three things were found:<br>><br>> 1. The function where the crash happens<br>> (pl_inodelk_log_cleanup()), makes an attempt to resolve the<br>> inode to path as can be seen below. But the way inode_path()<br>> itself<br>> works is to first construct the path based on the given<br>> inode's ancestry and place it in the buffer provided. And if<br>> all else fails, the gfid of the inode is placed in a certain<br>> format ("<gfid:%s>").<br>> This eliminates the need for statements from line 4<br>> through 7 below, thereby "preventing" dereferencing of<br>> pl_inode->refkeeper.<br>> Now, although this change prevents the crash<br>> altogether, it still does not fix the race that led to<br>> pl_inode->refkeeper becoming NULL, and comes at the cost of<br>> printing "(null)" in the log message on line 9 every<br>> time pl_inode->refkeeper is found to be NULL, rendering the<br>> logged messages somewhat useless.<br>><br>> <code><br>> 0 pl_inode = lock->pl_inode;<br>> 1<br>> 2 inode_path (pl_inode->refkeeper, NULL, &path);<br>> 3<br>> 4 if (path)<br>> 5 file = path;<br>> 6 else<br>> 7 file = uuid_utoa<br>> (pl_inode->refkeeper->gfid);<br>> 8<br>> 9 gf_log (THIS->name, GF_LOG_WARNING,<br>> 10 "releasing lock on %s held by "<br>> 11 "{client=%p, pid=%"PRId64" lk-owner=%s}",<br>> 12 file, lock->client, (uint64_t)<br>> lock->client_pid,<br>> 13 lkowner_utoa (&lock->owner));<br>> <\code><br>><br>> I think this logging code is from the days when gfid handle<br>> concept was not there. So it wasn't returning <gfid:gfid-str> in<br>> cases the path is not present in the dentries. I believe the<br>> else block can be deleted safely now.<br>><br>> Pranith<br>><br>> Posted patch http://review.gluster.org/#/c/7981/1 for review, to fix<br>> problem #1 above.<br>><br>> -Krutika<br>><br>><br>> 2. There is at least one codepath found that can lead to<br>> this crash:<br>> Imagine an inode on which an inodelk operation is<br>> attempted by a client and is successfully granted too.<br>> Now, between the time the lock was granted and<br>> pl_update_refkeeper() was called by this thread, the client<br>> could send a DISCONNECT event,<br>> causing cleanup codepath to be executed, where the epoll<br>> thread crashes on dereferencing pl_inode->refkeeper which is<br>> STILL NULL at this point.<br>><br>> Besides, there are still places in locks xlator where<br>> the refkeeper is NOT updated whenever the lists are modified<br>> - for instance in the cleanup codepath from a DISCONNECT.<br>><br>> 3. Also, pl_update_refkeeper() seems to be not taking into<br>> account blocked locks on the inode in the<br>> __pl_inode_is_empty() check, when it should, as there could<br>> still be cases<br>> where the granted list could be empty but not the<br>> blocked list at the time of udpating the refkeeper, in which<br>> case pl_inode must still take ref on the inode.<br>><br>> Avati,<br>> There is one more issue, where pl_inode_forget is cleaning<br>> up blocked/granted locks which is not handling the new<br>> 'client_list'. But if we fix inode ref/unref like Kritika<br>> proposed, pl_forget should never come when there is a<br>> granted/blocked lock. So may be we should delete that as well.<br>> Please feel free to comment.<br>><br>> It is pl_forget not pl_inode_forget. typo :-(.<br>><br>> Pranith<br>><br>><br>> Pranith<br>><br>><br>> Proposed solution to 2/3:<br>><br>> 1. Do away with refkeeper in pl_inode_t altogether.<br>> 2. Let every lock object (inodelk/entryllk/posix_lock) have<br>> an inode_t * member to act as a placeholder for the<br>> associated inode object on which it is locking.<br>> 3. Let each lock object hold a ref on the inode at the time<br>> of its creation and unref the inode during its destruction.<br>><br>> Please let me know what you think of the above.<br>><br>> -Krutika<br>><br>><br>><br>><br>> _______________________________________________<br>> Gluster-devel mailing list<br>> Gluster-devel@gluster.orghttp://supercolony.gluster.org/mailman/listinfo/gluster-devel<br>><br>><br>><br>><br>> _______________________________________________<br>> Gluster-devel mailing list<br>> Gluster-devel@gluster.orghttp://supercolony.gluster.org/mailman/listinfo/gluster-devel<br>><br>><br>><br><div><br></div>To summarize, the real "problems" are:<br><div><br></div>- Deref of pl_inode->refkeeper as an inode_t in the cleanup logger. It <br>is an internal artifact of pl_update_refkeeper() working and nobody else <br>is supposed to "touch" it.</blockquote><div>For this, the solution I have in mind is to</div><div>a. have a placeholder for gfid in pl_inode_t object,</div><div>b. remember the gfid of the inode at the time of pl_inode_t creation in pl_ctx_get(), and</div><div>c. print pl_inode->gfid in the log message in pl_{inodelk,entrylk}_log_cleanup().<br></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;" data-mce-style="border-left: 2px solid #1010FF; margin-left: 5px; padding-left: 5px; color: #000; font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><div><br></div>- Missing call to pl_update_refkeeper() in client_disconnect_cbk(). This <br>can result in a potential leak of inode ref (not a leak if the same <br>inode gets any locking activity by another client in the future).<br></blockquote><div>As far as updates to refkeeper is concerned during DISCONNECT is concerned, Pranith and I did have discussions at length but could<br>not find a single safe place in cleanup functions to call pl_update_refkeeper() that does not lead to illegal memory access,</div><div>whether before or after the call to __pl_inodelk_unref() in the third for loop.</div><div><br></div><div>Case #1 explained:<br></div><div>=============<br></div><div><br></div><div><snip><br></div><div>list_for_each_entry_safe() {<br></div><div>...<br></div><div>...<br></div><div>pl_update_refkeeper(inode);<br></div><div>pthread_mutex_lock (&pl_inode->mutex);<br></div><div> __pl_inodelk_unref(l);<br></div><div>pthread_mutex_unlock (&pl_inode->mutex);<br></div><div></snip><br></div><div><br></div><div>This causes the last unref in pl_update_refkeeper() to implicitly call pl_forget() where pl_inode_t object is destroyed while it is<br></div><div>still needed in terms of having to obtain pl_inode->mutex before unrefing the lock object.<br></div><div><br></div><div>Case 2 explained:<br></div><div>=============<br></div><div><div><snip><br></div><div>list_for_each_entry_safe() {<br></div><div>...<br></div><div>...<br></div><div>pthread_mutex_lock (&pl_inode->mutex);<br></div><div> __pl_inodelk_unref(l);<br></div><div>pthread_mutex_unlock (&pl_inode->mutex);<br></div><div>pl_update_refkeeper(inode);</div></snip></div><div><br></div><div>After the first for loop is processed, the domain's lists could have been emptied. And at this point, there could well come a second thread that could update refkeeper,<br></div><div>triggering last unref on the inode leading to a call to pl_forget() (where pl_inode_t is freed), after which the DISCONNECT thread would be trying to perform illegal<br></div><div>memory access in its call to pl_update_refkeeper() during its turn.<br></div><div><br></div><div>So the solution Pranith and I came up with involves making sure that the inode_t object is alive for as long as there is atleast one lock on it:<br></div><div><br></div><div>1. refkeeper will not be used for inodelks and entrylks (but will continue to be used in fcntl locks).<br></div><div>2. Each pl_inode_t object will also host an inode_t member which is initialised during a call to pl_inode_get() in pl_common_{inodelk, entrylks}().<br></div><div>3. Everytime a lock is granted/blocked, pl_inode->inode is ref'd (in pl_inode_setlk() after successful locking.<br></div><div>4. Everytime a lock is unlocked, pl_inode->inode is unref'd.<br></div><div>5. In disconnect codepath, as part of "released" list processing, the inodes associated with the client are unref'd in the same loop right after every lock object is unref'd.</div><div><br></div><div>With all this, there is one problem with the lock object that is found to be in both blocked and granted list in the transient state, when there's a race between the unlocking thread<br></div><div>and the disconnecting thread. This can be best explained with the following example:<br></div><div><example><br></div><div>Consider an inode I on which there's a granted lock G and a blocked lock B, from clients C1 and C2 respectively. With this approach, at this point the number of refs taken<br></div><div>by the locks xlator on I would be 2.<br></div><div>C1 unlocks B, at which point I's refcount becomes 1.<br></div><div>Now as part of granting other blocked locks in unlock codepath, B is put in granted list.<br></div><div>Now B's client C2 sends a DISCONNECT event, which would cause I to be unref'd again. This being the last unref, pl_forget() is called on I causing its pl_inode to be destroyed</div><div>At this point, the unlocking thread could try to do a mutex lock on pl_inode->mutex in grant_blocked_{inode,entry}_locks() (whereas pl_inode is now already destroyed) leading to a crash.</div><div></example><br></div><div><br></div><div>The problem described in the example can be fixed by making sure grant_blocked_{inode,entry}_locks() refs the inode first thing and then unrefs it just before returning.</div><div>This would fix illegal memory access. <br></div><div> <br></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;" data-mce-style="border-left: 2px solid #1010FF; margin-left: 5px; padding-left: 5px; color: #000; font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;">Nice to have (not necessary):<br><div><br></div>- Include blocked locks in pl_inode_is_empty(), for the sake of <br>symmetry. I don't see how this will make a difference thoough, given the <br>goal of pl_update_refkeeper (which is, ensure inode->ref > 0 while a <br>user's locks are existing).</blockquote><div>Done.<br></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;" data-mce-style="border-left: 2px solid #1010FF; margin-left: 5px; padding-left: 5px; color: #000; font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><div><br></div>Avati<br><div><br></div></blockquote><div><br></div></div></body></html>