<html><body><div style="font-family: garamond,new york,times,serif; font-size: 12pt; color: #000000"><div>Let me copy-paste the code from DHT for easy reference here:<br></div><div><br></div><div>&lt;code&gt;</div><div><p style="margin: 0px;" data-mce-style="margin: 0px;">&nbsp; &nbsp; &nbsp; &nbsp; linked_inode = inode_link (entry_loc.inode, loc-&gt;inode, entry-&gt;d_name, &amp;entry-&gt;d_stat); <br>&nbsp; &nbsp; &nbsp; &nbsp; inode = entry_loc.inode; <br>&nbsp; &nbsp; &nbsp; &nbsp; entry_loc.inode = linked_inode; <br>&nbsp; &nbsp; &nbsp; &nbsp; inode_unref (inode); <br>&lt;/code&gt;</p><p style="margin: 0px;" data-mce-style="margin: 0px;"><br></p><p style="margin: 0px;" data-mce-style="margin: 0px;">-Krutika</p></div><div><br></div><hr id="zwchr"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Krutika Dhananjay" &lt;kdhananj@redhat.com&gt;<br><b>To: </b>"Emmanuel Dreyfus" &lt;manu@netbsd.org&gt;<br><b>Cc: </b>"Gluster Devel" &lt;gluster-devel@gluster.org&gt;<br><b>Sent: </b>Monday, December 1, 2014 5:46:43 PM<br><b>Subject: </b>Re: [Gluster-devel] spurious error in self-heald.t<br><div><br></div><div style="font-family: garamond,new york,times,serif; font-size: 12pt; color: #000000"><div>Hi,</div><div><br></div><div>So here is what seems to be happening:</div><div><br></div><div>Self-heal daemon is implemented in a way that each shd on a given node contains one healer thread for every brick that is local to it.</div><div>And since in our regression tests, all bricks are on the same node, the lone self-heal daemon contains 2 healer threads.</div><div><br></div><div>It so happens that at one point both healer threads perform readdirp() simultaneously on a given directory (the one coming from syncop_readdirp() in afr_shd_full_sweep()).</div><div>The respective client xlators get the responses. Both of them do an inode_find() to see if the shared inode table already contains an entry for this gfid (refer to&nbsp;unserialize_rsp_direntp() in client-helpers.c).</div><div>Both of them do not find the corresponding inode in inode table, hence they call inode_new() and allocate new in-memory inodes (I1 and I2 respectively, say) which are at that point not filled completely (and as a result ia_gfid, ia_type etc are all-zeroes).</div><div><br></div><div>Now both clients unwind the calls to their parent - AFR, where afr performs inode_link() from calls to&nbsp;gf_link_inodes_from_dirent().</div><div>Healer thread-1 races ahead, successfully links its inode (I1) in the inode table and releases the mutexes.</div><div>Now healer thread-2 enters __inode_link(), where it manages to get the in-memory inode object (I1) that was just linked by thread-1 through a call to inode_find() and the function returns I1 as the @link_inode.</div><div><br></div><div>What AFR should have done after that call was to set entry-&gt;inode to point to link_inode. But because it does not perform this action, entry-&gt;inode continues to hold I2 which was not initialised.</div><div>And the subsequent syncop_opendir() is called using entry-&gt;inode (whose gfid is all zeroes), causing the process to crash in client xlator where it performs this null check.</div><div><br></div><div>I think a proper fix would be something similar to what DHT does for instance in gf_defrag_fix_layout() in the three C statements following the inode_link() in dht-rebalance.c<br></div><div>Basically AFR must be setting entry-&gt;inode to point to the linked inode object.</div><div><br></div><div>Hope that helped.</div><div><br></div><div>-Krutika</div><div><br></div><div><br></div><hr id="zwchr"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Emmanuel Dreyfus" &lt;manu@netbsd.org&gt;<br><b>To: </b>"Krutika Dhananjay" &lt;kdhananj@redhat.com&gt;<br><b>Cc: </b>"Emmanuel Dreyfus" &lt;manu@netbsd.org&gt;, "Gluster Devel" &lt;gluster-devel@gluster.org&gt;<br><b>Sent: </b>Monday, December 1, 2014 2:44:25 PM<br><b>Subject: </b>Re: [Gluster-devel] spurious error in self-heald.t<br><div><br></div>On Mon, Dec 01, 2014 at 04:00:46AM -0500, Krutika Dhananjay wrote:<br>&gt; Was able to recreate it. Thanks for the report. Will look<br>&gt; into why this could possibly happen. <br><div><br></div>I poste the symptom workaround:<br>http://review.gluster.com/9216<br><div><br></div>Would it be admissible as an interm measure? At least it<br>spares a crash.<br><div><br></div>-- <br>Emmanuel Dreyfus<br>manu@netbsd.org<br></blockquote><div><br></div></div><br>_______________________________________________<br>Gluster-devel mailing list<br>Gluster-devel@gluster.org<br>http://supercolony.gluster.org/mailman/listinfo/gluster-devel<br></blockquote><div><br></div></div></body></html>