<br><br><div class="gmail_quote">On Tue Dec 16 2014 at 8:46:48 AM Shyam &lt;<a href="mailto:srangana@redhat.com">srangana@redhat.com</a>&gt; wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 12/15/2014 09:06 PM, Anand Avati wrote:<br>

&gt; Replies inline<br>

&gt;<br>

&gt; On Mon Dec 15 2014 at 12:46:41 PM Shyam &lt;<a href="mailto:srangana@redhat.com" target="_blank">srangana@redhat.com</a><br>

&gt; &lt;mailto:<a href="mailto:srangana@redhat.com" target="_blank">srangana@redhat.com</a>&gt;&gt; wrote:<br>

&gt;<br>

&gt;     With the changes present in [1] and [2],<br>

&gt;<br>

&gt;     A short explanation of the change would be, we encode the subvol ID in<br>

&gt;     the d_off, losing &#39;n + 1&#39; bits in case the high order n+1 bits of the<br>

&gt;     underlying xlator returned d_off is not free. (Best to read the commit<br>

&gt;     message for [1] :) )<br>

&gt;<br>

&gt;     Although not related to the latest patch, here is something to consider<br>

&gt;     for the future:<br>

&gt;<br>

&gt;     We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol<br>

&gt;     encoding in the returned readdir offset. Due to this, the loss in bits<br>

&gt;     _may_ cause unwanted offset behavior, when used in the current scheme.<br>

&gt;     As we would end up eating more bits than what we do at present.<br>

&gt;<br>

&gt;     Or IOW, we could be invalidating the assumption &quot;both EXT4/XFS are<br>

&gt;     tolerant in terms of the accuracy of the value presented<br>

&gt;     back in seekdir().<br>

&gt;<br>

&gt;<br>

&gt; XFS has not been a problem, since it always returns 32bit d_off. With<br>

&gt; Ext4, it has been noted that it is tolerant to sacrificing the lower<br>

&gt; bits in accuracy.<br>

&gt;<br>

&gt;     i.e, a seekdir(val) actually seeks to the entry which<br>

&gt;     has the &quot;closest&quot; true offset.&quot;<br>

&gt;<br>

&gt;     Should we reconsider an in memory _cookie_ like approach that can help<br>

&gt;     in this case?<br>

&gt;<br>

&gt;     It would invalidate (some or all based on the implementation) the<br>

&gt;     following constraints that the current design resolves, (from, [1])<br>

&gt;     - Nothing to &quot;remember in memory&quot; or evict &quot;old entries&quot;.<br>

&gt;     - Works fine across NFS server reboots and also NFS head failover.<br>

&gt;     - Tolerant to seekdir() to arbitrary locations.<br>

&gt;<br>

&gt;     But, would provide a more reliable readdir offset for use (when valid<br>

&gt;     and not evicted, say).<br>

&gt;<br>

&gt;     How would NFS adapt to this? Does Ganesha need a better scheme when<br>

&gt;     doing multi-head NFS fail over?<br>

&gt;<br>

&gt;<br>

&gt; Ganesha just offloads the responsibility to the FSAL layer to give<br>

&gt; stable dir cookies (as it rightly should)<br>

&gt;<br>

&gt;<br>

&gt;     Thoughts?<br>

&gt;<br>

&gt;<br>

&gt; I think we need to analyze the actual assumption/problem here.<br>

&gt; Remembering things in memory comes with the limitations you note above,<br>

&gt; and may after all, still not be necessary. Let&#39;s look at the two<br>

&gt; approaches taken:<br>

&gt;<br>

&gt; - Small backend offsets: like XFS, the offsets fit in 32bits, and we are<br>

&gt; left with another 32bits of freedom to encode what we want. There is no<br>

&gt; problem here until our nested encoding requirements cross 32bits of<br>

&gt; space. So let&#39;s ignore this for now.<br>

&gt;<br>

&gt; - Large backend offsets: Ext4 being the primary target. Here we observe<br>

&gt; that the backend filesystem is tolerant to sacrificing the accuracy of<br>

&gt; lower bits. So we overwrite the lower bits with our subvolume encoding<br>

&gt; information, and the number of bits used to encode is implicit in the<br>

&gt; subvolume cardinality of that translator. While this works fine with a<br>

&gt; single transformation, it is clearly a problem when the transformation<br>

&gt; is nested with the same algorithm. The reason is quite simple: while the<br>

&gt; lower bits were disposable when the cookie was taken fresh from Ext4,<br>

&gt; once transformed the same lower bits are now &quot;holy&quot; and cannot be<br>

&gt; overwritten carelessly, at least without dire consequences. The higher<br>

&gt; level xlators need to take up the &quot;next higher bits&quot;, past the previous<br>

&gt; transformation boundary, to encode the next subvolume information. Once<br>

&gt; the d_off transformation algorithms are fixed to give such due &quot;respect&quot;<br>

&gt; to the lower layer&#39;s transformation and use a different real estate, we<br>

&gt; might actually notice that the problem may not need such a deep redesign<br>

&gt; after all.<br>

<br>

Agreed, my lack of understanding though is how may bits can be<br>

sacrificed for ext4? I do not have that data, any pointers there would<br>

help. (did go through <a href="https://lwn.net/Articles/544520/" target="_blank">https://lwn.net/Articles/<u></u>544520/</a> but that does not<br>

have the tolerance information in it)<br>

<br>

Here is what I have as the current bits lost based on the following<br>

volume configuration,<br>

- 2 Tiers (DHT over DHT)<br>

- 128 subvols per DHT<br>

- Each DHT instance is either AFR or EC subvolumes, with 2 replicas and<br>

say 6 bricks per EC instance<br>

<br>

So EC side of the subvol needs log(2)6 (EC) + log(2)128 (DHT) + log(2)2<br>

(Tier) = 3 + 7 + 1, or 11 bits of the actual d_off used to encode the<br>

volume, +1 for the high order bit to denote the encoding. (AFR would<br>

have 1 bit less, so we can consider just the EC side of things for the<br>

maximum loss computation at present)<br>

<br>

Is 12 bits still a tolerable loss for ext4? Or, till how many bits can<br>

we still use the current scheme?<br>

<br>

If we move to 1000/10000 node gluster in 4.0, assuming everything<br>

remains the same except DHT, we need an additional 3-5 bits for the DHT<br>

subvol encoding. Would this still survive the ext4 encoding scheme for<br>

d_off?<br>

<br></blockquote><div><br></div><div><br></div><div>In theory, we need at least log_base2(#of bricks) bits for storing the information. If we are creative enough, in making the various layers co-operate, we could get away with just that minimum, independent of the number of xlator layers.</div><div><br></div><div>One example approach (not necessarily the best): Make every xlator knows the total number of leaf xlators (protocol/clients), and also the number of all leaf xlators from each of its subvolumes. This way, the protocol/client xlators (alone) do the encoding, by knowing its global brick# and total #of bricks. The cluster xlators blindly forward the readdir_cbk without any further transformations of the d_offs, and also route the next readdir(old_doff) request to the appropriate subvolume based on the weighted graph (of counts of protocol/clients in the subtrees) till it reaches the right protocol/client to resume the enumeration.</div><div><br></div><div>There may be better/even simpler approaches too (especially one that does not need global awareness of xlator counts), and finding such a stateless solution, and remaining NFS friendly is well worth the effort IMO.</div><div><br></div><div>Thanks</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

&gt;<br>

&gt; Hope that helps<br>

&gt; Thanks<br>

&gt;<br>

&gt;     Shyam<br>

&gt;     [1] <a href="http://review.gluster.org/#/c/__4711/" target="_blank">http://review.gluster.org/#/c/<u></u>__4711/</a><br>

&gt;     &lt;<a href="http://review.gluster.org/#/c/4711/" target="_blank">http://review.gluster.org/#/<u></u>c/4711/</a>&gt;<br>

&gt;     [2] <a href="http://review.gluster.org/#/c/__8201/" target="_blank">http://review.gluster.org/#/c/<u></u>__8201/</a><br>

&gt;     &lt;<a href="http://review.gluster.org/#/c/8201/" target="_blank">http://review.gluster.org/#/<u></u>c/8201/</a>&gt;<br>

&gt;     ______________________________<u></u>___________________<br>

&gt;     Gluster-devel mailing list<br>

&gt;     <a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a> &lt;mailto:<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.<u></u>org</a>&gt;<br>

&gt;     <a href="http://supercolony.gluster." target="_blank">http://supercolony.gluster.</a>__<u></u>org/mailman/listinfo/gluster-_<u></u>_devel<br>

&gt;     &lt;<a href="http://supercolony.gluster.org/mailman/listinfo/gluster-devel" target="_blank">http://supercolony.gluster.<u></u>org/mailman/listinfo/gluster-<u></u>devel</a>&gt;<br>

&gt;<br>

</blockquote></div>