<br><br><div class="gmail_quote">On Tue Dec 16 2014 at 8:46:48 AM Shyam <<a href="mailto:srangana@redhat.com">srangana@redhat.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 12/15/2014 09:06 PM, Anand Avati wrote:<br>
> Replies inline<br>
><br>
> On Mon Dec 15 2014 at 12:46:41 PM Shyam <<a href="mailto:srangana@redhat.com" target="_blank">srangana@redhat.com</a><br>
> <mailto:<a href="mailto:srangana@redhat.com" target="_blank">srangana@redhat.com</a>>> wrote:<br>
><br>
> With the changes present in [1] and [2],<br>
><br>
> A short explanation of the change would be, we encode the subvol ID in<br>
> the d_off, losing 'n + 1' bits in case the high order n+1 bits of the<br>
> underlying xlator returned d_off is not free. (Best to read the commit<br>
> message for [1] :) )<br>
><br>
> Although not related to the latest patch, here is something to consider<br>
> for the future:<br>
><br>
> We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol<br>
> encoding in the returned readdir offset. Due to this, the loss in bits<br>
> _may_ cause unwanted offset behavior, when used in the current scheme.<br>
> As we would end up eating more bits than what we do at present.<br>
><br>
> Or IOW, we could be invalidating the assumption "both EXT4/XFS are<br>
> tolerant in terms of the accuracy of the value presented<br>
> back in seekdir().<br>
><br>
><br>
> XFS has not been a problem, since it always returns 32bit d_off. With<br>
> Ext4, it has been noted that it is tolerant to sacrificing the lower<br>
> bits in accuracy.<br>
><br>
> i.e, a seekdir(val) actually seeks to the entry which<br>
> has the "closest" true offset."<br>
><br>
> Should we reconsider an in memory _cookie_ like approach that can help<br>
> in this case?<br>
><br>
> It would invalidate (some or all based on the implementation) the<br>
> following constraints that the current design resolves, (from, [1])<br>
> - Nothing to "remember in memory" or evict "old entries".<br>
> - Works fine across NFS server reboots and also NFS head failover.<br>
> - Tolerant to seekdir() to arbitrary locations.<br>
><br>
> But, would provide a more reliable readdir offset for use (when valid<br>
> and not evicted, say).<br>
><br>
> How would NFS adapt to this? Does Ganesha need a better scheme when<br>
> doing multi-head NFS fail over?<br>
><br>
><br>
> Ganesha just offloads the responsibility to the FSAL layer to give<br>
> stable dir cookies (as it rightly should)<br>
><br>
><br>
> Thoughts?<br>
><br>
><br>
> I think we need to analyze the actual assumption/problem here.<br>
> Remembering things in memory comes with the limitations you note above,<br>
> and may after all, still not be necessary. Let's look at the two<br>
> approaches taken:<br>
><br>
> - Small backend offsets: like XFS, the offsets fit in 32bits, and we are<br>
> left with another 32bits of freedom to encode what we want. There is no<br>
> problem here until our nested encoding requirements cross 32bits of<br>
> space. So let's ignore this for now.<br>
><br>
> - Large backend offsets: Ext4 being the primary target. Here we observe<br>
> that the backend filesystem is tolerant to sacrificing the accuracy of<br>
> lower bits. So we overwrite the lower bits with our subvolume encoding<br>
> information, and the number of bits used to encode is implicit in the<br>
> subvolume cardinality of that translator. While this works fine with a<br>
> single transformation, it is clearly a problem when the transformation<br>
> is nested with the same algorithm. The reason is quite simple: while the<br>
> lower bits were disposable when the cookie was taken fresh from Ext4,<br>
> once transformed the same lower bits are now "holy" and cannot be<br>
> overwritten carelessly, at least without dire consequences. The higher<br>
> level xlators need to take up the "next higher bits", past the previous<br>
> transformation boundary, to encode the next subvolume information. Once<br>
> the d_off transformation algorithms are fixed to give such due "respect"<br>
> to the lower layer's transformation and use a different real estate, we<br>
> might actually notice that the problem may not need such a deep redesign<br>
> after all.<br>
<br>
Agreed, my lack of understanding though is how may bits can be<br>
sacrificed for ext4? I do not have that data, any pointers there would<br>
help. (did go through <a href="https://lwn.net/Articles/544520/" target="_blank">https://lwn.net/Articles/<u></u>544520/</a> but that does not<br>
have the tolerance information in it)<br>
<br>
Here is what I have as the current bits lost based on the following<br>
volume configuration,<br>
- 2 Tiers (DHT over DHT)<br>
- 128 subvols per DHT<br>
- Each DHT instance is either AFR or EC subvolumes, with 2 replicas and<br>
say 6 bricks per EC instance<br>
<br>
So EC side of the subvol needs log(2)6 (EC) + log(2)128 (DHT) + log(2)2<br>
(Tier) = 3 + 7 + 1, or 11 bits of the actual d_off used to encode the<br>
volume, +1 for the high order bit to denote the encoding. (AFR would<br>
have 1 bit less, so we can consider just the EC side of things for the<br>
maximum loss computation at present)<br>
<br>
Is 12 bits still a tolerable loss for ext4? Or, till how many bits can<br>
we still use the current scheme?<br>
<br>
If we move to 1000/10000 node gluster in 4.0, assuming everything<br>
remains the same except DHT, we need an additional 3-5 bits for the DHT<br>
subvol encoding. Would this still survive the ext4 encoding scheme for<br>
d_off?<br>
<br></blockquote><div><br></div><div><br></div><div>In theory, we need at least log_base2(#of bricks) bits for storing the information. If we are creative enough, in making the various layers co-operate, we could get away with just that minimum, independent of the number of xlator layers.</div><div><br></div><div>One example approach (not necessarily the best): Make every xlator knows the total number of leaf xlators (protocol/clients), and also the number of all leaf xlators from each of its subvolumes. This way, the protocol/client xlators (alone) do the encoding, by knowing its global brick# and total #of bricks. The cluster xlators blindly forward the readdir_cbk without any further transformations of the d_offs, and also route the next readdir(old_doff) request to the appropriate subvolume based on the weighted graph (of counts of protocol/clients in the subtrees) till it reaches the right protocol/client to resume the enumeration.</div><div><br></div><div>There may be better/even simpler approaches too (especially one that does not need global awareness of xlator counts), and finding such a stateless solution, and remaining NFS friendly is well worth the effort IMO.</div><div><br></div><div>Thanks</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
><br>
> Hope that helps<br>
> Thanks<br>
><br>
> Shyam<br>
> [1] <a href="http://review.gluster.org/#/c/__4711/" target="_blank">http://review.gluster.org/#/c/<u></u>__4711/</a><br>
> <<a href="http://review.gluster.org/#/c/4711/" target="_blank">http://review.gluster.org/#/<u></u>c/4711/</a>><br>
> [2] <a href="http://review.gluster.org/#/c/__8201/" target="_blank">http://review.gluster.org/#/c/<u></u>__8201/</a><br>
> <<a href="http://review.gluster.org/#/c/8201/" target="_blank">http://review.gluster.org/#/<u></u>c/8201/</a>><br>
> ______________________________<u></u>___________________<br>
> Gluster-devel mailing list<br>
> <a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a> <mailto:<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.<u></u>org</a>><br>
> <a href="http://supercolony.gluster." target="_blank">http://supercolony.gluster.</a>__<u></u>org/mailman/listinfo/gluster-_<u></u>_devel<br>
> <<a href="http://supercolony.gluster.org/mailman/listinfo/gluster-devel" target="_blank">http://supercolony.gluster.<u></u>org/mailman/listinfo/gluster-<u></u>devel</a>><br>
><br>
</blockquote></div>