Can you run ls as &#39;strace -Ttc ls&#39; in each of the three runs to compare the output of first and third run to see where most of the time is getting spent?<div><br></div><div>Avati<br><br><div class="gmail_quote">On Tue, Mar 26, 2013 at 11:01 AM, Xavier Hernandez <span dir="ltr">&lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

since one of the improvements seemed to be the reduction of the number of directories inside .glusterfs I&#39;ve made a modification to storage/posix so that instead of creating 2 levels of 256 directories each, I create 4 levels of 16 directories.<br>


<br>

With this change, the first and second ls take 0.9 seconds; the third 9.<br>

<br>

I don&#39;t know what causes such slowness on the third ls, however the second ls has improved a lot.<br>

<br>

Any one has some advice ?<br>

<br>

Is there any way to improve this ? some tweak of the kernel/xfs/gluster ?<br>

<br>

Thanks,<br>

<br>

Xavi<br>

<br>

Al 26/03/13 11:02, En/na Xavier Hernandez ha escrit:<div class="HOEnZb"><div class="h5"><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi,<br>

<br>

I&#39;ve reproduced a problem I&#39;ve seen with directory listing of directories not accessed for a long time (some hours). Gluster version is 3.3.1.<br>

<br>

I&#39;ve made the tests with different hardware and the behavior is quite similar.<br>

<br>

The problem can be clearly seen doing this:<br>

<br>

1. Format bricks with XFS, inode size 512, and mount them<br>

2. Create a gluster volume (I&#39;ve tried several combinations, see later)<br>

3. Start and mount it<br>

4. Create a directory &lt;vol&gt;/dirs and fill it with 300 subdirectories<br>

5. Unmount the volume, stop it and flush kernel caches of all servers (sync ; echo 3 &gt; /proc/sys/vm/drop_caches)<br>

6. Start the volume, mount it, and execute &quot;time ls -l &lt;vol&gt;/dirs | wc -l&quot;<br>

7. Create 80.000 directories at &lt;vol&gt;/ (notice that these directories are not created inside &lt;vol&gt;/dirs)<br>

8. Unmount the volume, stop it and flush kernel caches of all servers (sync ; echo 3 &gt; /proc/sys/vm/drop_caches)<br>

9. Start the volume, mount it, and execute &quot;time ls -l &lt;vol&gt;/dirs | wc -l&quot;<br>

10. Delete directory &lt;vol&gt;/dirs and recreate it with 300 subdirectories also<br>

11. Unmount the volume, stop it and flush kernel caches of all servers (sync ; echo 3 &gt; /proc/sys/vm/drop_caches)<br>

12. Start the volume, mount it, and execute &quot;time ls -l &lt;vol&gt;/dirs | wc -l&quot;<br>

<br>

With this test, I get the following times:<br>

<br>

first ls: 1 second<br>

second ls: 3.5 seconds<br>

third ls: 10 seconds<br>

<br>

I don&#39;t understand the second ls because the &lt;vol&gt;/dirs directory still have the same 300 subdirectories. But the third one is worst.<br>

<br>

I&#39;ve tried with different kinds of volumes (distributed-replicated, distributed, and even a single brick), and the behavior is the same (though the times are smaller when less bricks are involved).<br>

<br>

After reaching this situation, I&#39;ve tried to get the previous ls times by deleting directories, however the times do not seem to improve. Only after doing some &quot;dirty&quot; tests and removing empty gfid directories from &lt;vol&gt;/.glusterfs on all bricks I get better times, though not as good as the first ls (3 - 4 seconds better than the third ls).<br>


<br>

This is always reproducible if the volume is stopped and the caches are emptied before each ls. With more files and/or directories, it can take up to 20 or more seconds to list a directory with 100-200 subdirectories.<br>


<br>

Without stopping anything, a second ls responds in about 0.2 seconds.<br>

<br>

I&#39;ve also tested this with ext4 and BTRFS (I know it is not supported, but tested anyway). These are the results:<br>

<br>

ext4 first ls: 0.5 seconds<br>

ext4 second ls: 0.8 seconds<br>

ext4 third ls: 7 seconds<br>

<br>

btrfs first ls: 0.5 seconds<br>

btrfs second ls: 0.5 seconds<br>

btrfs third ls: 0.5 seconds<br>

<br>

It seems clear that it depends on the file system, but if I access directly the bricks, all ls take at most 0.1 seconds to complete.<br>

<br>

Repairing and defragmenting the bricks does not help.<br>

<br>

strace&#39;ing the glusterfs process of the bricks, I see that for each directory a lot of entries from &lt;vol&gt;/.glusterfs are lstat&#39;ed and a lot of lgetxattr are called. For 300 directories I&#39;ve counted more than 4500 lstat&#39;s and more than 5300 lgetxattr, many of them repeated. I&#39;ve also noticed that some lstat&#39;s take from 10 to 60 ms to complete (with XFS).<br>


<br>

Is there any way to minimize these effects ? I&#39;m doing something wrong ?<br>

<br>

Thanks in advance for your help,<br>

<br>

Xavi<br>

<br>

______________________________<u></u>_________________<br>

Gluster-devel mailing list<br>

<a href="mailto:Gluster-devel@nongnu.org" target="_blank">Gluster-devel@nongnu.org</a><br>

<a href="https://lists.nongnu.org/mailman/listinfo/gluster-devel" target="_blank">https://lists.nongnu.org/<u></u>mailman/listinfo/gluster-devel</a><br>

</blockquote>

<br>

<br>

______________________________<u></u>_________________<br>

Gluster-devel mailing list<br>

<a href="mailto:Gluster-devel@nongnu.org" target="_blank">Gluster-devel@nongnu.org</a><br>

<a href="https://lists.nongnu.org/mailman/listinfo/gluster-devel" target="_blank">https://lists.nongnu.org/<u></u>mailman/listinfo/gluster-devel</a><br>

</div></div></blockquote></div><br></div>