<div dir="ltr">please try with 'option self-heal off' in the unify volume and check if that improves your performance.<br><br>avati<br><br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
File layout is such that each host has it's own directory, for example the<br>
GlusterFS website would be located in:<br>
<fs_root>/db/org/g/<a href="http://www.glusterfd.org/" target="_blank">www.glusterfd.org/</a><br>
and each directory will have a small number of potentially large data files.<br>
A similar setup on local disks (without gluster) has proven it's capabilities<br>
over the years.<br>
<br>
We use a distributed computing model, each node in the archive runs one<br>
or more processes to update the archive. We use the nufa scheduler to favor<br>
local files and we use a distributed hashing algorithm to prevent data from<br>
moving around nodes (unless the configuration changes of course).<br>
<br>
I've included the GlusterFS configuration at the bottom of this e-mail.<br>
<br>
Data access and throughput are pretty good (good enough), but calling stat()<br>
on a directory can take extraordinary long. Here is for example a listing<br>
of the .nl top level domain:<br>
<br>
vagabond@spider2:~/archive/db/nl$ time ls<br>
0/ 2/ 4/ 6/ 8/ a/ c/ e/ g/ i/ k/ m/ o/ q/ s/ u/ w/ y/<br>
1/ 3/ 5/ 7/ 9/ b/ d/ f/ h/ j/ l/ n/ p/ r/ t/ v/ x/ z/<br>
<br>
real 4m28.373s<br>
user 0m0.004s<br>
sys 0m0.000s<br>
<br>
<br>
The same operation performed directly on the local filesystem of the namespace<br>
node returns almost instantly (also for large directories):<br>
<br>
time ls /local.mnt/md0/glfs-namespace/db/nl/a | wc -l<br>
17506<br>
<br>
real 0m0.043s<br>
user 0m0.032s<br>
sys 0m0.012s<br>
<br>
<br>
A trace of the namespace gluster deamon shows that it is performing a<br>
lstat() on all the subdirectories (nl/0/*, nl/1/* etc). Information that<br>
IMO is not needed at this time. In our case the total number of directories<br>
on the filesystem goes into the many millions so this behaviour is hurting<br>
performance.<br>
<br>
Now for our questions:<br>
<br>
* is this expected to scale to tens of millions of directories?<br>
<br>
* is this behaviour a necessity for GlusterFS to operate correctly or is<br>
it some form of performance optimisation? Is it tunable?<br>
<br>
* what exactly is the sequency of events to handle a directory listing?<br>
Is this request handled by the namespace node only?<br>
<br>
* is there anything we can tune or change to speed up directory access?<br>
<br>
Thanks for your time,<br>
<br>
Arend-Jan<br>
<br>
<br>
**** Hardware config ****<br>
data nodes<br>
- 1 x Xeon quad core 2.5 Ghz<br>
- 4 x Barracuda ES.2 SATA 3.0-Gb/s 1-TB Hard Drive<br>
Disks configured in RAID0, 128k chunks<br>
Filesystem XFS<br>
Network: gigabit LAN<br>
<br>
namespace node<br>
- 2 x Xeon quad core 2.5 Ghz<br>
- 4 x Cheetah® 15K.5 U320 SCSI Hard Drives<br>
Disks configured in RAID1 (1 mirror, 1 spare)<br>
Filesystem XFS<br>
Network: gigabit LAN<br>
<br>
<br>
Glusterfs Version: 1.3.11 with Fuse fuse-2.7.3glfs10<br>
Glusterfs Version: 1.4-pre5 with Fuse fuse-2.7.3glfs10<br>
<br>
<br>
**** GlusterFS data node config ****<br>
<br>
volume brick-posix0<br>
type storage/posix<br>
option directory /local.mnt/md0/glfs-data<br>
end-volume<br>
<br>
volume brick-lock0<br>
type features/posix-locks<br>
subvolumes brick-posix0<br>
end-volume<br>
<br>
volume brick-fixed0<br>
type features/fixed-id<br>
option fixed-uid 2224<br>
option fixed-gid 224<br>
subvolumes brick-lock0<br>
end-volume<br>
<br>
volume brick-iothreads0<br>
type performance/io-threads<br>
option thread-count 4<br>
subvolumes brick-fixed0<br>
end-volume<br>
<br>
volume brick0<br>
type performance/read-ahead<br>
subvolumes brick-iothreads0<br>
end-volume<br>
<br>
volume server<br>
type protocol/server<br>
option transport-type tcp/server<br>
subvolumes brick0<br>
option auth.ip.brick0.allow <a href="http://10.1.0." target="_blank">10.1.0.</a>*<br>
end-volume<br>
<br>
<br>
**** GlusterFS namespace config ****<br>
<br>
volume brick-posix<br>
type storage/posix<br>
option directory /local.mnt/md0/glfs-namespace<br>
end-volume<br>
<br>
volume brick-namespace<br>
type features/fixed-id<br>
option fixed-uid 2224<br>
option fixed-gid 224<br>
subvolumes brick-posix<br>
end-volume<br>
<br>
volume server<br>
type protocol/server<br>
option transport-type tcp/server<br>
subvolumes brick-namespace<br>
option auth.ip.brick-namespace.allow <a href="http://10.1.0." target="_blank">10.1.0.</a>*<br>
end-volume<br>
<br>
<br>
**** GlusterFS client config ****<br>
<br>
volume brick-0-0<br>
type protocol/client<br>
option transport-type tcp/client<br>
option remote-host archive0<br>
option remote-subvolume brick0<br>
end-volume<br>
<br>
volume brick-1-0<br>
type protocol/client<br>
option transport-type tcp/client<br>
option remote-host archive1<br>
option remote-subvolume brick0<br>
end-volume<br>
<br>
volume brick-2-0<br>
type protocol/client<br>
option transport-type tcp/client<br>
option remote-host archive2<br>
option remote-subvolume brick0<br>
end-volume<br>
<br>
volume ns0<br>
type protocol/client<br>
option transport-type tcp/client<br>
option remote-host archivens0<br>
option remote-subvolume brick-namespace<br>
end-volume<br>
<br>
volume unify<br>
type cluster/unify<br>
option namespace ns0<br>
option scheduler nufa<br>
option nufa.local-volume-name brick-2-0 # depends on data node of course<br>
option nufa.limits.min-free-disk 10%<br>
subvolumes brick-0-0 brick-1-0 brick-2-0<br>
end-volume<br>
<font color="#888888"><br>
--<br>
Arend-Jan Wijtzes -- Wiseguys -- <a href="http://www.wise-guys.nl" target="_blank">www.wise-guys.nl</a><br>
<br>
</font><br>_______________________________________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
<a href="http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users" target="_blank">http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br>If I traveled to the end of the rainbow<br>As Dame Fortune did intend,<br>Murphy would be there to tell me<br>The pot's at the other end.<br>
</div>