<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">AFAIK kernel does not allow requests
      bigger than 128KB and gluster has this limit hardcoded in
      fuse-bridge.c. Currently it is not possible to increase or
      decrease this value.<br>
      <br>
      I made the tests using maximum block sizes.<br>
      <br>
      Al 12/03/13 08:16, En/na lierihanmei ha escrit:<br>
    </div>
    <blockquote
      cite="mid:19ce6bfd.189bb.13d5d750146.Coremail.lierihanmei@163.com"
      type="cite">
      <div
        style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><br>
        <br>
        When glusterfs mount fuse, It uses the max_read=128KB option.
        &nbsp;Any big&nbsp;<span style="white-space: pre-wrap; line-height: 1.7;">&nbsp;request
          would be split. Tuning the option, it will be faster in big
          read and write, but no use for small files.</span><br>
        <br>
        <br>
        <br>
        <pre>
At&nbsp;2013-03-11&nbsp;18:49:47,"Xavier&nbsp;Hernandez"&nbsp;<a class="moz-txt-link-rfc2396E" href="mailto:xhernandez@datalab.es">&lt;xhernandez@datalab.es&gt;</a>&nbsp;wrote:
&gt;Hello,
&gt;
&gt;I've&nbsp;recently&nbsp;performed&nbsp;some&nbsp;tests&nbsp;with&nbsp;gluster&nbsp;on&nbsp;a&nbsp;fast&nbsp;network&nbsp;(IP&nbsp;
&gt;over&nbsp;infiniband)&nbsp;and&nbsp;got&nbsp;some&nbsp;unexpected&nbsp;results.&nbsp;It&nbsp;seems&nbsp;that&nbsp;
&gt;mount/fuse&nbsp;is&nbsp;becoming&nbsp;a&nbsp;bottleneck&nbsp;when&nbsp;the&nbsp;network&nbsp;and&nbsp;disk&nbsp;are&nbsp;very&nbsp;fast.
&gt;
&gt;I&nbsp;started&nbsp;with&nbsp;a&nbsp;simple&nbsp;distributed&nbsp;volume&nbsp;with&nbsp;2&nbsp;bricks&nbsp;mounted&nbsp;on&nbsp;a&nbsp;
&gt;ramdisk&nbsp;to&nbsp;avoid&nbsp;possible&nbsp;disk&nbsp;bottlenecks&nbsp;(however&nbsp;I&nbsp;repeated&nbsp;the&nbsp;tests&nbsp;
&gt;with&nbsp;an&nbsp;SSD&nbsp;and,&nbsp;later,&nbsp;with&nbsp;a&nbsp;normal&nbsp;hard&nbsp;disk&nbsp;and&nbsp;the&nbsp;results&nbsp;were&nbsp;the&nbsp;
&gt;same,&nbsp;probably&nbsp;due&nbsp;to&nbsp;the&nbsp;good&nbsp;work&nbsp;of&nbsp;performance&nbsp;translators).&nbsp;With&nbsp;
&gt;this&nbsp;configuration,&nbsp;a&nbsp;single&nbsp;write&nbsp;reached&nbsp;a&nbsp;throughput&nbsp;of&nbsp;~420&nbsp;MB/s.&nbsp;
&gt;It's&nbsp;way&nbsp;below&nbsp;the&nbsp;maximum&nbsp;network&nbsp;limit,&nbsp;but&nbsp;for&nbsp;a&nbsp;single&nbsp;write&nbsp;it's&nbsp;
&gt;quite&nbsp;acceptable.&nbsp;However&nbsp;with&nbsp;two&nbsp;concurrent&nbsp;writes&nbsp;(carefully&nbsp;chosen&nbsp;
&gt;so&nbsp;that&nbsp;each&nbsp;one&nbsp;goes&nbsp;to&nbsp;a&nbsp;different&nbsp;brick),&nbsp;the&nbsp;throughput&nbsp;was&nbsp;~200&nbsp;
&gt;MB/s&nbsp;(for&nbsp;each&nbsp;transfer).&nbsp;That&nbsp;was&nbsp;totally&nbsp;unexpected.&nbsp;As&nbsp;there&nbsp;was&nbsp;
&gt;plenty&nbsp;of&nbsp;bandwith&nbsp;available&nbsp;and&nbsp;no&nbsp;IO&nbsp;limitation,&nbsp;I&nbsp;was&nbsp;expecting&nbsp;
&gt;something&nbsp;near&nbsp;800&nbsp;MB/s.
&gt;
&gt;In&nbsp;fact,&nbsp;any&nbsp;combination&nbsp;of&nbsp;concurrent&nbsp;writes&nbsp;always&nbsp;led&nbsp;to&nbsp;the&nbsp;same&nbsp;
&gt;combined&nbsp;throughput&nbsp;of&nbsp;~400&nbsp;MB/s.
&gt;
&gt;Trying&nbsp;to&nbsp;determine&nbsp;the&nbsp;cause&nbsp;of&nbsp;this&nbsp;odd&nbsp;behavior,&nbsp;I&nbsp;noticed&nbsp;that&nbsp;
&gt;mount/fuse&nbsp;uses&nbsp;a&nbsp;single&nbsp;thread&nbsp;to&nbsp;serve&nbsp;kernel&nbsp;requests,&nbsp;and&nbsp;once&nbsp;a&nbsp;
&gt;request&nbsp;is&nbsp;received,&nbsp;it&nbsp;is&nbsp;sent&nbsp;down&nbsp;the&nbsp;xlator&nbsp;stack&nbsp;to&nbsp;process&nbsp;it,&nbsp;
&gt;only&nbsp;reading&nbsp;additional&nbsp;requests&nbsp;once&nbsp;the&nbsp;stack&nbsp;returns.&nbsp;This&nbsp;means&nbsp;that&nbsp;
&gt;to&nbsp;reach&nbsp;a&nbsp;420&nbsp;MB/s&nbsp;throughput&nbsp;using&nbsp;128KB&nbsp;per&nbsp;request&nbsp;(the&nbsp;current&nbsp;
&gt;maximum&nbsp;block&nbsp;size),&nbsp;it&nbsp;needs&nbsp;to&nbsp;serve,&nbsp;at&nbsp;least,&nbsp;3360&nbsp;requests&nbsp;per&nbsp;
&gt;second.&nbsp;In&nbsp;other&nbsp;words,&nbsp;it&nbsp;processes&nbsp;each&nbsp;request&nbsp;in&nbsp;300&nbsp;us.&nbsp;If&nbsp;we&nbsp;take&nbsp;
&gt;into&nbsp;account&nbsp;that&nbsp;every&nbsp;translator&nbsp;will&nbsp;allocate&nbsp;memory,&nbsp;and&nbsp;do&nbsp;some&nbsp;
&gt;system&nbsp;calls,&nbsp;it's&nbsp;quite&nbsp;possible&nbsp;that&nbsp;it&nbsp;really&nbsp;takes&nbsp;300&nbsp;us&nbsp;to&nbsp;serve&nbsp;
&gt;each&nbsp;request.
&gt;
&gt;To&nbsp;see&nbsp;if&nbsp;this&nbsp;is&nbsp;the&nbsp;case,&nbsp;I&nbsp;added&nbsp;the&nbsp;performance/io-threads&nbsp;just&nbsp;
&gt;below&nbsp;the&nbsp;mount/fuse.&nbsp;This&nbsp;would&nbsp;queue&nbsp;each&nbsp;request&nbsp;to&nbsp;a&nbsp;different&nbsp;
&gt;thread,&nbsp;freeing&nbsp;the&nbsp;current&nbsp;one&nbsp;to&nbsp;read&nbsp;another&nbsp;request&nbsp;much&nbsp;before&nbsp;than&nbsp;
&gt;300&nbsp;us.&nbsp;This&nbsp;should&nbsp;improve&nbsp;the&nbsp;concurrent&nbsp;writes&nbsp;case.
&gt;
&gt;The&nbsp;results&nbsp;are&nbsp;good.&nbsp;Using&nbsp;this&nbsp;simple&nbsp;modification,&nbsp;2&nbsp;concurrent&nbsp;
&gt;writes&nbsp;performed&nbsp;at&nbsp;~300&nbsp;MB/s&nbsp;each&nbsp;one.&nbsp;However&nbsp;the&nbsp;throughput&nbsp;for&nbsp;a&nbsp;
&gt;single&nbsp;write&nbsp;dropped&nbsp;to&nbsp;~250&nbsp;MB/s.&nbsp;Anyway,&nbsp;this&nbsp;solution&nbsp;is&nbsp;not&nbsp;valid&nbsp;
&gt;because&nbsp;there&nbsp;is&nbsp;some&nbsp;incompatibility&nbsp;with&nbsp;this&nbsp;configuration&nbsp;and&nbsp;some&nbsp;
&gt;things&nbsp;do&nbsp;not&nbsp;work&nbsp;well&nbsp;(for&nbsp;example&nbsp;a&nbsp;simple&nbsp;'ls'&nbsp;does&nbsp;not&nbsp;show&nbsp;all&nbsp;the&nbsp;
&gt;files).
&gt;
&gt;Then&nbsp;I&nbsp;modified&nbsp;the&nbsp;mount/fuse&nbsp;xlator&nbsp;to&nbsp;start&nbsp;some&nbsp;threads&nbsp;to&nbsp;serve&nbsp;
&gt;kernel&nbsp;requests.&nbsp;With&nbsp;this&nbsp;modification&nbsp;all&nbsp;seems&nbsp;to&nbsp;work&nbsp;as&nbsp;expected&nbsp;
&gt;and&nbsp;throughput&nbsp;is&nbsp;quite&nbsp;better:&nbsp;a&nbsp;single&nbsp;write&nbsp;still&nbsp;performs&nbsp;at&nbsp;420&nbsp;
&gt;MB/s,&nbsp;and&nbsp;2&nbsp;concurrent&nbsp;writes&nbsp;reach&nbsp;330&nbsp;MB/s.&nbsp;In&nbsp;fact,&nbsp;any&nbsp;combination&nbsp;
&gt;of&nbsp;2&nbsp;or&nbsp;more&nbsp;concurrent&nbsp;writes&nbsp;has&nbsp;a&nbsp;combined&nbsp;throughput&nbsp;of&nbsp;~650&nbsp;MB/s.
&gt;
&gt;However,&nbsp;a&nbsp;replicate&nbsp;volume&nbsp;does&nbsp;not&nbsp;improve&nbsp;at&nbsp;all.&nbsp;I'm&nbsp;not&nbsp;sure&nbsp;why.&nbsp;
&gt;It&nbsp;seems&nbsp;that&nbsp;there&nbsp;should&nbsp;be&nbsp;some&nbsp;kind&nbsp;of&nbsp;serialization&nbsp;point&nbsp;in&nbsp;
&gt;cluster/afr.&nbsp;A&nbsp;single&nbsp;write&nbsp;has&nbsp;a&nbsp;throughput&nbsp;of&nbsp;~175&nbsp;MB/s,&nbsp;and&nbsp;2&nbsp;
&gt;concurrent&nbsp;writes&nbsp;~85&nbsp;MB/s.&nbsp;I'll&nbsp;have&nbsp;to&nbsp;investigate&nbsp;this&nbsp;further.
&gt;
&gt;Does&nbsp;all&nbsp;this&nbsp;make&nbsp;sense&nbsp;?
&gt;
&gt;Is&nbsp;this&nbsp;something&nbsp;that&nbsp;would&nbsp;be&nbsp;worth&nbsp;investing&nbsp;more&nbsp;time&nbsp;?
&gt;
&gt;Regards,
&gt;
&gt;Xavi
&gt;
&gt;_______________________________________________
&gt;Gluster-devel&nbsp;mailing&nbsp;list
&gt;<a class="moz-txt-link-abbreviated" href="mailto:Gluster-devel@nongnu.org">Gluster-devel@nongnu.org</a>
&gt;<a class="moz-txt-link-freetext" href="https://lists.nongnu.org/mailman/listinfo/gluster-devel">https://lists.nongnu.org/mailman/listinfo/gluster-devel</a>
</pre>
      </div>
      <br>
      <br>
      <span title="neteasefooter"><span id="netease_mail_footer"></span></span>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
Gluster-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Gluster-devel@nongnu.org">Gluster-devel@nongnu.org</a>
<a class="moz-txt-link-freetext" href="https://lists.nongnu.org/mailman/listinfo/gluster-devel">https://lists.nongnu.org/mailman/listinfo/gluster-devel</a>
</pre>
    </blockquote>
    <br>
  </body>
</html>