<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://blog.gmane.org/gmane.comp.file-systems.ext3.user">
    <title>gmane.comp.file-systems.ext3.user</title>
    <link>http://blog.gmane.org/gmane.comp.file-systems.ext3.user</link>
    <description/>
    <syn:updatePeriod>hourly</syn:updatePeriod>
    <syn:updateFrequency>1</syn:updateFrequency>
    <syn:updateBase>1901-01-01T00:00+00:00</syn:updateBase>
    <items>
      <rdf:Seq>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4998"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4997"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4994"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4993"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4992"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4991"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4990"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4989"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4988"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4987"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4986"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4985"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4984"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4983"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4982"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4981"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4980"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4979"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4978"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4977"/>
      </rdf:Seq>
    </items>
    <image rdf:resource="http://gmane.org/img/gmane-25t.png"/>
    <textinput rdf:resource=""/>
  </channel>
  <image rdf:about="http://gmane.org/img/gmane-25t.png">
    <title>Gmane</title>
    <url>http://gmane.org/img/gmane-25t.png</url>
    <link>http://gmane.org</link>
  </image>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4998">
    <title>Re: (LONG) Delay when writing to ext4 LVM after boot</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4998</link>
    <description>&lt;pre&gt;
This is a problem that I am very familiar with for large filesystems. The issue is that if the filesystem is relatively full, the first write needs to load and search a lot of the block bitmaps to try and find enough space to allocate blocks for the write.  Depending on how it was formatted, each block bitmap read needs a seek. 


Might I guess that this filesystem was formatted as ext3 and not as ext4?  In particular, is the "flex_bg" option missing from the Features line in the "dumpe2fs -h /dev/XXX" output?  This feature is enabled by default if formatting as ext4, but not as ext3.

The flex_bg feature will allocate the block bitmaps in large chunks on the disk so that they can be loaded quickly at mount and e2fsck time.  On a 16TB filesystem with 10 ms seek time, in the worst case without flex_bg it could take up to 20 minutes to load all of the bitmaps at boot time without flex_bg...


Unfortunately, flex_bg is a format-time option, so you would need a full backup-restore to benefit from it for your filesystem.

If there is a delay between mounting and the first write, you could prefetch the bitmaps with "dumpe2fs /dev/XXX &amp;gt; /dev/null" so that it loads all of the bitmaps before they are needed.  Some people do this in a startup script as a workaround for the initial write slowness. 

Changing the allocation policy would not help in your case, since the large file would need more blocks than could be satisfied by the early groups. That is why you don't see a slowdown for small files.

In theory, it would be possible to modify resize2fs to co-locate the bitmaps on disk to enable flex_bg, in the same manner as it currently moves the inode table to add group descriptor blocks, but that would need some non-trivial development. 

Cheers, Andreas
&lt;/pre&gt;</description>
    <dc:creator>Andreas Dilger</dc:creator>
    <dc:date>2013-04-24T15:58:21</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4997">
    <title>(LONG) Delay when writing to ext4 LVM after boot</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4997</link>
    <description>&lt;pre&gt;(I previously asked this question in the LVM list, and they suggested I ask
here.)

I have a large LV, about 6.5T, consisting of 4 physical drives of various
sizes. The LV is formatted as ext4. There is no raid  involved (hardware of
software).

After I first boot, if I try to write a large file (&amp;gt;~ 80M) to this LV, the
write hangs for about 1minute or more, then continues on at full speed and
finishes successfully. Writes of small files don't show this delay. After
that first write and delay, all subsequent writes to other large files
proceed at full speed.

I am currently running Fedora 17 64bit  (kernel 3.8.4-102.fc17.x86_64) but
have noticed this also in previous systems (both 64 and 32bit). With
smaller file systems ( &amp;lt; 1T ), there was a delay, but it was small, and it
increased significantly as I increased the LV size.

I have run e2fsck with the -D option (before attempting a write), which
made no difference. Also, fwiw,  I am mounting this with the default
options. I've tried other options that were suggested to tweak ext4, but,
again, no effect. This LV is also not my system (root) partition - that is
on a separate physical drive.

Any ideas? Suggestions?

(I will gladly supply additional info as requested.)

TIA

ken
_______________________________________________
Ext3-users mailing list
Ext3-users&amp;lt; at &amp;gt;redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users&lt;/pre&gt;</description>
    <dc:creator>Ken Bass</dc:creator>
    <dc:date>2013-04-24T00:14:52</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4994">
    <title>Re: e2freefrag says filesystem too large</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4994</link>
    <description>&lt;pre&gt;
Yes, it's fixed in e2fsprogs 1.42.7.  And if you're using 64-bit file
systems, you really really want to upgrade to 1.42.7 --- especially if
you are ever thinking of using resize2fs; we fixed a number of very
serious bugs, some of which could destroy file systems if you try to
do off-line resizes.  See the release notes for more details:

http://e2fsprogs.sourceforge.net/e2fsprogs-release.html#1.42.7

Regards,

- Ted
&lt;/pre&gt;</description>
    <dc:creator>Theodore Ts'o</dc:creator>
    <dc:date>2013-03-27T02:14:43</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4993">
    <title>e2freefrag says filesystem too large</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4993</link>
    <description>&lt;pre&gt;

Can someone tell me if this will be fixed?

# e2freefrag
/dev/sdl1
Device: /dev/sdl1
Blocksize: 4096 bytes
/dev/sdl1: Filesystem too large to use legacy bitmaps while reading
block bitmap

# rpm -qa|grep e2fsprogs
e2fsprogs-libs-1.42.5-1.fc18.x86_64
e2fsprogs-1.42.5-1.fc18.x86_64

# df|grep /dev/sdl1
/dev/sdl1Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 
35143869536 30265426892Â  4175317880Â  88% /mnt/backup

Thanks!

---
Will Y.

&lt;/pre&gt;</description>
    <dc:creator>aragonx&lt; at &gt;dcsnow.com</dc:creator>
    <dc:date>2013-03-26T19:51:24</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4992">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4992</link>
    <description>&lt;pre&gt;
  I did not find it in  git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git

  There are some references from Lustre doc [1] and I could find some
github repo with some code [2]. Would you know where the e2scan upstream
lives ?


  [1]
http://wiki.lustre.org/manual/LustreManual20_HTML/SystemConfigurationUtilities_HTML.html#50438219_55923
  [2] https://github.com/morrone/e2fsprogs
&lt;/pre&gt;</description>
    <dc:creator>Vincent Caron</dc:creator>
    <dc:date>2013-03-24T22:40:56</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4991">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4991</link>
    <description>&lt;pre&gt;* Vincent Caron &amp;lt;vcaron&amp;lt; at &amp;gt;bearstech.com&amp;gt; hat geschrieben:


Using -m technically makes the file system driver report ENOSPC when
there is in fact still free space available.

So, as long as you make sure, that you have at least 5% free space at any
given time, it doesn't matter whether you have -m0 or -m5. However, tools
like df show the available capacity to user space, so 100GB with 95GB used
will show 100% used with -m5 and 95% used with -m0. Effectively, that
means, when you go with -m5 and make sure, df shows at least 5% free space,
you end up with about 10% free all the time - this way reducing
fragmentation - in theory. Since you're missusing your file system as
data bank management system with many small files anyways,
inter file fragmentation is the least of your problems. So, it's totally
safe for you to stay with -m0.

Regards, Bodo
&lt;/pre&gt;</description>
    <dc:creator>Bodo Thiesen</dc:creator>
    <dc:date>2013-03-16T11:29:07</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4990">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4990</link>
    <description>&lt;pre&gt;* Vincent Caron &amp;lt;vcaron&amp;lt; at &amp;gt;bearstech.com&amp;gt; hat geschrieben:


There is a simple and reliable solution for block level backups: dd

umount &amp;amp;&amp;amp; dd if="our raid partition" of="some new big enough disc" &amp;amp;&amp;amp;
mount

and then wait for the data to go at 100MB/s or so to the new disc.

Using snapshots is not a reliable way to do backups, since you would
still have to trust the LVM code to be totally error free and protect
your data under any circumstances (including hardware failures in your
raid array etc).

For your actual problem: Ask your developers to use some mapping
system.

When they want to access a file "filename" then calculate md5sum of
"filename", take the first 6 characters of the ascii representation
(here it would be 435ed7) and create a file called "43/5e/d7-X".
This way you would end up with at most 65792 directories. The X is needed
to distinquish between files with same 6 first letters md5sum. So, first
file gets name "43/5e/d7-1", second file gets name "43/5e/d7-2" and so on.

Somewhere else, they would then store the mapping table, mapping file
"filename" to "43/5e/d7-2". All accesses go through this mapping table.

Regards, Bodo
&lt;/pre&gt;</description>
    <dc:creator>Bodo Thiesen</dc:creator>
    <dc:date>2013-03-16T11:17:01</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4989">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4989</link>
    <description>&lt;pre&gt;

That's exactly what e2scan does. I'm pretty sure that is in upstream e2fsprogs now (not just our Lustre version), but I'm on a plane and cannot check.

It will scan the inode table directly and can generate the pathnames of files efficiently. It can filter on timestamps. 

Cheers, Andreas

&lt;/pre&gt;</description>
    <dc:creator>Andreas Dilger</dc:creator>
    <dc:date>2013-03-15T07:14:17</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4988">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4988</link>
    <description>&lt;pre&gt;
  Very useful. Global stats without having to scan the whole filesystems
are very precious...

  I was wondering : couldn't we use dumpe2fs or something based on
libext2fs to quickly extract a snapshot of all inodes from a given
filesystem ? For incremental backups, simply checking the mtime on
millions of inodes and discovering that only a handful of them were
updated since the previous pass looks very inefficient with
readdir()+lstat(). So mnay syscalls, so man spoonfed bits of
information. When I had a peek, I tought I'd got a list of inodes but
would not be able to link them back to their name(s) without inducing
the same cost as a regular find-like filesystem traversal. Does it make
sense ?

  AFAIK I would be better served with block-level snapshot solutions,
but LVM snapshots are supposed to double your writes if I got it right,
and I'm not sure there's something else in the Linux and free software
world. Plus I'd love to not migrate away from my ext3/4's without a
compelling reason. Btrfs is not (yet) and option and ZFS doesn't fit
legally with Linux.
&lt;/pre&gt;</description>
    <dc:creator>Vincent Caron</dc:creator>
    <dc:date>2013-03-14T23:07:31</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4987">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4987</link>
    <description>&lt;pre&gt;
  Because the man page says that reserved blocks are used to protect
root-level daemons from misbehaving would unprivileged programs try to
fill the disk. And uh, to avoid fragmentation, I missed that part.

  OTOH I monitor disk space and never let go past 95% block usage
without specific action (freeing inodes or enlarging filesystem). Would
I use -m5 and oversize my filesystems (because I sell the capacity, say
I sell 100GB then I need a 105GB blockdev), I would still monitor the
disk usage and take action before it's 100% filled up. But I'd end up
reserving more blocks without more guarantees that the -m0 case.

  So technically it looks wrong, but politically I'm not sure it's
stupid. Or is it ?
&lt;/pre&gt;</description>
    <dc:creator>Vincent Caron</dc:creator>
    <dc:date>2013-03-14T22:56:47</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4986">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4986</link>
    <description>&lt;pre&gt;


Why does it matter here that "no file owned by root"?

What has that got to do with the much greater difficulty to find
contiguous space the fuller the filetree is?
&lt;/pre&gt;</description>
    <dc:creator>Peter Grandi</dc:creator>
    <dc:date>2013-03-14T20:57:49</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4985">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4985</link>
    <description>&lt;pre&gt;
Just as a note, e2fsck -v can sometimes get this information much more
quickly than other alternatives, since it can scan the file system in
inode order, instead of the essentially random order.

Just as a side, if you just want to get a rough count of the number of
directories, you can get that by grabbing the information out of
dumpe2fs.

Group 624: (Blocks 20447232-20479999) [ITABLE_ZEROED]
  Checksum 0xd3f5, unused inodes 4821
  Block bitmap at 20447232 (+0), Inode bitmap at 20447248 (+16)
  Inode table at 20447264-20447775 (+32)
  24103 free blocks, 4821 free inodes, 435 directories, 4821 unused inodes
                                       ^^^^^^^^^^^^^^^
  Free blocks: 20455889, 20455898-20479999
  Free inodes: 5115180-5120000

Dumpe2fs doesn't actually sum the number of directories, and you won't
be able to differentiate the number of files that are regular files versus
symlinks, device nodes, etc., but if you just want to get the number
of directories, you can get this number by getting the information out
of dumpe2fs without having to wait for e2fsck to complete.  You can
even do this with a mounted file system, but the number will of course
not necessarily be completely accurate if you do that.

(You can get the number of inodes in use by subtracting the number of
free inodes from the number of inodes in the file system.  If you then
subtract the number of directories, then you can get the number of
non-directory inodes versus directory inodes.)

                          - Ted
&lt;/pre&gt;</description>
    <dc:creator>Theodore Ts'o</dc:creator>
    <dc:date>2013-03-14T01:05:51</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4984">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4984</link>
    <description>&lt;pre&gt;
  This filesystem has no file owned by root and won't have any. I
thought in this case -m0 would be a good idea.

  Thanks a lot for your detailed insight on the various performance
figures, I didn't do the proper math to realize that this inode reading
rate was actually *good*.

  Fortunately the client is technically savvy, and pointing him at this
mailing-list thread will help make him the right decision.
&lt;/pre&gt;</description>
    <dc:creator>Vincent Caron</dc:creator>
    <dc:date>2013-03-13T21:50:13</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4983">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4983</link>
    <description>&lt;pre&gt;
That is roughly 1M inodes/hour and 20GB/hour, or  nearly 300
inodes/s and nearly 6MB/s. These are very good numbers for high
random IOPS loads, and as seen later, you have one.


That helps.


That's the pointless default... But does not particularly slow
things down here.


The striping or alignment are not relevant on reads, but the
stride matters a great deal as to metadata parallelism, and here
it is set to 64KiB. But the array stride is 16KiB (a 4-wide
stripe of 64KiB). But since it is an integral multiple it should
be about as good. And since the backup performance is pretty
good, that seems the case.


That's usually a truly terrible setting (20% is a much better
value), but your filesystem is not very full anyhow.


Rather they are pretty good. Each 500GB SATA disk can usually do
somewhat less than 100 random IOPS/second, there are 4 disks in
each stripe when reading, and you are getting nearly 300 inodes/s
and 5MB/s, quite close to the maximum. On random loads with
smallish records typical rotating disks have transfer rates of
0.5MB to 1.5 MB/s, and you are getting rather more than that
(mostly thanks to the 20KiB average inode size).

You are getting pretty good delivery from 'ext4' and a very low
random IOPS storage system on a highly randomized workload:


That's a really bad idea.


That's the usual 1M inodes/s.


Well, any system administrator would tell you the same: your
backup workload and your storage system are mismatched, and the
best solution is probably to use 146GB SAS 15K RPM disks for the
same capacity (or more). Or perhaps recent enteprise level SSDs.

The "small file" problem is ancient, and I call it the
"mailstore" problems from its typical incarnation:

  http://www.sabi.co.uk/blog/12-thr.html#120429


The "I use the filesystem as a DBMS" attitude is really very
common among developers. It is cost-free to them, and backup (and
system) administrators bear the cost when the filesystem fills
up.  Because at the beginning everything looks fine. Designing
stuff that seems cheap and fast at the beginning even if it
becomes very bad after some time is a good way to look like a
winner in most organizations.


It's not, it is a very bad idea. In 2013, just like in 1973, or
in 1993, it is a much better idea to use simple indexed files to
keep a collection of smallish records.

Directories are a classification system, not a database indexing
system. Here is an amusing report of the difference between the
two:

  http://www.sabi.co.uk/blog/anno05-4th.html#051016


As a backup administrator you can't get much better from your
situation. You are already getting nearly the best performance
for whole tree scans of very random small records on a low random
IOPS storage layer.
&lt;/pre&gt;</description>
    <dc:creator>Peter Grandi</dc:creator>
    <dc:date>2013-03-13T21:29:32</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4982">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4982</link>
    <description>&lt;pre&gt;
Wow.  You have more directories than regular files!  Given that there
are no hard links, that implies that you have at least 2,079,271
directories which are ***empty***.

The inline data feature (which is still in testing and isn't something
I can recommend for production use yet) is probably the best hope for
you.  But probably the best thing you can do is to harrague your
developers to ask what the heck they are doing....

                - Ted
&lt;/pre&gt;</description>
    <dc:creator>Theodore Ts'o</dc:creator>
    <dc:date>2013-03-13T20:33:46</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4981">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4981</link>
    <description>&lt;pre&gt;
To be clear, that's at least two million directories assuming that all
of the other directories have but a single file in them(!).  In reality
you probably have a lot more than 2 million empty directories....

                    - Ted
&lt;/pre&gt;</description>
    <dc:creator>Theodore Ts'o</dc:creator>
    <dc:date>2013-03-13T20:52:10</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4980">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4980</link>
    <description>&lt;pre&gt;
  Awful, isn't it ? I knew directories were abused, but didn't know that
'e2fsck -v' would display the exact figures (since I never waited 5+
hours to scan the whole filesystem). Nice to know.



  Indeed, these filers are storing live and sensitive data and are
conservatively running stable OS and well known kernels.

  Thanks for your advice, I'll actively work with the devs in order to
refactor their filesystem layout.
&lt;/pre&gt;</description>
    <dc:creator>Vincent Caron</dc:creator>
    <dc:date>2013-03-13T20:49:20</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4979">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4979</link>
    <description>&lt;pre&gt;
  Same slowness, I ran :

  filer:~# gcc -shared -fPIC -ldl -o spd_readdir.so spd_readdir.c
  filer:~# LD_PRELOAD=./spd_readdir.so find /srv/vol -type d |pv -bl

  I stopped the experiment at +54min with 845k directories found (which
gives roughly the same rate of 1M directories / hour, and I know there
are more that 2M of them).
&lt;/pre&gt;</description>
    <dc:creator>Vincent Caron</dc:creator>
    <dc:date>2013-03-13T11:59:53</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4978">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4978</link>
    <description>&lt;pre&gt;
  I don't think I tried this specific hack, I'm having a go right now.
Is is still useful if each directory only holds a few inodes ?



  Information attached. Dumpfs said: dumpe2fs 1.42.5 (29-Jul-2012).

  Thanks for your help !
Device: /dev/vg-raid3/xyz
Blocksize: 4096 bytes
Total blocks: 78643200
Free blocks: 26099944 (33.2%)

Min. free extent: 4 KB 
Max. free extent: 557632 KB
Avg. free extent: 468 KB

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
    4K...    8K-  :         52301         52301    0.20%
    8K...   16K-  :         36907         87967    0.34%
   16K...   32K-  :         36063        187694    0.72%
   32K...   64K-  :         41123        433236    1.66%
   64K...  128K-  :         26411        579943    2.22%
  128K...  256K-  :         16236        745344    2.86%
  256K...  512K-  :          6073        513906    1.97%
  512K... 1024K-  :          4378        774518    2.97%
    1M...    2M-  :           699        244092    0.94%
    2M...    4M-  :           392        280118    1.07%
    4M...    8M-  :           289        420979    1.61%
    8M...   16M-  :           308        906478    3.47%
   16M...   32M-  :           289       1520835    5.83%
   32M...   64M-  :            67        668899    2.56%
   64M...  128M-  :           479      13748140   52.67%
  128M...  256M-  :            16        875692    3.36%
  256M...  512M-  :            39       3920394   15.02%
  512M... 1024M-  :             1        139408    0.53%
e2fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

 9345910 inodes used (47.54%)
   12447 non-contiguous files (0.1%)
      34 non-contiguous directories (0.0%)
         # of inodes with ind/dind/tind blocks: 0/0/0
         Extent depth histogram: 9345889/11
52543256 blocks used (66.81%)
       0 bad blocks
       1 large file

 3633315 regular files
 5712586 directories
       0 character device files
       0 block device files
       0 fifos
       0 links
       0 symbolic links (0 fast symbolic links)
       0 sockets
--------
 9345901 files
_______________________________________________
Ext3-users mailing list
Ext3-users&amp;lt; at &amp;gt;redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users&lt;/pre&gt;</description>
    <dc:creator>Vincent Caron</dc:creator>
    <dc:date>2013-03-13T09:19:52</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4977">
    <title>Re: ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4977</link>
    <description>&lt;pre&gt;
Did you sort results from readdir() by inode number?  i.e., such as
what the following LD_PRELOAD hack does?

https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/contrib/spd_readdir.c?h=maint


Try running "e2fsck -fv /dev/XXX" and send me the output.

Also useful would be the output of "e2freefrag /dev/XXX" and "dumpe2fs -h"

                  - Ted
&lt;/pre&gt;</description>
    <dc:creator>Theodore Ts'o</dc:creator>
    <dc:date>2013-03-13T02:52:26</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4976">
    <title>ext4 and extremely slow filesystem traversal</title>
    <link>http://permalink.gmane.org/gmane.comp.file-systems.ext3.user/4976</link>
    <description>&lt;pre&gt;Hello list,

  I have troubles with the daily backup of a modest filesystem which
tends to take more that 10 hours. I have ext4 all over the place on ~200
servers and never ran into such a problem.

  The filesystem capacity is 300 GB (19,6M inodes) with 196 GB (9,3M
inodes) used. It's mounted 'defaults,noatime'. It sits on a hardware
RAID array thru plain LVM slices. The RAID array is a RAID5 running on
5x SATA 500G disks, with a battery-backed (RAM) cache and write-back
cache policy. To be precise, it's an Areca 1231.

  The hardware RAID array use 64kB stripes and I've configured the
filesystem with 4kB blocks and stride=16. It also has 0 reserved blocks.
In other works the fs was created with 'mkfs -t ext4 -E stride=16 -m 0
-L volname /dev/vgX/Y'. I'm attaching the mke2fs.conf for reference too.

  Everything is running with Debian Squeeze and its 2.6.32 kernel (amd64
flavour), on a 4 cores and 4 GB RAM server.

  I ran a tiobench tonight on an idle instance (I have two identicals
systems - hw, sw, data - with exactly the same pb). I've attached
results as plain text to protect them from line wrapping. They look fine
to me.

  When I try to backup the problematic filesystem with tar, rsync or
whatever tool traversing the whole filesystem, things are awful. I know
that this filesystem has *lots* of directories, most with few or no
files in them. Tonight I ran a simple 'find /path/to/vol -type d |pv
-bl' (counts directories as they are found), I stopped it more than 2
hours later : it was not done, and already counted more than 2M
directories. IO stats showed 1000 read calls/sec with avq=1 and avio=5
ms. CPU is 2% so it is totally I/O bound. This looks like the worst
random read case to me.

  I even tried a hack which tries to sort directories while traversing
the filesystem to no avail.

  Right now I don't even know how to analyze my filesystem further.
Sorry for not being able to describe it more accurately. I'm in search
for any advice or direction to improve this situation. While keeping
using ext4 of course :).

  PS: I did ask to the developers to not abuse the filesystem that way,
and that in 2013 it's okay to have 10k+ files per directory... No
success, so I guess I'll have to work around it.

filer:/srv/painfulvol/bench# tiobench --size 10000
Run #1: /usr/bin/tiotest -t 8 -f 1250 -r 500 -b 4096 -d . -T

Unit information
================
File size = megabytes
Blk Size  = bytes
Rate      = megabytes per second
CPU%      = percentage of CPU used during the test
Latency   = milliseconds
Lat%      = percent of requests that took longer than X seconds
CPU Eff   = Rate divided by CPU% - throughput per cpu load

Sequential Reads
                              File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      &amp;gt;2s      &amp;gt;10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.32-5-amd64               10000  4096    1  215.82 42.21%     0.017     1384.29   0.00000  0.00000   511
2.6.32-5-amd64               10000  4096    2  129.51 48.53%     0.057     5115.46   0.00020  0.00000   267
2.6.32-5-amd64               10000  4096    4   89.80 66.26%     0.168     6697.64   0.00043  0.00000   136
2.6.32-5-amd64               10000  4096    8   77.11 113.3%     0.394     6750.12   0.00102  0.00000    68

Random Reads
                              File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      &amp;gt;2s      &amp;gt;10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.32-5-amd64               10000  4096    1    0.79 0.302%     4.951       58.56   0.00000  0.00000   260
2.6.32-5-amd64               10000  4096    2    0.41 0.328%    17.165      174.55   0.00000  0.00000   126
2.6.32-5-amd64               10000  4096    4    0.80 1.024%    18.848      358.64   0.00000  0.00000    78
2.6.32-5-amd64               10000  4096    8    0.82 1.801%    35.989      808.74   0.00000  0.00000    45

Sequential Writes
                              File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      &amp;gt;2s      &amp;gt;10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.32-5-amd64               10000  4096    1  243.70 78.53%     0.014      492.80   0.00000  0.00000   310
2.6.32-5-amd64               10000  4096    2  186.89 150.9%     0.037     1969.62   0.00000  0.00000   124
2.6.32-5-amd64               10000  4096    4  113.90 209.8%     0.122     6303.26   0.00137  0.00000    54
2.6.32-5-amd64               10000  4096    8   88.32 336.6%     0.307     9451.83   0.00285  0.00000    26

Random Writes
                              File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency    Latency      &amp;gt;2s      &amp;gt;10s    Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
2.6.32-5-amd64               10000  4096    1  107.11 101.4%     0.009        0.06   0.00000  0.00000   106
2.6.32-5-amd64               10000  4096    2  173.32 337.2%     0.010        0.04   0.00000  0.00000    51
2.6.32-5-amd64               10000  4096    4  224.92 921.3%     0.011        0.76   0.00000  0.00000    24
2.6.32-5-amd64               10000  4096    8  206.05 1598.%     0.012        1.00   0.00000  0.00000    13
[defaults]
base_features = sparse_super,filetype,resize_inode,dir_index,ext_attr
blocksize = 4096
inode_size = 256
inode_ratio = 16384

[fs_types]
ext3 = {
features = has_journal
}
ext4 = {
features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize
inode_size = 256
}
ext4dev = {
features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize
inode_size = 256
options = test_fs=1
}
small = {
blocksize = 1024
inode_size = 128
inode_ratio = 4096
}
floppy = {
blocksize = 1024
inode_size = 128
inode_ratio = 8192
}
news = {
inode_ratio = 4096
}
largefile = {
inode_ratio = 1048576
blocksize = -1
}
largefile4 = {
inode_ratio = 4194304
blocksize = -1
}
hurd = {
     blocksize = 4096
     inode_size = 128
}
_______________________________________________
Ext3-users mailing list
Ext3-users&amp;lt; at &amp;gt;redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users&lt;/pre&gt;</description>
    <dc:creator>Vincent Caron</dc:creator>
    <dc:date>2013-03-12T23:56:15</dc:date>
  </item>
  <textinput rdf:about="http://search.gmane.org/?group=$group=gmane.comp.file-systems.ext3.user">
    <title>Search Engine</title>
    <description>Search the mailing list at Gmane</description>
    <name>query</name>
    <link>http://search.gmane.org/?group=$group=gmane.comp.file-systems.ext3.user</link>
  </textinput>
</rdf:RDF>
