Sunday, November 1, 2009

Effects of too many files in a directory

How to efficiently process a large directory

in reference to:

"In addition to the while suggestion above, depending on the filesystem you use, your suggestion to hash the directories can be a good one, especially if the system has to read multiple disk blocks to find the file you are looking for.

The same concept has been applied to mail spools (qmail), and suggested to help speed up access to home directories on hosts with large numbers of "users".

With the number of files you are considering, you probably want to consider something along the lines of NUMFILES < SPLIT ** DEPTH where SPLIT is the number of subdirectories that can fit in one disk block, and DEPTH is how deep your directory structure should go. Once you get to the point where NUMFILES is larger, then you start needing multiple directory reads to find the file you need to open.

Add this to the while suggestion above, and you should be able to access each individual file (such as an open() call) as quick as the OS can handle it.

Of course, this is all IIRC, and it has been a while since I have applied this in my studies.

Now this is all based on older file systems (inode, UFS style, chained directory block style, etc). The newer filesystems (btree, reiser?) may not have this "problem" anymore."
- Re: Efficient processing of large directory (view on Google Sidewiki)

Storing a graph with millions of nodes as files

Nice way of using MD5 hash to distribute files in a directory hierarchy.

in reference to:

"I concur with rhesa's method. I've used this before with considerable success.

I just untarred a test structure containing a million files distributed this way using 3 levels of subdirectory to give an average of ~250 file per directory. I then ran a quick test of opening and reading 10,000 files at random and got an average time to locate, open, read and close each file of 12ms.

#! perl -slw
use strict;
use Math::Random::MT qw[ rand ];
use Digest::MD5 qw[ md5_hex ];
use Benchmark::Timer;


our $SAMPLE ||= 1000;

my $T = new Benchmark::Timer;

for my $i ( 1 .. $SAMPLE ) {
$T->start( 'encode/open/read/close' );
my $md5 = md5_hex( int( rand 1e6 ) );
my( $a, $b, $c ) = unpack 'AAA', $md5 ;

$T->start( 'open' );
open my $fh, '<', "fs/$a/$b/$c/$md5.dat" or warn "fs/$a/$b/$c/$md5
+ : $!";
$T->stop( 'open' );

$T->start( 'read' );
my $data = do{ local $/; <$fh> };
$T->stop( 'read' );

$T->start( 'close' );
close $fh;
$T->stop( 'close' );
$T->stop( 'encode/open/read/close' );
}

$T->report;
__END__
c:\test>612729-r -SAMPLE=10000
10000 trials of encode/open/read/close (112.397s total), 11.240ms/tria
+l
10000 trials of open (110.562s total), 11.056ms/trial
10000 trials of read (158.554ms total), 15us/trial
10000 trials of close (365.520ms total), 36us/trial
[download]


The files in this case are all 4k, but that doesn't affect your seek time. If you envisage needing to deal with much more than 1 million files, moving to 4 levels of hierarchy woudl distribute the million files at just 15 p"
- Perl solution for storage of large number of small files (view on Google Sidewiki)