开发者

What's the fastest way to get directory and subdirs size on unix using Perl?

开发者 https://www.devze.com 2022-12-27 20:16 出处:网络
I am using Perl stat() function to get the size of directory and its subdirectories. I have a list of about 20 parent directories which have few thousand recursive subdirs and every subdir has few hun

I am using Perl stat() function to get the size of directory and its subdirectories. I have a list of about 20 parent directories which have few thousand recursive subdirs and every subdir has few hundred records. Main computing part of script looks like this:

sub getDirSize {
my $dirSize = 0;
my @dirContent = <*>;

my $sizeOfFilesInDir = 0;
foreach my $dirContent (@dirContent) {
   if (-f $dirContent) {
        my $size = (stat($dirContent))[7];
        $dirSize +开发者_如何学JAVA= $size;
   } elsif (-d $dirContent) {
        $dirSize += getDirSize($dirContent);
   } 
}
return $dirSize;
}

The script is executing for more than one hour and I want to make it faster.

I was trying with the shell du command, but the output of du (transfered to bytes) is not accurate. And it is also quite time consuming. I am working on HP-UNIX 11i v1.


With some help from sfink and samtregar on perlmonks, try this one out:

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
my $size = 0;
find( sub { $size += -f $_ ? -s _ : 0 }, shift(@ARGV) );
print $size, "\n";

Here we're recursing all subdirs of the specified dir, getting the size of each file, and we re-use the stat from the file test by using the special '_' syntax for the size test.

I tend to believe that du would be reliable enough though.


I once faced a similar problem, and used a parallelization approach to speed it up. Since you have ~20 top-tier directories, this might be a pretty straightforward approach for you to try. Split your top-tier directories into several groups (how many groups is best is an empirical question), call fork() a few times and analyze directory sizes in the child processes. At the end of the child processes, write out your results to some temporary files. When all the children are done, read the results out of the files and process them.


Whenever you want to speed up something, you're first task is to find out what's slow. Use a profiler such as Devel::NYTProf to analyze the program and find out where you should concentrate your efforts.

In addition to reusing that data from the last stat, I'd get rid of the recursion since Perl is horrible at it. I'd construct a stack (or a queue) and work on that until there is nothing left to process.


Below is another variant of getDirSize() which doesn't require a reference to a variable holding the current size and accepts a parameter to indicate whether sub-directories shall be considered or not:

#!/usr/bin/perl

print 'Size (without sub-directories): ' . getDirSize(".") . " bytes\n";
print 'Size (incl. sub-directories): ' . getDirSize(".", 1) . " bytes\n";

sub getDirSize
# Returns the size in bytes of the files in a given directory and eventually its sub-directories
# Parameters:
#   $dirPath (string): the path to the directory to examine
#   $subDirs (optional boolean): FALSE (or missing) = consider only the files in $dirPath, TRUE = include also sub-directories
# Returns:
#   $size (int): the size of the directory's contents
{
  my ($dirPath, $subDirs) = @_;  # Get the parameters

  my $size = 0;

  opendir(my $DH, $dirPath);
  foreach my $dirEntry (readdir($DH))
  {
    stat("${dirPath}/${dirEntry}");  # Stat once and then refer to "_"
    if (-f _)
    {
     # This is a file
     $size += -s _;
    }
    elsif (-d _)
    {
     # This is a sub-directory: add the size of its contents
     $size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
    } 
  }
  closedir($DH);

  return $size;
}


I see a couple of problems. One @dirContent is explicitly set to <*> this will be reset each time you enter getDirSize. The result will be an infinite loop at least until you exhaust the stack (since it is a recursive call). Secondly, there is special filehandle notation for retrieving information from a stat call -- underscore (_). See: http://perldoc.perl.org/functions/stat.html. Your code as-is is calling stat three times for essentially the same information (-f, stat, and -d). Since file I/O is expensive, what you really want is to call stat once and then reference the data using "_". Here is some sample code that I believe accomplishes what you are trying to do

#!/usr/bin/perl

my $size = 0;
getDirSize(".",\$size);

print "Size: $size\n";

sub getDirSize {
  my $dir  = shift;
  my $size = shift;

  opendir(D,"$dir");
  foreach my $dirContent (grep(!/^\.\.?/,readdir(D))) {
     stat("$dir/$dirContent");
     if (-f _) {
       $$size += -s _;
     } elsif (-d _) {
       getDirSize("$dir/$dirContent",$size);
     } 
  }
  closedir(D);
}


Bigs answer is good. I modified it slightly as I wanted to get the sizes of all the folders under a given path on my windows machine.

This is how I did it.

#!/usr/bin/perl
use strict;
use warnings;
use File::stat;


my $dirname = "C:\\Users\\xxx\\Documents\\initial-docs";
opendir (my $DIR, $dirname) || die "Error while opening dir $dirname: $!\n";

my $dirCount = 0;
foreach my $dirFileName(sort readdir $DIR)
{

      next if $dirFileName eq '.' or $dirFileName eq '..';

      my $dirFullPath = "$dirname\\$dirFileName";
      #only check if its a dir and skip files
      if (-d $dirFullPath )
      {
          $dirCount++;
          my $dirSize = getDirSize($dirFullPath, 1); #bytes
          my $dirSizeKB = $dirSize/1000;
          my $dirSizeMB = $dirSizeKB/1000;
          my $dirSizeGB = $dirSizeMB/1000;
          print("$dirCount - dir-name: $dirFileName  - Size: $dirSizeMB (MB) ... \n");

      }   
}

print "folders in $dirname: $dirCount ...\n";

sub getDirSize
{
  my ($dirPath, $subDirs) = @_;  # Get the parameters

  my $size = 0;

  opendir(my $DH, $dirPath);
  foreach my $dirEntry (readdir($DH))
  {
    stat("${dirPath}/${dirEntry}");  # Stat once and then refer to "_"
    if (-f _)
    {
     # This is a file
     $size += -s _;
    }
    elsif (-d _)
    {
     # This is a sub-directory: add the size of its contents
     $size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
    } 
  }
  closedir($DH);

  return $size;
}
1
;

OUTPUT:

1 - dir-name: acct-requests  - Size: 0.458696 (MB) ...
2 - dir-name: environments  - Size: 0.771527 (MB) ...
3 - dir-name: logins  - Size: 0.317982 (MB) ...
folders in C:\Users\xxx\Documents\initial-docs: 3 ...


If your main directory is overwhelmingly largest consumer of directory and file inodes then don't calculate it. Calculate the other half of system and deduce the size of the rest of the system from that. (you can get used disk space from df in a couple of ms'). You might need to add a small 'fudge' factor to get to the same numbers. (also remember that if you calculate some free space as root, then you'll have some extra compared to other users 5% in ext2/ext3 on Linux, don't know about HPUX).

0

精彩评论

暂无评论...
验证码 换一张
取 消