Portable way to get file size (in bytes) in the shell_问答_开发者

On Linux, I use 开发者_JS百科stat --format="%s" FILE, but the Solaris machine I have access to doesn't have the stat command. What should I use then?

I'm writing Bash scripts and can't really install any new software on the system.

I've considered already using:

perl -e '@x=stat(shift);print $x[7]' FILE

or even:

ls -nl FILE | awk '{print $5}'

But neither of these looks sensible - running Perl just to get file size? Or running two programs to do the same?

wc -c < filename (short for word count, -c prints the byte count) is a portable, POSIX solution. Only the output format might not be uniform across platforms as some spaces may be prepended (which is the case for Solaris).

Do not omit the input redirection. When the file is passed as an argument, the file name is printed after the byte count.

I was worried it wouldn't work for binary files, but it works OK on both Linux and Solaris. You can try it with wc -c < /usr/bin/wc. Moreover, POSIX utilities are guaranteed to handle binary files, unless specified otherwise explicitly.

I ended up writing my own program (really small) to display just the size. More information is in bfsize - print file size in bytes (and just that).

The two cleanest ways in my opinion with common Linux tools are:

stat -c %s /usr/bin/stat

50000


wc -c < /usr/bin/wc

36912

But I just don't want to be typing parameters or pipe the output just to get a file size, so I'm using my own bfsize.

Even though du usually prints disk usage and not actual data size, the GNU Core Utilities du can print a file's "apparent size" in bytes:

du -b FILE

But it won't work under BSD, Solaris, macOS, etc.

Finally I decided to use ls, and Bash array expansion:

TEMP=( $( ls -ln FILE ) )
SIZE=${TEMP[4]}

It's not really nice, but at least it does only one fork+execve, and it doesn't rely on a secondary programming language (Perl, Ruby, Python, or whatever).

BSD systems have stat with different options from the GNU Core Utilities one, but with similar capabilities.

stat -f %z <file name>

This works on macOS (tested on 10.12), FreeBSD, NetBSD and OpenBSD.

When processing ls -n output, as an alternative to ill-portable shell arrays, you can use the positional arguments, which form the only array and are the only local variables in the standard shell. Wrap the overwrite of positional arguments in a function to preserve the original arguments to your script or function.

getsize() { set -- $(ls -dn "$1") && echo $5; }
getsize FILE

This splits the output of ln -dn according to current IFS environment variable settings, assigns it to positional arguments and echoes the fifth one. The -d ensures directories are handled properly and the -n assures that user and group names do not need to be resolved, unlike with -l. Also, user and group names containing white space could theoretically break the expected line structure; they are usually disallowed, but this possibility still makes the programmer stop and think.

Cross-platform fastest solution (it only uses a single fork() for ls, doesn't attempt to count actual characters, doesn't spawn unneeded awk, perl, etc.).

It was tested on Mac OS X and Linux. It may require minor modification for Solaris:

__ln=( $( ls -Lon "$1" ) )
__size=${__ln[3]}
echo "Size is: $__size bytes"

If required, simplify ls arguments, and adjust the offset in ${__ln[3]}.

Note: It will follow symbolic links.

If you use find from GNU fileutils:

size=$( find . -maxdepth 1 -type f -name filename -printf '%s' )

Unfortunately, other implementations of find usually don't support -maxdepth, nor -printf. This is the case for e.g. Solaris and macOS find.

You can use the find command to get some set of files (here temporary files are extracted). Then you can use the du command to get the file size of each file in a human-readable form using the -h switch.

find $HOME -type f -name "*~" -exec du -h {} \;

Output:

4.0K    /home/turing/Desktop/JavaExmp/TwoButtons.java~
4.0K    /home/turing/Desktop/JavaExmp/MyDrawPanel.java~
4.0K    /home/turing/Desktop/JavaExmp/Instream.java~
4.0K    /home/turing/Desktop/JavaExmp/RandomDemo.java~
4.0K    /home/turing/Desktop/JavaExmp/Buff.java~
4.0K    /home/turing/Desktop/JavaExmp/SimpleGui2.java~

You first Perl example doesn't look unreasonable to me.

It's for reasons like this that I migrated from writing shell scripts (in Bash, sh, etc.) to writing all but the most trivial scripts in Perl. I found that I was having to launch Perl for particular requirements, and as I did that more and more, I realised that writing the scripts in Perl was probably a more powerful (in terms of the language and the wide array of libraries available via CPAN) and more efficient way to achieve what I wanted.

Note that other shell-scripting languages (e.g., Python and Ruby) will no doubt have similar facilities, and you may want to evaluate these for your purposes. I only discuss Perl since that's the language I use and am familiar with.

I don't know how portable GNU Gawk's filefuncs extension is. The basic syntax is

time gawk -e '@load "filefuncs"; BEGIN {
     fnL[1] = ARGV[ARGC-1];
     fts(fnL, FTS_PHYSICAL, arr); print "";

     for (fn0 in arr) {
         print arr[fn0]["path"] \
           " :: "arr[fn0]["stat"]["size"]; };

     print ""; }' genieMV_204583_1.mp4


genieMV_204583_1.mp4 :: 259105690
real    0m0.013s


ls -Aln genieMV_204583_1.mp4

----------  1 501  20  259105690 Jan 25 09:31
            genieMV_204583_1.mp4

That syntax allows checking multiple files at once. For a single file, it's

time gawk -e '@load "filefuncs"; BEGIN {
      stat(ARGV[ARGC-1], arr);
      printf("\n%s :: %s\n", arr["name"], \
           arr["size"]); }' genieMV_204583_1.mp4

   genieMV_204583_1.mp4 :: 259105690
   real    0m0.013s

It is hardly any incremental savings. But admittedly it is slightly slower than stat straight up:

time stat -f '%z' genieMV_204583_1.mp4

259105690
real    0m0.006s (BSD-stat)


time gstat -c '%s' genieMV_204583_1.mp4

259105690
real    0m0.009s (GNU-stat)

And finally, a terse method of reading every single byte into an AWK array. This method works for binary files (front or back makes no diff):

time mawk2 'BEGIN { RS = FS = "^$";
     FILENAME = ARGV[ARGC-1]; getline;
     print "\n" FILENAME " :: "length"\n"; }' genieMV_204583_1.mp4

genieMV_204583_1.mp4 :: 259105690
real    0m0.270s


time mawk2 'BEGIN { RS = FS = "^$";
   } END { print "\n" FILENAME " :: " \
     length "\n"; }'  genieMV_204583_1.mp4

genieMV_204583_1.mp4 :: 259105690
real    0m0.269

But that's not the fastest way because you're storing it all in RAM. The normal AWK paradigm operates upon lines. The issue is that for binary files like MP4 files, if they don't end exactly on \n, the summing of length + NR method would overcount by one. The code below is a form of catch-all by explicitly using the last 1-or-2-byte as the line-splitter RS.

I found that it's much faster with the 2-byte method for binaries, and the 1-byte method it's a typical text file that ends with newlines. With binaries, 1-byte one may end up row-splitting far too often and slowing it down.

But we're close to nitpicking here, since all it took mawk2 to read in every single byte of that 1.83 GB .txt file was 0.95 seconds, so unless you're processing massive volumes, it's negligible.

Nonetheless, stat is still by far the fastest, as mentioned by others, since it's an OS filesystem call.

time mawk2 'BEGIN { FS = "^$";
    FILENAME = ARGV[ARGC-1];
    cmd = "tail -c 2 \""FILENAME"\"";
    cmd | getline XRS;
    close(cmd);

    RS = ( length(XRS) == 1 ) ? ORS : XRS ;

} { bytes += length } END {

    print FILENAME " :: "  bytes + NR * length(RS) }' genieMV_204583_1.mp4

        genieMV_204583_1.mp4 :: 259105690
        real    0m0.092s

        m23lyricsRTM_dict_15.txt :: 1961512986
        real    0m0.950s


ls -AlnFT "${m3t}" genieMV_204583_1.mp4

-rw-r--r--  1 501  20  1961512986 Mar 12 07:24:11 2021 m23lyricsRTM_dict_15.txt

-r--r--r--@ 1 501  20   259105690 Jan 25 09:31:43 2021 genieMV_204583_1.mp4

(The file permissions for MP4 was updated because the AWK method required it.)

I'd use ls for a better speed instead of wc which will read all the stream in a pipeline:

ls -l <filename> | cut -d ' ' -f5

This is in plain bytes
Use the flag --b M or --b G for the output in megabytes or gigabytes (per saying: not portable by @Andrew Henle on the comments).

BTW, if you're planning to go for: du cut

du -b <filename> | cut -f -1

use -h for a better human reading

Or, by du awk

du -h <filename> | awk '{print $1}'

Or stat:

stat <filename> | grep Size: | awk '{print $2}'

If you have Perl on your Solaris, then use it. Otherwise, ls with AWK is your next best bet, since you don't have stat or your find is not GNU find.

There is a trick in Solaris I have used. If you ask for the size of more than one file, it returns just the total size with no names - so include an empty file like /dev/null as the second file:

For example,