开发者

2 Files, Half the Content, vs. 1 File, Twice the Content, Which is Greater?

开发者 https://www.devze.com 2022-12-25 05:30 出处:网络
If I h开发者_StackOverflow社区ave 2 files each with this: \"Hello World\" (x 1000) Does that take up more space than 1 file with this:

If I h开发者_StackOverflow社区ave 2 files each with this:

"Hello World" (x 1000)

Does that take up more space than 1 file with this:

"Hello World" (x 2000)

What are the drawbacks of dividing content into multiple smaller files (assuming there's reason to divide them into more files, not like this example)?

Update:

I'm using a Macbook Pro, 10.5. But I'd also like to know for Ubuntu Linux.


Marcelos gives the general performance case. I'd argue worrying about this is premature optimization. you should split things into different files where it is logical to split them.

also if you really care about file size of such repetitive files then you can compress them. your example even hints at this, a simple run length encoding of

"Hello World"x1000

is much more space efficient than actually having "hello world" written out 1000 times.


Files take up space in the form of clusters on the disk. A cluster is a number of sectors, and the size depends on how the disk was formatted.

A typical size for clusters is 8 kilobytes. That would mean that the two smaller files would use two clusters (16 kilobytes) each and the larger file would use three clusters (24 kilobytes).

A file will by average use half a cluster more than it's size. So with a cluster size of 8 kilobytes each file will by average have an overhead of 4 kilobytes.


Most filesystems use a fixed-size cluster (4 kB is typical but not universal) for storing files. Files below this cluster size will all take up the same minimum amount.

Even above this size, the proportional wastage tends to be high when you have lots of small files. Ignoring skewness of size distribution (which makes things worse), the overall wastage is about half the cluster size times the number of files, so the fewer files you have for a given amount of data, the more efficiently you will store things.

Another consideration is that metadata operations, especially file deletion, can be very expensive, so again smaller files aren't your friends. Some interesting work was done in ReiserFS on this front until the author was jailed for murdering his wife (I don't know the current state of that project).

If you have the option, you can also tune the file sizes to always fill up a whole number of clusters, and then small files won't be a problem. This is usually too finicky to be worth it though, and there are other costs. For high-volume throughput, the optimal file size these days is between 64 MB and 256 MB (I think).

Practical advice: Stick your stuff in a database unless there are good reasons not to. SQLite substantially reduces the number of reasons.


I think the usage of file(s) is to take into consideration, according to the API and the language used to read/write them (and hence eventually API restrictions). Fragmentation of the disk, that will tend to decrease with only big files, will penalize data access if you're reading one big file in one shot, whereas several access spaced out time to small files will not be penalized by fragmentation.


Most filesystems allocate space in units larger than a byte (typically 4KB nowadays). Effective file sizes get "rounded up" to the next multiple of that "cluster size". Therefore, dividing up a file will almost always consume more total space. And of course there's one extra entry in the directory, which may cause it to consume more space, and many file systems have an extra intermediate layer of inodes where each file consumes one entry.

What are the drawbacks of dividing content into multiple smaller files (assuming there's reason to divide them into more files, not like this example)?

  • More wasted space
  • The possibility of running out of inodes (in extreme cases)
  • On some filesystems: very bad performance when directories contain many files (because they're effectively unordered lists)
  • Content in a single file can usually be read sequentially (i.e. without having to move the read/write head) from the HD, which is the most efficient way. When it spans multiple files, this ideal case becomes much less likely.
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号