开发者

Best way to store many files in disk

开发者 https://www.devze.com 2022-12-19 11:33 出处:网络
I couldn\'t find a good title for the question, this is what I\'m trying to do: This is .NET application.

I couldn't find a good title for the question, this is what I'm trying to do:

  • This is .NET application.
  • I need to store up to 200000 objects (between开发者_运维技巧 3KB-500KB)
  • I need to store about 10 of them per second from multiple-threads
  • I use binaryserialization before storing it
  • I need to access them later on by an integer, unique id

What's the best way to do this?

  • I can't keep them on memory as I'll get outofmemory exceptions
  • When I store them in the disk as separate files what are the possible performance issues? Would it decrease the overall performance much?
  • Shall I implement some sort of caching, for example combine 100 objects and write it once as one file. Then parse them later on. Or something similar?
  • Shall use a database? (access time is not important, there won't be search and I'll access only couple of times by the known unique id). In theory I don't need a database, I don't want to complicate this.

UPDATE:

  • I assume database would be slower than file system, prove me wrong if you got something about that. So that's why I'm also leaning towards to file system. But what I'm truly worried is about writing 200KB*10 per second to HDD (this can be any HDD, I don't control hardware, it's a desktop tool which will be deployed in different systems).
  • If I use file system I'll store files in separate folders to avoid file-system related issues (so you can ignore that limitation)


If you want to avoid using a database, you can store them as files on disk (to keep things simple). But you need to be aware of filesystem considerations when maintaining a large number of files in a single directory.

A lot of common filesystems maintain their files per directory in some kind of sequential list (e.g., simply storing file pointers or inodes one after the other, or in linked lists.) This makes opening files that are located in the bottom of the list really slow.

A good solution is to limit your directory to a small number of nodes (say n = 1000), and create a tree of files under the directory.

So instead of storing files as:

/dir/file1 /dir/file2 /dir/file3 ... /dir/fileN

Store them as:

/dir/r1/s2/file1 /dir/r1/s2/file2 ... /dir/rM/sN/fileP

By splitting up your files this way, you improve access time significantly across most file systems.

(Note that there are some new filesystems that represent nodes in trees or other forms of indexing. This technique will work as well on those too.)

Other considerations are tuning your filesystem (block sizes, partitioning etc.) and your buffer cache such that you get good locality of data. Depending on your OS and filesystem, there are many ways to do this - you'll probably need to look them up.

Alternatively, if this doesn't cut it, you can use some kind of embedded database like SQLlite or Firebird.

HTH.


I would be tempted to use a database, in C++ either sqlite or coucheDB.
These would both work in .Net but i don't know if there is a better .Net specific alternative.

Even on filesystems that can handle 200,000 files in a directory it will take for ever to open the directory

Edit - The DB will probably be faster!
The filesystem isn't designed for huge numbers of small objects, the DB is.
It will implement all sorts of clever caching/transaction stratergies that you never thought of.

There are photo sites that chose the filesystem over a DB. But they are mostly doing reads on rather larger blobs and they have lots of admins who are expert in tuning their servers for this specific application.


I recommend making a class that has a single threaded queue for dumping images (gzipped) onto the end of a file and then saving the file offsets/meta-info into a small database like sqlite. This allows you to store all of your files quickly, tightly, from multiple threads, and read them back, efficiently and without having to deal with any filesystem quirks (other than max filesize -- which can be dealt with by having some extra metadata.

File:
file.1.gzipack

Table:
compressed_files {
  id,
  storage_file_id,
  storage_offset,
  storage_compressed_length,
  mime_type,
  original_file_name
}


you can check out mongoDb, it support store files.


The only way to know for sure would be to know more about your usage scenario.

For instance, will later usage of the files need them in clusters of 100 files at a time? Perhaps if it does it would make sense to combine them.

In any case, I would try to make a simple solution to begin with, and only change it if you later on find that you have a performance problem.

Here's what I would do:

  1. Make a class that deals with the storage and retrieval (so that you can later on change this class, and not every point in your application that uses it)
  2. Store the files on disk as-is, don't combine them
  3. Spread them out over sub-directories, keeping 1000 or less files in each directory (directory access adds overhead if you have many files in a single directory)


I actually don't use .NET so I'm not sure what is easy there, but in general I'd offer two pieces of advice.

If you need to write a lot and read rarely (e.g. log files), you should create a .zip file or the like (choose a compression level that doesn't slow down performance too much; in the 1-9 rating, 5 or so usually works for me). This gives you several advantages: you don't hit the filesystem so hard, your storage space is reduced, and you can naturally group files in blocks of 100 or 1000 or whatever.

If you need to write a lot and read a lot, you could define your own flat file format (unless you have access to utilities to read and write .tar files or the like, or cheat and put binary data in an 8-bit grayscale TIFF). Define records for each header--perhaps 1024 bytes each that contains the offset into the file and the file name and anything else you need to store--and then write the data in chunks. When you need to read a chunk, you first read the header (perhaps 100k) and then jump to the offset you need and read the amount that you need. The advantage of fixed-size headers is that you can write empty data to them at the beginning and then just keep appending new stuff to the end of the file and then go back and overwrite the corresponding record.

Finally, you could possibly look into something like HDF5; I don't know what the .NET support for that is, but it's a good way to store generic data.


You might consider using Microsoft's Caching Application Block. You can configure it to use IsolatedStorage as a backing store, so items in the cache will be serialized to disk. Performance could be a problem - I think that out of the box it blocks on writes, so you might need to tweak it to do async writes instead.


in your case memchached may cover some performance problems.

0

精彩评论

暂无评论...
验证码 换一张
取 消