How Should I Implement a Huge but Simple Indexed StringList in Delphi?_问答_开发者

I am using Delphi 2009. I have a very simple data structure, with 2 fields:

A string that is the key field I need to retrieve by and is usually 4 to 15 characters in length.
A string that is the data field that may be any size, from 1 character up to say 10,000 characters.

The difficulty is that I may have several million of these records, so they may total as much or more than 10 GB in size. Obviously, I'm looking for an on-disk solution rather than an in-memory solution.

My program needs to randomly retrieve these records, based on the key field. That's that part that needs to be made as efficient as possible.

Should I use a database for such a simple structure, and if so, which database would be best to handle this and be simplest to implement?

Alternatively, is there a simple on-disk data structure not requiring a full-blown database that would work just as well?

Well, all I needed was the one answer to kick me back into reality. I was looking for something simpler than even a simple database. But when the no-duh answer is to use a database, then I realize I've already answered this question with my own answer to another question: Best database for small applications and tools.

My answer was DISQLite3 for the reasons I specified there. And that's what I'll probably with for my implementation.

A few more good answers with some possibilities. That's great. I'll be able to try a few different methods to see what works best.

More contemplation, and I have had to change the accepted answer to the GpStructuredStorage solution.

In my case, a million records totalling several Gigabytes will put a strain on a database structure. Specifically, the B* tree that is used to store the index in most databases is fast, but will slow down for some operations such as reindexing a million values.

The only thing you'll find faster than B* for an index is a hash table. And that is precisely what is provided in gabr's suggested addition to the GpStructuredStorage solution. I think it's quite elegant the way he segmented the hash value to give a 4 level directory structure.

The key reason why I can go to a hash solution is that I only need random access by the keys. I do not need to sort by the keys. If sorting was needed, then the gains of the speed of the hash table would be lost and the database system would be a no-brain winner.

When I get down to implementing this, I should do a comparison of this technique versus a database. Maybe I'll compare with both Firebird and SQLite which would both be worthy opponents.

One more followup:

I just discovered Synopse Big Table by A. Bouchez which is designed for speed and 开发者_JAVA百科meets the specs of my question almost precisely. I'll be trying it out first when I do my implementation in a few months and will report back here with my results.

A much later followup (July 2015)

I never did get to try Synopse Big Table. I've stuck with my B* tree up to now. But now I've upgraded to Delphi XE8 and plan to go with the database solution using FireDAC with SQLite.

For more than 10GB in data, a database is exactly what you need. It will handle indexing for rapidly locating data (your random retrieval), the functionality for adding, modifying, and deleting data, and the actual storage, as well as much more if you so choose.

There are dozens of posts here related to which databases are available for use in Delphi, including built-ins and FOS ones like Firebird.

why all the bragging? just use GpStructuredStorage(4 TB restriction and with little time invested in class you can go over), it will take you few hours to get used to it but it worths the time. Hope it helps

GpStructuredStorage can retrieve the file names extremely fast(I've tested it), you need to save each record as a file in GpStructuredStorage and retrieve each name as a string in a string list, 1 milion string names(because you mentioned about stringlist) needs a few MB in RAM which is not much, use a TStream descendant to write data to a file in GpStructuredStorage, I do not have time today to write a simple example, but on Saturday or Sunday I will write a tutorial for GPStructuredStorage on my blog.

_{[Added by gabr - I hope that will not be considered a terrible breach of netiquette. It's just that my thoughts don't fit in a comment (sizewise) and that it feels stupid to add another answer just to write this ...]}

While GpStructuredStorage can store loads of data, finding it may be a slow process. What I usually do in such cases is to create a hash of the key and convert it into hex (say 00FFA784). Then I convert this hex hash into folder structure (in this case it would be /00/FF/A7/84) and store relevant data in this folder, either as a file, as attributes or a combination of both.

This approach speeds data retrieval but slows down data insertions and is therefore recommended only for mostly static data. If the data is fairly dynamic I'd certainly recommend using database and not GpStructuredStorage.

You should analyse your data. If

a sizeable part of the data values is larger than the default file system block size,
you don't want to search in the data values using SQL (so it doesn't matter what format they are stored in), and
you really need random access over the whole database,

then you should test whether compressing your data values increases performance. The decompression of data values (especially on a modern machine with multiple cores, performed in background threads) should incur only a small performance hit, but the gains from having to read fewer blocks from the hard disc (especially if they are not in the cache) could be much larger.

But you need to measure, maybe the database engine stores compressed data anyway.

BerkleyDB is exactly that

Since your data is more than 3GB, you will need to make sure what ever database engine you select either handles tables that large, or split things up into multiple tables, which I would suggest doing no matter what the maximum size of a single table. If you perform the split, perform it as evenly as possible on a logical key break so that its easy to determine which table to use by the first or first two characters of the key. This will greatly reduce the search times by eliminating any records which could never match your query to start with.

If you just want raw performance, and will only be performing read only lookups into the data, then your better served by an ordered index file(s) using a fixed size record for your keys which points to your data file. You can then perform a binary search easily on this data and avoid any database overhead. For even more of a performance gain, you can pre-load/cache the midpoints into memory to reduce repetitive reads.

A simple fixed size record for your specs might look like:

type
  rIndexRec = record
    KeyStr  : String[15];  // short string 15 chars max
    DataLoc : integer;     // switch to int64 if your using gpHugeFile
  end;

For initial loading, use the Turbo Power sort found in the SysTools, which the latest version for Delphi 2009/2010 can be downloaded on the songbeamers website. The DataLoc would be the stream position of your datastring record, which writing/reading might look like the following:

function WriteDataString(aDataString:String;aStream:tStream):integer;
var
  aLen : integer;
begin
  Result := aStream.Position;
  aLen := Length(aDataString);
  aStream.Write(aLen,sizeOf(aLen));
  aStream.Write(aDataString[1],aLen*sizeOf(Char));
end;

function ReadDataString(aPos:Integer;aStream:tStream):String;
var
  aLen : integer;
begin
  if aStream.Position <> aPos then
    aStream.Seek(aPos,soFromBeginning);
  result := '';
  aStream.Read(aLen,SizeOf(aLen));
  SetLength(Result,aLen);
  if aStream.Read(Result[1],aLen*sizeOf(Char)) <> aLen*SizeOf(Char) then
    raise Exception.Create('Unable to read entire data string');
end;

When you are creating your index records, the dataloc would be set to datastring record position. It doesn't matter the order in which records are loaded, as long as the index records are sorted. I used just this technique to keep a 6 billion record database up to date with monthly updates, so it scales to the extreme easily.

EDIT: Yes, the code above is limited to around 2GB per datafile, but you can extend it by using gpHugeFile, or segmenting. I prefer the segmenting into multiple logical files < 2gb each, which will take up slightly less disk space.

If you more often need large datasets, and have some money to spare, simply stuff 16GB ( 500-750 Eur) in a machine, and make a separate process with some 64-bit compiler (*) of it that you query over e.g. shared mem or other IPC method.

In that case you can use the in-memory approach till Delphi 64-bit finally comes out. Since your data seems to be simple ( a map from array of char to array of char) it is easily to export over IPC.

This is of course if this approach has any merit for your case (like it is a cache or so), which I can't determine from your question.

(*) I recommend FPC of course :-)

I did this once, till about 5 million objects, 5 GB of data.

I got permission to open source the container types I made for it, they are at:

http://www.stack.nl/~marcov/lightcontainers.zip (warning: very dirty code)

mghie: to answer in another cliche: There is no silver bullet

Databases have a lot of other assumptions too

their generalized approach make relative inefficient use of memory. Most notably your dataset using normal memory storage techniques falls inside the affordable memory ranges, which are of course typically bigger for a server (my bad assumption here, apparantly) than for a client.
databases assume that their resultsets can reduced to small sets within the database-server with a relative straight kind of processing, and assisted by indexing.
they have a relatively high latency.
they are relatively bad in some kinds of processing (like multidimensional analysis/ OLAP, which is why databases need to be extended for that)

This makes databases relatively bad for use in e.g. caches, loadbalancers etc. Of course that is all provided that you need the speed. But the initial question felt a bit speed-sensitive to me.

In a past job my function in an database oriented firm was to do everything but that, IOW fix the problems when the standard approach couldn't hack it (or required 4 socket Oracle servers for jobs where the budget didn't warrant such expenses). The solution/hack written above was a bit of OLAPpy, and connected to hardware (a rfid chipprogramming device), requiring some guaranteed response time. Two months of programming time, still runs, and couldn't even buy a windows server + oracle license for the cost.

Synopse Big Table by A. Bouchez. See his answer to my other question about SQLite/DISQLite.

It wasn't even developed when I first asked this question, but now it's a quite mature and fully functional unit.