Optimal data structure for time and source dependent log data for fast browsing?_问答_开发者

I've got field bus data that gets sent in packets and contains a datum (e.g. a float) from a source.

=> I get timestamps with a source ID and a value.

Now I want to create a little program (actually a logging deamon in C++ that offers a query interface over HTTP for displaying the data in a plot diagram) where the user can select a few of the sources and the interesting time range and then gets the data drawn. This deamon will run under a Linux-based embedded system.

So my question is: what is the most efficient (query performance and memory consumption) data storage scheme for that data?

Addendum #1:

Although I think the algorithm question is very interesting stand alone I will provide a few informations about the problem that caused this question:

Data rate is typically 3 packets / second (bursts开发者_开发问答 up to 30/s are usual)
Interesting data might be as old as a month (the more the better; the algorithm might use an hierarchy that allows ultra fast lookup for the last day, fast lookup for the last week and acceptable lookup for older data)
the IDs are (at the moment) 32 bits wide.
There are roghly 1000 IDs used - but it's not known in advance which and the user might use an additional ID any time
The values stored will have different data types (boolean, integer, float - even 14 byte width strings are possible)

Doing a bit of math:

Assuming a 32 bit timestamp + 32 bit ID + 32 bit values on average will create a datum to store of 12 bytes
That'll be for a month 12*3*60*60*24*30 = about 100 MB of data to filter trough (in real-time with an 500 MHz Geode CPU)
Showing the plot for the last day will filter out 1/30th of the data - that'll leave 3 MB to filter through.
That 3 MB will be reduced to 1/1000th (= 3 KB) by showing only the relevant ID.

Addendum #2:

This problem asks basically how do I transfer a 2D dataset (time and ID are the dimensions) into memory (and from there serialize it to a file). And the constraint is that both dimensions will be filtered.

The suggested time sorted array is an obvious solution to handle the time-dimension. (To increase query performance an tree based index might be used. A binary search itself isn't so easy as each entry might have a different size - but the index tree covers that nicely and basically has the same underlying idea).

Going that route (first one dimension (time), then the other one) will result in a poor performance (I fear) for the ID filtering, as I have to use a brute force lookup.

You could just store your data in SQLite and have your web-server run SQL queries against it. Using existing tools you can prototype rapidly and see how well it scales for your purposes.

It really depends on the specific case but I can think that a possible solution would be to store events in pages and keep in memory just the page directory:

struct Page
{
    int id;
    int timestamp0, timestamp1;
    int bytes_used;
};

Every page only has events for a specific ID and all pages are of the same size (e.g. 4K). When you receive an event you add it to the specific page if it fits, otherwise allocate a new page for that event ID.

When doing the searches you can quickly decide by looking at the index which pages from your data file are worth processing and you don't have to process the whole file.

Pseudocode for adding an event is:

find last page for ID x
if the event doesn't fit in the page allocate a new fresh page
store the event and update the index entry for the page

for doing a search you:

for each entry in the index
if the entry is about an ID you don't care about then go to next one
if (page.timestamp0 >= tsmax || page.timestamp1 < tsmin) then the page doesn't contain any interesting event, go to next one
this page contains at least a relevant event; load the page and process the events that are contained in the period tsmin ... tsmax you are interested in.

You can also avoid storing the index in the file (making the event logging operation faster) if you add an ID field once per page. Just when starting the server a full scan of all the data will be needed... may be or not this is an interesting option.

You can also easily decide how much data to keep... when no more free pages are available you can reuse (may be after storing away a zipped copy for archival) all pages that only contain events older than a certain date.

most efficient (query performance and memory consumption)

By this you probably mean something that is well balanced between the two. Also, I think that the data insertion must be fast.

The easiest and maybe sufficient solution would be to use plain array IMO as it is most memory efficient non-compressed form you can store the data. Thus each array element contains the timestamp, id and value.

When you query the data with two timestamps begin and end, you determine the locations of the timestamps in the array using binary search. Then, you traverse all the elements and fetch only the ones with id-s of data sources you are interested in. The array elements must be of course ordered by timestamps.

The data takes O(n) memory, where the number of log entries is n.
Data insertion is O(1)
Data retrieval should be something like O(2*log(n) + n*m), where n is the number of elements. If you have more data sources you want to include in the query, then you can store the data source ID-s in a set, thus the complexity would be O(2*log(n) + n*log(m)).

There are of course other possibilities that can involve storing the transactions in trees, hashtables or something that mixes these with lists to get more detailed balance between performance/memory consumption.

Also, the problems arise, when the amount of logs is large. In that case, you should split the array into files and store the begin/end timestamps the files contain the logs. Then the implementation gets a little bit more complex.

Hopefully this helps somehow you to decide the best data structure / implementation for your solution.