I have a database which stores temperature-logging data from various instruments. Data may be logged as often as once per minute. One approach to designing a log table would be to put each log entry in its own row along with the device ID, a time stamp, and a sequence number (even if the clock on a device is changed, it should be possible to sort entries in the order the measurements were actually taken). That would seem incredibly grossly inefficient, however, since every 16-bit measurement would have probably 16 bytes of other data attached to it, in addition to whatever the system adds for indexing. I recognize that it is often senseless to try to optimize every last byte out of a database, but expanding data by a factor of 9:1 or worse seems silly.
At present, I aggregate the records into groups of equally-spaced readings, and store one group per record in a va开发者_运维知识库riable-length opaque binary format along with the device ID, time stamp and sequence number for the first reading, and interval between readings. This works nicely, and for all I know may be the best approach, but it doesn't allow for much in the way of queries.
Is there any nice approach for handling such data sets without excessive redundancy?
Your data doesn't expand by a factor of 9. Your data stay roughly the same, because you do not have a 16 bit measurement to start with. Your measurement is the Nth measurement from a specific device at a specific moment. So your data does have a sequence number, a device ID and a timestamp even before you add them to the database, whether you're willing to account for it or not.
If you store data in a relational table (SQL), store it in a relational format: normalized. One record per row. Store information in queryable format. 'aggregating' records in a an opaque binary format makes your entire database useless, as the data cannot be queried, aggregated, filtered, nothing. Your definition of 'this works nicely' is basically 'I can write the data and nobody can make any use of it', which is hardly 'nice'. You may just as well dump the data to /dev/nul
...
Store the data as proper records. Store data as proper database types, don't use 'opaque blobs'. And 'data may be logged as often as once per minute' is not 'frequent' by any database standards. If you'd say '100 times per second' then we'd have something to talk about.
Is this really a problem? Imagine we overestimate a bit and say that there is 50 bytes of data+metadata per measurement. Google suggests that you won't have many problems unless you are in a really tight environment.
I believe you should use RRDtool to store such data. Wikipedia article.
精彩评论