I have a table in my database which stores logs. The log files are timestamped with second accuracy and store the values of various sensors and their source:
log_id, log_date, primary_system_source, sub_system_source, values
Where log_id, primary_source and sub_source are integers and values is a variable length byte array (datatype: bytea).
In most cases a combination of log_id, log_date, primary_system_source and sub_system_source fields would be sufficient as the primary key. Unfortunately, as a result of the resolution of the timestamping in the logging system in some rows the only factor differentiating rows is if the sensor values are also added to the primary key.
It appears I have a choice between having no primary key (bad?), and开发者_运维知识库 including the values field in the primary key. I am concerned at the second choice as I understand it could be seriously detrimental to performance (the table will have hundreds of millions of rows).
Any hints as to which is the best solution?
That's a difficult issue since your entire row functions as your primary key in the example you just presented. Since your logs timestamp without absolute precision, I would argue that your logs themselves may not contain unique values (two similar sensory readings in the same time period). If that holds true, you do not have any way to uniquely identify your data, and therefore can not impose a unique constraint on it.
I would recommend simply adding a SERIAL PK field for links to other relations and not worrying about the uniqueness of your entries since you can not reasonably enforce it anyways. You can identify duplicated log entries if you have a greater number of entries within a certain time period than you were expecting. I'm not sure of the performance implications, but running SELECT DISTINCT may be more prudent than attempting to enforce uniqueness.
精彩评论