DB Design for high amount of data (20 millions rows / day) [closed]_问答_开发者

DB Design for high amount of data (20 millions rows / day) [closed]

开发者 https://www.devze.com 2023-03-17 03:09 出处：网络

Closed. This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be 开发者_如何学Pythonanswered with facts and cita

Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it can be 开发者_如何学Pythonanswered with facts and citations by editing this post.

Closed 4 years ago.

Improve this question

We are looking to create a software that receive log files from a high number of devices. We are looking around 20 million rows a day with log (2kb / each for each log line).

I have developed a lot of software but never with this large quantity of input data. The data needs to be searchable, sortable, groupable by source IP, dest IP, alert level etc.

It should be combining similiar log entries (occured 6 times etc..)

Any ideas and suggestions on what type of design, database and general thinking around this would be much appreciated.

UPDATE:

Found this presentation, seems like a similar scenario, any thoughts on this? http://skillsmatter.com/podcast/cloud-grid/mongodb-humongous-data-at-server-density

I see a couple of things you may want to consider.

1) message queue - to drop a log line and let other part (worker) of the system to take care of it when time permits

2) noSQL - reddis, mongodb,cassandra

I think your real problem would be in querying the data , not in storing.

Also you probably would need a scalable solution. Some of noSql databases are distributed you may need that.

Check this out, it might be helpful https://github.com/facebook/scribe

A web search on "Stackoverflow logging device data" yielded dozens of hits.

Here is one of them. The question asked may not be exactly the same as yours, but you should get dozens on intersting ideas from the responses.

I'd base many decisions on how users most often will be selecting subsets of data -- by device? by date? by sourceIP? You want to keep indexes to a minimum and use only those you need to get the job done.

For low-cardinality columns where indexing overhead is high yet the value of using an index is low, e.g. alert-level, I'd recommend a trigger to create rows in another table to identify rows corresponding to emergency situations (e.g. where alert-level > x) so that alert-level itself would not have to be indexed, and yet you could rapidly find all high-alert-level rows.

Since users are updating the logs, you could move handled/managed rows older than 'x' days out of the active log and into an archive log, which would improve performance for ad-hoc queries.

For identifying recurrent problems (same problem on same device, or same problem on same ip address, same problem on all devices made by the same manufacturer, or from the same manufacturing run, for example) you could identify the subset of columns that define the particular kind of problem and then create (in a trigger) a hash of the values in those columns. Thus, all problems of the same kind would have the same hash value. You could have multiple columns like this -- it would depend on your definition of "similar problem" and how many different problem-kinds you wanted to track, and on the subset of columns you'd need to enlist to define each kind of problem. If you index the hash-value column, your users would be able to very quickly answer the question, "Are we seeing this kind of problem frequently?" They'd look at the current row, grab its hash-value, and then search the database for other rows with that hash value.