开发者

structured vs. unstructured data in db

开发者 https://www.devze.com 2022-12-27 17:10 出处:网络
the question is one of design. i\'m gathering a big chunk of performance data with lots of key-value pairs. pretty much everything in /proc/cpuinfo, /proc/meminfo/, /proc/loadavg, plus a bunch of othe

the question is one of design. i'm gathering a big chunk of performance data with lots of key-value pairs. pretty much everything in /proc/cpuinfo, /proc/meminfo/, /proc/loadavg, plus a bunch of other stuff, from several hundred hosts. right now, i just need to display the latest chunk of data in my UI. i will probably end up doing some analysis of the data gathered to figure out performance problems down the road, but this is a new application so i'm not sure what exactly i'm looking for performance-wise just yet.

i could structure the data in the db -- have a column for each key i'm gathering. the table would end up being O(100) columns wide, it would be a pain to put into the d开发者_运维技巧b, i would have to add new columns if i start gathering a new stat. but it would be easy to sort/analyze the data just using SQL.

or i could just dump my unstructured data blob into the table. maybe three columns -- host id, timestamp, and a serialized version of my array, probably using JSON in a TEXT field.

which should I do? am i going to be sorry if i go with the unstructured approach? when doing analysis, should i just convert the fields i'm interested in and create a new, more structured table? what are the trade-offs i'm missing here?


I say if you need to run SQL queries to calculate things like min/max/avg or to perform sorting, restrictions, or joins based on the values, then you should create the 100+ columns. That's what I would do.

You don't state which brand of database you are using, but most should support 100+ columns in a table without risk of inefficiency.

Please don't use the Entity-Attribute-Value antipattern -- the key/value design that some people will suggest. It's nice and easy to insert any arbitrary collection of key/value pairs into such a design, but any of the queries that would be simple to do in a conventional table with one column per attribute become insanely difficult and inefficient with the EAV design. You also lose many advantages of using an SQL database, like data types and constraints.


i think

performance_data

        host_id
        key
        value
        timestamp

is the proper structure. you will be able to query the specific subsets from the specific hosts at the specific times to generate your analysis.


Here is an alternative solution: use more than one table.

An obvious schema design would be a table each for cpuinfo, meminfo, loadavg, etc. You might end up with a miscellaneous_stats table, depending on what you're including in "a bunch of other stuff".

This approach has several attractive features:

  • simplified column naming.
  • easy to report against a related sub-set of statistics e.g. all of meminfo. Probably better performance too.
  • less problematic to add a column. If you start gathering a new cpuinfo statistic they are all clumped together, whereas in the One Big Yable you would end up with columns 1-15 and column 94.
  • granularity of recording. For instance you might not want to log cpuinfo as often as meminfo.

You should have a master table of stats_runs to hold things like HOST, TIMESTAMP, etc rather than duplicating those details on each table.

I have two working assumptions underlying this proposal:

  1. You are going to do some analysis of this data (because if you're not going to analyze it while bother collecting it?).
  2. SQL remains the best mechanism for data crunching, although flat file tools are improving all the time.


http://blogs.technet.com/fort_sql/archive/2010/03/26/the-myth-of-unstructured-data.aspx

http://tonyandrews.blogspot.com/2004/10/otlt-and-eav-two-big-design-mistakes.html


Thank you for your suggestions.

After thinking about this issue some more, I've decided to go with a two-table approach. One table holds the very latest dump of the raw data, in the same JSON format I originally get it in. I use this to display the most recent stats -- the most common use case -- and it would be silly to try to parse out all the fields in the dump only to reassemble them all again when someone wants to see the current status.

I've picked out a few stats that I want to do long-term analysis on from this raw data, and I'm storing those in a wide table (lots of columns). This will allow me to easily render trends graphs and to spot performance issues.

Based on my experience with EAV, I don't think it is a good idea. It makes it neither easy to do long-term analysis (the 40-way JOIN or pivot problem) nor, because my data is not flat, would it make it much easier to store the raw data.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号