Tips/links/books for designing a very large, low-granularity database?_问答_开发者

Tips/links/books for designing a very large, low-granularity database?

开发者 https://www.devze.com 2023-04-01 08:35 出处：网络

SAS programmers at my company work with researchers to analyze data stored in a number of text files around 1Tb in size. The resulting SAS processes can take days to run. Whenever the researchers want to change a question slightly, the processes have to be re-run, requiring further hours or days.

The SAS programmers approached our DB开发者_StackOverflowA team for a way of storing their data with the aim of greatly improving query performance.

Two main difficulties are:

We have only a handful of example queries, and there is no particularly typical set of queries to expect.
Many of the queries will be of a form like

SELECT COUNT(DISTINCT id) FROM TABLE t WHERE a = true AND b = 3 AND c IN (3 to 10);

but in which the WHERE filter parameters are unknown and could include any combination of columns and attributes. This is to say, it seems to me (having read up a bit about data warehouses) that our requirements exclude a typical data warehouse approach in which we perform some aggregations and work with a higher granularity of records.

I'm looking for any resources that speak to designing databases with similar constraints. In Bill Inmon's Building the Data Warehouse, he briefly mentions "exploration warehouses" and "data mining warehouses". Using these terms I found this article that was slightly helpful: "Designing the Data Warehouse for Effective Data Mining" [pdf], but that's more or less it. Most of what I find when searching re: "data mining" regards OLAP.

I'm a novice DBA and I've been tasked with coming up with some suggestions for this design. I think at this point my most helpful suggestion will be to suggest we design to avoid expensive joins as much as possible. I'm out on a limb here--not expecting miracles, but any sage advice or reading recommendations would be very welcome.

Thanks!

Read everything you can by Ralph Kimball.

http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247

Your typical query (SELECT aggregate FROM fact JOIN dimension WHERE criteria) is the sweet spot for the star schema.

Forget "data mining". It isn't a helpful term.

Focus on "Star Schema". Build the right data structure.

I wanted to make a comment to get a bit more clarification but it seems I can't yet! such as...

How long does the text file read take?
Is it possible to have incremental text files sent and maintain a bunch of SAS datasets to which you append the incremental data?

Here are some suggestions...

If funding is not an issue then switching to a backend DBMS like Netezza would help with this problem.

A simpler approach may be to split the data into smaller datasets and then change the queries to dynamically look at the correct datasets. e.g. if all the queries are looking at the A variable being either true or false and true or false is about 50/50 then splitting the data into two datasets here may halve your query time for that given example. The only problem with this approach is that it really depends on finding the best split to accommodate all the query types.

Also indexing may help speed things up. You would need to do the analysis on which variables would be candidates for the index.

Please let me know if you need any more information.

Thanks, M