开发者

Database Design Question regarding duplicate information

开发者 https://www.devze.com 2022-12-25 19:08 出处:网络
I have a database that contains a history of product sales.For example the following table CREATE TABLE SalesHistoryTable (

I have a database that contains a history of product sales. For example the following table

CREATE TABLE SalesHistoryTable (
OrderID, // Order Number Unique to all orders
ProductID, // Product ID can be used as a Key to look up product info in another table
Price, // Price of the product per unit at the time of the order
Quantity, // quantity of the product for the order
Total, // total cost of the order for the product. (Price * Quantity)
Date, // Date of the order
StoreID, // The store that created the Order
PRIMARY KEY(OrderID)); 

The table will eventually have millions of transactions. From this, profiles can be created for products in different geographical regions (based on the StoreID). Creating these profiles can be very time consuming as a database query. For example.

SELECT ProductID, StoreID,
 SUM(Total) AS Total,
SUM(Quantity) QTY,
SUM(Total)/SUM(Quantity) AS AvgPrice
FROM SalesHistoryTable
GROUP BY ProductID, StoreID;

The above query could be used to get the Information based on products for any particular store. You could then determine which store has sold the most, has made the most money, and on average sells for the most/least. This would be very costly to use as a normal query run anytime. What are some design descisions in order to allow these types of queries to run faster assuming storage size isn’t an issue. For example, I could create another Table with duplicate information. Store ID (Key), Product ID, TotalCost, QTY, AvgPrice And provide a trigger so that when a new order is received, the entry for that store is updated in a new table. The cost for the update is almost nothing.

What开发者_开发知识库 should be considered when given the above scenario?


This is normally something you would use a data warehouse for, but aside from that, using a trigger to update a second table is a perfectly viable option.

You could also have a second table that is populated by a batch job on a periodic basis (a more data-warehouse like option). You could also use a materialized view if your database supports them.


I'd consider:

  • a data warehouse/OLAP solution
  • (as you said) run your data mining queries against a separate precomputed table/dataset
  • indexed/materialised views which is almost the same as the previous point

There are some questions though:

  • do you expect real time data?
  • what is your write volume?
  • what DB engine?


You may want to look into using materialized views, which will only be queried periodically.


"The cost for the update is almost nothing."

Except that all updates must now be serialized. Because no matter what, the ancient law of physics still remains that no two things can be in the same place at the same time.

0

精彩评论

暂无评论...
验证码 换一张
取 消