开发者

what's the best way to backup a large real-time data warehouse?

开发者 https://www.devze.com 2023-01-16 05:08 出处:网络
I\'ve got a db2 ese 9.7 non-dpf data warehouse using data compression with 20 TB of data that gets 100 million rows a day via loads every 10 minutes and gets another million a day via 50,000 imports e

I've got a db2 ese 9.7 non-dpf data warehouse using data compression with 20 TB of data that gets 100 million rows a day via loads every 10 minutes and gets another million a day via 50,000 imports every day. Additionally, there is a small amount of transactional data associated with the other two large sets of data.

Currently, we're using application-level backups - and rely on loading previously exported summary tables or reloading the 100 million rows a day in case of a recovery. But for the sake of the small amount of transactions and the imports - I'd like online backups.

However, it appears that online tablespace-specific backups require an initial offline backup. And this is the problem, even if I can redirect the off-line backup to /dev/null, an offline backup will take approximately 48 hours of downtime. Which is unacceptable. And may be required again at some point in the future.

At some point we'll probably be splitting this up into 8+ partitions and that would help both this and load index builds. But that may not happen for a while, and it's hard to justify for tasks that shouldn't be necessary in the first place.

EDIT: The reason that we didn't initially go with DPF, and why it's not a driving issue for our queries is that over 99% of our queries hit summary tables, and the 1% that does have to dip into the table with 3+ billion rows per month can almost always take advantage of table partitioning, MDC and indexes in order to only scan a much smaller amount. What this gets to is that the traditional heuristics regarding how much data per CPU don't always apply.

Any way to get around that offl开发者_如何学JAVAine backup requirement? Any third-party tools that can help me out? Any other suggestions?


Unfortunately there isn't really a way to get around this -- you have to plan for your recovery when you do the physical design of the database. Using separate tablespaces with range partitioning allows you to back up only the tablespaces with new data (assuming you know which tablespaces are changing...)

Generally this would fall into the domain of using split mirrors or snapshots at the disk level. This, of course, requires that your disk subsystem supports this functionality (unless you're using software like Veritas Volume Manager), AND that you have the capacity to actually enable this. DB2 fully supports this, though, and it's very useful. I have done this with EMC Symmetrix and Clariion; but it does require a brief "outage" where you freeze database I/O in order to issue the operating system commands to handle breaking the mirrors.

In v9.5, DB2 added a feature called Advanced Copy Services (ACS), which allows the storage vendors to integrate into the BACKUP DATABASE command. IBM supports this with some of their storage subsystems, and NetApp added support for this very quickly as well. It is pretty amazing to say, "BACKUP DATABASE HUGEDB USE SNAPSHOT" and watch it take 10 seconds. And then "RESTORE DATABASE HUGEDB USE SNAPSHOT TAKEN AT timestamp".

0

精彩评论

暂无评论...
验证码 换一张
取 消