开发者

Data Warehouse: One Database or many?

开发者 https://www.devze.com 2022-12-31 16:25 出处:网络
At my new company, they keep all data associated with the data warehouse, including import, staging, audit, dimension and fact tables, together in the same physical database.

At my new company, they keep all data associated with the data warehouse, including import, staging, audit, dimension and fact tables, together in the same physical database.

I've been a database developer for a number of years now and this consolidation of function and form seems counter to everything I know.

It seems to make security, backup/restore and performance management issues more manually intensive.

Is this something that is done in the 开发者_StackOverflow中文版industry? Are there substantial reasons for doing or not doing it?

The platform is Netezza. The size is in terabytes, hundreds of millions of rows.

What I'm looking to get from answers to this question is a solid understanding of how right or wrong this path is. From your experience, what are the issues I should be focused on arguing if this is a path that will cause trouble for us down the road. If it is no big deal, then I'd like to know that as well.


In general I would recommend using separate databases. This is the configuration I have always seen used in production and it really makes a lot of sense since - as you mentioned - both databases have fundamentally different purposes / usage patterns / etc.


Edit

If you're using one physical server, the fewer instances on that server the simpler the management and the more efficient the process.

If you put TWO instances on the same Physical Server you get:

Negatives:

  1. Half the memory to use
  2. Twice the count of database process

Positives:

  1. You could take the entire staging db down without affecting the DW

So which is more precious to you, outage windows or CPU and Memory?

On the same the physical server multiple instances make performance management issues MUCH more manual to solve. If you look at the health of one of the instances, it might look fine but users are reporting poor performance, so you have to look at the next instance to see if the problem may be coming from there... and so on per instance.

Security is also harder with more than one instance. At best it's just as hard as a single instance but it's never easier. You'll have two admin accounts (SYS or something), Duplicate process accounts, etc.

Tell us why you think it's better to have more than one instance.

ORIGINAL POST

Can we be clear on terms. When you say "in the same Database" do you mean to say the same instance, or the same physical server. If you did move the staging to a new instance would it reside on the same physical hardware?

I think people get a little too hung up on instances. If you're going to put two instances on the same piece of hardware, you're only doubling the number of everything to very little advantage. All the server processes will be running twice... all the memory pools will be cut in half.

so let's say you really did mean two separate physical boxes...

Let's say you buy 2 12-way boxes (just say). When you're staging db server is done for the day, those 12 CPU's are wasting away. When your users pack up and go home, your prod DW CPUs are wasting away. CPU cycles are perishable, you can't get them back. BUT, if you had one 24 way box... then the staging DB COULD use 20 CPUs at night for some excellent Parallel Execution for building summary tables and your users will have double the capacity for processes during the day.

so let's say you meant the same hardware.

"It seems to make security, backup/restore and performance management issues more manually intensive."

Guaranteed that performance issues are harder to solve the more instances that share the same hardware. Guaranteed.

Security

What security do you do at the instance level?

Backup

What DW are you backing up at the instance level? You're not backing up tablespaces, but rather whole instances? Seems like that pattern will fail at a certain size.

PLATFORM: NETEZZA

Not familiar with the tool specifically. So if it's a single instance on a single box, then the division would seem more logical than physical and therefore the reasons they exist is for management, not performance. You don't increase your CPUs or memory by adding a database, right? So it doesn't seem like there's no performance upside to it. Each DB may be adding separate processes (performance hit), or it might be completely logical like schemas in Oracle. If each database is managed by new processes than data going between them will mean IPC.

Maybe the addition of the Netezza tag will get some traction.


We use databases for every segment (INVENTORY, CRM, BILLING...). There are no performance downsides and maintenance and overview is much better.


Better late than never, but for Netezza:

There are no performance hits while querying cross database. Netezza allows only SELECT operations cross database, no INSERT, UPDATE or DELETEstatements allowed.

This means you cannot do:

THISDB(ADMIN)=>INSERT INTO OTHERDB..TBL SELECT * FROM THISDBTABLE;

but you can do \c OTHERDB then

OTHERDB(ADMIN)=>INSERT INTO TBL SELECT * FROM THISDB..THISDBTABLE;

You are also not able to create a materialized view on a cross-database object, for example: OTHERDB(ADMIN)=>CREATE MATERIALIZED VIEW BLAH AS SELECT * FROM THISDB..THISDBTABLE;

Administration might be where you will decide (though you probably already did long ago) on what kind of database(s) you'll create. Depending on your infrastructure, you might have a TEST/QA system and a PROD system on the same box, or on separate boxes.


You will gain speed in the load and the output if the tables are in the same schema (database). Obvious...but hey, I said it.

There is more overhead the more tables you put into one schema. Backups time, size of backups, ease of use.

Where I am, we have many multiple TB databases within one data-warehouse. Our rule of thumb is that a single loading process or a single report query should NOT have to span database. This keeps "like" tables together but gives some allowances for our backups and contingency processes. It also makes it a bit easier to "find" data.

For those processes that need to break this rule, we will either move data from one database to the other or allow the process to join across schemas.

I'm not as familiar with Netezza, so I'm not 100% sure what your options might be.


Few points for you to consider a) If the data in one or more staging, audit, dimension and fact table has to be joined, you are better off keeping them in one database

b) Typically you will retain dimension tables and fact tables in the same database and distribute on most frequently joined columns to leverage "co-located join" functionality of Netezza

c) You should be able to use SQL grant permission to manage access to all objects (DB, tables, views etc)

0

精彩评论

暂无评论...
验证码 换一张
取 消