A solution of one of our business problems would be to store about 500 milion records into a database. Each record would have about 40 to 50 columns.
I have a text file containing all of this data and it is about 150 GB. (a 3rd of my harddisk)
If I were to load (somehow) all of this data into a db (ORACLE ?) how well would it perform ?
A colleague of mine insists that it would be absolutely fine. And I could even index all of the 40 to 50 columns and then it is a matter of writing some sql to get data out.
is he right ? or is 500 milion records too much for a db ?
p.s. Just 开发者_如何学JAVAadding more information following some very good answers: The 40 to 50 columns will hold small strings and/or numbers. For small strings I intend something smaller than 64 characters.
Without any information about the RDBMS you're using, how it is hosted and what kind of data this is (large text, small numbers and such), a solid answer ain't easy.
The pure amount of records should not be an issue, almost every modern RDBMS will easily cope with 500 million records and more.
It'll get more interesting how the data is stored upon your RDBMS, p.e. what kind of filesystem it is using, how much disk space is available for the table, how the table is spread out on the hard disk(s) and such, which all should be taken into consideration.
Generally speaking I recommend only to index the columns which really are necessary for the applications and queries the data is used for, otherwise they'll just slow your inserts down, use precious disk space and don't help you at all.
Here are a few SO links which might help you further:
- Just what is 'A big database'?
- What's TOO BIG for a database?
- How does database indexing work?
Your colleague is kind of correct - 500M records in DB is fine, I've used DB's with 2G rows and that was 10 years back. Indexing every column is a problem - the indexes will slow every new record insert, and building the indexes will take a looonng time. You need to identify what kind of queries you'll run and then index appropriately. With so many records you may get benefits by normalising the data - a flat structure is often faster, but if you have repetitive long text fields, then replacing them with lookups may give storage and indexing benefits. Without seeing the data it's hard to give more precise advice.
BTW If you do hit performance probs you can also partition the data into physically separate tables, maybe by year?
My next step (after you've chosen your DB platform and found a server) is to get the data loaded and see how it performs. I'd take a look at Bulk Loading your data - I'm a Sql Server bloke so Integration Services is the way to go. I'd make sure you have a single unique key and if it's not in the data add an identity column. Then you're ready to test some of this out. SqlExpress is free, comes with SSIS but it can only handle 10G DBs - but that's enough to get familiar with the issues.
I regularly bulk load a 4M row table with 50+ columns and it takes around 2 minutes. I'm happy to take this offline if you want further one on one advice.
精彩评论