I'm looking at building some data w开发者_开发问答arehousing/querying infrastructure, right now on top of Map/Reduce solutions like Hadoop.
However, it strikes me that all the M/R work is just repeating what the RDBMS guys have solved for the last 20 years with parallel SQL databases. Parallel SQL implementations scale reads and writes across nodes, just like M/R, but additionally already contains the niceties from regular databases (SQL, existing integration libraries, etc).
The problem is: you don't seem to find the customers of those companies posting much online. So, does anyone here have experience with those kinds of solutions, and can give me some insight and/or links?
I have used Netezza and Hadoop. And have second hand knowledge of Infobright, a column database.
Netezza is a true database and implements ACID properties, which has both a cost and a benefit. Netezza is moving toward allowing more M/R code to run on its table data with the new architecture of twinfin. In the previous version of the appliance they supported user-defined functions and aggregations. In the new version, which runs linux on the SPUs and uses Intel processors, the door is opening to do more custom code close to the data. My experience with Netezza has been very positive - both the technology and the company.
Hadoop is pure map-reduce computing. It doesn't incur the cost of ACID database properties. So, it's really a different beast than Netezza. Depending on the use pattern it may be better and certainly cheaper than Netezza. Hadoop had supports Hbase and Hive that may give you the query convenience you need at a lower cost.
Another developer on our team evaluated Infobright, so this is second hand, and found the load performance to be poor and some of the aggregations to be slow. It has some parallels with Netezza (e.g. zone maps are used in netezza to help narrow scan scope). Infobright is open source with both a community and a supported enterprise edition.
There is much more that can be said in context of your particular problem - probably beyond the scope of this forum. Hope this helps.
You haven't specified what questions you are trying to answer with your queries, or how your data is structured. Before you choose what solution to use you probably need to think about those two things.
You're correct: the major RDBMS vendors offer clustering solutions; both for parallel processing and high availability. They've had this technology for a while and any enterprise with a lot of data is probably using it. When you buy ($$$) the product they will give you lots of documentation and help you set it up (more $$$) if you can afford it.
RDBMS are good for online transactions (OLTP); answering questions about specific rows (where does Mary live?); answering some summary-type questions (how much did we sell in the first quarter, etc.) Although they can be made to perform detailed summary questions (how much did we sell in the first quarter, broken down by product, salesperson, month, and region?), you're usually starting to tax their limits (any query that needs to visit all of the rows is going to be slow).
For those types of queries most enterprises have a data warehouse that structures the data into multi-dimensional "cubes." (See Cognos, Hyperion, others). That may be appropriate for what you're trying to do.
I don't have any experience with MapReduce but I've read the wikipedia section on Uses and so if what you're trying to do falls into those categories I'd continue with it.
If you are in a fast paced growing organization, you should use Teradata. We really have a good experience with Teradata. It gives you the scalability which cannot be given by any other vendor. Once you get used to its SQL and working style you will really appreciate the design and architecture of Teradata.
精彩评论