Sorry in advance if this is a basic question. I'm reading a book on hbase and learing but most of the examples in the book(and well as online) tend to be using Java(I guess because hbase is native to java). There are a few pyth开发者_开发知识库on examples and I know I can access hbase with python(using thrift or other modules), but I'm wondering about additional functions?
For example, hbase has a 'coprocessors' function that pushs the data to where your doing your computing. Does this type work with python or other apps that are using streaming hadoop jobs? It seems with java, it can know what your doing and manage the data flow accordingly but how does this work with streaming? If it doesn't work, is there a way to get this type of functionality(via streaming without switching to another language)?
Maybe another way of asking this is..what can a non-java programmer do to get all the benefits of the features of hadoop when streaming?
Thanks in advance!
As far as I know, you are talking about 2(or more) totally different concepts.
"Hadoop Streaming" is there to stream data through your executable (independent from your choice of programming language). When using streaming there can't be any loss of functionality, since the functionality is basicly map/reduce the data you are getting from hadoop stream.
For hadoop part you can even use pig or hive big data query languages to get things done efficiently. With the newest versions of pig you can even write custom functions in python and use them inside your pig scripts.
Although there are tools to make you use the language you are comfortable with never forget that hadoop framework is mostly written in java. There could be times when you would need to write a specialized InputFormat; or a UDF inside pig, etc. Then a decent knowledge in java would come handy.
Your "Hbase coprocessors" example is kinda unrelated with streaming functionality of hadoop. Hbase coproccessors consists of 2 parts : server-side part, client-side part. I am pretty sure there would be some useful server-side coprocessors embedded inside hbase with release; but other than that you would need to write your own coprocessor (and bad news: its java). For client side I am sure you would be able to use them with your favorite programming language through thrift without too much problem.
So as an answer to your question: you can always dodge learning java; still using hadoop to it's potential (using 3rd party libraries/applications). But when shit hits the fan its better to understand the underlaying content; to be able to develop with java. Knowing java would give you a full control over hadoop/hbase enviroment.
Hope you would find this helpful.
Yes, you should get data local code execution with streaming. You do not push the data to where the program is, you push the program to where the data is. Streaming simply takes the local input data and runs it through stdin to your python program. Instead of each map running inside of a java task, it spins up and instance of your python program and just pumps the input through that.
If you really want to do fast processing you really should learn java though. Having to pipe everything through stdin and stout is a lot of overhead.
精彩评论