I am currently working on syste开发者_高级运维m that generated product recommendations like those on Amazon : "People who bought this also bought this.."
Current Scenario:
Extract the Google Analytics data of the client and insert it in database.
On the website of the client, on load of product page the API call is made to get the recommendations of the product being viewed.
When API receives the product ID as request it looks in the database and retrieves (using association rules) the recommended product IDs and sends them as response.
The list of these product Ids will be processed to get the product details(image,price..) at the client end and displayed on website.
Currently I am using PHP and MYSQL with gapi package and REST api storage on AMAZON EC2 .
My Question is: Now, if I have to choose amongst the following, which will be the best choice to implement the above mentioned concept.
PHP with SimpleDB or BIGQuery.
R language with BIGQuery.
RHIPE-(R and hadoop ) with SimpleDB.
Apache Mahout.
Plese help!
This isn't so easy to answer, because the constraints are fairly specialized.
The following considerations can be made, though:
- BIGQuery is not yet public. Thus, with a small usage base, even if you are in the preview population, it will be harder to get advice on improvement.
- Each of your answers asked about a modeling system & a storage system. Apache Mahout is not a storage mechanism, so it won't necessarily work on its own. I used to believe that its machine learning implementations were a a pastiche of a few Google Summer of Code, but I've updated that view on the suggestion of a commenter. It still looks like it has rather uneven and spotty coverage of different algorithms, and it's not particularly clear how the components are supported or maintained. I encourage an evangelist for Mahout to address this.
As a result, this eliminates the 1st, 2nd, and 4th options.
What I don't quite get is the need for a real-time server to utilize Hadoop and RHIPE. That should be done in your batch processing for developing the recommendation models, not in real-time. I suppose you could use RHIPE as a simple one-stop front end for firing off queries.
I'd recommend using RApache instead of RHIPE, because you can get your packages and models pre-loaded. I see no advantage to using Hadoop in the front end, but it would be a very natural back end system for the model fitting.
(Update 1) Other interface options include RServe (http://www.rforge.net/Rserve/) and possibly RStudio in server mode. There are R/PHP interfaces (see comments below), but I suspect it would be better to access R through HTTP or TCP/IP.
(Update 2) Addressing the whole process, the basic idea I see is that you could query the data from PHP and pass to R or, if you wish to query from within R, look at the link in the comments (to the OmegaHat tools) or post a new question about R & SimpleDB - I'm sure someone else on SO would be able to give better insight on this particular connection. RApache would let you instantiate many R processes already prepared with packages loaded and data in RAM; thus you would only need to pass whatever data needs to be used for prediction. If your new data is a small vector then RApache should be fine, and it seems this is correct for the data being processed in real-time.
If you want a real-time API for recommendations based on data in a database, Apache Mahout does this directly. You want to use ReloadFromJDBCDataModel
, put on top a GenericItemBasedRecommender
, and use the servlet-based wrapper in the examples
module. It's probably a day or two of work to get familiar with the code and customize it to your needs, but it's pretty simple.
When you get past about 100M data points you would need to look at distributing the computation Hadoop. That's a fair bit more complex. Mahout has a distributed recommender too which you can customize.
精彩评论