I want track which articles a user read on a website.
Then with that data, be able to know:
1) - top N articles read in the last hour/day/week/month
2) - show recommendations ("users who read this, also read that")
3) - same as (1), but for a specific section on the site
S开发者_如何学Cince the site has high traffic ( >1M views/day) i can't use a RDBMS for this.
I started to look at NoSQL (cassandra specifically) and since it's all new to me i'm not sure it's what i need or not.
I'm possitive i'm not the first one who needs something like this but couldn't find links/articles giving me pointers on how to do something like this. Is NoSQL the best aproach? Any tips on the data model?
Thanks.
SQL will do this quite happily. A million views a day is only ten per second; most databases will do several hundred easily.
You should already have a table for Articles and a table for Users; you will need to create a table Read which is a many-to-many relationship between Users and Articles and maybe a timestamp. Every time you serve an article, you add an entry to the Read table, in essence saying 'User x just read Article y".
You can then ask questions like "How many times was Article y read in the past week", or "How many articles does the average reader look at on Thursdays".
For speed, you may also find it useful to preprocess some of this information and do selective denormalization, for example, keeping per-Article counts of how often it has been read.
Edit:
I am tempted to refer you to http://nosql.mypopescu.com/post/1016320617/mongodb-is-web-scale - being "NoSQL" does not reduce the amount of work needed or magically make it run faster (although it often does make it easier to throw more hardware at it, if you can phrase your problem in a form it likes).
"Users who read this also read:"
SELECT
Article.id, OtherArticle.id as oid, COUNT(*) AS cnt
FROM
Article
JOIN Read AS R1 ON Article.id=R1.article_id
JOIN Read AS R2 ON R1.user_id=R2.user_id AND NOT R1.article_id=R2.article_id
JOIN Article AS OtherArticle on R2.article_id=OtherArticle.id
GROUP BY
OtherArticle.id, OtherArticle.title
ORDER BY
cnt DESC, OtherArticle.title ASC
By all means see how long this takes to run; I would probably keep the result as a reference table for immediate use, and update it with a background process every couple of hours.
Hmm easyrec has exactly the functionality you would need and can manage 1M actions (it uses mysql) check out the forum thread about max actions: forum topic
精彩评论