This is a question that's a bit different from the typical "how do I make a tagging system" question, which has been thoroughly discussed on SO and elsewhere.
I want to crowdsource tagging so you don't have to rely on the owner of each document to completely list out applicable tags. At the same time, I don't want a random schmoe to be able to mess up everyone's tags by intentionally mistagging lots of documents.
How does a system like this work in general? For example, Slashdot.org relies on something like this to come up with story tags. (never having edited tags, I would be interested in hearing more about how this works.)
Now to make this more concrete: Suppose my tagging DB schema looks like this:
doc: id, name, ...
tag: id, tag_name
doc_tag: doc_id, tag_id, user_id
Now, each user can assign his/her own tags to the documents. One way to determine a consensus is to look at the fraction of people who labeled a document with a particular tag. This results in the below monstrosity of an SQL statement.
SELECT
doc_id, tag_id,
num_times_tagged, taggers_count,
num_times_tagged/taggers_count AS popularity
FROM doc_tag
LEFT JOIN (
SELECT doc_id, tag_id, COUNT(*) AS num_times_tagged
F开发者_如何转开发ROM doc_tag GROUP BY doc_id, tag_id
) num_times
ON doc_tag.doc_id = num_times.doc_id AND
doc_tag.tag_id = num_times.tag_id
LEFT JOIN (
SELECT doc_id, COUNT(DISTINCT user_id) AS taggers_count
FROM doc_tag GROUP BY doc_id
) num_taggers
ON doc_tag.doc_id = num_taggers.doc_id
GROUP BY doc_tag.doc_id, doc_tag.tag_id
Am I going about this completely wrong? This seems like a really expensive query to make. Suppose I just wanted to get a list of documents and top tags for each one -- how would I even write a join for this? I don't want to run this query for every document that I get!
Thanks for any advice.
David
Here's a cleaner query:
SELECT
doc_id,
tag_id,
COUNT(*) AS num_times_tagged,
COUNT(DISTINCT user_id) AS taggers_count,
COUNT(*)/COUNT(DISTINCT user_id) AS popularity
FROM doc_tag
GROUP BY doc_tag.doc_id, doc_tag.tag_id
Also, I'm not familiar with all RBDMS, but if you're using Sql Server, you can create a view, then a clustered index on top of the view. This will slow down your inserts to doctag
, but make your reads from that view really fast.
精彩评论