Slope One implementations offers poor recommendations_问答_开发者

I'm attempting to implement a Slope One algorithm via PHP for user-based item recommendation. To do this, I'm using the OpenSlopeOne library. The problem I'm having is that the recommendations generated aren't at all relevant to the user.

Currently I have two tables: user_ratings and slope_one. The user_ratings table is fairly straight forward. It contains a per-item rating given by that particular user (user_id, item_id and user_item_rating). The slope_one table follows OpenSlopeOne's default schema: item_id1, item_id2, times and rating.

The slope_one table is populated using the following SQL procedure:

CREATE PROCEDURE `slope_one`()
begin                    
    DECLARE tmp_item_id int;
    DECLARE done int default 0;                    
    DECLARE mycursor CURSOR FOR select distinct item_id from user_ratings;
    DECLARE CONTINUE HANDLER FOR NOT FOUND set done=1;
    open mycursor;
    while (!done) do
        fetch mycursor into tmp_item_id;
        if (!done) then
            insert into slope_one (select a.item_id as item_id1,b.item_id as item_id2,count(*) as times, sum(a.rating-b.rating) as rating from user_ratings a, user_ratings b where a.item_id = tmp_item_id and b.item_id != a.item_id and a.user_id=b.user_id group by a.item_id,b.item_id);
        end if;
    END while;
    close mycursor;
end

开发者_如何学JAVA

And to fetch the most relevant recommendations for a given user, I perform the following query:

SELECT
    item.* 
FROM
    slope_one s,
    user_ratings u,
    item
WHERE 
    u.user_id = '{USER_ID}' AND 
    s.item_id1 = u.item_id AND 
    s.item_id2 != u.item_id AND
    item.id = s.item_id2
GROUP BY 
    s.item_id2 
ORDER BY
    SUM(u.rating * s.times - s.rating) / SUM(s.times) DESC
LIMIT 20

As previously stated, this just doesn't seem to be working. I'm working with a fairly large data set (10,000+ recommendations) but I'm just not seeing any form of correlation. In fact, the majority of recommendations seem to be identical for users, even with totally disparate item ratings.

(Yes I'm purposely giving another answer.)

The other answer is that all these algorithms have strengths and weaknesses and do well on some day but not others. But I had a similar observation about slope-one some time ago and even got some comments from Daniel Lemire who proposed the implementation originally.

Consider what happens as the data becomes 100% dense -- each user rates every item. The rating diff between item A and item B is the average, over all co-rating users u, of the rating difference: average(r_uB - r_uA). But as all users rate, that approaches simply the average rating (over all users) for B, minus average rating for A: average(r_uB) - average(r_uA). Call those average(B) and average(A) for ease.

Imagine the item P with highest average rating overall. The diff between A and P will be larger than the diff between A and any other B; it's (average(P) - average(A)), versus (average(B) - average(A)). P's diffs are always higher than any other B by (average(P) - average(B)).

But since the algorithm estimates a preference by adding these diffs to a user's ratings, and averaging those, P becomes the top recommendation for all users, always. No matter what the user's ratings, and no matter what the diffs, the sum for P (and thus average) is largest. And so on.

That's the tendency as data gets dense, and I think you see some echo of that effect already. It's not "wrong" (after all P is highly rated!) but feels intuitively suboptimal as the recommendations become unpersonalized.

Daniel Lemire says that the better approach, described in some follow-on papers, is to segment the data model into "positive" and "negative" ratings and build independent models from both. It avoids some of this and gives better performance.

Another variant, implemented in Apache Mahout, is to use better weighting in the estimated preference calculation. It has an option to weight against diffs who have a high standard deviation, and for those with low standard deviation. This favors diffs computed over many users. It's a crude step, but helps.

You could try the Java implementation in Apache Mahout. There is an excerpt from Mahout in Action which covers its usage. That might be useful as a second data point and help differentiate algorithm versus implementation issues.

As of Mahout 0.9 the remmenders are discontinued. See https://mahout.apache.org/