开发者

Choosing data pseudo-randomly with even distribution

开发者 https://www.devze.com 2023-02-15 01:39 出处:网络
I\'m currently working on a medium-sized web project, and I\'ve ran into a problem. What I want to do is display a question, together with an image. I have a (global) list of questions, and a (globa

I'm currently working on a medium-sized web project, and I've ran into a problem.

What I want to do is display a question, together with an image. I have a (global) list of questions, and a (global) list of images, all questions should be asked for all images.

As far as the user can see the question and image should be chosen at random. However the statistics from the answers (question/image-pair) will be used for research purposes. This means that all the question/image-pair must be chosen such that the answers will be distributed evenly across all question, and across all images.

A user should only be able to answer a specific question/image-pair one time.

I am using a mysql database and php. Currently, i have three database tables:

tbl_images (image_id)

tbl_questions (question_id)

tbl_answers (answer_id, image_id, question_id, user_id)

The other columns are not related to this specific problem.

Solution 1: Track how many times each image/question has been used (add a column in each table). Always choose the image and question that has been asked the least.

Problem: What I'm actually interested in is distribution among questions for an image and vice versa, not that each question is even globally.

Solution 2: Add another table, containing all question/image-pairs along with how many times it has been asked. Choose the lowest combination (first row if count 开发者_如何学运维column is sorted by ascending order).

Problem: Does not enforce that the user can only answer a question once. Also does not give the appearance that the choice is random to the user.

Solution 3: Same as #2, but store question/image/user_id in table.

Problem: Performance issues (?), a lot of space wasted for each user. There will probably be semi-large amounts of data (thousands of questions/images and atleast hundreds of users).

Solution 4: Choose a question and image at true random from all available. With a large enough amount of answers they will be distributed evenly.

Problem: If i add a new question or image they will not get more answers than the others and therefore never catch up. I want an even amount of statistics for all question/image-pairs.

Solution 5: Weighted random. Choose a number of question/image pairs (say about 10-100) at true random and pick the best (as in, lowest global count) of these that the user has not answered.

Problem: Does not guarantee that a recently added question or image gets a lot of answers quickly.

Solution #5 is probably the best once I've come up with so far.

Your input is very much appreciated, thank you for your time.


From what I understand of your problem, I would go with #1. However, you do not need a new column. I would create an SQL View instead becuase it sounds like you'll need to report on things like that anyway. A view is basically a cached select, but acts similar to a table. Thus you would create a view for keeping the total of each question answered for each image:

DROP VIEW IF EXISTS "main"."view_image_question_count";
CREATE VIEW "view_image_question_count" AS 
SELECT a.image_id, a.question_id, SUM(b.question_id) as "total"
FROM answer AS a
INNER JOIN answer AS b ON a.question_id = b.question_id
GROUP BY a.image_id, a.question_id;

Then, you need a quick and easy way to get the next best image/question combo to ask:

DROP VIEW IF EXISTS "main"."view_next_best_question";
CREATE VIEW "view_next_best_question" AS 
SELECT a.*, user_id
    FROM view_image_question_count a
    JOIN answer USING( image_id, question_id )
    JOIN question USING(question_id)
    JOIN image USING(image_id)
ORDER BY total ASC;

Now, if you need to report on your image to question performace, you can do so by:

SELECT * FROM view_image_question_count

If you need the next best image+question to ask for a user, you would call:

SELECT * FROM view_next_best_question WHERE user_id != {USERID} LIMIT 1

The != {USERID} part is to prevent getting a question the user has already answered. The LIMIT optimizes to only get one.

Disclaimer: There is probably a lot that could be done to optimize this. I just wanted to post something for thought.

Also, here is the database dump I used for testing. http://pastebin.com/yutyV2GU

0

精彩评论

暂无评论...
验证码 换一张
取 消