Should I use a unique ID for a row in a junction table?_问答_开发者

I am using SQL Server 2008.

A while back, I asked the question "should I use RecordID in a junction table". The tables would look like this:

// Images
ImageID// PK

// Persons
PersonID // pk

// Images_Persons
RecordID // pk
ImageID // fk
PersonID // fk

I was strongly advised NOT to use RecordID because it's useless in a table where the two IDs create a unique combination, meaning there will be no duplicate records.

Now, I am trying to find a random r开发者_运维问答ecord in the junction table to create a quiz. I want to pull the first id and see if someone can match the second id. Specifically, I grab a random image and display it with three possible choices of persons.

The following query works, but I've quite a bit of negativity that suggests that it's very slow. My database might have 10,000 records, so I don't think that matters much. I've also read that the values generated aren't truly random.

SELECT TOP 1 * FROM Images_Persons ORDER BY newid();

Should I add the RecordID column or not? Is there a better way to find a random record in this case?

Previous questions for reference

Should I use "RecordID" as a column name?
SQL - What is the best table design to store people as musicians and artists?

NEWID is random enough and probably best
10k rows is peanuts
You don't need a surrogate key for a junction (link, many-many) table

Edit: in case you want to prematurely optimise...

You could ignore this and read these from @Mitch Wheat. But with just 10k rows your development time will be longer than any saved execution time..

Efficiently select random rows from large resultset with LINQ (ala TABLESAMPLE)
Efficiently randomize (shuffle) data in Sql Server table

Personally, I don't think that having the RecordID column should be advised AGAINST. Rather I'd advise that often it is UNNECESSARY.

There are cases where having a single value to identify a row makes for simpler code. But they're at the cost of additional storage, often additional indexes, etc. The overheads realistically are small, but so are the benefits.

In terms of the selection of random records, the existence of a single unique identifier can make the task easier if the identifiers are both sequential and consecutive.

The reason I say this is because your proposed solution requires the assignment of NEWID() to every record, and the sorting of all records to find the first one. As the table size grows this operation grows, and can become relatively expensive. Whether it's expensive enough to be worth optimising depends on whatever else is happening, how often, etc.

Where there are sequential consecutive unique identifiers, however, one can then choose a random value between MIN(id) and MAX(id), and then SEEK that value out. The requirement that all value are consecutive, however, is often a constraint too far; you're never allowed to delete a value mid-table, for example...

To overcome this, and depending on indexes, you may find the following approach useful.

DECLARE
  @max_id INT
SELECT
  @id = COUNT(*)
FROM
  Images_Persons

SELECT
  *
FROM
(
  SELECT
    *,
    ROW_NUMBER() OVER (ORDER BY ImageID, PersonID) AS id
  FROM
    Images_Persons
)
  AS data
WHERE
  Images_Persons.id = CAST(@max_id * RAND() + 1 AS INT)

-- Assuming that `ImageID, PersonID` is the clustered index.

A down side here is that RAND() is notoriously poor at being truly random. Yet it normally perfectly suitable if executed at a random time relative to any other call to RAND().

Consider what you've got.

SELECT TOP 1 * FROM Images_Persons ORDER BY newid();

Not truly random? Excluding the 'truly random is impossible' bit, you're probably right - I believe that there are patterns in generated uniqueidentifiers. But you should test this yourself. It'd be simple; just create a table with 1 to 100 in it, order by newid() a lot of times, and look at the results. If it's random 'enough' for you (which it probably will be, for a quiz) then it's good enough.

Very slow? I wouldn't worry about that. I'd be very surprised if the newid() is slower than reading the record from the table. But again, test and benchmark.

I'd be happy with the solution you have, pending tests if you're concerned about it.

I've always used order by newid().