We have a table in our database (MySQL) that has the following: id, title, and discription.
It's a mushup site and much of the data we get from the other site has the same titles and discriptions, though unique ids. We do not wish to pick up data that have the same titles and discriptions as we already hav开发者_开发技巧e similiar (same) data in our database.
How could we make it easy to drop the same data? (But remember, ids are unique.)
You do not insert them first and drop them later, you create a UNIQUE KEY on these two columns and make INSERT IGNORE INTO statements when inserting your data.
If you already have your data and want to remove them, you can ALTER IGNORE TABLE ADD UNIQUE INDEX ...
Depending on the size of your database you could calculate the Levenshtein distance betwen records or use soundex.
Your question is how to delete duplicate data from a table. Right?
You want to find all rows that have the same title and same description with other rows and from those keep only one and delete the others.
Suppose your table name is named table1 and your ID column is numeric.
DELETE t
FROM table1 t
JOIN (
SELECT title, description, MIN(ID) AS idNotToDelete
FROM table1
GROUP BY title, description
HAVING COUNT(*) > 1
) t1
ON t.title = t1.title AND t.description = t1.description AND t1.idNotToDelete <> t.id
The query above will find all rows with more than one occurrence and mark the minimum ID per occurrence. Then it will delete all the duplicate rows with the same title and description EXCEPT from the one with the minimum ID. so this
id title description
1 myTitle myDescription
2 myTitle myDescription
3 myTitle2 myDescription2
4 myTitle2 myDescription2
5 myTitle myDescription
will become
id title description
1 myTitle myDescription
3 myTitle2 myDescription2
精彩评论