ok, this is bugging me. i got a phonebook DB from a client where some of the results containts accented names,
and by some i mean mainly the city field,or category. which makes my query results look ridiculous.
DB Charset: UTF-8
for example:
CompanyName | City | etc...
DemoCompany | Hauptstraße 18 | Whatever
DemoCompany | Hauptstrabe 18 | Whatever
the DB has around 360k records.... so manual checking is not an option. anyone has an idea how can i find the accented/not accented values ? something like a duplicate column check...
EDIT: when i query the table, i get results for both, that is not the problem. the problem is, when i display the results, some are displayed with accent, and some without.
EDIT:
CREATE TABLE `enc` (
`company` varchar(255) DEFAULT NULL,
`address` varchar(255) DEFAULT NULL,
`postcode` varchar(255) DEFAULT NULL,
`city` varchar(255) DEFAULT NULL,
`Telefon1` varchar(255) DEFAULT NULL,
`Telefon2` varchar(255) DEFAULT NULL,
`Telefon3` varchar(255) DEFAULT NULL,
`Telefon4` varchar(255) DEFAULT NULL,
`Telefon5` varchar(255) DEFAULT NULL,
`Branche1` varchar(255) DEFAULT NULL,
`Branche2` varcha开发者_如何转开发r(255) DEFAULT NULL,
`Branche3` varchar(255) DEFAULT NULL,
`Branche4` varchar(255) DEFAULT NULL,
`Branche5` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8$$
You can start with something like this, that will show if there are rows that are exact duplicates of each other (and their count):
SELECT
CompanyName, City, etc...
, COUNT(*) AS DuplicateCount
FROM
TableToCheck
GROUP BY
CompanyName, City, etc... --- all columns except the Primary Key
HAVING
COUNT(*) > 1
If you want to find only duplicate addresses, you do something like this:
SELECT
Address
, COUNT(*) AS DuplicateCount
FROM
TableToCheck
GROUP BY
Address
HAVING
COUNT(*) > 1
Reading your question again, I think I misunderstood what you are asking. If you don't want to find duplicates (as there are not) but you want to find accented words (and replace them with unaccented perhaps):
The table you have now is probably using a case insensitive collation (like utf_general_ci
or utf_unicode_ci
), so you could copy the table into a new one that has same charset but a case sensitive collation, like utf_bin
.
You could then create a list of accented characters and then write a query to check for this list in fields of your new table (this will be real slow):
SELECT nt.*
FROM NewTable AS nt
JOIN AccentedList AS al
WHERE nt.field LIKE CONCAT('%', al.AccentedChar, '%')
GROUP BY nt.PK
or run a query to REPLACE()
those characters, like 'ß'
with 'ss'
for example.
You don't only have to consider accents but many other equivalent characters:
- in German you can write 'ß' as 'ss', ä as 'ae', 'ü' as 'ue' and so on
- in Italian and French you can search for letters without the accent but the accent is also sometimes substituted with an apostrophe (e.g., giocherò as giochero' in Italian)
If found write a function the compares the strings without considering these differences or you could try to match using a function that leverages phonetic differences.
Examples are (many databases implement them):
- Soundex
- Distance similarity
- Jaro Winkler
Mysql has a SOUNDEX
function, for the others you will have to define your own function (there are several examples on the web).
The results are not perfect but looking for similar entries will help a manual check.
I'm pretty sure this is a case for a phonetic search. You could create a temporary (possible memory located) table, insert the phonetic equivalent of the row into it, then take a count of how many are duplicates. This works very well for names (Meyer, Mayer) as well as Streets (Straße, Strasse).
精彩评论