I am writing a parser. I have taken care of all 开发者_StackOverflowthe encoding conversion to output UTF-8 correctly, but sometimes the source material is incorrect. such as ☐
or â€tm
- the results of bad encoding conversion.
I know this is a long shot - but does anyone know of a list of common strings resulting from bad character conversions, or anything so I don't have to build my own list.
Yes I know I am being lazy, but I read somewhere that makes me a good programmer?
tl;dr: See last two paragraphs.
I hate/love encoding problems.
We're looking at a mutated copy of Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019). The byte sequence for that character is 0xE2 0x80 0x99
. In Windows-1252, that corresponds to a+hat, Euro, and the trademark symbol (™). The 'tm' we see is a further transliteration of that trademark symbol into ASCII t and ASCII m, 0x74 0x6D
, making our final corrupted sequence of bytes 0xE2 0x80 0x74 0x6D
.
Chances are that the actual representation of a+hat-euro-t-m is already in UTF-8. That is, that a+hat is a UTF-8 sequence and the Euro symbol is also a UTF-8 sequence, because someone Copied from a Windows-1252 document that was already improperly encoded, and Pasted into a UTF-8 document. You'll find it's plenty more bytes than just the four from the original corruption.
One way to solve this would be first turning the UTF-8 encoding of those characters back into Windows-1252, then treat that Windows-1252 string as UTF-8 when writing it back out.
You can use iconv
with the //TRANSLIT
flag for this purpose:
$less_bad = iconv('UTF-8', 'Windows-1252//TRANSLIT', $bad);
This tells iconv to try turning any characters that can't be represented in Windows-1252 into something similar. This translation is imperfect and will destroy any legitimate UTF-8 characters that aren't representable in Windows-1252.
Once you have the Windows-1252 string, save it back out and serve it up as UTF-8. If all went well, the corruption should be gone, and you shouldn't have any problems.
Yeah, right.
In this specific case, the final byte of the proper sequence, 0x99
, has been munged into two bytes by a bad Copy/Paste. You aren't going to get it back through character set encoding hoop jumping.
While the hoop jumping could work for some documents, you will surely find many things that are even more poorly re-encoded. Your best bet is going to be conducting a byte-level search and replace operation, looking for incorrectly encoded sequences and replacing them with a plain-ASCII or properly UTF-8 encoded alternative. There are lots of ways that the encoding would be wrong. For example, if the corruption source was in the ISO-8859 family, the final corrupted sequence would have been different, or perhaps the final ™ might not be munched into t
and m
in certain places.
A byte-level search and replace is guaranteed only to impact incorrectly re-encoded sequences, and will not leave the risk of munching on single-encoded UTF-8 characters that can't be represented in inferior character sets. It's safer and faster.
edit: I totally didn't actually catch that you were already planning on doing this. ;) Unfortunately I've never seen such a handy list. Perhaps you should publish and publicize your work so that others may benefit. yourcharacterencodingsucks.com
is available!
精彩评论