I nee开发者_JAVA百科d some Utf32 test strings to exercise some cross platform string manipulation code. I'd like a suite of test strings that exercise the utf32 <-> utf16 <-> utf8 encodings to validate that characters outside the BMP can be transformed from utf32, through utf16 surrogates, through utf8, and back. properly.
And I always find it a bit more elegant if the strings in question aren't just composed of random bytes, but are actually meaningful in the (various) languages they encode.
Although this isn't quite what you asked for, I've always found this test document useful.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
The same site offers this
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt
... which are equivalents of English's "Quick brown fox" text, which exercise all the characters used, for a variety of languages. This page refers to a larger list of "pangrams" which used to be on Wikipedia, but was apparently deleted there. It is still available here:
http://clagnut.com/blog/2380/
https://github.com/noct/cutf/tree/master/bin
Includes following files:
UTF-8-demo.txt
big.txt
quickbrown.txt
utf8_invalid.txt
To really test all possible conversions between formats, opposed to character conversions (i.e. towupper()
, towlower()
) you should test all characters. The following loop gives you all of those:
for(wint_t c(0); c < 0x110000; ++c)
{
if(c >= 0xD800 && c <= 0xDFFF)
{
continue;
}
// here 'c' is any one Unicode character in UTF-32
...
}
That way you can make sure you don't miss anything (i.e. 100% complete test.) This is only 1,112,065 characters, so it will be very fast with a modern computer.
Note that for basic conversions between encodings my loop above is more than enough. However, there are other feature in Unicode which would require testing character pairs which behave differently when used together. This is really not necessary here.
Also I now have a separate C++ libutf8 library to convert characters between UTF-32, UTF-16, and UTF-8. The tests use loops as shown above. The tests also verify that using invalid character codes gets caught properly.
Hmmm
You could find a lot of incidental data by googling (and see the right column for questions like these on SO...)
However, I recommend you pretty much build your test strings as byte array. It is not really about 'what data', just that unicode gets handled correctly.
E.g. you will want to make sure that identical strings in different normalized forms (i.e. even if not in canonical form) still compare equal.
You will want to check that the string length detection is robust (and recognizes single, double, triple and quadruple byte characters). You will want to check that traversing a string from begin to end honours the same logic. More targeted tests for random access of unicode characters.
These are all things you knew, I'm sure. I'm just spelling them out to remind you that you need test data catered to exactly the edge cases, the logical properties that are intrinsic to Unicode.
Only then will you have proper test data.
Beyond this scope (technical correct Unicode handling) is actual localization (collation, charset conversion etc.). I refer to the Turkey Test
Here are helpful links:
- http://minaret.info/test/collate.msp
- http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html
You can try this one (there are some sentences in russian, greek, chinese, etc. to test Unicode):
http://www.madore.org/~david/misc/unitest/
精彩评论