I wrote the UTF-8 string va开发者_如何转开发lidator function.
The function takes a buffer of bytes and its length in UTF-8 characters, and validates that the buffer consists exactly of given number of valid UTF-8 characters.
If buffer is too short or large, or if it contains invalid UTF8-characters, validation fails.
Now I want to write auto-tests for my validator.
Is there a data-set that I can reuse?
I've found this file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt, but it looks like that it does not suit my purposes well — it is more for visualization tests, as I understand.
Any clues?
- Valid UTF-8 data, to see that it passes
- Strings containing characters needing 1 code unit, 2, 3, and 4! (Don't just test "ABC" or "café")
- Clearly invalid data, say some ISO-8859-1 string (that isn't also valid UTF-8)
- A string containing overlong forms (A 1-byte character encoded as 2, for example.) These should not pass as UTF-8
- A string containing code points above U+10FFFF
- Everything listed here: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
Depending on how good your code is:
- Catching a UTF-8 string that encodes anything from U+D800 to U+DFFF (surrogate pairs, which should never be present in a UTF-8 string)
Those test cases:
Should pass: "ABC" 41 42 43
Should pass: "ABÇ" 41 42 c3 87
Should pass: "ABḈ" 41 42 e1 b8 88
Should pass: "AB
精彩评论