开发者

A test data set for auto-testing UTF-8 string validator

开发者 https://www.devze.com 2023-02-21 18:39 出处:网络
I wrote the UTF-8 string va开发者_如何转开发lidator function. The function takes a buffer of bytes and its length in UTF-8 characters, and validates that the buffer consists exactly of given number o

I wrote the UTF-8 string va开发者_如何转开发lidator function.

The function takes a buffer of bytes and its length in UTF-8 characters, and validates that the buffer consists exactly of given number of valid UTF-8 characters.

If buffer is too short or large, or if it contains invalid UTF8-characters, validation fails.

Now I want to write auto-tests for my validator.

Is there a data-set that I can reuse?

I've found this file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt, but it looks like that it does not suit my purposes well — it is more for visualization tests, as I understand.

Any clues?


  • Valid UTF-8 data, to see that it passes
    • Strings containing characters needing 1 code unit, 2, 3, and 4! (Don't just test "ABC" or "café")
  • Clearly invalid data, say some ISO-8859-1 string (that isn't also valid UTF-8)
  • A string containing overlong forms (A 1-byte character encoded as 2, for example.) These should not pass as UTF-8
  • A string containing code points above U+10FFFF
  • Everything listed here: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

Depending on how good your code is:

  • Catching a UTF-8 string that encodes anything from U+D800 to U+DFFF (surrogate pairs, which should never be present in a UTF-8 string)

Those test cases:

Should pass: "ABC"    41 42 43
Should pass: "ABÇ"    41 42 c3 87
Should pass: "ABḈ"    41 42 e1 b8 88
Should pass: "AB
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号