How to Generate all the characters in the UTF-8 charset in .net_问答_开发者

I have been given the task of generating all the characters in the UTF-8 character set to test how a system handles each of them. I do not have much experience with character encoding. The approaching I was going to try was to increment a counter, and then try to translate that base ten number into it's equivalent UTF-8 character, but so far I have no been able to find an effective way to 开发者_如何学Cto this in C# 3.5

Any suggestions would be greatly appreciated.

System.Net.WebClient client = new System.Net.WebClient();
string definedCodePoints = client.DownloadString(
                         "http://unicode.org/Public/UNIDATA/UnicodeData.txt");
System.IO.StringReader reader = new System.IO.StringReader(definedCodePoints);
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
while(true) {
  string line = reader.ReadLine();
  if(line == null) break;
  int codePoint = Convert.ToInt32(line.Substring(0, line.IndexOf(";")), 16);
  if(codePoint >= 0xD800 && codePoint <= 0xDFFF) {
    //surrogate boundary; not valid codePoint, but listed in the document
  } else {
    string utf16 = char.ConvertFromUtf32(codePoint);
    byte[] utf8 = encoder.GetBytes(utf16);
    //TODO: something with the UTF-8-encoded character
  }
}

The above code should iterate over the currently assigned Unicode characters. You'll probably want to parse the UnicodeData file locally and fix any C# blunders I've made.

The set of currently assigned Unicode characters is less than the set that could be defined. Of course, whether you see a character when you print one of them out depends on a great many other factors, like fonts and the other applications it'll pass through before it is emitted to your eyeball.

There is no "UTF-8 characters". Do you mean Unicode characters or UTF-8 encoding of Unicode characters?

It's easy to convert an int to a Unicode character, provided of course that there is a mapping for that code:

char c = (char)theNumber;

If you want the UTF-8 encoding for that character, that's not very hard either:

byte[] encoded = Encoding.UTF8.GetBytes(c.ToString())

You would have to check the Unicode standard to see the number ranges where there are Unicode characters defined.

Even once you generate all the characters, you'll find it's not an effective test. Some of the characters are combining marks, which means they will combine with the next character to come after them - having a string full of combining marks won't make much sense. There are other special cases too. You'll be much better off using actual text in the languages you need to support.

UTF-8 isn't a character set - it's a character encoding which is capable of encoding any character in the Unicode character set into binary data.

Could you give more information about what you're trying to do? You could encode all the possible Unicode characters (including ones which aren't allocated at the moment) although if you need to cope with characters outside the basic multilingual plane (i.e. those above U+FFFF) then it becomes slightly trickier...

You can brute-force an Encoding to figure out which code points it supports. To do so, simply go through all possible code points, convert them to strings, and see if Encoding.GetBytes() throws an exception or not (after setting Encoding.EncoderFallback to EncoderExceptionFallback).

IEnumerable<int> GetAllWritableCodepoints(Encoding encoding)
{
    encoding = Encoding.GetEncoding(encoding.WebName, new EncoderExceptionFallback(), new DecoderExceptionFallback());

    var i = -1;
    // Docs for char.ConvertFromUtf32() say that 0x10ffff is the maximum code point value.
    while (i != 0x10ffff)
    {
        i++;

        var success = false;
        try
        {
            encoding.GetByteCount(char.ConvertFromUtf32(i));
            success = true;
        }
        catch (ArgumentException)
        {
        }
        if (success)
        {
            yield return i;
        }
    }
}

This method should support discovering characters represented by surrogate pairs of Char in .net. However, it is very slow (takes minutes to run on my machine) and probably impractical.

UTF-8 is not a charset, it's an encoding. Any value in Unicode can be encoded in UTF-8 with different byte lengths.

For .net, the characters are 16-bit (it's not the complete set of unicode but is the most practical), so you can try this:

 for (char i = 0; i < 65536; i++) {
     string s = "" + i;
     byte[] bytes = Encoding.UTF8.GetBytes(s);
     // do something with bytes
 }

This will give you all the characters in a charset - just make sure you specify a charset when specifying the Encoding:

var results = new ConcurrentBag<int> ();
Parallel.For (0, 10, set => {
    var encoding = Encoding.GetEncoding ("ISO-8859-1");
    var c = encoding.GetEncoder ();
    c.Fallback = new EncoderExceptionFallback ();
    var start = set * 1000;
    var end = start + 1000;
    Console.WriteLine ("Worker #{0}: {1} - {2}", set, start, end);

    char[] input = new char[1];
    byte[] output = new byte[5];
    for (int i = start; i < end; i++) {
        try {
            input[0] = (char)i;
            c.GetBytes (input, 0, 1, output, 0, true);
            results.Add (i);
        }
        catch {
        }
    }
});
var hashSet = new HashSet<int> (results);
//hashSet.Remove ((int)'\r');
//hashSet.Remove ((int)'\n');
var sorted = hashSet.ToArray ();
Array.Sort (sorted);
var charset = new string (sorted.Select (i => (char)i).ToArray ());

This code will produce the output in a file. All characters printable or not will be in there.

Encoding enc = (Encoding)Encoding.GetEncoding("utf-8").Clone();
enc.EncoderFallback = new EncoderReplacementFallback("");
char[] chars = new char[1];
byte[] bytes = new byte[16];

using (StreamWriter sw = new StreamWriter(@"C:\utf-8.txt"))
{
    for (int i = 0; i <= char.MaxValue; i++)
    {
        chars[0] = (char)i;
        int count = enc.GetBytes(chars, 0, 1, bytes, 0);

        if (count != 0)
        {
            sw.WriteLine(chars[0]);
        }
    }
}

As other people have said, UTF / Unicode is an encoding not a character set.

If you skim though http://www.joelonsoftware.com/articles/Unicode.html it should help clarify what unicode is.

Powershell code with which, I put together the lines which I made by the code suggested by Jake into a text file with 256 lines length.

Service symbols create two blank lines that do not exist in the original, which must be removed in the original text file befor Powershell processing in order for the resulting file to be created correctly.

I'll just post here what the ASC2 part should look like.

NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US Space ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ PAD HOP BPH NBH IND NEL SSA ESA HTS HTJ VTS PLD PLU RI SS2 SS3 DCS PU1 PU2 STS CCH MW SPA EPA SOS SGCI SCI CSI ST OSC PM APC Non-breakingSpace ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

In the initial file, each character will be on a new line.

It is better to use Notepad ++ to see the service symbols. It is better to replace them with text with your hands.

Two more service symbols are contained, just below asc2 part and at the end - a lot.

But, to admire the colored emoticons, you can simply copy your favorite text into Word or social network. Word interprets characters better than notepad, but worse than a website.

$arrayFromFile = [IO.File]::ReadAllLines('C:\utf-8.txt')
$counter = [pscustomobject] @{ Value = 0 }
$groupSize = 256
$text=''
$groups = $arrayFromFile | Group-Object -Property { [math]::Floor($counter.Value++ / $groupSize) }
foreach ($group in $groups){
    $text+=$group.Group -join (' ')
    $text+="`n"
}
$text | Out-File -FilePath 'C:\utf-8 (sorted).txt'