开发者

Is there a way to dumb down text from Unicode to ASCII?

开发者 https://www.devze.com 2023-02-25 02:52 出处:网络
What I need is something like, for each ASCII character, a list of equivalent Unicode c开发者_Python百科haracters.

What I need is something like, for each ASCII character, a list of equivalent Unicode c开发者_Python百科haracters.

The problem is that programs like Microsoft Excel and Word insert non-ASCII double-quotes, single-quotes, dashes, etc. when people type into documents. I want to store this text in a database field of type "varchar", which requires single-byte characters.

For the sake of storing ASCII (single-byte) text, some of those Unicode characters could be considered equivalent to or similar enough to a particular ASCII character that replacing the Unicode character with the equivalent ASCII character would be fine.

I would like a simple function like MapToASCII, that would convert Unicode text to an ASCII equivalent, allowing me to specify a replacement character for any Unicode characters that are not similar to any ASCII character.


The Win32 API WideCharToMultiByte can be used for this conversion (Unicode to ANSI). Use CP_ACP as the first parameter. Something like that would likely be better than trying to build your own mapping function.

Edit At the risk of sounding like I am trying to promote this as a solution against the OP's wishes, it seems that it may be worth pointing out that this API does much (all?) of what is being asking for. The goal is to map (I think) a Unicode string as much as possible to "ANSI" (where ANSI may be something of a moving target in this case). An additional requirement is to be able to specify some alternative character for those that cannot be mapped. The following example does this. It "converts" a Unicode string to char and uses an underscore (second to last parameter) for those characters that cannot be converted.

ret = WideCharToMultiByte( CP_ACP, 0, L"abc個חあЖdef", -1, 
                           ac, sizeof( ac ), "_", NULL );
for ( i = 0; i < strlen( ac ); i++ )
  printf( "%c %02x\n", ac[i], ac[i] );


A highly relevant question is here: Replacing unicode punctuation with ASCII approximations

Although the answer there is insufficient, it gave me an idea. I could map each of the Unicode code points in the Basic Multilingual Plane (0) to an equivalent ASCII character, if one exists. The following C# code will help by creating an HTML form in which you can type a replacement character for each value.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;
using System.IO;

namespace UnicodeCharacterCategorizer
{
    class Program
    {
        static void Main(string[] args)
        {
            string output_filename = "output.htm"; //set a filename if not specifying one through the command line
            Dictionary<UnicodeCategory,List<char>> category_character_sets = new Dictionary<UnicodeCategory,List<char>>();
            foreach (UnicodeCategory c in Enum.GetValues(typeof(UnicodeCategory)))
                category_character_sets.Add( c, new List<char>() );
            for (int i = 0; i <= 0xFFFF; i++)
            {
                if (i >= 0xD800 && i <= 0xDFFF) continue; //Skip ranges reserved for high/low surrogate pairs.
                char c = (char)i;
                UnicodeCategory category = char.GetUnicodeCategory( c );
                category_character_sets[category].Add( c );
            }
            StringBuilder file_data = new StringBuilder( @"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html xmlns=""http://www.w3.org/1999/xhtml""><head><title>Unicode Category Character Sets</title><style>.categoryblock{border:3px solid black;margin-bottom:10px;padding:5px;} .characterblock{display:inline-block;border:1px solid grey;padding:5px;margin-right:5px;} .character{display:inline-block;font-weight:bold;background-color:#ffeeee} .numericvalue{color:blue;}</style></head><body><form id=""charactermap"">" );
            foreach (KeyValuePair<UnicodeCategory,List<char>> entry in category_character_sets)
            {
                file_data.Append( @"<div class=""categoryblock""><h1>" + entry.Key.ToString() + ":</h1><br />" );
                foreach (char c in entry.Value)
                {
                    string hex_value = ((int)c).ToString( "x" );
                    file_data.Append( @"<div class=""characterblock""><span class=""character"">&#x" + hex_value + @";<br /><span class=""numericvalue"">" + hex_value + @"</span><br /><input type=""text"" name=""r_" + hex_value + @""" /></div>" );
                }
                file_data.Append( "</div>" );
            }
            file_data.Append("</form></body></html>" );
            File.WriteAllText( output_filename, file_data.ToString(), Encoding.Unicode );
        }
    }
}

Specifically, that code will generate an HTML form containing all characters in the BMP, along with input text boxes named after the hex values prefixed with "r_" (r is for "replacement value"). If this ported over to an ASP.NET page, additional code could be written to pre-populate replacement values as much as possible:

  • with their own value if already ASCII, or
  • with Unicode normalized FormD or FormKD decomposed equivalents, or
  • a single ASCII value for an entire category (i.e. all "punctuation initial" characters with a ASCII double quote)

You could then go through manually and make adjustments, and it probably wouldn't take as long as you'd think. There are only 64512 code points, and large chunks of entire categories can probably be dismissed as "no even close to anything ASCII". So, I'm going to build this map and function.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号