开发者

.NET string replace russian to english

开发者 https://www.devze.com 2022-12-30 21:08 出处:网络
I have a strange problem replacing chars in string... I read a .txt file containing russian text, and starting from a list of letters russian to english (ru=en), I loop the list and I WOULD like to r

I have a strange problem replacing chars in string...

I read a .txt file containing russian text, and starting from a list of letters russian to english (ru=en), I loop the list and I WOULD like to replace russian characters with english characters.

The problem is: I can see in the debug the right reading of the russian and the right reading of th开发者_开发知识库e english, but using myWord = myWord.Replace(ruChar, enChar) the string is not replaced.

My txt file is a UTF-8 encoding.


String.Replace() is going to be horribly inefficient, you'll have to call it for each possible Cyrillic letter you'd want to replace. Use a Dictionary instead (no pun intended). For example:

    private const string Cyrillic = "AaБбВвГг...";
    private const string Latin = "A|a|B|b|V|v|G|g|...";
    private Dictionary<char, string> mLookup;

    public string Romanize(string russian) {
        if (mLookup == null) {
            mLookup = new Dictionary<char, string>();
            var replace = Latin.Split('|');
            for (int ix = 0; ix < Cyrillic.Length; ++ix) {
                mLookup.Add(Cyrillic[ix], replace[ix]);
            }
        }
        var buf = new StringBuilder(russian.Length);
        foreach (char ch in russian) {
            if (mLookup.ContainsKey(ch)) buf.Append(mLookup[ch]);
            else buf.Append(ch);
        }
        return buf.ToString();
    }

Note how the bars and the Split() function are necessary in the Latin replacement because some Cyrillic letters require more than one letter for their transliteration. Key idea is to use a dictionary for fast lookup and a string builder for fast string construction.

This United Nations document might be helpful.


Don't -1 me if this doesnt work, I'm just guessing that you must UTF-8 English string that you want to replace, like so for example:

string myWord = Encoding.UTF8.GetString(Encoding.ASCII.GetBytes(myWord));
myWord = myWord.Replace("слово", Encoding.UTF8.GetString(Encoding.ASCII.GetBytes("letter")));

I'm assuming that myWord is in ASCII so the first line of code converts it to UTF-8 string, but left it out if it is UTF-8.

Second line converts English word to UTF-8 so it can be replaced over the Russian word.


Very strange

Console.WriteLine("слово".Replace("слово", "word")); // prints 'word'

Works as planned. Maybe because I have set Russian as non-unicode system language..

0

精彩评论

暂无评论...
验证码 换一张
取 消