开发者

A faster way of doing multiple string replacements

开发者 https://www.devze.com 2023-01-24 19:38 出处:网络
I need to do the following: static string[] pats = { \"å\", \"Å\", \"æ\", \"Æ\", \"ä\", \"Ä\", \"ö\", \"Ö\", \"ø\", \"Ø\" ,\"è\", \"È\", \"à\", \"À\", \"ì\", \"Ì\", \"õ\", \"Õ\", \"

I need to do the following:

    static string[] pats = { "å", "Å", "æ", "Æ", "ä", "Ä", "ö", "Ö", "ø", "Ø" ,"è", "È", "à", "À", "ì", "Ì", "õ", "Õ", "ï", "Ï" };
    static string[] repl = { "a", "A", "a", "A", "a", "A", "o", "O", "o", "O", "e", "E", "a", "A", "i", "I", "o", "O"开发者_开发知识库, "i", "I" };
    static int i = pats.Length;
    int j;

     // function for the replacement(s)
     public string DoRepl(string Inp) {
      string tmp = Inp;
        for( j = 0; j < i; j++ ) {
            tmp = Regex.Replace(tmp,pats[j],repl[j]);
        }
        return tmp.ToString();            
    }
    /* Main flow processes about 45000 lines of input */

Each line has 6 elements that go through DoRepl. Approximately 300,000 function calls. Each does 20 Regex.Replace, totalling ~6 million replaces.

Is there any more elegant way to do this in fewer passes?


static Dictionary<char, char> repl = new Dictionary<char, char>() { { 'å', 'a' }, { 'ø', 'o' } }; // etc...
public string DoRepl(string Inp)
{
    var tmp = Inp.Select(c =>
    {
        char r;
        if (repl.TryGetValue(c, out r))
            return r;
        return c;
    });
    return new string(tmp.ToArray());
}

Each char is checked only once against a dictionary and replaced if found in the dictionary.


How about this "trick"?

string conv = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(input));


Without regex it might be way faster.

    for( j = 0; j < i; j++ ) 
    {
        tmp = tmp.Replace(pats[j], repl[j]);
    }

Edit

Another way using Zip and a StringBuilder:

StringBuilder result = new StringBuilder(input);
foreach (var zipped = patterns.Zip(replacements, (p, r) => new {p, r}))
{
  result = result.Replace(zipped.p, zipped.r);
}
return result.ToString();


First, I would use a StringBuilder to perform the translation inside a buffer and avoid creating new strings all over the place.

Next, ideally we'd like something akin to XPath's translate(), so we can work with strings instead of arrays or mappings. Let's do that in an extension method:

public static StringBuilder Translate(this StringBuilder builder,
    string inChars, string outChars)
{
    int length = Math.Min(inChars.Length, outChars.Length);
    for (int i = 0; i < length; ++i) {
        builder.Replace(inChars[i], outChars[i]);
    }
    return builder;
}

Then use it:

StringBuilder builder = new StringBuilder(yourString);
yourString = builder.Translate("åÅæÆäÄöÖøØèÈàÀìÌõÕïÏ",
    "aAaAaAoOoOeEaAiIoOiI").ToString();


The problem with your original regex is that you're not using it to its fullest potential. Remember, a regex pattern can have alternations. You will still need a dictionary, but you can do it in one pass without looping through each character.

This would be achieved as follows:

string[] pats = { "å", "Å", "æ", "Æ", "ä", "Ä", "ö", "Ö", "ø", "Ø" ,"è", "È", "à", "À", "ì", "Ì", "õ", "Õ", "ï", "Ï" };
string[] repl = { "a", "A", "a", "A", "a", "A", "o", "O", "o", "O", "e", "E", "a", "A", "i", "I", "o", "O", "i", "I" };
// using Zip as a shortcut, otherwise setup dictionary differently as others have shown
var dict = pats.Zip(repl, (k,v) => new { Key = k, Value = v }).ToDictionary(o => o.Key, o => o.Value);

string input = "åÅæÆäÄöÖøØèÈàÀìÌõÕïÏ";
string pattern = String.Join("|", dict.Keys.Select(k => k)); // use ToArray() for .NET 3.5
string result = Regex.Replace(input, pattern, m => dict[m.Value]);

Console.WriteLine("Pattern: " + pattern);
Console.WriteLine("Input: " + input);
Console.WriteLine("Result: " + result);

Of course, you should always escape your pattern using Regex.Escape. In this case this is not needed since we know the finite set of characters and they don't need to be escaped.


If you want to remove accents then perhaps this solution would be helpful How do I remove diacritics (accents) from a string in .NET?

Otherwise I would to this in single pass:

Dictionary<char, char> replacements = new Dictionary<char, char>();
...
StringBuilder result = new StringBuilder();
foreach(char c in str)
{
  char rc;
  if (!_replacements.TryGetValue(c, out rc)
  {
    rc = c;
  }
  result.Append(rc);
}


The fastest (IMHO) way (compared even with the dictionary) in the special case of one-to-one character replacement would be a full character map:

public class Converter
{
    private readonly char[] _map;

    public Converter()
    {
        // This code assumes char to be a short unsigned integer
        _map = new char[char.MaxValue];

        for (int i = 0; i < _map.Length; i++)
            _map[i] = (char)i;

        _map['å'] = 'a';  // Note that 'å' is used as an integer index into the array.
        _map['Å'] = 'A';
        _map['æ'] = 'a';
        // ... the rest of overriding map
    }

    public string Convert(string source)
    {
        if (string.IsNullOrEmpty(source))
            return source;

        var result = new char[source.Length];

        for (int i = 0; i < source.Length; i++)
            result[i] = _map[source[i]]; // convert using the map

        return new string(result);
    }
}

To further speed up this code, you might want to use the "unsafe" keyword and use pointers. This way, traversing the string array could be done faster and without bound-checks (which in theory would be optimized away by the VM, but might not).


I'm not familiar with the Regex class, but most regular expression engines have a transliterate operation that would work well here. Then you would only need one call per line.

0

精彩评论

暂无评论...
验证码 换一张
取 消