开发者

Regex questions

开发者 https://www.devze.com 2023-02-15 08:09 出处:网络
I\'m trying to get some text from a large text file, the text I\'m looking for is: Type:Production Color:Red

I'm trying to get some text from a large text file, the text I'm looking for is:

Type:Production Color:Red

I pass the whole text in the following method to get (Type:Production , Color:Red)

  private static void FindKeys(IEnumerable<string> keywords, string source)
    {
        var found = new Dictionary<string, string>(10);
        var keys = string.Join("|", keywords.ToArray());
        var matches = Regex.Matches(source, @"(?<key>" + @"\B\s" + keys + @"\B\s" + "):",
                              RegexOptions.Singleline);

        foreach (Match m in matches)
        {
            var key = m.Groups["key"].ToString();
            var start = m.Index + m.Length;
            var nx = m.NextMatch();
            var end = (nx.Success ? nx.Index : source.Length);
            found.Add(k开发者_如何学Goey, source.Substring(start, end - start));

        }

        foreach (var n in found)
        {
            Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
        }
    }
}

My problems are the following:

  1. The search returns _Type: as well, where I only need Type:
  2. The search return Color:Red/n/n/n/n/n (with the rest of the text, where I only need Color:Red

So, basically: - How can I force Regex to get the exact match for Type and ignore _Type - How to get only the text after : and ignore /n/n/ and any other text

I hope this is clear

Thanks,


Your regex currently looks like this:

(?<key>\B\sWord1|Word2|Word3\B\s):

I see the following issues here:

  • First, Word1|Word2|Word3 should be put in parenthesis. Otherwise, it will search for \B\sWord1 or Word2 or Word3\B\s, which is not what you want (I guess).

  • Why \B\s? A non-boundary followed by a whitespace? That doesn't make sense. I guess you want just \b (= word boundary). There's no need to use it in the end, because the colon already constitutes a word boundary.

So, I would suggest to use the following. It will fix the _Type problem, because there is no word boundary between _ and Type (since _ is considered to be a word character).

\b(?<key>Word1|Word2|Word3):

If the text following the key is always just a single word, I'd match it in the regex as well: (\s* allows for whitespace after the colon, I don't know if you need this. \w+ ensures that only word characters -- i.e. no line breaks etc. -- are matched as the value.)

\b(?<key>Word1|Word2|Word3):\s*(?<value>\w+)

Then you just need to iterate through all the matches and extract the key and value groups. No need for any string operations or index arithmetic.


So if I understand correctly, you have:

  • Pairs of key:values
  • Each pair is separated by a space
  • Within each pair, the key and value is separated by “:”

Then I would not use regex at all. I would:

  • use String.Split(' ') to get an array of pairs
  • loop over all the pairs
  • use String.Split(':') to get the key and value from each pair
0

精彩评论

暂无评论...
验证码 换一张
取 消