I'm trying to get some text from a large text file, the text I'm looking for is:
Type:Production Color:Red
I pass the whole text in the following method to get (Type:Production , Color:Red)
private static void FindKeys(IEnumerable<string> keywords, string source)
{
var found = new Dictionary<string, string>(10);
var keys = string.Join("|", keywords.ToArray());
var matches = Regex.Matches(source, @"(?<key>" + @"\B\s" + keys + @"\B\s" + "):",
RegexOptions.Singleline);
foreach (Match m in matches)
{
var key = m.Groups["key"].ToString();
var start = m.Index + m.Length;
var nx = m.NextMatch();
var end = (nx.Success ? nx.Index : source.Length);
found.Add(k开发者_如何学Goey, source.Substring(start, end - start));
}
foreach (var n in found)
{
Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
}
}
}
My problems are the following:
- The search returns _Type: as well, where I only need Type:
- The search return Color:Red/n/n/n/n/n (with the rest of the text, where I only need Color:Red
So, basically: - How can I force Regex to get the exact match for Type and ignore _Type - How to get only the text after : and ignore /n/n/ and any other text
I hope this is clear
Thanks,
Your regex currently looks like this:
(?<key>\B\sWord1|Word2|Word3\B\s):
I see the following issues here:
First,
Word1|Word2|Word3
should be put in parenthesis. Otherwise, it will search for\B\sWord1
orWord2
orWord3\B\s
, which is not what you want (I guess).Why
\B\s
? A non-boundary followed by a whitespace? That doesn't make sense. I guess you want just\b
(= word boundary). There's no need to use it in the end, because the colon already constitutes a word boundary.
So, I would suggest to use the following. It will fix the _Type
problem, because there is no word boundary between _
and Type
(since _
is considered to be a word character).
\b(?<key>Word1|Word2|Word3):
If the text following the key is always just a single word, I'd match it in the regex as well: (\s*
allows for whitespace after the colon, I don't know if you need this. \w+
ensures that only word characters -- i.e. no line breaks etc. -- are matched as the value.)
\b(?<key>Word1|Word2|Word3):\s*(?<value>\w+)
Then you just need to iterate through all the matches and extract the key
and value
groups. No need for any string operations or index arithmetic.
So if I understand correctly, you have:
- Pairs of key:values
- Each pair is separated by a space
- Within each pair, the key and value is separated by “:”
Then I would not use regex at all. I would:
- use String.Split(' ') to get an array of pairs
- loop over all the pairs
- use String.Split(':') to get the key and value from each pair
精彩评论