Hola. I'm failing to write a method to test for words within a plain text or html document. I was reasonably literate with regex, and I am newer to c# (from way more java).
Just 'cause,
string html = source.ToLower();
string plaintext = Regex.Replace(html, @"<(.|\n)*?>", " "); // remove tags
plaintext = Regex.Replace(plaintext, @"\s+", " "); // remove excess white space
and then,
string tag = "c++";
bool foundAsRegex = Regex.IsMatch(plaintext,@"\b" + Regex.Escape(tag) + @"\b");
bool foundAsContains = plaintext.Contains(tag);
For a ca开发者_运维知识库se where "c++" should be found, sometimes foundAsRegex is true and sometimes false. My google-fu is weak, so I didn't get much back on "what the hell". Any ideas or pointers welcome!
edit:
I'm searching for matches on skills in resumes. for example, the distinct value "c++".
edit:
a real excerpt is given below:
"...administration- c, c++, perl, shell programming..."
The problem is that \b
matches between a word character and a non-word character. Given the expression \bc\+\+\b
, you have a problem. "+" is a non-word character. So searching for the pattern in "xxx c++, xxx", you're not going to find anything. There's no "word break" after the "+" character.
If you're looking for non-word characters then you'll have to change your logic. Not sure what the best thing would be. I suppose you can use \W
, but then it's not going to match at the beginning or end of the line, so you'll need (^|\W)
and (\W|$)
... which is ugly. And slow, although perhaps still fast enough depending on your needs.
Your regular expression is turning into:
/\bc\+\+\b/
Which means you're looking for a word boundary, followed by the string c++
, followed by another word boundary. This means it won't match on strings like abc++
, whereas plaintext.Contains
will succeed.
If you can give us examples of where your regex fails when you expected it to succeed, then we can give you a more definite answer.
Edit: My original regex was /\bc++\b/
, which is incorrect, as c++
is being passed to Regex.Escape()
, which escapes out regular expression metacharacters like +
. I've fixed it above.
精彩评论