I've tried to accomplish this with regex but it seems not to be working at all. I tried the same regex pattern with PHP, Javascript and it worked like a charm. I have no idea why it's not working with C#.
Here is my code sample:
Regex mysReg = new Regex(@"<form[^>]*action=""do\.php""[^>开发者_运维百科]*>(.*)<\/form>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
MatchCollection form = mysReg.Matches(html);
If I remove the part <\/form>
the regex works ok but it doesn't get the content inside the parenthesis.
Now some of you will tell me to use "HtmlAgilityPack". I've tried to use it but, since I'm still unfamiliar with C#, I found it hard to work with it, since there is no documentation came with it.
So is there any way to work around this problem?
Your (.*)
isn't matching newlines. ([\S\s]*?)
will work, or you can turn newline matching on with RegexOptions.SingleLine
.
However, as others have pointed out, you should be using something like the HTML Agility Pack instead of trying to use regex to parse HTML.
Instead of reg ex, use the HTML Agility Pack to parse the document. You might not be comfortable with it, but that's the way to go.
The download comes with examples - projects that do all sorts of things, so you can read through the code to see how they were accomplished.
You will then be able to query it in XPath syntax, though it exposes an interface similar XmlDocument
.
See here for a compelling reason to not use RegEx for parsing HTML.
I was playing with this in RegexBuddy and got
@"<form[^>]*action=""do\.php""[^>]*>([\s\S]*)<\/form>"
wot work with my (hastily put together) sample data.
精彩评论