开发者

Best practice to parse html (not XML) to XElement?

开发者 https://www.devze.com 2022-12-17 12:23 出处：网络

I have this code: var url = textBox1.Text; WebClient wc = new WebClient(); var page= wc.DownloadString(url);

相关专题：parsing

I have this code:

var url = textBox1.Text;
WebClient wc = new WebClient();

var page= wc.DownloadString(url);
XElement doc = XElement.Parse(page);

It 开发者_如何学Cfails with exception about unexpected characters. Obviously, the HTML i'm trying to parse in such a dumb way is not strict xml. What's the next easiest way to parse arbitrary HTML to something IQueriable?

What I actually want is to grab a table inside and paging links. Then parse them on my own with LINQ.

Have a look at the HTML Agility Pack:
http://www.codeplex.com/htmlagilitypack

The best way that I can think of is to search for the tags and parse everything inside, same for the tags containing the paging links. Hopefully narrowing it down to that should make a manual parser to write.