I am using web client class to HTML data from a web page. Now I开发者_Go百科 want to get the complete href tags and there titles from the HTML data. Initially I used loops, Felling inefficient I switched to regExp, but dint got efficient solution.
He is the initial code:
for (int i = 0; i < htmldata.Length - 5; i++)
{
if (htmldata.Substring(i, 5) == "href=")
{
n1 = htmldata.Substring(i + 6, htmldata.Length - (i + 6)).IndexOf("\"");
Sublink = htmldata.Substring(i + 6, n1);
var absoluteUri = new Uri(baseUri, temp);
n2 = htmldata.Substring(i + n1 + 1, htmldata.Length - (i + n1 + 1)).IndexOf("<");
subtitle = htmldata.Substring(i + 6 + n1 + 2, n2 - 7);
}
}
This code is getting some of the links like this.
/l.href.replace(new RegExp(
/advanced_search?hl=en&q=&hl=en&
and titles like this
onclick=gbar.qs(this) class=gb2>Photos
")+"q="+encodeURIComponent(b)})}i.qs=n;function o(a,b,d,c,f,e){var g=document.getElementById(a);if(g){var
Which are absolutely invalid. Please suggest me the correct code for getting valid relative href links and titles.
Use the HTML Agility pack to parse the HTML for you, then you can use XPath expressions to select all links in the page and associated data.
Trying to parse out HTML by yourself is error prone and brittle, as you have already discovered.
RegEx match open tags except XHTML self-contained tags
精彩评论