I have following html in string and i have to extract the content only in Paragraph tags any ideas??
link is http://www.public-domain-content.com/books/Coming_Race/C1P1.shtml
I have tried
const string HTML_TAG_PATTERN = "<[^>]+.*?>";
static string StripHTML(string inputString)
{
return Regex.Replace(inputString, HTML_TAG_PATTERN, string.Empty);
}
it removes all html tags but i dont want to remove all the tags because this is the way how i can get content like paragraph by tags
secondly it makes line breaks to \n in text and and applying replace("\n","") dose not helps one problem is that when i apply
int UrlStart = e.Result.IndexOf("<p>"), urlEnd = e.Result.IndexOf("<p> </p></td>\r" );
string paragraph = e.Result.Substring(UrlStart, urlEnd);
extractedContent.Text = paragraph.Replace(Environment.NewLine, "");
<p> </p></td>\r
this appears at the end of paragraph but urlEnd dose not makes sure only paragraph is shown
the string extracted is shown in visual studio is like this
this page is downloaded by Webclient End of HTMLpageWe will provide ourselves with ropes of\rsuitable length and st开发者_运维技巧rength- and- pardon me- you must not\rdrink more to-night. our hands and feet must be steady and\rfirm tomorrow.\"\r<p> </p> </td>\r </tr>\r\r <tr>\r <td height=\"25\" width=\"10%\">\r \r </td><td height=\"25\" width=\"80%\" align=\"center\">\r <font color=\"#FFFFFF\">\r <font size=\"4\">1</font> \r </font></td>\r <td height=\"25\" width=\"10%\" align=\"right\"><a href=\"C2P1.shtml\">Next</a></td>\r </tr>\r </table>\r </center>\r</div>\r<p align=\"center\"><a href=\"index.shtml\"><b>The Coming Race -by- Edward Bulwer Lytton</b></a></p>\r<P><B><center><A HREF=\"http://www.public-domain-content.com/encyclopedia.shtml\">Encyclopedia</a> - <A HREF=\"http://www.public-domain-content.com/books.shtml\">Books</a> - <A HREF=\"http://www.public-domain-content.com/religion.shtml\">Religion<a/> - <A HREF=\"http://www.public-domain-content.com/links2.shtml\">Links</a> - <A HREF=\"http://www.public-domain-content.com/\">Home</a> - <A HREF=\"http://www.webmaster-headquarters.com/mb/\">Message Boards</a></B><BR>This <a HREF=\"http://www.wikipedia.org/\">Wikipedia</a> content is licensed under the <a href=\"http://www.gnu.org/copyleft/fdl.html\">GNU Fr
Don't use regular expressions to parse HTML. Use the HTML Agility Pack (or something similar) instead.
A quick example, but you could do something like this:
HtmlDocument document = new HtmlDocument();
document.Load("your_file_here.htm");
foreach(HtmlNode paragraph in document.DocumentElement.SelectNodes("//p"))
{
// do something with the paragraph node here
string content = paragraph.InnerText; // or something similar
}
精彩评论