Extract content in paragraph Tags_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-01 00:09 出处：网络

I have following html in string and i have to extract the content only in Paragraph tagsany ideas?? link is http://www.public-domain-content.com/books/Coming_Race/C1P1.shtml

相关专题：webclient

I have following html in string and i have to extract the content only in Paragraph tags any ideas??

link is http://www.public-domain-content.com/books/Coming_Race/C1P1.shtml

I have tried

  const string HTML_TAG_PATTERN = "<[^>]+.*?>";
    static string StripHTML(string inputString)
            {
                return Regex.Replace(inputString, HTML_TAG_PATTERN, string.Empty);
            }

it removes all html tags but i dont want to remove all the tags because this is the way how i can get content like paragraph by tags

secondly it makes line breaks to \n in text and and applying replace("\n","") dose not helps one problem is that when i apply

int UrlStart = e.Result.IndexOf("<p>"), urlEnd = e.Result.IndexOf("<p>&nbsp;</p></td>\r" );
     string paragraph = e.Result.Substring(UrlStart, urlEnd);
     extractedContent.Text = paragraph.Replace(Environment.NewLine, "");

<p> </p></td>\r this appears at the end of paragraph but urlEnd dose not makes sure only paragraph is shown

the string extracted is shown in visual studio is like this

Extract content in paragraph Tags

this page is downloaded by Webclient End of HTMLpage

We will provide ourselves with ropes of\rsuitable length and st开发者_运维技巧rength- and- pardon me- you must not\rdrink more to-night.  our hands and feet must be steady and\rfirm tomorrow.\"\r<p>&nbsp;</p>     </td>\r    </tr>\r\r    <tr>\r     <td height=\"25\" width=\"10%\">\r     \r     </td><td height=\"25\" width=\"80%\" align=\"center\">\r       <font color=\"#FFFFFF\">\r       <font size=\"4\">1</font> &nbsp;\r       </font></td>\r     <td height=\"25\" width=\"10%\" align=\"right\"><a href=\"C2P1.shtml\">Next</a></td>\r    </tr>\r   </table>\r  </center>\r</div>\r<p align=\"center\"><a href=\"index.shtml\"><b>The Coming Race -by- Edward Bulwer Lytton</b></a></p>\r<P><B><center><A HREF=\"http://www.public-domain-content.com/encyclopedia.shtml\">Encyclopedia</a> - <A HREF=\"http://www.public-domain-content.com/books.shtml\">Books</a> - <A HREF=\"http://www.public-domain-content.com/religion.shtml\">Religion<a/> - <A HREF=\"http://www.public-domain-content.com/links2.shtml\">Links</a> - <A HREF=\"http://www.public-domain-content.com/\">Home</a> - <A HREF=\"http://www.webmaster-headquarters.com/mb/\">Message Boards</a></B><BR>This <a HREF=\"http://www.wikipedia.org/\">Wikipedia</a> content is licensed under the <a href=\"http://www.gnu.org/copyleft/fdl.html\">GNU Fr

Don't use regular expressions to parse HTML. Use the HTML Agility Pack (or something similar) instead.

A quick example, but you could do something like this:

HtmlDocument document = new HtmlDocument();
document.Load("your_file_here.htm");
foreach(HtmlNode paragraph in document.DocumentElement.SelectNodes("//p"))
{
    // do something with the paragraph node here
    string content = paragraph.InnerText; // or something similar
}