I am trying to convert the webpage into a plain text. But if I encountered the table I am getting td and tr tags too. If I replace those table tags then I can't get some of the content.
Here is my code
string s = Regex.Replace(htmldoc, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<!--.*?-->", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<style.*?style>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<a.*?a>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<img.*?img>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<table.*?table>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
HtmlAgilityPack.HtmlDocument doc = 开发者_JAVA百科new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
s = doc.DocumentNode.SelectSingleNode("//body").InnerText.Trim();
Please check it and tell me how can I get the contents from table without getting td and tr tags.
If you are using HTML Agility pack to parse the table you don't need to remove the HTML tags with your regex. There are some good examples of parsing tables using HTML Agility pack here on SO. ex: HTML Agility pack - parsing tables
You can use the body's InnerText
:
string html = @"
<html>
<title>title</title>
<body>
<h1> The wheel.</h1>
Stop reinventing the wheel ! Use powerful APIs
for manipulating html docs !
<h3> I am fine </h3>
<img src=""da_wheel_in_my_mind.png""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, @"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as hello<br>world
or hello<i>world</i>
will be converted by InnerText
to helloworld
- removing the tags. It is difficult to solve that issue, as display is often determined by the CSS, not just by the markup.
精彩评论