I have an HTML document in .txt
format containing multiple tables and other texts and I am trying to delete any HTML (anything within "<>
") if it's inside a table开发者_运维技巧 (between <table>
and </table>
). For example:
===================
other text
<other HTML>
<table>
<b><u><i>bold underlined italic text</b></u></i>
</table>
other text
<other HTML>
==============
The final output would be as the following. Note that only HTML within and are removed.
==============
other text
<other HTML>
<table>
bold underlined italic text
</table>
other text
<other HTML>
=============
Any help is greatly appreciated!
Use the HTMLDocument Class Instead of Regex
Imports System.Windows.Forms.HtmlDocument
Imports System.IO.File
Public Class Form1
Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
Dim myHTMLString As String
Dim myDoc As HtmlDocument
Dim myTables As HtmlElementCollection
Dim myTable As HtmlElement
Dim myAllTags As HtmlElementCollection
Dim myHTMLTag As HtmlElement
myHTMLString = ReadAllText("C:\Users\Geoffrey Van Wyk\Desktop\myPage1.txt")
WebBrowser1.DocumentText = myHTMLString
myDoc = WebBrowser1.Document.OpenNew(True)
myDoc.Write(myHTMLString)
myTables = myDoc.GetElementsByTagName("table")
myTable = myTables.Item(0)
For Each child As HtmlElement In myTable.Children
child.OuterText = child.InnerText
Next
myAllTags = myDoc.GetElementsByTagName("html")
myHTMLTag = myAllTags.Item(0)
WriteAllText("C:\Users\Geoffrey Van Wyk\Desktop\myPage2.txt", myHTMLTag.OuterHtml)
End Sub
End Class
I have tested it. It works.
input = Regex.Replace(input, @"<table>(.|\n)*?</table>", string.Empty, RegexOptions.Singleline);
Here input is the string that contains html. This regex will remove all the tags and text that are between start table and end /table tag. Try it !!!
精彩评论