开发者

Regex to delete HTML within <table> tags

开发者 https://www.devze.com 2023-01-31 11:13 出处:网络
I have an HTML document in .txt format containing multiple tables and other texts and I am trying to delete any HTML (anything within \"<>\") if it\'s inside a table开发者_运维技巧 (between <

I have an HTML document in .txt format containing multiple tables and other texts and I am trying to delete any HTML (anything within "<>") if it's inside a table开发者_运维技巧 (between <table> and </table>). For example:

===================
other text
<other HTML>
<table>
<b><u><i>bold underlined italic text</b></u></i>
</table>
other text
<other HTML>
==============

The final output would be as the following. Note that only HTML within and are removed.

==============
other text
<other HTML>
<table>
bold underlined italic text        
</table>
other text
<other HTML>
=============

Any help is greatly appreciated!


Use the HTMLDocument Class Instead of Regex

Imports System.Windows.Forms.HtmlDocument
Imports System.IO.File

Public Class Form1

    Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
        Dim myHTMLString As String

        Dim myDoc As HtmlDocument
        Dim myTables As HtmlElementCollection
        Dim myTable As HtmlElement

        Dim myAllTags As HtmlElementCollection
        Dim myHTMLTag As HtmlElement

        myHTMLString = ReadAllText("C:\Users\Geoffrey Van Wyk\Desktop\myPage1.txt")
        WebBrowser1.DocumentText = myHTMLString

        myDoc = WebBrowser1.Document.OpenNew(True)
        myDoc.Write(myHTMLString)

        myTables = myDoc.GetElementsByTagName("table")
        myTable = myTables.Item(0)

        For Each child As HtmlElement In myTable.Children
            child.OuterText = child.InnerText
        Next

        myAllTags = myDoc.GetElementsByTagName("html")
        myHTMLTag = myAllTags.Item(0)

        WriteAllText("C:\Users\Geoffrey Van Wyk\Desktop\myPage2.txt", myHTMLTag.OuterHtml)
    End Sub
End Class

I have tested it. It works.


input = Regex.Replace(input, @"<table>(.|\n)*?</table>", string.Empty, RegexOptions.Singleline);

Here input is the string that contains html. This regex will remove all the tags and text that are between start table and end /table tag. Try it !!!

0

精彩评论

暂无评论...
验证码 换一张
取 消