开发者

Removing incomplete P Tags (using REGEX or any other method)

开发者 https://www.devze.com 2023-01-16 20:41 出处:网络
my problems is a bit case specific , first of all, Its only for <p>tags not for any other tag.So you need not worry about any other tag.

my problems is a bit case specific ,

first of all,

Its only for <p>tags not for any other tag.So you need not worry about any other tag.

I am having html document which is a output of one software ,but it has some errors like unclosed <p> tags.

eg. I have taken all document in a string

my document is like ..

    <html>
    ....
    ....
      <head>
      </head>
    ....
    ....
       <body>

    ...
    ...
    <p>           开发者_如何学运维      // tag is to be removed as no closing tag

<p align="left">   AAA   </p>
<p class="style6">   BBB    </P>
<p class="style1" align="center">    CCC    </P>

<p align="left">  DDD               // tag is to be removed as no closing tag
<p class="style6">   EEE              // tag is to be removed as no closing tag
<p class="style1" align="center">    FFF             // tag is to be removed as no closing tag

<p class="style15"><strong>xxyyzz</strong><br/></p>

<p>                // tag is to be removed as no closing tag



<p> stack Overflow </P>


       <body>
      </html>

tags with DDD,EEE,FFF and unclosed <p> tag are to be removed As you can see it should work for every unclosed <P> tag whether it is having attributes like class or align.

I also want to mention that, there is no <p> tag inside another <p> tag ,i mean

<p>
    <p>
    </p>

     <p>
     </p>

</p>

Such condition will never occur .

I tried using REGEX and StringBuilder but could not get perfect answer.

Thanx a lot in advance for those who will help.

Regards


You might get better results using the Html Agility Pack:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML.

Just load the document into the DOM, iterate over the elements looking for <p> and filter them out, almost like you were doing valid XML manipulation.


Disclaimer: Please note that I do not advocate trying to parse arbitrary HTML with regular expressions or simple substring matches. The solution below is for this specific problem, which appears to be purposely limited to make parsing possible with simple methods. In general, I agree with the consensus: To parse HTML, use an HTML parser.

That said . . .

Given that nested <p> tags aren't allowed, and assuming that there aren't any HTML comments allowed, it should be relatively easy to do the following in a loop to find and eliminate all <p> tags that have no corresponding </p>.

string inputText = GetHtmlText();
int scanPos = 0;
int startTag = inputText.IndexOf("<p>", scanPos);
while (startTag != -1)
{
    scanPos += 4;
    // Now look for a closing tag or another open tag
    int closeTag = inputText.IndexOf("</p">, scanPos);
    int nextStartTag = inputText.IndexOf("<p>", scanPos);
    if (closeTag == -1 || nextStartTag < closeTag)
    {
        // Error at position startTag.  No closing tag.
    }
    else
    {
        // You have a full paragraph between startTag and (closeTag+5).
    }
    startTag = nextStartTag;
}

The code assumes that the strings <p> and </p> cannot exist in the text except as actual paragraph open and closing tags. If you can make that guarantee, than the above (or something very similar) should work quite well.

ADDED:

Handling things like <p class="classname">, etc., gets a little less sure. If you can guarantee that there won't be any > characters between the opening <p and the closing >, then you can modify the code above to search for <p as well as for <p>, and if found then locate the closing >. It's a little bit messy, but not particularly difficult.

All that said, I would not recommend this approach for parsing arbitrary HTML, because of the caveats I've already stated: it won't handle comments and it makes what are probably invalid assumptions about the format of the HTML in general. It also won't handle things like <p > and </p >, both of which are perfectly valid (and that I've encountered in the wild).


I really appreciate help from all of u specially JIM n ALEX.. i tried and its working nicely. thnx a lot.

 public static string CleanUpXHTML(string xhtml)
            {
                int pOpen = 0, pClose = 0, pSlash = 0, pNext = 0, length = 0;
                pOpen = xhtml.IndexOf("<p", 0);
                pClose = xhtml.IndexOf(">", pOpen);
                pSlash = xhtml.IndexOf("</p>", pClose);
                pNext = xhtml.IndexOf("<p", pClose);

                while (pSlash > -1)
                {


                    if (pSlash < pNext)
                    {
                        if (pSlash < pNext)
                        {
                            pOpen = pNext;
                            pClose = xhtml.IndexOf(">", pOpen);
                            pSlash = xhtml.IndexOf("</p>", pClose);
                            pNext = xhtml.IndexOf("<p", pClose);
                        }
                    }
                    else
                    {
                        length = pClose - pOpen + 1;
                        if (pNext < 0 && pSlash > 0)
                        {
                            break;
                        }


                        xhtml = xhtml.Remove(pOpen, length);

                        pOpen = pNext - length;
                        pClose = xhtml.IndexOf(">", pOpen);
                        pSlash = xhtml.IndexOf("</p>", pClose);
                        pNext = xhtml.IndexOf("<p", pClose);


                    }

                    if (pSlash < 0)
                    {
                        int lastp = 0, lastclosep = 0, lastnextp = 0, length3 = 0, TpSlash =0 ;

                        lastp = xhtml.IndexOf("<p",pOpen-1);

                        lastclosep = xhtml.IndexOf(">", lastp);
                        lastnextp = xhtml.IndexOf("<p", lastclosep);


                        while (lastp >0)
                        {
                            length3 = lastclosep - lastp + 1;
                            xhtml = xhtml.Remove(lastp, length3);
                            if (lastnextp < 0)
                            {
                                break;
                            }
                            lastp = lastnextp-length3;
                            lastclosep = xhtml.IndexOf(">", lastp);
                            lastnextp = xhtml.IndexOf("<p", lastclosep);

                        }

                        break;
                    }

                }

                return xhtml;

            }


First of all, please have a look here. If that didn't deter you from using regular expressions for parsing HTML (and because I understand it's a very specific case that might not warrant using a full DOM parser, even though that's the absolute best recommended way), I've posted an answer to a similar question here; you can easily adapt it for your case, but please understand that it's not recommended and many things can go wrong if you decide to use it (including, as outlined in the first link above, the end of the universe etc. :P).

If the regex I pointed you to seems too complex or you're having problems understanding or simplifying it, post a comment and I'll add more clarifications.

0

精彩评论

暂无评论...
验证码 换一张
取 消