开发者

HTML Agility pack removes break tag close

开发者 https://www.devze.com 2023-02-22 11:50 出处:网络
I am creating an HTML document using HTML agility pack. I load a template file then append content to it. All of this works, but when I view the output file it has removed the closing tag from my <

I am creating an HTML document using HTML agility pack. I load a template file then append content to it. All of this works, but when I view the output file it has removed the closing tag from my <br/> tags to look like this <br>. What is causing this?

Dim doc As New HtmlDocument()
doc.Load(Server.MapPath("Template.htm"))

Dim title As HtmlNode = doc.DocumentNode.SelectSingleNode("//title")

title.InnerHtml = title.InnerHtml & "CEU Classes"
Dim topContent As HtmlAgilityPack.HtmlNode = doc.GetElementbyId("topContent")

topContent.InnerHtml = html.ToString
doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)

More info:

It was removing my closing image tags, after I added doc.OptionWriteEmptyNodes = True, it quite doing that.

Update

This is my code as it stands now that removes the closing BR tag

Dim html As String = "Words<br/>more words"
Dim doc As New HtmlDocument()
Dim title As HtmlNode
Dim topContent As HtmlNode

HtmlNode.ElementsFlags("br") = HtmlElementFlag.Empty
doc.Load(Server.MapPath("Template.htm"))

Title = doc.DocumentNode.SelectSingleNode("//title")
title.InnerHtml = title.InnerHtml & "CEU Classes"

topContent = doc.GetElementbyId("topContent")
topContent.InnerHtml = html.ToString

doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)

Update 2

I ended up just reading in my template file as a standard string then loading the html like this

Dim TemplateHTML As String = File.ReadAllText(Server.MapPath("Template.htm"))

TemplateHTML = TemplateHTML.Insert开发者_如何学JAVA(TemplateHTML.IndexOf("<div id=""topContent"">") + "<div id=""topContent"">".Length, _
                                   html.ToString)

doc.LoadHtml(TemplateHTML)


It happens because the Html Agility Pack handles the BR in a special way. It still supports old (but existing on the web today) HTML 3.2 syntax where the BR could be declared without a closing tag at all (browsers also still handle it gracefully by the way...).

To change this default behavior, you need to modify the HtmlNode.ElementFlags property, like this:

Dim doc As New HtmlDocument()
HtmlNode.ElementsFlags("br") = HtmlElementFlag.Empty
doc.LoadHtml("<test>before<br/>after</test>")
doc.OptionWriteEmptyNodes = True   
doc.Save(Console.Out)

which will display:

<test>before<br />after</test>


As per @Simon Mourier, the following C# code works in version 1.4

var doc = new HtmlDocument();
HtmlNode.ElementsFlags["br"] = HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml("Lorem ipsum dolor sit<br/>Lorem ipsum dolor sit");

var postParsed = doc.DocumentNode.WriteTo();

has the following string value for postParsed

"Lorem ipsum dolor sit<br />Lorem ipsum dolor sit"


Seems this is a standard setting in Html Agility Pack. By default, it does not conform to XHTML and many tags are not closed.

There are 2 ways to do this. At the document level you can do the following which will turn on ALL closing tags. (This is my preferred method).

HtmlDocument doc = new HtmlDocument();
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml(content);

However, this may not be desirable. There is another way to do it at the node level.

if (HtmlNode.ElementsFlags.ContainsKey("img"))
{
    HtmlNode.ElementsFlags["img"] = HtmlElementFlag.Closed;
}
else
{
    HtmlNode.ElementsFlags.Add("img", HtmlElementFlag.Closed);
}


I have encountered same kind of problem and I solved it by manually re-parsing HTML chunk using new HtmlDocument object with correct settings.

Problem as I see it is that HtmlDocument has all those nice settings to let you close
tags etc, but when you select a node or do some other soft of operation with nodes and use their OuterHtml or InnerHtml some of those closing tags are lost (probably because those properties do not use same settings as document itself, or meybe there is some other reason). So when you get that incorrect html string from InnerHtml or OuterHtml, you can just re-parse it with HtmlDocument again and use document.DocumentElement.InnerHtml to get correct HTML string.

0

精彩评论

暂无评论...
验证码 换一张
取 消