开发者

Losing the 'less than' sign in HtmlAgilityPack loadhtml

开发者 https://www.devze.com 2023-02-19 18:11 出处:网络
I recently started experimenting with the HtmlAgilityPack. I am not familiar with all of its options and I think therefor I am doing something wrong.

I recently started experimenting with the HtmlAgilityPack. I am not familiar with all of its options and I think therefor I am doing something wrong.

I have a string with the following content:

string s = "<span style=\"color: #0000FF;\"><</span>";

You see that in my span I have a 'less than' sign. I process this string with the following code:

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(s);

But when I do a quick and dirty look in the span like this:

htmlDocument.DocumentNode.ChildNodes[0].InnerHtml

I see that the span is empty.

What option do I need to set maintain the 'less than' sign. I already tried this:

htmlDocument.OptionAutoCloseOnEnd = false;
htmlDocument.OptionCheckSyntax = false;
htmlDocument.OptionFix开发者_高级运维NestedTags = false;

but with no success.

I know it is invalid HTML. I am using this to fix invalid HTML and use HTMLEncode on the 'less than' signs

Please direct me in the right direction. Thanks in advance


The Html Agility Packs detects this as an error and creates an HtmlParseError instance for it. You can read all errors using the ParseErrors of the HtmlDocument class. So, if you run this code:

    string s = "<span style=\"color: #0000FF;\"><</span>";
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(s);
    doc.Save(Console.Out);

    Console.WriteLine();
    Console.WriteLine();

    foreach (HtmlParseError err in doc.ParseErrors)
    {
        Console.WriteLine("Error");
        Console.WriteLine(" code=" + err.Code);
        Console.WriteLine(" reason=" + err.Reason);
        Console.WriteLine(" text=" + err.SourceText);
        Console.WriteLine(" line=" + err.Line);
        Console.WriteLine(" pos=" + err.StreamPosition);
        Console.WriteLine(" col=" + err.LinePosition);
    }

It will display this (the corrected text first, and details about the error then):

<span style="color: #0000FF;"></span>

Error
 code=EndTagNotRequired
 reason=End tag </> is not required
 text=<
 line=1
 pos=30
 col=31

So you can try to fix this error, as you have all required information (including line, column, and stream position) but the general process of fixing (not detecting) errors in HTML is very complex.


As mentioned in another answer, the best solution I found was to pre-parse the HTML to convert orphaned < symbols to their HTML encoded value &lt;.

return Regex.Replace(html, "<(?![^<]+>)", "&lt;");


Fix the markup, because your HTML string is invalid:

string s = "<span style=\"color: #0000FF;\">&lt;</span>";


Although it is true that the given html is invalid, HtmlAgilityPack should still be able to parse it. It is not an uncommon mistake on the web to forget to encode "<", and if HtmlAgilityPack is used as a crawler, then it should anticipate bad html. I tested the example in IE, Chrome and Firefox, and they all show the extra < as text.

I wrote the following method that you can use to preprocess the html string and replace all 'unclosed' '<' characters with "&lt;":

static string PreProcess(string htmlInput)
{
    // Stores the index of the last unclosed '<' character, or -1 if the last '<' character is closed.
    int lastGt = -1; 

    // This list will be populated with all the unclosed '<' characters.
    List<int> gtPositions = new List<int>();

    // Collect the unclosed '<' characters.
    for (int i = 0; i < htmlInput.Length; i++)
    {
        if (htmlInput[i] == '<')
        {
            if (lastGt != -1)
                gtPositions.Add(lastGt);

            lastGt = i;
        }
        else if (htmlInput[i] == '>')
            lastGt = -1;
    }

    if (lastGt != -1)
        gtPositions.Add(lastGt);

    // If no unclosed '<' characters are found, then just return the input string.
    if (gtPositions.Count == 0)
        return htmlInput;

    // Build the output string, replace all unclosed '<' character by "&lt;".
    StringBuilder htmlOutput = new StringBuilder(htmlInput.Length + 3 * gtPositions.Count);
    int start = 0;

    foreach (int gtPosition in gtPositions)
    {
        htmlOutput.Append(htmlInput.Substring(start, gtPosition - start));
        htmlOutput.Append("&lt;");
        start = gtPosition + 1;
    }

    htmlOutput.Append(htmlInput.Substring(start));
    return htmlOutput.ToString();
}


string "s" is bad html.

string s = "<span style=\"color: #0000FF;\">&lt;</span>";

it's true.

0

精彩评论

暂无评论...
验证码 换一张
取 消