How to parse HTML to modify all words_问答_开发者

开发者 https://www.devze.com 2023-02-09 19:47 出处：网络

This seems to be a recurring question, but here goes. I have HTML which is well-formatted (it comes from a controlled source, so this can be taken to be a given). I need to iterate through the conten

相关专题：

This seems to be a recurring question, but here goes.

I have HTML which is well-formatted (it comes from a controlled source, so this can be taken to be a given). I need to iterate through the contents of the body of the HTML, look for all the words in the document, perform some editing on those words, and save the results.

For example, I have file sample.html and I want to run it through my application and product output.html, which is exactly the same as the original, plus my edits.

I found the following using HTMLAgilityPack, but all the examples I've found look at the attributes of the specified tags - is there an easy modification that will look at the contents and perform my edits?

HtmlDocument HD = new HtmlDocument();
HD.Load (@"e:\test.htm");
var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(@alt)]");
if (NoAltElements != null)
{
    foreach (HtmlNode HN in NoAltElements)
    {
       HN.Attributes.Append("alt", "no alt image");
    }
}

HD.Save(@"e:\test.htm");

The above looks for image tags with no ALT tags. I want to look for all tags in the <body> of the file and do something with the contents (whi开发者_StackOverflowch may involve creating new tags in the process).

A very simple sample of what I might do is take the following input:

<html>
    <head><title>Some Title</title></head>
    <body>
        <h1>This is my page</h1>
        <p>This is a paragraph of text.</p>
    </body>
</html>

and produce the output, which takes every word and alternates between making it uppercase and making it italics:

<html>
    <head><title>Some Title</title></head>
    <body>
        <h1>THIS <em>is</em> MY <em>page</em></h1>
        <p>THIS <em>is</em> A <em>paragraph</em> OF <em>text</em>.</p>
    </body>
</html>

Ideas, suggestions?

Personally, given this setup, I'd work with the InnerText property of HtmlNode to find the words (probably with Regex so I can exclude for punctuation and not simply rely on spaces) and then use the InnerHtml property to make the changes using iterative calls to Regex.Replace (because the Regex.Replace has a method that allows you to specify both start position and number of times to replace).

Processing code:

IEnumerable<HtmlNode> nodes = doc.DocumentNode.DescendantNodes().Where(n => n.InnerText == "something");
foreach (HtmlNode node in nodes)
{
    string[] words = getWords(node.InnerText);

    node.InnerHtml = processHtml(node.InnerHtml, words);
}

identify words (there's probably some slicker way to do this but here's an initial stab):

private string[] getWords(string text)
{
    Regex reg = new Regex("/w+");
    MatchCollection matches = reg.Matches(text);
    List<string> words = new List<string>();
    foreach (Match match in matches)
    {
        words.Add(match.Value);
    }
    return words.ToArray();
}

process the html:

private string processHtml(string html, string[] words)
{
    int startPosition = 0;
    foreach (string word in words)
    {
        startPosition = html.IndexOf(word, startPosition);
        Regex reg = new Regex(word);
        html = reg.Replace(html, alterWord(word), 1, startPosition);
    }

    return html;
}

I'll leave the details of alterWord() to you. :)

Try .SelectNodes("//body//*"). That'll get you all elements within any body element, at any depth.