Regex read html values_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-28 09:14 出处：网络

I have this file that contains the following text (html): <tr> <th scope=\"row\">X:</th>

相关专题：regex

I have this file that contains the following text (html):

<tr> 
<th scope="row">X:</th> 
<td>343</td> 
</tr> 
<tr> 
<th scope="row">Y:</th> 
<td>6,995 sq ft / 0.16 acres</td> 
</tr>

And I have this method to read the values from X,Y

        private static Dictionary<string, string> FindKeys(IEnumerable<string> keywords, string source)
    {
        var found = new Dictionary<string, string>();
        var keys = string.Join("|", keywords.ToArray());
        var matches = Regex.Matches(source, @"\b(?<key>" + keys + @"):\s*(?<value>)");

        foreach (Match m in matches)
        {
            try
            {
                var key = m.Groups["key"].ToString();
                var value = m.Groups["value"].ToString();
                found.Add(key, value);
            }
            catch
            {
            }
        }
        return found;
    }

I can't get开发者_开发技巧 the method to return the values from X,Y

Any thing wrong in the regex expression?

You have "" between keyword and value so you need to skip them in your regex like this:

\b(?<key>" + keys + @"):\s*</th>[^<]*<td>(?<value>[^<]*)

And BTW, you need to specify the pattern for "value" - I've specified it as [^<]*.

As I'm sure you know, parsing HTML with a regex is never fun. You current regex does not look very close to capturing what your looking for. As such I would recommend two possible alternatives...

Option 1 - If adding a library is acceptable, use the Html Agility Pack. It's blazing fast and very accurate.

Option 2 - If your looking for lighter-weight solution, these source files contain a regex parser for xml/html. To use directly, implement the IXmlLightReader then call the XmlLightParser.Parse method. If your document is a complete HTML document and not a fragment, you can also use the HtmlLightDocument as follows:

HtmlLightDocument doc = new HtmlLightDocument(@"<html> ... </html>");
foreach(XmlLightElement row in doc.Select(@"//tr"))
    found.Add(
        row.SelectSingleNode(@"th").InnerText, 
        row.SelectSingleNode(@"td").InnerText
    );

Option 3 - As always, if the html is xhtml compliant then you can just use an xml parser.