开发者

C# extracting certain parts of a string

开发者 https://www.devze.com 2022-12-09 19:31 出处:网络
I have a console application which is parsing HTML documents via the WebRequest method (http). The issue is really with extracting data from the htm开发者_如何学Cl code that is returned.

I have a console application which is parsing HTML documents via the WebRequest method (http). The issue is really with extracting data from the htm开发者_如何学Cl code that is returned.

Below is a fragment of the html I am interested in:

<span class="header">Number of People:</span>
<span class="peopleCount">1001</span>  <!-- this is the line we are interested in! -->
<span class="footer">As of June 2009.</span>

Assume that the above html is contained in a string called "responseHtml". I would like to just extract the 'People Count' value, (second line).

I've searched stack over flow and found some code that could work:

How do I extract text that lies between parentheses (round brackets)?

But when I implement it, it doesn't work - I don't think it likes the way I have placed HTML tags into the regex:

        string responseHtml; // this is already filled with html code above ^^
        string insideBrackets = null;


        Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

        Match match = regex.Match(responseHtml);
        if (match.Success)
        {
            insideBrackets = match.Groups["TextInsideBrackets"].Value;
            Console.WriteLine(insideBrackets);
        }

The above just fails to work, is it something to do with the html span brackets? All I want is the text value in between the tags for that specific span.

Thanks in advance!


Try this one:

Regex regex = new Regex("class=\\\"peopleCount\\\"\\>(?<data>[^\\<]*)",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);

It should be a tad faster, as you are basically saying the data you are looking for starts after peopleCount"> and ends at the first <

(I changed the group name to data)

Cheers, Florian


?<TextInsideBrackets> is incorrect

You need:

(?<TextInsideBrackets>...)


I assume you want to do a named capture.

You should use

Regex regex = new Regex("\\<span class=\"peopleCount\">(?<TextInsideBrackets>\\w+)\\</span>");

and not

Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

0

精彩评论

暂无评论...
验证码 换一张
取 消