开发者

Regex matching HTML option tags which are unselected and also selected

开发者 https://www.devze.com 2023-01-13 18:48 出处:网络
Can some recommend a regex to return the value when an item is selected as well as unselected as seen below.

Can some recommend a regex to return the value when an item is selected as well as unselected as seen below.

<option value="32_1002_ACCT1001" selected="selected">ACCT1001 -- Accounting 1a</option>
<option value="32_1002_ACCT1002">ACCT1002 -- Accounting 1b</option>

My regex currently works only for the unsel开发者_运维技巧ected option seen below.

(<option value="([^"]+)">([^<]+)<\/option>)

EDIT:

Thanks for the great responses guys, however I should have been a bit more detailed and specific.

I am using it in a screen-scraper extractor pattern as follows:

<option value="~@COURSE_ID@~">~@COURSE_CODE@~ -- ~@COURSE_NAME@~</option>

where ~@COURSE_ID@~ specifies the following regex query:

([^"]+)

Works fine for all option tags EXCEPT the first one which is already selected.

I am testing out your suggestions at the moment, but if anyone wants to jump in with a sure fire solution that would be great.

I'm really struggling with this one, nothing seems to be working!


First, its bad idea to use regex for parsing HTML. Use some html parser. (I am tired of writing this, but I just put it as a first sentence as people tend to downvote immediately without this statement :) )

Anyways, just modify your regex to account for all attributes like this

(<option[^>]*?>([^<]+)<\/option>)

Well, I dont say its an optimal one, its just with minimal modifications to yours


I agree with Kobi but if you really want to use regex here is a solution in perl :

#!/usr/bin/perl
use strict;
use warnings;

while (<DATA>) {
    print $_;
    if (/^(<option value="([^"]+).*?(?:selected="selected")?.*)$/) {
        print "match\t value=$2\n";
    } else {
        print "NOT match\n";
    }
}

__DATA__
<option value="32_1002_ACCT1001" selected="selected">ACCT1001 -- Accounting 1a</option>
<option value="32_1002_ACCT1002">ACCT1002 -- Accounting 1b</option>

output :

<option value="32_1002_ACCT1001" selected="selected">ACCT1001 -- Accounting 1a</option>
match    value=32_1002_ACCT1001
<option value="32_1002_ACCT1002">ACCT1002 -- Accounting 1b</option>
match    value=32_1002_ACCT1002


Here's an alternative way to load these values in C# using the Html Agility Pack:

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/unasu/");
HtmlNodeCollection options = doc.DocumentNode.SelectNodes("//option[@value]");
IEnumerable<string> values = options.Select(o => o.Attributes["value"].Value);

Loading a local file, for completeness, is done using:

HtmlDocument doc = new HtmlDocument();
doc.Load(@"c:\file.html");

As clearly seen, this solution is a lot more robust than a regex - it won't fail with most code, doesn't care about attributes order, quote formats (single double or none), and many, many more common cases.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号