开发者

Scraping digit values from a webpage?

开发者 https://www.devze.com 2023-02-09 05:19 出处:网络
I would like to scrape 17 values from a website. This is the url of the page with the data: http://www.bungie.net/stats/reach/online.aspx

I would like to scrape 17 values from a website.

This is the url of the page with the data: http://www.bungie.net/stats/reach/online.aspx

On the lower left side of the page there is Unordered List titled "ONLINE PLAYLIST" I want to scrape the number of Players in each list item that contain such information. The number needs to be only digits i开发者_JAVA技巧.e. no comma.


$c = curl_init();
curl_setopt_array($c, array(
    CURLOPT_URL => 'http://www.bungie.net/stats/reach/online.aspx',
    CURLOPT_RETURNTRANSFER => true,
    ));
$r = curl_exec($c);
curl_close($c);

preg_match_all('|([^<>]+)</a> </h4>\s*([0-9,]+) Players|s', $r, $m);
$teams = array_combine($m[1], $m[2]);
foreach ($teams as &$v) $v = str_replace(',','',$v);
echo '<pre>'.print_r($teams,1).'</pre>';

Output at the moment:

Array
(
    [NOBLE MAP PACK] => 997
    [RUMBLE PIT] => 4117
    [LIVING DEAD] => 6638
    [TEAM SLAYER] => 7730
    [MLG] => 586
    [TEAM SWAT] => 6358
    [TEAM SNIPERS] => 2145
    [TEAM OBJECTIVE] => 758
    [MULTI TEAM] => 1707
    [BIG TEAM BATTLE] => 5706
    [INVASION] => 2881
    [FIREFIGHT] => 2780
    [SCORE ATTACK] => 1121
    [CO-OP CAMPAIGN] => 695
    [TEAM ARENA] => 393
    [DOUBLES ARENA] => 680
    [FFA ARENA] => 120
)

Edit: Fixed the name capturing group so that "CO-OP" would be captured, instead of just "OP".


Seems to me that a bit of regex is all you need here. I did something like this in PERL recently, which was not terribly tricky and was also well-documented online with many useful threads and tutorials.

Inspecting the page, it looks like each list item is assigned a class called "glowBox". I'd try getting the full text/source of the page, and then filtering so you just have sections that begin with this class. Alternatively, you could use a lookahead or lookbehind to check that the number is preceded or followed by ". Once you've got it narrowed down, you'll need a capture group to pull in the number as something you can use later. In PERL, captured strings are automatically assigned to the variables $1, $2, $3...etc. If you just loop through each line of the unordered list performing the regex, you should only need $1 to capture the number.

Your capture group might look like this: (\d+)

The parenthesis make it a capture group, \d it will only match digit characters, and the + means that in order for anything to be captured, the \d must be matched at least once. Not sure what your requirements are, but if you need both the name and the number, PERL makes it a breeze to scrape the page for the necessary data and turn it into a hash with key/value pairs.

Definitely check out http://www.regexr.com, sort of the regex equivalent of a CSS zen garden. You can paste the full page source into it and play with regular expressions until it finds what you want, and only what you want. For more info and explanation of regular expressions' weird syntax, start here, and obviously, use google.

Edit: too late, it seems.

0

精彩评论

暂无评论...
验证码 换一张
取 消