开发者

Remove from scrape

开发者 https://www.devze.com 2023-03-09 21:52 出处:网络
Hey all, I\'ve successfully created a website scraper getting the top 40 from the record industry website, however one of the columns in the table I\'m scraping might sometimes not be there. Basically

Hey all, I've successfully created a website scraper getting the top 40 from the record industry website, however one of the columns in the table I'm scraping might sometimes not be there. Basically what I need is a way to remove any instances of this from my scrape:

<td><img src="/images/bullet_red.gif" width="8" height="8" title="Red Dot" /></td>

Here's what I've got from a tutorial so far.

$url = "http://www.ariacharts.com.au/pages/charts_display_singles.asp?chart=1U50";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content,'<table class="chartTable"');
$end = strpos($content,'<开发者_高级运维/table>',$start) + 8;

$table = substr($content,$start,$end-$start);

preg_match_all("|<tr(.*)</tr>|U",$table,$rows);

foreach ($rows[0] as $row){

if ((strpos($row,'<th')===false)){

    preg_match_all("|<td(.*)</td>|U",$row,$cells);

    $number = strip_tags($cells[0][1]);

    $name = strip_tags($cells[0][5]);

    $artist = strip_tags($cells[0][6]);

    $name = strtolower($name);
    $name = ucwords($name);

    echo "{$artist} - {$name} - Number {$number} <br>\n";

}

}


Try using PHP Simple HTML DOM Parser instead of complex regex http://simplehtmldom.sourceforge.net/

require_once 'simple_html_dom.php';

$html = file_get_html('http://www.ariacharts.com.au/pages/charts_display_singles.asp?chart=1U50');
$table = $html->find('table.chartTable');

foreach ($table[0]->find('tr') as $row) {
    $columns = $row->find('td');
    if (sizeof($columns) < 7) continue;

    $number = $columns[1]->plaintext;
    $name = ucwords($columns[6]->plaintext);
    $artist = $columns[7]->plaintext;

    echo "$artist - $name - Number $number <br />\n";
}


For the quick and dirty method you want, put this code before you declare the "start" variable:

$content = str_replace('<td><img src="/images/bullet_red.gif" width="8" height="8" title="Red Dot" /></td>', '', $content);
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号