Remove from scrape_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-09 21:52 出处：网络

Hey all, I\'ve successfully created a website scraper getting the top 40 from the record industry website, however one of the columns in the table I\'m scraping might sometimes not be there. Basically

Hey all, I've successfully created a website scraper getting the top 40 from the record industry website, however one of the columns in the table I'm scraping might sometimes not be there. Basically what I need is a way to remove any instances of this from my scrape:

<td><img src="/images/bullet_red.gif" width="8" height="8" title="Red Dot" /></td>

Here's what I've got from a tutorial so far.

$url = "http://www.ariacharts.com.au/pages/charts_display_singles.asp?chart=1U50";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content,'<table class="chartTable"');
$end = strpos($content,'<开发者_高级运维/table>',$start) + 8;

$table = substr($content,$start,$end-$start);

preg_match_all("|<tr(.*)</tr>|U",$table,$rows);

foreach ($rows[0] as $row){

if ((strpos($row,'<th')===false)){

    preg_match_all("|<td(.*)</td>|U",$row,$cells);

    $number = strip_tags($cells[0][1]);

    $name = strip_tags($cells[0][5]);

    $artist = strip_tags($cells[0][6]);

    $name = strtolower($name);
    $name = ucwords($name);

    echo "{$artist} - {$name} - Number {$number} <br>\n";

}

}

Try using PHP Simple HTML DOM Parser instead of complex regex http://simplehtmldom.sourceforge.net/

require_once 'simple_html_dom.php';

$html = file_get_html('http://www.ariacharts.com.au/pages/charts_display_singles.asp?chart=1U50');
$table = $html->find('table.chartTable');

foreach ($table[0]->find('tr') as $row) {
    $columns = $row->find('td');
    if (sizeof($columns) < 7) continue;

    $number = $columns[1]->plaintext;
    $name = ucwords($columns[6]->plaintext);
    $artist = $columns[7]->plaintext;

    echo "$artist - $name - Number $number <br />\n";
}

For the quick and dirty method you want, put this code before you declare the "start" variable:

$content = str_replace('<td><img src="/images/bullet_red.gif" width="8" height="8" title="Red Dot" /></td>', '', $content);

Remove from scrape

精彩评论

关注公众号

热门标签

图文推荐

Remove from scrape

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：