Greetings. I have a php script that is supposed scrape a wholesaler's website for product information and enter that information into a database.
I have successfully collected all information for a sample product, and when doing a simple echo of all $v variables, everything outputs to the screen correctly.
Now, after I add the check to see if the categories of the products exist in the database, and actually insert the information, I get
[phpBB Debug] PHP Notice: in file /rip.php on line 35: Trying to get property of non-object [phpBB Debug] PHP Notice: in file /rip.php on line 36: Trying to get property of non-object [phpBB Debug] PHP Notice: in file /rip.php on line 38: Undefined offset: 3 [phpBB Debug] PHP Notice: in file /rip.php on line 38: Undefined offset: 2 [phpBB Debug] PHP Notice: in file /rip.php on line 41: Trying to get property of non-object Fatal error: Call to a member function find() on a non-object in /XXXXX/public_html/XXXXX/rip.php on line 42
However, all of the product's information is still entered into the database.
The script is supposed to go page by page, gathering info, but stops after the first product.
I am using S.C. Chen's Simple HTML DOM scraper script (http://sourceforge.net/projects/simplehtmldom/), and phpBB's core system for database calls, and here is my PHP source:
<?php
define('IN_PHPBB', true);
$phpbb_root_path = (defined('PHPBB_ROOT_PATH')) ? PHPBB_ROOT_PATH : './';
$phpEx = substr(strrchr(__FILE__, '.'), 1);
include($phpbb_root_path . 'common.' . $phpEx);
include($phpbb_root_path . 'includes/simple_html_dom.' . $phpEx);
// Start session management
$user->session_begin();
$auth->acl($user->data);
$user->setup();
function save($in, $out)
{
$tempDir = './rip_images';
$finalDir = $out;
$imageUrl = $in;
$file = basename($imageUrl);
exec("cd $tempDir && wget --quiet $imageUrl");
if (rename("$tempDir/$file", "$finalDir") === false) {
die('Failed while trying to move image file from temp dir to final dir');
}
}
function scrape($i)
{
$html = file_get开发者_如何学编程_html('XXXXXXXXX.com/index.php?main_page=product_info&products_id='.$i.'&zenid=e4b7dde8de02e1df005d4549e2e3e529');
foreach($html->find('body') as $html)
{
$item['title'] = trim($html->find('#productName', 0)->plaintext);
$item['price'] = trim($html->find('#productPrices', 0)->plaintext);
$item['cat'] = $html->find('#navBreadCrumb', 0)->plaintext;
list($home, $item['cat'], $item['subcat'], $title) = explode("::", $item['cat']);
$item['cat'] = str_replace(" ", "", $item['cat']);
$item['subcat'] = str_replace("\n", "", str_replace(" ", "", $item['subcat']));
$item['desc'] = trim($html->find('#productDescription', 0)->plaintext);
$item['model'] = $html->find('ul#productDetailsList', 0)->find('li', 0)->plaintext;
$item['model'] = explode(":", $item['model']);
$item['model'] = trim($item['model'][1]);
$item['manufacturer'] = $html->find('ul#productDetailsList', 0)->find('li', 1)->plaintext;
$item['manufacturer'] = explode(":", $item['manufacturer']);
$item['manufacturer'] = trim($item['manufacturer'][1]);
foreach($html->find('img') as $img)
{
if($img->alt == $item['title'])
{
$item['img_sm'] = $img->src;
$thumb_width = $img->width;
$thumb_height = $img->height;
}
}
$sm_img_src = "http://XXXXXXXXXX.com/".$item['img_sm'];
$lg_img_src = "http://XXXXXXXXXX.com/index.php?main_page=popup_image&pID=".$i;
$ext = strrchr($item['img_sm'], '.');
$filename = $item['model'] . $ext;
$new_sm = "./rip_images/small/{$filename}";
$new_lg = "./rip_images/large/{$filename}";
$item['image'] = $filename;
$file = file_get_contents($lg_img_src);
$f = fopen($new_lg,'w+');
fwrite($f,$file);
fclose($f);
save($sm_img_src,$new_sm);
$ret[] = $item;
}
$html->clear();
unset($html);
return $ret;
}
$i = 1;
$end = 9999999;
while($i < $end)
{
$ret = scrape($i);
foreach($ret as $v)
{
$item['title'] = $v['title'];
$item['price'] = $v['price'];
$item['desc'] = $v['desc'];
$item['model'] = $v['model'];
$item['manufacturer'] = $v['manufacturer'];
$item['image'] = $v['image'];
$item['cat'] = $v['cat'];
$item['subcat'] = $v['subcat'];
}
//see if parent cat exists
$sql = 'SELECT cat_id FROM ' . SHOP_CAT_TABLE . ' WHERE cat_name = "'.$db->sql_escape($item['cat']).'"';
$result = $db->sql_query($sql);
$parent = $db->sql_fetchrow($result);
// if not exists
if($parent['cat_id'] == '')
{
//add the parent cat to the db
$sql_ary = array(
'cat_name' => $item['cat'],
'cat_parent' => 0
);
$sql = 'INSERT INTO '.SHOP_CAT_TABLE.' '.$db->sql_build_array('INSERT', $sql_ary);
$db->sql_query($sql);
$cat_id = $db->sql_nextid();
//see if subcat exists
$sql = 'SELECT cat_id FROM ' . SHOP_CAT_TABLE . ' WHERE cat_name = "'.$db->sql_escape($item['subcat']).'"';
$result = $db->sql_query($sql);
$row = $db->sql_fetchrow($result);
// if not exists
if($row['cat_id'] == '')
{
//add subcat to db
$sql_ary = array(
'cat_name' => $db->sql_escape($item['subcat']),
'cat_parent' => $cat_id
);
$sql = 'INSERT INTO '.SHOP_CAT_TABLE.' '.$db->sql_build_array('INSERT', $sql_ary);
$db->sql_query($sql);
$item_cat = $db->sql_nextid();
}
else //if exists
{
$item_cat = $row['cat_id'];
}
}
else //if parent cat exists
{
//see if subcat exists
$sql = 'SELECT cat_id FROM ' . SHOP_CAT_TABLE . ' WHERE cat_name = "'.$db->sql_escape($item['subcat']).'"';
$result = $db->sql_query($sql);
$row = $db->sql_fetchrow($result);
// if not exists
if($row['cat_id'] == '')
{
//add the subcat to the db
$sql_ary = array(
'cat_name' => $db->sql_escape($item['subcat']),
'cat_parent' => $parent['cat_id']
);
$sql = 'INSERT INTO '.SHOP_CAT_TABLE.' '.$db->sql_build_array('INSERT', $sql_ary);
$db->sql_query($sql);
$item_cat = $db->sql_nextid();
}
else //if exists
{
$item_cat = $row['cat_id'];
}
}
$sql_ary = array(
'item_title' => $db->sql_escape($item['title']),
'item_price' => $db->sql_escape($item['price']),
'item_desc' => $db->sql_escape($item['desc']),
'item_model' => $db->sql_escape($item['model']),
'item_manufacturer' => $db->sql_escape($item['manufacturer']),
'item_image' => $db->sql_escape($item['image']),
'item_cat' => $db->sql_escape($item_cat)
);
$sql = 'INSERT INTO ' . SHOP_ITEM_TABLE . ' ' . $db->sql_build_array('INSERT', $sql_ary);
$db->sql_query($sql);
$i++;
}
?>
Any suggestions on how to clear these notices/errors and get the script to iterate through the pages, correctly? I'm almost positive that it's something very simple that I'm overlooking...
It doesn't stop, it dies because of the fatal error on line 42.
Warnings come from these lines:
35: $item['title'] = trim($html->find('#productName', 0)->plaintext);
36: $item['price'] = trim($html->find('#productPrices', 0)->plaintext);
37: $item['cat'] = $html->find('#navBreadCrumb', 0)->plaintext;
38: list($home, $item['cat'], $item['subcat'], $title) = explode("::", $item['cat']);
39: $item['cat'] = str_replace(" ", "", $item['cat']);
40: $item['subcat'] = str_replace("\n", "", str_replace(" ", "", $item['subcat']));
41: $item['desc'] = trim($html->find('#productDescription', 0)->plaintext);
42: $item['model'] = $html->find('ul#productDetailsList', 0)->find('li', 0)->plaintext;
Your script doesn't go from page to page, it goes through all products_ids from 1 to 99999, and apparently there's no product with id 2, so this URL returns something unexpected:
index.php?main_page=product_info&products_id='.$i.'&zenid=e4b7dde8de02e1df005d4549e2e3e529`
Since you are expecting that this page has ul#productDetailsList
(and it does not) and call ->find()
on it - script dies, because you are calling a method on a null
Solution would be to check if the page has certain selectors first, and try to extract title
, price
, cat
and so on only if they are present
Why don't you just check if the objects you are retrieving are not null before trying to using them ?
Your errors indicate that the your not looking at the correct HTML source.
Couldn't find property plaintext because #productName wasn't found.
Line 35: $item['title'] = trim($html->find('#productName', 0)->plaintext);
Couldn't find property plaintext because #productPrices wasn't found.
Line 36: $item['price'] = trim($html->find('#productPrices', 0)->plaintext);
Exploding on '::' didn't produce a third item
Line 38: list($home, $item['cat'], $item['subcat'], $title) = explode("::", $item['cat']);
Fatal, Couldn't call find() on the result because 'ul#productDetailsList wasn't found.
Line 42: $item['model'] = $html->find('ul#productDetailsList', 0)->find('li', 0)->plaintext;
I would add two things to your script, 1) debug dumps to log the raw html you are trying to parse, and 2) checks to see if the finds are successful.
1) around line 32:
// put the $url into a variable
$url = 'XXXXXXXXX.com/index.php?blahblah';
echo "$url<br>\n"; // echo it so you can see the progress
$html = file_get_html($url);
// log for debugging if problems:
$log_file = tempnam('/tmp/', 'rip_log_');
file_put_contents($log_file, $url . "\n" . $html)
2) add a check for each successful find. (example):
$test = $html->find('#productName', 0);
if ($test) {
$item['title'] = trim($test->plaintext);
} else {
echo "Could not find #productName";
// maybe call break?
break;
}
精彩评论