开发者

What's the most efficient way to get this data, thousands of times?

开发者 https://www.devze.com 2023-02-17 07:11 出处:网络
What would be the best way to get the following data (the 4.0m after the </b> tag) using PHP\'s DOMDocument->loadHTML() system? I\'m guessing some kind of CSS-stye selector?

What would be the best way to get the following data (the 4.0m after the </b> tag) using PHP's DOMDocument->loadHTML() system? I'm guessing some kind of CSS-stye selector?

(LINE 240, always 240) <b>Current Price:</b> 4.0m

I have been looking around the documentation, but to be honest this is all completely alien to me! Furthermore, how would I be able to get this data for thousands of pages, from URLs such as:

http://site.com/q=item/viewitem.php?obj=11928

The obj=# minimum/maximum values are known (how many pages I will need to scrape), and I want to grab all of them, incrementally, and output name description and price (not terribly concerned about the percentage rise/drop as of yet) to a MySQL database, so I can grab it from there and display it in my site.

Here is the main block of code that I am interested in:

<div class="subsectionHeader"> 
<h2> 
Item Name
</h2> 
</div> 
<div id="item_additional" class="inner_brown_box">  
Descript开发者_开发问答ion of item goes here.
<br> 
<br> 
<b>Current Price:</b> 4.0m
<br><br> 
<b>Change in Price:</b><br> 
<span> 
<b>30 Days:</b> <span class="rise">+2.5%</span> 
</span> 
<span class="spaced_span"> 
<b>90 Days:</b> <span class="drop">-30.4%</span> 
</span> 
<span class="spaced-span"> 
<b>180 Days:</b> <span class="drop">-33.3%</span> 
</span> 
<br class="clear"> 
</div> </div> <div class="brown_box main_page"> 
<div class="subsectionHeader"> `

If anyone could provide any skeletal hints on how to go about this, it would be much appreciated!


Parsing HTML with regular expressions is usualy bad idea, but in your case it may me right/easy way. It's fast enough and maybe more flexible than chunking with strpos and plain text patterns.

Try this example with source HTML given above:

//checked with php 5.3.3
if (preg_match('#<h2>(?P<itemName>[^>]+)</h2>.*?<div[^>]+id=([\'"])item_additional(\2)[^>]*>\s*(?P<description>[^<]+).*?<b>\s*Current\s+Price\s?:?</b>\s*(?P<price>[^<]+)#six',$src, $matches))
{
    print_r($matches);
} 

Regular expressions might look too complex, but with documenation and nice tools like RegexBuddy or Expresso anyone can write simple ones ;)


You could use Simple HTML DOM Parser - http://simplehtmldom.sourceforge.net/

Extract the contents using:

echo file_get_html('http://www.google.com/')->plaintext; 

And then locate the 4.0m using a PHP str function.


DOM parsing is the most robust way to do this.

If you want the fastest way, and know that the HTML structure is consistent, it would probably be faster to use strpos to search for offsets. It is more likely to break if the page structure changes, though. Something like this:

$needles = array(
  'name' => "<div class=\"subsectionHeader\">\n<h2>\n"
  'description' => "<div id=\"item_additional\" class=\"inner_brown_box\">\n"
  'price' => "<b>Current Price:</b> "
);
$buffer = file_get_contents("http://site.com/q=item/viewitem.php?obj=1234");
$result = array();
foreach ($needles as $key => $needle) {
  $index1 = strpos($buffer, $needle);
  $index2 = strpos($buffer, "\n", $index1);
  $value = substr($buffer, $index1, $index2 - $index1);
  $result[$key] = $value;
}

You will need to get the needles exactly right, including any trailing whitespace.

0

精彩评论

暂无评论...
验证码 换一张
取 消