Extracting portions of a loaded page in PHP (RegEx)_问答_开发者

I have a newsletter system I am trying to incorporate wi开发者_开发知识库thin a PHP site. The PHP site loads a content area and also loads scripts into the head of the page. This works fine for the code that is generated for the site but now I have the newsletter I am trying to incorporate.

Originally I was going to use an iFrame but the amount of AJAX and jQuery calls makes this quite complex.

So I thought I could use cURL to load the newsletter page as a variable. Then I was going to use RegEx to grab the content between the body tags and place this in the content area. Finally I was going to use RegEx again to search through the head and grab any scripts.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $config_live_site."lib/alerts/user/update.php?email=test@test.com.au"); # URL to post to
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1 ); # return into a variable
curl_setopt($ch, CURLOPT_HEADER, 0);
$loaded_result = curl_exec( $ch ); # run!
curl_close($ch);

// Capture the body content and place in $_content
if (preg_match('%<body>([\s\S]*)</body>%', $loaded_result, $regs)) {
 $_content .= $regs[1];
} else {
 $_content .= "<p>No content to display.</p>";
}

// Capture the scripts and place in the head
if (preg_match('%(<script type="text/javascript">[\s\S]*</script>)%', $loaded_result, $regs)) {
 $headDetails .= $regs[0];
}

This works most of the time but if there is a script in the body of the document it captures down to the last /script'.

My question is two-fold I guess...

A. Is there a better overall approach (My deadline is very short so it needs to be a quick solution without too much editing of the newsletter code)?

B. What RegEx would I need to use to just capture the first script?

I think you'll need to add a ? to the script regex after the * so it's not greedy. Greedy regex's match as much as is possible (everything between the first opening tag and the last closing), non-greedy match as little as possible (only what's between the opening tag and the first closing tag). Try:

%(<script type="text/javascript">[\s\S]*?</script>)%

As mentioned, change it to preg_match_all, and you should just match the individual script sections instead of everything between the first and last script tags.

A: I see no issues with using regular expressions to extract the bits you need from HTML pages which are not necessarily valid. In fact some of the spidering solutions I worked with did exactly that.

B: Use preg_match_all() instead of preg_match(). preg_match() only captures the first match while preg_match_all() will continue until the end of the string and return all matches.

A quick and dirty response can be: delete the body content just after capturing it. Then proceed

if (preg_match('%<head>([\s\S]*)</head>%', $loaded_result, $regs)) {
   $_header .= $regs[1];
} else {
   $_header .= "<p>No content to display.</p>";
}

then apply the regex just to the header

if (preg_match('%(<script type="text/javascript">[\s\S]*</script>)%', $_header, $regs)) {
   $headDetails .= $regs[0];
}

If the html you get from curl is well formed, you should use simplexml to perform your extraction. As its name suggest, it is very simple to use.

$xml = simplexml_load_string($loaded_content);

$body = $xml->body->asXML();

$scripts = $xml->xpath('//head/script');
foreach ($scripts as $script) {
  $_scripts .= $script->asXML();
}

If your html is not well formed, then you hava to resort to tidy to normalize it (or better, correct the scripts that output invalid html content)

$doc = new DOMDocument();
$doc->loadHTML($loaded_result);
$xpath = new DOMXpath($doc);

$kod = $xpath->query("//head/script");
$i = 0;
foreach($kod as $node){
    echo 'im the script nº'.(++$i).' in the head and this is my content: ';
    echo $doc->saveXML($node)."\n";
}