Ok. Admittedly, I am not the best at working with regular expressions. What I am doing is a screen scrape, then trying to fix the img src values in the embedded images to point back to the original domain. This is the regex I have been trying variations of (too many to list - here's the current one):
preg_match_all('/<img\b[^>]*>/i', $html, $images);
What this ends up doing is to replace all <
with />
. What I need it to do is just return the (currently) five images on the page in an array so that I can work with those to fix their src values, the开发者_运维知识库n write them back to $html, which is set at the beginning of the file:
$html = file_get_contents($target_url);
Basically, don't do this with regex. You can parse HTML with regex, but it is almost certainly not worth the effort.
Do it with genuine DOM parsing instead, using the DOMDocument
class:
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$image->setAttribute('src', 'http://example.com/' . $image->getAttribute('src'));
}
$html = $dom->saveHTML();
精彩评论