I am trying to get the src of all of the images in a page. But some pages use absolute paths and some do not. So I am wondering whats the best way to do this?
right now I am using this.
$imgsrc_regex = '#<\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1#im';
preg_match_all($imgsrc_regex, $html, $matches);
For example webpage a might have the images as src="xyz.png" while others might use src="b.com/xyz.png" so is there a way to a开发者_StackOverflowutomatically append the url when necessary?
The best way (imo) would be to use DOMDocument and DOMXPath to get the URLs:
$dom=new domDocument;
$dom->loadHTML($html);
and
$xpath = new DOMXPath($dom);
$result = $xpath->query("//img/@src");
Using regex to parse HTML is bad.
Or you have to clarify your question what you really want. Do you only want to get the image URLs that are absolute? If so, you can check whether they start with http:
:
$result = $xpath->query("//img[starts-with(@src, 'http:') or starts-with(@src, 'HTTP:')]/@src");
Use a HTML Parser, not a regular expression
Seriously, searching for tags in HTML is the wrong problem domain for a regular expression.
精彩评论