I have loaded the entire HTML of a page and want to retrieve all the URL's which start with http and end with pdf. I wrote the following which didn't work:
$html = file_get_contents( "http://www.example.com" );
preg_match( '/^http(pdf)$/', $html, $matches );
I'm pretty new to 开发者_如何学Pythonregex but from what I've learned ^
marks the beginning of a pattern and $
marks the end. What am I doing wrong?
You need to match the characters in the middle of the URL:
/\bhttp[\w%+\/-]+?pdf\b/
\b
matches a word boundary^
and$
mark the beginning and end of the entire string. You don't want them here.[...]
matches any character in the brackets\w
matches any word character+
matches one or more of the previous match?
makes the+
lazy rather than greedy
preg_match( '/http[^\s]+pdf/', $html, $matches );
Matches http
followed by not ([^...]
) spaces (\s
) one or more times (+
) followed by pdf
Try this,
preg_match( '/\bhttp\S*pdf\b/', $html, $matches );
You need to match the part between the http
and the pdf
, this is what .*?
is doing.
^
matches the start of the string and $
the end, but this is not what you want, when you want to extract those links from a longer text.
\b
is matching on word boundaries
Update
for completeness, the .*?
would still match too much so exchanged with \S*
\S
matches a non whitespace character
Try this one:
preg_match_all('/\bhttp\S*?pdf\b/', $html, $matches);
Note that you need to use the preg_match_all()
-function here, since you are trying to match more than one occurrence. ^
and $
wont work, because they only apply to line or file boundaries (depending on the used modifiers).
preg_match( '/^http.*pdf$/', $html, $matches );
is better (working)
精彩评论