Regular expression starting with http and ending with pdf?_问答_开发者

Regular expression starting with http and ending with pdf?

开发者 https://www.devze.com 2023-03-12 00:34 出处：网络

I have loaded the entire HTML of a page and want to retrieve all the URL\'s which start with http and end with pdf. I wrote the following which didn\'t work:

I have loaded the entire HTML of a page and want to retrieve all the URL's which start with http and end with pdf. I wrote the following which didn't work:

$html = file_get_contents( "http://www.example.com" );
preg_match( '/^http(pdf)$/', $html, $matches );

I'm pretty new to 开发者_如何学Pythonregex but from what I've learned ^ marks the beginning of a pattern and $ marks the end. What am I doing wrong?

You need to match the characters in the middle of the URL:

/\bhttp[\w%+\/-]+?pdf\b/

\b matches a word boundary
^ and $ mark the beginning and end of the entire string. You don't want them here.
[...] matches any character in the brackets
\w matches any word character
+ matches one or more of the previous match
? makes the + lazy rather than greedy

preg_match( '/http[^\s]+pdf/', $html, $matches );

Matches http followed by not ([^...]) spaces (\s) one or more times (+) followed by pdf

Try this,

preg_match( '/\bhttp\S*pdf\b/', $html, $matches );

You need to match the part between the http and the pdf, this is what .*? is doing.

^ matches the start of the string and $ the end, but this is not what you want, when you want to extract those links from a longer text.

\b is matching on word boundaries

Update

for completeness, the .*? would still match too much so exchanged with \S*

\S matches a non whitespace character

Try this one:

preg_match_all('/\bhttp\S*?pdf\b/', $html, $matches);

Note that you need to use the preg_match_all()-function here, since you are trying to match more than one occurrence. ^ and $ wont work, because they only apply to line or file boundaries (depending on the used modifiers).

preg_match( '/^http.*pdf$/', $html, $matches );

is better (working)