开发者

Regular expression starting with http and ending with pdf?

开发者 https://www.devze.com 2023-03-12 00:34 出处:网络
I have loaded the entire HTML of a page and want to retrieve all the URL\'s which start with http and end with pdf. I wrote the following which didn\'t work:

I have loaded the entire HTML of a page and want to retrieve all the URL's which start with http and end with pdf. I wrote the following which didn't work:

$html = file_get_contents( "http://www.example.com" );
preg_match( '/^http(pdf)$/', $html, $matches );

I'm pretty new to 开发者_如何学Pythonregex but from what I've learned ^ marks the beginning of a pattern and $ marks the end. What am I doing wrong?


You need to match the characters in the middle of the URL:

/\bhttp[\w%+\/-]+?pdf\b/
  • \b matches a word boundary

  • ^ and $ mark the beginning and end of the entire string. You don't want them here.

  • [...] matches any character in the brackets

  • \w matches any word character

  • + matches one or more of the previous match

  • ? makes the + lazy rather than greedy


preg_match( '/http[^\s]+pdf/', $html, $matches );

Matches http followed by not ([^...]) spaces (\s) one or more times (+) followed by pdf


Try this,

preg_match( '/\bhttp\S*pdf\b/', $html, $matches );

You need to match the part between the http and the pdf, this is what .*? is doing.

^ matches the start of the string and $ the end, but this is not what you want, when you want to extract those links from a longer text.

\b is matching on word boundaries

Update

for completeness, the .*? would still match too much so exchanged with \S*

\S matches a non whitespace character


Try this one:

preg_match_all('/\bhttp\S*?pdf\b/', $html, $matches);

Note that you need to use the preg_match_all()-function here, since you are trying to match more than one occurrence. ^ and $ wont work, because they only apply to line or file boundaries (depending on the used modifiers).


preg_match( '/^http.*pdf$/', $html, $matches );

is better (working)

0

精彩评论

暂无评论...
验证码 换一张
取 消