开发者

regex help to return page titles which do and dont include /n tags etc

开发者 https://www.devze.com 2023-01-17 16:58 出处:网络
i have been looking for ages for a regular expression which will return all page titles. unfortuantley some have newline tags in them and other wiered stuff which is preventing me from finding a resul

i have been looking for ages for a regular expression which will return all page titles. unfortuantley some have newline tags in them and other wiered stuff which is preventing me from finding a result.

here are some of the regex's i have tried

"/\<title.*\>(.+)\<\/title\>/"开发者_JS百科


"#\<title.*\>(.+)\<\/title\>#s"  

but none of them return titles with /n tags can anyone help me out please?

many thanks Luke

edit

here is the full code

$data = file_get_contents("http://www.awin1.com/pclick.php?p=116824093&a=79524&m=2694&platform=cs");
$subject = $data;
$pattern = '#<title.*>(.+)</title>#s';
preg_match($pattern,$subject,$matches);
var_dump($matches);

obviously the link changes thanks


As long as you put 'dot matches newline' on, this will work just fine:

<title>.*?</title>

For 'dot matches newline' you'll have to postfix the regex with /s in PHP.

preg_match("/<title>(.*?)</title>/s", someTextToSearch)


Firstly, have you considered using PHP's DOM functions instead of regex? Using regex can be quite fraught when trying to parse html.

If you still want to use regex...

1) The dot operator (that you're using already) matches "any character except line feeds". However there is an option that you can enable to switch it to be "any character including line feeds".

2) Or you could continue using dot, plus \n and \r, which are the two line feed characters you're likely to encounter - so (.|\n|\r) where you currenty have just the dot.

3) Another alternative would be to use str_replace() to get rid of all the line feed characters before doing the regex. (this won't affect your html output in the browser).


For me it works just fine (with \n )

$sgml = <<<HTML
<title>fooo bar ? \n
baz! </title>
HTML;

preg_match('#\<title.*\>(.+)\<\/title\>#s',$sgml,$matches);

var_dump($matches); // dumps array(2) { [0]=>  string(33) "" [1]=>  string(18) "fooo bar ? baz! " } 

Or did i understood you wrong?


Erm this works? Am I missing something?

$data = file_get_contents("http://www.awin1.com/pclick.php?p=116824093&a=79524&m=2694&platform=cs"); 
$subject = $data; 
preg_match('!<title?[^>]+>(.+)</title>!is', $subject, $matches); 
var_dump(trim($matches[1]));  


I couldn't get any of the solutions on this page to work 100%-- some title tags have newline characters, some have tabs, and some are cased irregularly. In all these cases the regex will fail.

Thusfar the best all-inclusive expression I have found (& tested) is this:

$res = preg_match('/<title>(.*?)<\/title>/is', $fp, $title_matches);
0

精彩评论

暂无评论...
验证码 换一张
取 消