regex help to return page titles which do and dont include /n tags etc

开发者 https://www.devze.com 2023-01-17 16:58 出处：网络

i have been looking for ages for a regular expression which will return all page titles. unfortuantley some have newline tags in them and other wiered stuff which is preventing me from finding a resul

相关专题：php regex

here are some of the regex's i have tried

"/\<title.*\>(.+)\<\/title\>/"开发者_JS百科


"#\<title.*\>(.+)\<\/title\>#s"

but none of them return titles with /n tags can anyone help me out please?

many thanks Luke

edit

here is the full code

$data = file_get_contents("http://www.awin1.com/pclick.php?p=116824093&a=79524&m=2694&platform=cs");
$subject = $data;
$pattern = '#<title.*>(.+)</title>#s';
preg_match($pattern,$subject,$matches);
var_dump($matches);

obviously the link changes thanks

As long as you put 'dot matches newline' on, this will work just fine:

<title>.*?</title>

For 'dot matches newline' you'll have to postfix the regex with /s in PHP.

preg_match("/<title>(.*?)</title>/s", someTextToSearch)

Firstly, have you considered using PHP's DOM functions instead of regex? Using regex can be quite fraught when trying to parse html.

If you still want to use regex...

1) The dot operator (that you're using already) matches "any character except line feeds". However there is an option that you can enable to switch it to be "any character including line feeds".

2) Or you could continue using dot, plus \n and \r, which are the two line feed characters you're likely to encounter - so (.|\n|\r) where you currenty have just the dot.

3) Another alternative would be to use str_replace() to get rid of all the line feed characters before doing the regex. (this won't affect your html output in the browser).

For me it works just fine (with \n )

$sgml = <<<HTML
<title>fooo bar ? \n
baz! </title>
HTML;

preg_match('#\<title.*\>(.+)\<\/title\>#s',$sgml,$matches);

var_dump($matches); // dumps array(2) { [0]=>  string(33) "" [1]=>  string(18) "fooo bar ? baz! " }

Or did i understood you wrong?

Erm this works? Am I missing something?

$data = file_get_contents("http://www.awin1.com/pclick.php?p=116824093&a=79524&m=2694&platform=cs"); 
$subject = $data; 
preg_match('!<title?[^>]+>(.+)</title>!is', $subject, $matches); 
var_dump(trim($matches[1]));

I couldn't get any of the solutions on this page to work 100%-- some title tags have newline characters, some have tabs, and some are cased irregularly. In all these cases the regex will fail.

Thusfar the best all-inclusive expression I have found (& tested) is this:

$res = preg_match('/<title>(.*?)<\/title>/is', $fp, $title_matches);

regex help to return page titles which do and dont include /n tags etc

edit

精彩评论

关注公众号

热门标签

图文推荐

regex help to return page titles which do and dont include /n tags etc

edit

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：