开发者

regex: find new line character in string that isn't in textarea

开发者 https://www.devze.com 2023-01-23 17:55 出处:网络
heya, so I\'m looking for a regex that would allow me to basically replace a n开发者_开发问答ewline character with whatever (eg. \'xxx\'), but only if the newline character isn\'t within textarea tags

heya, so I'm looking for a regex that would allow me to basically replace a n开发者_开发问答ewline character with whatever (eg. 'xxx'), but only if the newline character isn't within textarea tags.

For example, the following:

<strong>abcd
efg</strong>
<textarea>curious
george
</textarea>
<span>happy</span>

Would become:

<strong>abcdxxxefg</strong>xxx<textarea>curious
geroge
</textarea>xxx<span>happy</span>

Anyone have any idea on where I should start? I'm kinda clueless here :( Thanks for any help possible.


I've got it, but you're not gonna like it. ;)

$result = preg_replace(
  '~[\r\n]++(?=(?>[^<]++|<(?!/?textarea\b))*+(?!</textarea\b))~',
  'XYZ', $source);

After matching a line break, the lookahead scans ahead, consuming any character that's not a left angle bracket, or any left angle bracket that's not the beginning of a <textarea> or </textarea> tag. When it runs out of those, the next thing it sees has to be one of those tags or the end of the string. If it's a </textarea> tag, that means the line break was found inside a textarea element, so the match fails, and that line break is not replaced.

I've included an expanded version below, and you can see it an action on ideone. You can adapt it to handle those other tags too, if you really want to. But it sounds to me like what you need is an HTML minimizer (or minifier); there are plenty of those available.

  $re=<<<EOT
~
[\r\n]++
(?=
  (?>
    [^<]++            # not left angle brackets, or
  |
    <(?!/?textarea\b) # bracket if not for TA tag (opening or closing)
  )*+
  (?!</textarea\b)    # first TA tag found must be opening, not closing
)
~x
EOT;


If you still want to go with regexp, you may try this - escape newlines inside special tags, delete newlines and then unescape:

<?php //5.3 syntax here

//Regex matches everything within textarea, pre or code tags
$str = preg_replace_callback('#<(?P<tag>textarea|pre|code)[^>]*?>.*</(?P=tag)>#sim',
    function ($matches) { 
         //and then replaces every newline by some escape sequence
         return str_replace("\n", "%ESCAPED_NEWLINE%", $matches[0]);
    }, $str);
//after all we can safely remove newlines
//and then replace escape sequences by newlines
$str = str_replace(array("\n", "%ESCAPED_NEWLINE%"), array('', "\n"), $str);


Why use a regex for this? Why not use a very simple state machine to do it? Work through the string looking for opening <textarea> tags, and when inside them look for the closing tag instead. When you come across a newline, convert it or not based on whether you're currently inside a <textarea> or not.


What you are doing is parsing HTML. You cannot parse HTML with a regular expression.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号