First off, if it's not clear from the tag, I'm doing this in PHP - but that probably doesn't matter much.
I have this code:
$inputStr = strip_tags($inputStr);
$inputStr = preg_replace("/[^a-zA-Z\s]/", " ", $inputStr);
Which seems to remove all HTML tags and virtually all special and non-alphabetic characters perfectly. The one problem is, for some reason, it doesn't filter out carraige return/line feeds (just the combination).
开发者_运维百科If I add this line:
$inputStr = preg_replace("/\s+/", " ", $inputStr);
at the end, however, it works great. Can someone tell me:
- Why doesn't the first preg_replace filter out the CR/LFs?
- What this second preg_repalce is actually doing? I understand the first one for the most part, but hte second one is confusing me - it works but I don't know why.
- Can I combine them into 1 line somehow?
- You told it to remove everything except letters and whitespace. Newlines are whitespace, so they don't get removed. You could use
\h
instead of\s
to only exclude horizontal whitespace. - It simply means "replace every sequence of one or more whitespace characters (
\s+
) with a single space." preg_replace("/[^A-Za-z]+/", " ", ...)
might do.
Your first regex is removing all characters that are not letters or whitespace. CRLFs are whitespace, so they aren't filtered out.
The second one is replacing whitespace with a space character. Essentially it condenses sequences of whitespace into a single space (due to the quantifier being greedy).
I suggest removing the \s
from the first regex, see if that works.
\s
matches whitespace such as\n
.- It is replacing all whitespace characters with a space.
- You could make it one unreadable line, but probably not one regex.
精彩评论