What is a cross platform regex for removal of line breaks?_问答_开发者

I am sure this has been asked before, b开发者_如何学运维ut I cannot find it.

Basically, assuming you are parsing a text file of unknown origin and want to replace line breaks with some other delimiter, is this the best regex, or is there another?

(\r\n)|(\n)|(\r)

Fletcher - this did get asked once before.

Here you go: Regular Expression to match cross platform newline characters

Spoiler Alert!

The regex I use when I want to be precise is "\r\n?|\n".

Do check if your regex engine supports \R as a shorthand character class and you will not need to be concerned with the various Unicode newline / linefeed combos. If implemented correctly, you can then match all the various ascii or Unicode line endings transparently using \R.

In Unicode you need to detect NEL (OS/390 line ending, \x85) LS (Line Separator, \x2028) and PS (Paragraph Separator, \x2029) if you want to be completely cross platform these days.

It is debatable whether LS, NEL, and PS should be treated as line breaks, line endings, or white space. The XML 1.0 standard, for example, does not recognize NEL as a line break character. ECMAScript treats LS and PS as line breaks but NEL as whitespace. Perl unicode regexs will treat VT, FF, CR, CRLF, NEL, LS and PS as line breaks for the purpose of ^ and $ regex meta characters.

The Unicode Implementation Guide (section 5.8 and table 5.3) is probably the best bet of what the definitive treatment of what a "newline" is.

If you are only concerned with ascii with the DOS/Windows/Unix/Mac classic variants, the regex equivalent to \R is (?>\r\n|[\r\n])

In Unicode, the equivalent to \R is (?>\r\n|\n|\x0b|\f|\r|\x85|\x2028|\x2029) The \x0b in there is a vertical tab; once again, this may or may not fit you definition of what a line break is, but that does match the recommendation of the Unicode Implantation. (FF, or \x0C is not included in the regex since a Form Feed is a new page, not a new line in the definition.)

The regex to find any Unicode line terminator should be (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}]) rather than as drewk wrote it, at least in Perl. Taken directly from the perl 5.10.0 documentation (it was removed in later versions). Note the braces after \x: U+2029 is \x{2029} but \x2029 is an ASCII whitespace (U+0020) + a digit 2 + a digit 9. \n outside a character class ,is also not guaranteed to match \x{0a}.

If your platform does not support the \R class as suggested by @dawg above, you may still be able to make a pretty elegant and robust solution if your platform supports negative lookaround or character class subtraction (e.g. in Java class subtraction is through the syntax [x&&[^y]]).

In most regular expresssion grammars, the dot character is defined to mean "any character except the newline character" (see for example, for JavaScript, here). If you match something with the following characteristics:

not (any character except the newline character) → the newline character; and
is whitespace

Since I'm currently working in JavaScript, which AFAIK doesn't have the \R shorthand or character class subtraction, I can still use negative lookahead to get what I want. The following regular expression matches all newlines:

/((?!.)\s)+/g

And the following JavaScript code, at least when run in Chrome 42.0.2311.90m on Windows 7, wipes out all the kinds of newlines that JavaScript (i.e. the "ECMAScript" mentioned in @dawg's third paragraph) recognizes:

var input = "hello\r\n\f\v\u2028\u2029 world";
var output = input.replace(/((?!.)\s)+/g, "");
document.write(output); // hello world

Just replace /[\r\n]+/g with an empty string "".

It'll replace all \r and \n no matter what order they appear in the string.

What is a cross platform regex for removal of line breaks?

精彩评论

关注公众号

热门标签

图文推荐

What is a cross platform regex for removal of line breaks?

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：