开发者

Regular expression need to identify where sentences don't have a space between them

开发者 https://www.devze.com 2023-01-29 17:25 出处:网络
I开发者_StackOverflow中文版 need a regular expression to identify all instances where a sentence begins without a space following the previous period.

I开发者_StackOverflow中文版 need a regular expression to identify all instances where a sentence begins without a space following the previous period.

For example, this is a bad sentence:

I'm sentence one.This is sentence two.

this needs to be fixed as follows:

I'm sentence one. This is sentence two.

It's not simply a case of doing a string replace of '.' with '. ' because there are a also a lot of isntances where the rest of the sentences in the paragraph the correct spacing, and this would give those an extra space.


\.(?!\s) will match dots not followed by a space. You probably want exclamation marks and question marks as well though: [\.\!\?](?!\s)

Edit: If C# supports it, try this: [\.\!\?](?!\s|$). It won't match the punctuation at the end of the string.


You could search for \w\s{1}\.[A-Z] to find a word character, followed by a single space character, followed by a period, followed by a Capital letter, to identify these. For a find/replace: find: (\w\s{1}\.)(A-Z]) and replace with $1 $2.


I doubt that you can create a regular expression that will work in the general case.

Any regex solution you come up with is going to have some interesting edge cases that you'll have to look at carefully. For example, the abbreviation "i.e." would become "i. e." (i.e., it will have an extra space and, if this parenthetical comment were run through the regex, it would become "i. e. ,").

Also, the proper way to quote text is to include the punctuation inside the quotes, as in "He said it was okay." If you had ["He said it was okay."This is a new sentence.], your regex solution might put a space before the final quote, or might ignore the error altogether.

Those are just two cases that come to mind immediately. There are plenty of others.

Whereas a regular expression will work in a limited set of simple sentences, real written language will quickly show that regular expressions are insufficient to provide a general solution to this problem.


if a sentence ends with e.g. ... you probably don't want to change this to . . .

I think the previous answers don't consider this case.

try to insert space where you find a word followed a new word starting with uppercase

find (\w+[\.!?])([A-Z]'?\w+) replace $1 $2


Best website ever: http://www.regular-expressions.info/reference.html

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号