开发者

Select capitalized & all-caps words using RegEx

开发者 https://www.devze.com 2023-03-27 09:17 出处:网络
I\'m trying to find names of people and companies (everything that is capitalized but not in the beginning of a sentence) in a large body of text. The purpose is to find as many instances as possible

I'm trying to find names of people and companies (everything that is capitalized but not in the beginning of a sentence) in a large body of text. The purpose is to find as many instances as possible so that they can be XML-tagged properly.

This is what I've come up with so far:

[^\W](\s\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+

It has two problems:

  1. It selects two characters too many in front of the hit. In the sentence "Is this Beetle ugly?" it finds s Beetle which complicates the subsequent tagging.
  2. When a capitalized word is preceded with an apostrophe or a colon, it isn't found. If possible I'd like to limit what characters are used for determining a sentence to just !?.

Here's the sample text I'm using to test it out:

John Adams is my hero. There's just no limits to his imagination! Is this Beetle ugly? It sings at the: La Scala opera house. I have a dream that I will find work at' Frame Store but not in the USA! This way ILM could do whatever they pleased. ILM 开发者_如何学Gowas very sweet. Visual Effects did a good job... Neither did Animatronix?

I'm using jEdit http.//jedit.org since I need something that works on both Windows and OS X.


Update, this avoids now the matching at the start of the string.

(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+

(?<!(?:[!?\.]\s|^)) is a negative lookbehind that ensures it is not preceded by one of the !?. and a space OR by the start of a new row.

I tested it with jEdit.

Update to cover Names consisting of multiple words

(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]*\b(?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)*)+
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (added)
                                            ^ (changed)

I added the group (?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)* to match optional following words starting with uppercase letters. And I changed the + to a * to match the A in your example My company's called A Few Good Men. But this change causes now the regex to match I as a name.

See tchrists comment. Names are not a simple thing and it gets really difficult if you want to cover the more complex cases.

This is also working

(?<!\p{P}\s)(\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+

But \p{P} covers all punctuation, I understood this is not what you want. But maybe you can find here on regular-expressions.info/unicode.html a property that fits your needs.

Another mistake in your expression is the | in the character class. Its not needed, you are just adding this character to your class and with it it will match words like U|S|A, so just remove it:

(?<![!?\.]\s)(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+
0

精彩评论

暂无评论...
验证码 换一张
取 消