I'm trying to find names of people and companies (everything that is capitalized but not in the beginning of a sentence) in a large body of text. The purpose is to find as many instances as possible so that they can be XML-tagged properly.
This is what I've come up with so far:
[^\W](\s\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
It has two problems:
- It selects two characters too many in front of the hit. In the sentence "Is this Beetle ugly?" it finds s Beetle which complicates the subsequent tagging.
- When a capitalized word is preceded with an apostrophe or a colon, it isn't found. If possible I'd like to limit what characters are used for determining a sentence to just !?.
Here's the sample text I'm using to test it out:
John Adams is my hero. There's just no limits to his imagination! Is this Beetle ugly? It sings at the: La Scala opera house. I have a dream that I will find work at' Frame Store but not in the USA! This way ILM could do whatever they pleased. ILM 开发者_如何学Gowas very sweet. Visual Effects did a good job... Neither did Animatronix?
I'm using jEdit http.//jedit.org since I need something that works on both Windows and OS X.
Update, this avoids now the matching at the start of the string.
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+
(?<!(?:[!?\.]\s|^))
is a negative lookbehind that ensures it is not preceded by one of the !?.
and a space OR by the start of a new row.
I tested it with jEdit.
Update to cover Names consisting of multiple words
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]*\b(?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)*)+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (added)
^ (changed)
I added the group (?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)*
to match optional following words starting with uppercase letters. And I changed the +
to a *
to match the A
in your example My company's called A Few Good Men
. But this change causes now the regex to match I
as a name.
See tchrists comment. Names are not a simple thing and it gets really difficult if you want to cover the more complex cases.
This is also working
(?<!\p{P}\s)(\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
But \p{P}
covers all punctuation, I understood this is not what you want. But maybe you can find here on regular-expressions.info/unicode.html a property that fits your needs.
Another mistake in your expression is the |
in the character class. Its not needed, you are just adding this character to your class and with it it will match words like U|S|A
, so just remove it:
(?<![!?\.]\s)(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+
精彩评论