Parsing spanish family name_问答_开发者_运维开发者技术经验分享

A spanish family name consists of three parts:

The paternal name,
The optional maternal name,
The optional spouse's paternal name.

Each of these three parts is one single word that may be preceded by "De", "Del", "De La", "De Los" or "De Las". Each of these prefixes starts with a capital and there may be only one of them for each part. The spouse's paternal name is separated from the rest by the word "de" (no capital).

So valid family names would be:

Pérez
Pérez De León
López de López
De La Oca Ordóñez
Castillo Ramírez de Del Valle

I can parse these names with this regex:

^((?:De |Del |De La |De Los |De Las )?\w+)?( (?:De |Del |De La |De Los |De Las )?\w+)?( de (?:De |Del |De La |De Los |De Las )?\w+)?$

1.) Can this ugly regex be simplified?

2.) When the paternal name is the same as the maternal name the word "y" is inserted between them. So "López y Lópey de De León" and "Pérez y Pérez" are both valid, but "López y Pérez" and "Gómez y de Gómez" are not. How can I capture this case?

Thank you very much.

The exact answer depends on what programming language and/or regex engine you're using, but for most implementations, you should be able to do the following:

(1.) Make a separate regex that matches a single part of a name and then include this in the final regex, e.g., in Perl:

my $name1 = qr/(?:De |Del |De La |De Los |De Las )?\w+/;
my $name2 = qr/^($name1)( $name1)?( de $name1)?$/;

(I assume you don't want the ? after the first capture, as otherwise you'd match the empty string.) $name2 is then the regex to match against.

(2.) Strictly speaking, proper computer-theoretical regular expressions cannot test whether an arbitrary substring that appears at one point in the string also appears at another point. However, most regex implementations (e.g., Perl-compatible "regular expressions") actually support more features than a real regex engine would, so you could use a backreference like:

my $name2 = qr/^(?:($name1)( $name1)?|($name1) y \3)(de $name1)?$/;

In PCREs, the \3 matches the exact same string that the third (...) group matches. If you can't use backreferences for some reason, your only option is to use a regex like:

my $name2 = qr/^(?:($name1)( $name1)?|($name1) y ($name1))(de $name1)?$/;

and then, if $3 and $4 are defined after matching, test to see if they're equal or not. (Note that both of the above will allow names like "López López" without a "y"; if you want to prohibit those, it'll be a bit harder.)

Here's my attempt. It seems to work with the examples given:

public class Foo {

    public static void main(String[] args) throws Exception {
        System.out.println(new SpanishName("Pérez"));
        System.out.println(new SpanishName("Pérez De León"));
        System.out.println(new SpanishName("López de López"));
        System.out.println(new SpanishName("De La Oca Ordóñez"));
        System.out.println(new SpanishName("Castillo Ramírez de Del Valle"));
        System.out.println(new SpanishName("López y López de De León"));
        System.out.println(new SpanishName("Pérez y Pérez"));

        // System.out.println(new SpanishName("López y Pérez")); - Throws IAE
        // System.out.println(new SpanishName("Gómez y de Gómez")); - Throws IAE
    }

    public static class SpanishName {

        private final String paternal;
        private final String maternal;
        private final String spousePaternal;

        private static final Pattern NAME_REGEX = Pattern
                .compile("^([\\p{Ll}\\p{Lu}]+?)(?:\\s([\\p{Ll}\\p{Lu}]+?))?(?:\\s([\\p{Ll}\\p{Lu}]+?))?$");

        public SpanishName(String str) {
            str = stripJoinWords(str);
            str = removeYJoin(str);
            final Matcher matcher = NAME_REGEX.matcher(str);
            if (str.contains(" y ") || !matcher.matches()) {
                throw new IllegalArgumentException(String.format("'%s' is not a valid Spanish name", str));
            } else {
                paternal = matcher.group(1);
                maternal = matcher.group(2);
                spousePaternal = matcher.group(3);
            }
        }

        private String removeYJoin(final String str) {
            return str.replaceFirst("^([\\p{Ll}\\p{Lu}]+?) y \\1", "$1 $1");
        }

        private String stripJoinWords(final String str) {
            return str.replaceAll("(?<!\\sy\\s)[Dd]e(?:l| La| Los| Las)?\\s", "");

        }

        @Override
        public String toString() {
            return String.format("paternal = %s, maternal = %s, spousePaternal = %s", paternal, maternal,
                    spousePaternal);
        }
    }
}

Rather than using a regex, there's a service which does a pretty amazing job at this: https://www.nameapi.org/en/demos/name-parser/. It's open source, but instead of using regex it gathers data from phone books as well as a pretty sophisticated set of rules.