开发者

Regular expression matching fully qualified class names

开发者 https://www.devze.com 2023-02-14 20:36 出处:网络
What is the best way to match fully quali开发者_StackOverflow社区fied Java class name in a text?

What is the best way to match fully quali开发者_StackOverflow社区fied Java class name in a text?

Examples: java.lang.Reflect, java.util.ArrayList, org.hibernate.Hibernate.


A Java fully qualified class name (lets say "N") has the structure

N.N.N.N

The "N" part must be a Java identifier. Java identifiers cannot start with a number, but after the initial character they may use any combination of letters and digits, underscores or dollar signs:

([a-zA-Z_$][a-zA-Z\d_$]*\.)*[a-zA-Z_$][a-zA-Z\d_$]*
------------------------    -----------------------
          N                           N

They can also not be a reserved word (like import, true or null). If you want to check plausibility only, the above is enough. If you also want to check validity, you must check against a list of reserved words as well.

Java identifiers may contain any Unicode letter instead of "latin only". If you want to check for this as well, use Unicode character classes:

([\p{Letter}_$][\p{Letter}\p{Number}_$]*\.)*[\p{Letter}_$][\p{Letter}\p{Number}_$]*

or, for short

([\p{L}_$][\p{L}\p{N}_$]*\.)*[\p{L}_$][\p{L}\p{N}_$]*

The Java Language Specification, (section 3.8) has all details about valid identifier names.

Also see the answer to this question: Java Unicode variable names


Here is a fully working class with tests, based on the excellent comment from @alan-moore

import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;

import java.util.regex.Pattern;

import org.junit.Test;

public class ValidateJavaIdentifier {

    private static final String ID_PATTERN = "\\p{javaJavaIdentifierStart}\\p{javaJavaIdentifierPart}*";
    private static final Pattern FQCN = Pattern.compile(ID_PATTERN + "(\\." + ID_PATTERN + ")*");

    public static boolean validateJavaIdentifier(String identifier) {
        return FQCN.matcher(identifier).matches();
    }


    @Test
    public void testJavaIdentifier() throws Exception {
        assertTrue(validateJavaIdentifier("C"));
        assertTrue(validateJavaIdentifier("Cc"));
        assertTrue(validateJavaIdentifier("b.C"));
        assertTrue(validateJavaIdentifier("b.Cc"));
        assertTrue(validateJavaIdentifier("aAa.b.Cc"));
        assertTrue(validateJavaIdentifier("a.b.Cc"));

        // after the initial character identifiers may use any combination of
        // letters and digits, underscores or dollar signs
        assertTrue(validateJavaIdentifier("a.b.C_c"));
        assertTrue(validateJavaIdentifier("a.b.C$c"));
        assertTrue(validateJavaIdentifier("a.b.C9"));

        assertFalse("cannot start with a dot", validateJavaIdentifier(".C"));
        assertFalse("cannot have two dots following each other",
                validateJavaIdentifier("b..C"));
        assertFalse("cannot start with a number ",
                validateJavaIdentifier("b.9C"));
    }
}


The pattern provided by Renaud works, but his original answer will always backtrack at the end.

To optimize it, you can essentially swap the first half with the last. Note the dot match that you also need to change.

The following is my version of it that, when compared to the original, runs about twice as fast:

String ID_PATTERN = "\\p{javaJavaIdentifierStart}\\p{javaJavaIdentifierPart}*";
Pattern FQCN = Pattern.compile(ID_PATTERN + "(\\." + ID_PATTERN + ")*");

I cannot write comments, so I decided to write an answer instead.


I came (on my own) to a similar answer (as Tomalak's answer), something as M.M.M.N:

([a-z][a-z_0-9]*\.)*[A-Z_]($[A-Z_]|[\w_])*

Where,

M = ([a-z][a-z_0-9]*\.)*
N = [A-Z_]($[A-Z_]|[\w_])*

However, this regular expression (unlike Tomalak's answer) makes more assumptions:

  1. The package name (The M part) will be only in lower case, the first character of M will be always a lower letter, the rest can mix underscore, lower letters and numbers.

  2. The Class Name (the N part) will always start with an Upper Case Letter or an underscore, the rest can mix underscore, letters and numbers. Inner Classes will always start with a dollar symbol ($) and must obey the class name rules described previously.

Note: the pattern \w is the XSD pattern for letters and digits (it does not includes the underscore symbol (_))

Hope this help.


The following class validates that a provided package name is valid:

import java.util.HashSet;

public class ValidationUtils {

    // All Java reserved words that must not be used in a valid package name.
    private static final HashSet reserved;

    static {
        reserved = new HashSet();
        reserved.add("abstract");reserved.add("assert");reserved.add("boolean");
        reserved.add("break");reserved.add("byte");reserved.add("case");
        reserved.add("catch");reserved.add("char");reserved.add("class");
        reserved.add("const");reserved.add("continue");reserved.add("default");
        reserved.add("do");reserved.add("double");reserved.add("else");
        reserved.add("enum");reserved.add("extends");reserved.add("false");
        reserved.add("final");reserved.add("finally");reserved.add("float");
        reserved.add("for");reserved.add("if");reserved.add("goto");
        reserved.add("implements");reserved.add("import");reserved.add("instanceof");
        reserved.add("int");reserved.add("interface");reserved.add("long");
        reserved.add("native");reserved.add("new");reserved.add("null");
        reserved.add("package");reserved.add("private");reserved.add("protected");
        reserved.add("public");reserved.add("return");reserved.add("short");
        reserved.add("static");reserved.add("strictfp");reserved.add("super");
        reserved.add("switch");reserved.add("synchronized");reserved.add("this");
        reserved.add("throw");reserved.add("throws");reserved.add("transient");
        reserved.add("true");reserved.add("try");reserved.add("void");
        reserved.add("volatile");reserved.add("while");
    }

    /**
     * Checks if the string that is provided is a valid Java package name (contains only
     * [a-z,A-Z,_,$], every element is separated by a single '.' , an element can't be one of Java's
     * reserved words.
     *
     * @param name The package name that needs to be validated.
     * @return <b>true</b> if the package name is valid, <b>false</b> if its not valid.
     */
    public static final boolean isValidPackageName(String name) {
        String[] parts=name.split("\\.",-1);
        for (String part:parts){
            System.out.println(part);
            if (reserved.contains(part)) return false;
            if (!validPart(part)) return false;
        }
        return true;
    }

    /**
     * Checks that a part (a word between dots) is a valid part to be used in a Java package name.
     * @param part The part between dots (e.g. *PART*.*PART*.*PART*.*PART*).
     * @return <b>true</b> if the part is valid, <b>false</b> if its not valid.
     */
    private static boolean validPart(String part){
        if (part==null || part.length()<1){
            // Package part is null or empty !
            return false;
        }
        if (Character.isJavaIdentifierStart(part.charAt(0))){
            for (int i = 0; i < part.length(); i++){
                char c = part.charAt(i);
                if (!Character.isJavaIdentifierPart(c)){
                    // Package part contains invalid JavaIdentifier !
                    return false;
                }
            }
        }else{
            // Package part does not begin with a valid JavaIdentifier !
            return false;
        }

        return true;
    }
}


shorter version of a working regexp:

\p{Alnum}[\p{Alnum}._]+\p{Alnum}


For string like com.mycompany.core.functions.CustomFunction I'm using ((?:(?:\w+)?\.[a-z_A-Z]\w+)+)


Following expression works perfectly fine for me.

^[a-z][a-z0-9_]*(\.[a-z0-9_]+)+$


I'll say something like ([\w]+\.)*[\w]+

But maybe I can be more specific knowing what you want to do with it ;)

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号