The Mystery

In exploring precisely which characters were permitted in Java identifiers, I have stumbled upon something so extremely curious that it seems nearly certain to be a bug.

I’d expected to find that Java identifiers conformed to the requirement that they start with characters that have the Unicode property ID_Start and are followed by those with the property ID_Continue, with an exception granted for leading underscores and for dollar signs. That did not prove to be the case, and what I found is at extreme variance with that or any other idea of a normal identifier that I have heard of.

Short Demo

Consider the following demonstration proving that an ASCII ESC character (octal 033) is permitted in Java identifiers:

$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: \033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[

It’s even worse than that, though. Almost infinitely worse, in fact. Even NULLs are permitted! And thousands of other code points that are not even identifier characters. I have tested this on Solaris, Linux, and a Mac running Darwin, and all give the same results.

Long Demo

Here is a test program that will show all these unexpected code points that Java quite outrageuosly allows as part of a legal identifier name.

#!/usr/bin/env perl
# 
# test-java-idchars - find which bogus code points Java allows in its identifiers
# 
#   usage: test-java-idchars [low high]
#   e.g.:  test-java-idchars 0 255
#
# Without arguments, tests Unicode code points
# from 0 .. 0x1000.  You may go further with a
# higher explicit argument.
#
# Produces a report at the end.
#
# You can ^C it prematurely to end the program then
# and get a report of its progress up to that point.
#
# Tom Christiansen
# tchrist@perl.com
# Sat Jan 29 10:41:09 MST 2011

use strict;
use warnings;

use encoding "Latin1";
use open IO => ":utf8";

use charnames ();

$| = 1;

my @legal;

my ($start, $stop) = (0, 0x1000);

if (@ARGV != 0) {
    if (@ARGV == 1) {
        for (($stop) = @ARGV) { 
            $_ = oct if /^0/;   # support 0OCTAL, 0xHEX, 0bBINARY
        }
    }
    elsif (@ARGV == 2) {
        for (($start, $stop) = @ARGV) { 
            $_ = oct if /^0/;
        }
    } 
    else {
        die "usage: $0 [ [start] stop ]\n";
    } 
} 

for my $cp ( $start .. $stop ) {
    my $char = chr($cp);

    next if $char =~ /[\s\w]/;

    my $type = "?";
    for ($char) {
        $type = "Letter"      if /\pL/;
        $type = "Mark"        if /\pM/;
        $type = "Number"      if /\pN/;
        $type = "Punctuation" if /\pP/;
        $type = "Symbol"      if /\pS/;
        $type = "Separator"   if /\pZ/;
        $type = "Control"     if /\pC/;
    } 
    my $name = $cp ? (charnames::viacode($cp) || "<missing>") : "NULL";
    next if $name eq "<missing>" && $cp > 0xFF;
    my $msg = sprintf("U+%04X %s", $cp, $name);
    print "testing \\p{$type} $msg...";
    open(TESTPROGRAM, ">:utf8", "testchar.java") || die $!;

print TESTPROGRAM <<"End_of_Java_Program";

public class testchar { 
    public static void main(String argv[]) { 
        String var_$char = "variable name ends in $msg";
        System.out.println(var_$char); 
    }
}

End_of_Java_Program

    close(TESTPROGRAM) || die $!;

    system q{
        ( javac -encoding UTF-8 testchar.java \
            && \
          java -Dfile.encoding=UTF-8 testchar | grep variable \
        ) >/dev/null 2>&1
    };

    push @legal, sprintf("U+%04X", $cp) if $? == 0;

    if ($? && $? < 128) {
        print "<interrupted>\n";
        exit;  # from a ^C
    } 

    printf "is %s in Java identifiers.\n",  
        ($? == 0) ? uc "legal" : "forbidden";

} 

END { 
    print "Legal but evil code points: @legal\n";
}

Here is a sample of running that program on just the first 33 code points that are neither whitespace nor identifier characters:

$ perl test-java-idchars 0 0x20
testing \p{Control} U+0000 NULL...is LEGAL in Java identifiers.
testing \p{Control} U+0001 START OF HEADING...is LEGAL in Java identifiers.
testing \p{Control} U+0002 START OF TEXT...is LEGAL in Java identifiers.
testing \p{Control} U+0003 END OF TEXT...is LEGAL in Java identifiers.
testing \p{Control} U+0004 END OF TRANSMISSION...is LEGAL in Java identifiers.
testing \p{Control} U+0005 ENQUIRY...is LEGAL in Java identifiers.
testing \p{Control} U+0006 ACKNOWLEDGE...is LEGAL in Java identifiers.
testing \p{Control} U+0007 BELL...is LEGAL in Java identifiers.
testing \p{Control} U+0008 BACKSPACE...is LEGAL in Java identifiers.
testing \p{Control} U+000B LINE TABULATION...is forbidden in Java identifiers.
testing \p{Control} U+000E SHIFT OUT...is LEGAL in Java identifiers.
testing \p{Control} U+000F SHIFT IN...is LEGAL in Java identifiers.
testing \p{Control} U+0010 DATA LINK ESCAPE...is LEGAL in Java identifiers.
testing \p{Control} U+0011 DEVICE CONTROL ONE...is LEGAL in Java identifiers.
testing \p{Control} U+0012 DEVICE CONTROL TWO...is LEGAL in Java identifiers.
testing \p{Control} U+0013 DEVICE CONTROL THREE...is LEGAL in Java identifiers.
testing \p{Control} U+0014 DEVICE CONTROL FOUR...is LEGAL in Java identifiers.
testing \p{Control} U+0015 NEGATIVE ACKNOWLEDGE...is LEGAL in Java identifiers.
testing \p{Control} U+0016 SYNCHRONOUS IDLE...is LEGAL in Java identifiers.
testing \p{Control} U+0017 END OF TRANSMISSION BLOCK...is LEGAL in Java identifiers.
testing \p{Control} U+0018 CANCEL...is LEGAL in Java identifiers.
testing \p{Control} U+0019 END OF MEDIUM...is LEGAL in Java identifiers.
testing \p{Control} U+001A SUBSTITUTE...is LEGAL in Java identifiers.
testing \p{Control} U+001B ESCAPE...is LEGAL in Java identifiers.
testing \p{Control} U+001C INFORMATION SEPARATOR FOUR...is forbidden in Java identifiers.
testing \p{Control} U+001D INFORMATION SEPARATOR THREE...is forbidden in Java identifiers.
testing \p{Control} U+001E INFORMATION S开发者_开发百科EPARATOR TWO...is forbidden in Java identifiers.
testing \p{Control} U+001F INFORMATION SEPARATOR ONE...is forbidden in Java identifiers.
Legal but evil code points: U+0000 U+0001 U+0002 U+0003 U+0004 U+0005 U+0006 U+0007 U+0008 U+000E U+000F U+0010 U+0011 U+0012 U+0013 U+0014 U+0015 U+0016 U+0017 U+0018 U+0019 U+001A U+001B

And here is another demo:

$ perl test-java-idchars 0x600 0x700 | grep -i legal
testing \p{Control} U+0600 ARABIC NUMBER SIGN...is LEGAL in Java identifiers.
testing \p{Control} U+0601 ARABIC SIGN SANAH...is LEGAL in Java identifiers.
testing \p{Control} U+0602 ARABIC FOOTNOTE MARKER...is LEGAL in Java identifiers.
testing \p{Control} U+0603 ARABIC SIGN SAFHA...is LEGAL in Java identifiers.
testing \p{Control} U+06DD ARABIC END OF AYAH...is LEGAL in Java identifiers.
Legal but evil code points: U+0600 U+0601 U+0602 U+0603 U+06DD

The Question

Can anyone please explain this seemingly insane behavior? There are many, many, many other inexplicably permitted code points all over the place, starting right off with U+0000, which is perhaps the strangest of all. If you run it on the first 0x1000 code points, you do see certain patterns appear, such as permitting any and all code points with the property Current_Symbol. But too much else is wholly inexplicable, at least by me.

The Java Language Specification section 3.8 defers to Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart(). The latter, among other conditions, has Character.isIdentifierIgnorable(), which allows non-whitespace control characters (including whole C1 range, see the link for the list).

Another question might be: Why shouldn't Java allow control characters in its identifiers?

A good principle when designing a language or other system, is to not forbid anything without good cause, since you never know how it might be used, and the less rules implementers and users have to contend with, the better.

It is true that you certainly shouldn't take advantage of this, by actually embedding escapes into your variable names, and you won't see any popular libraries that expose classes with null characters in them.

Certainly, this could be abused, but it isn't the language designers job to protect programmers from themselves in this way, any more than by forcing proper indentation or well-chosen variable names.

I don't see what's the big deal. How does it affect you in anyway?

If a developer wants to obfuscate his code, he can do it with ASCII.

If a developer wants to make his code understandable, he will use the lingua franca of the industry: English. Not only identifiers are ASCII only, but also from common English words. Otherwise, nobody will use or read his code, he can use whatever crazy characters he likes.