The objective of this post is to determine the number of single character identifiers which can be constructed in the Java programming language. The first respone is usually 54. The underlying logic behind 54 being that there are 26 lowercase character, 26 uppercase characters, the underscore (_) character and the dollar($) character. But, there is more to Java identifiers than meets the eye.
Let us see how the Java Language Specification defines an identifier.
An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.
Since we have already restricted ourselves to single character identifier, the ‘unlimited-length sequence’ part is not relevant. Further, ‘Java Letter’ is defined as ‘any Unicode character that is a “Java letter”‘ and this where things start becoming interesting as well as complex. Java allows programmers to code using identifiers of their native language. This means that the letters are not limited to the English language alone, but includes other languages like Hindi, Urdu, Japanese, Korean etc. Now, since the first character must be a Java letter, special characters and numerals (of all languages) cannot be there as a first character.
Now the specification also says that the dollar sign and underscore are treated as letter for historical reasons. The specification is silent regarding the fact that currency symbols other than dollars are acceptable, meaning that Euro, Pound, Yen, Rupee etc. are also allowed. However, the specification does mentions that the function Character.isJavaIdentifierStart( int ) would return true for characters which can be treated as JavaLetters.
So we can easily take a loop which iterates through all Unicode characters and count the number of times the function Character.isJavaIdentifierStart( int ) returns true. However, there is one more issue which must be addressed and that is the choice of intial and final value for the loop. The original Unicode specification defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. This may seem complex but the good thing is that there exists two fields in the Character class which would simplify our job. They are MIN_CODE_POINT and MAN_CODE_POINT. MAX_CODE_POINT stores the maximum value of a Unicode code point, constant U+10FFFF. MIN_CODE_POINT stores the minimum value of a Unicode code point, constant U+0000.
The Java program to find out the number of valid single character identifiers using the above logic is as follows:
class SingleCharacterIdentifiers{ public static void main( String args[] ){ int count = 0; for( int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++){ if(Character.isJavaIdentifierStart(i)){ count++; } } System.out.println( "Count: "+count ); } }
The output of the above program in Java 8 is as follows:
101296
Thus, we see that due to the evolutionary nature of Unicode one would get different answers on different Java versions. One additional point that I would like to mention here is that the use of the underscore character as single character identifier has been deprecated in Java 8 and it might not be supported in the releases after Java SE 8.