File.listFiles() mangles unicode names with JDK 6 (Unicode Normalization issues)

Using Unicode, there is more than one valid way to represent the same letter.
The characters you’re using in your Tricky Name are a “latin small letter i with circumflex” and a “latin small letter a with ring above”.

You say “Note the %CC versus %C3 character representations”, but looking closer what you see are the sequences

i 0xCC 0x82 vs. 0xC3 0xAE
a 0xCC 0x8A vs. 0xC3 0xA5

That is, the first is letter i followed by 0xCC82 which is the UTF-8 encoding of the Unicode\u0302 “combining circumflex accent” character while the second is UTF-8 for \u00EE “latin small letter i with circumflex”. Similarly for the other pair, the first is the letter a followed by 0xCC8A the “combining ring above” character and the second is “latin small letter a with ring above”. Both of these are valid UTF-8 encodings of valid Unicode character strings, but one is in “composed” and the other in “decomposed” format.

OS X HFS Plus volumes store strings (e.g. filenames) as “fully decomposed”. A Unix file-system is really stored according to how the filesystem driver chooses to store it. You can’t make any blanket statements across different types of filesystems.

See the Wikipedia article on Unicode Equivalence for general discussion of composed vs decomposed forms, which mentions OS X specifically.

See Apple’s Tech Q&A QA1235 (in Objective-C unfortunately) for information on converting forms.

A recent email thread on Apple’s java-dev mailing list could be of some help to you.

Basically, you need to normalize the decomposed form into a composed form before you can compare the strings.

Leave a Comment