UTF8 locale MFC for DragonflyBSD [tjr at FreeBSD.org: cvs commit: src/etc/mtree BSD.usr.dist src/share/colldef Makefile src/share/mklocale Makefile UTF-8.src src/share/monetdef Makefile be_BY.UTF-8.src bg_BG.UTF-8.src cs_CZ.UTF-8.src en_GB.UTF-8.src...]

Sun Mar 28 18:35:58 PST 2004

Xin LI wrote:
> Having utf8'ized locales in DragonFlyBSD
will bring better internationalization, and personally, I believe that
making utf-8 the default internal locale will make DragonFly the best
UNIX-like platform to write internationalized applications.
Agreed wholeheartedly.

Out of curiosity, do you know how the FreeBSD folks handled some of the 
stranger UTF-8 behavior?  In particular:

1. Representation of embedded NUL characters.  The UTF-8 spec says this 
is one byte == 00000000b; Java and a few others, though, have used a 
double-byte encoding (110/00000 + 10/000000) so that stuff like strlen() 
works "reasonably."

2. Maximum size of a character.  To represent UCS-2 characters, you only 
needed up to 3 bytes (1110/xxxx + 10/xxxxxx + 10/xxxxxx). 
Unfortunately, surrogates make it necessary to do 6 byte encodings. 
Ironically, UTF-8 unaware routines handle these fine; some of my older 
UTF-8 handling routines, though, barf on stuff above and beyond U+10000. 
 (Fortunately, none of these have escaped Neolinear... I hope...)

3. Security issues.  The UTF-8 and Unicode FAQ [1] states that "a UTF-8 
decoder must not accept UTF-8 sequences that are longer than necessary 
to encode a character," noting that "any overlong UTF-8 sequence could 
be abused to bypass UTF-8 substring tests that look only for the 
shortest possible encoding."

None of these issues is a show-stopper.  However, it is stuff that we 
should check for and document.

[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html