UTF8 locale MFC for DragonflyBSD [tjr at FreeBSD.org: cvs commit: src/etc/mtree BSD.usr.dist src/share/colldef Makefile src/share/mklocale Makefile UTF-8.src src/share/monetdef Makefile be_BY.UTF-8.src bg_BG.UTF-8.src cs_CZ.UTF-8.src en_GB.UTF-8.src...]
David Cuthbert
dacut at kanga.org
Sun Mar 28 18:35:58 PST 2004
Xin LI wrote:
> Having utf8'ized locales in DragonFlyBSD
will bring better internationalization, and personally, I believe that
making utf-8 the default internal locale will make DragonFly the best
UNIX-like platform to write internationalized applications.
Agreed wholeheartedly.
Out of curiosity, do you know how the FreeBSD folks handled some of the
stranger UTF-8 behavior? In particular:
1. Representation of embedded NUL characters. The UTF-8 spec says this
is one byte == 00000000b; Java and a few others, though, have used a
double-byte encoding (110/00000 + 10/000000) so that stuff like strlen()
works "reasonably."
2. Maximum size of a character. To represent UCS-2 characters, you only
needed up to 3 bytes (1110/xxxx + 10/xxxxxx + 10/xxxxxx).
Unfortunately, surrogates make it necessary to do 6 byte encodings.
Ironically, UTF-8 unaware routines handle these fine; some of my older
UTF-8 handling routines, though, barf on stuff above and beyond U+10000.
(Fortunately, none of these have escaped Neolinear... I hope...)
3. Security issues. The UTF-8 and Unicode FAQ [1] states that "a UTF-8
decoder must not accept UTF-8 sequences that are longer than necessary
to encode a character," noting that "any overlong UTF-8 sequence could
be abused to bypass UTF-8 substring tests that look only for the
shortest possible encoding."
None of these issues is a show-stopper. However, it is stuff that we
should check for and document.
[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html
More information about the Submit
mailing list