UTF8 locale MFC for DragonflyBSD [tjr at FreeBSD.org: cvs commit: src/etc/mtree BSD.usr.dist src/share/colldef Makefile src/share/mklocale Makefile UTF-8.src src/share/monetdef Makefile be_BY.UTF-8.src bg_BG.UTF-8.src cs_CZ.UTF-8.src en_GB.UTF-8.src...]

Mon Mar 29 03:34:32 PST 2004

On Sun, Mar 28, 2004 at 09:35:58PM -0500, David Cuthbert wrote:
> Out of curiosity, do you know how the FreeBSD folks handled some of the 
> stranger UTF-8 behavior?  In particular:

I don't know how FreeBSD handles it, but I'm commenting on this to reflect
what might become _our_ attitude towards UTF-8.

> 1. Representation of embedded NUL characters.  The UTF-8 spec says this 
> is one byte == 00000000b; Java and a few others, though, have used a 
> double-byte encoding (110/00000 + 10/000000) so that stuff like strlen() 
> works "reasonably."

This is a clear violation of the minimum requirement. Since you should
always use explicitly sized strings instead of delimited strings for
processing of strings with possible embedded NULLs, I don't think we want
this. Actually this is one of the view checks we might want to do in the
kernel if we want to go the UTF-8 road in the future. It would be pretty
bad to be able to create undelete / unviewable files :)

> 2. Maximum size of a character.  To represent UCS-2 characters, you only 
> needed up to 3 bytes (1110/xxxx + 10/xxxxxx + 10/xxxxxx). 
> Unfortunately, surrogates make it necessary to do 6 byte encodings. 
> Ironically, UTF-8 unaware routines handle these fine; some of my older 
> UTF-8 handling routines, though, barf on stuff above and beyond U+10000. 
>  (Fortunately, none of these have escaped Neolinear... I hope...)

I don't really want to force us to UCS2, just because MS did. It is pretty
pointless if you think about Unicode as mean to encode every _written_
script in the world. Therefore if we want to apply any length checks, the
correct way is as specified by at least Unicode 3 e.g. UCS4.

> 3. Security issues.  The UTF-8 and Unicode FAQ [1] states that "a UTF-8 
> decoder must not accept UTF-8 sequences that are longer than necessary 
> to encode a character," noting that "any overlong UTF-8 sequence could 
> be abused to bypass UTF-8 substring tests that look only for the 
> shortest possible encoding."

This is a reasonable requirement since it allows us to use normal strcmp(3).
It might still be necessary to do normalisation, but that is an advanced
issue, the size of certain parts of ICU speaks for itself. We still need a
regex engine for Unicode :)

> None of these issues is a show-stopper.  However, it is stuff that we 
> should check for and document.

Fully agreed.

Joerg

> [1] http://www.cl.cam.ac.uk/~mgk25/unicode.html