UTF8 locale MFC for DragonflyBSD

Dave Cuthbert dacut at neolinear.com
Mon Mar 29 08:09:08 PST 2004


Joerg Sonnenberger wrote:
On Sun, Mar 28, 2004 at 09:35:58PM -0500, David Cuthbert wrote:
1. Representation of embedded NUL characters.  The UTF-8 spec says this 
is one byte == 00000000b; Java and a few others, though, have used a 
double-byte encoding (110/00000 + 10/000000) so that stuff like strlen() 
works "reasonably."
This is a clear violation of the minimum requirement. Since you should
always use explicitly sized strings instead of delimited strings for
processing of strings with possible embedded NULLs, I don't think we want
this. Actually this is one of the view checks we might want to do in the
kernel if we want to go the UTF-8 road in the future. It would be pretty
bad to be able to create undelete / unviewable files :)
Agreed.  Actually, I thought that this was a fairly elegant hack. 
Nonetheless, it is still a hack.

I don't really want to force us to UCS2, just because MS did. It is pretty
pointless if you think about Unicode as mean to encode every _written_
script in the world. Therefore if we want to apply any length checks, the
correct way is as specified by at least Unicode 3 e.g. UCS4.
Well, not just MS; a lot of folks (notably Sun/Java) were caught off 
guard when Unicode was extended beyond the base 64k characters.  I won't 
replicate the flame wars here, they're all on Google. :-)

My personal opinion: UCS-4 wastes a lot of space given that Unicode 3.1 
is a ~21-bit set and nobody is really using the >=U+10000 space in a 
practical manner (yet?).  But if you need to have a one-to-one mapping, 
you don't have much choice.

Unless you have a machine which uses 21-bit bytes, of course. ;-)





More information about the Submit mailing list