UTF8 locale MFC for DragonflyBSD
Dave Cuthbert
dacut at neolinear.com
Mon Mar 29 08:09:08 PST 2004
Joerg Sonnenberger wrote:
On Sun, Mar 28, 2004 at 09:35:58PM -0500, David Cuthbert wrote:
1. Representation of embedded NUL characters. The UTF-8 spec says this
is one byte == 00000000b; Java and a few others, though, have used a
double-byte encoding (110/00000 + 10/000000) so that stuff like strlen()
works "reasonably."
This is a clear violation of the minimum requirement. Since you should
always use explicitly sized strings instead of delimited strings for
processing of strings with possible embedded NULLs, I don't think we want
this. Actually this is one of the view checks we might want to do in the
kernel if we want to go the UTF-8 road in the future. It would be pretty
bad to be able to create undelete / unviewable files :)
Agreed. Actually, I thought that this was a fairly elegant hack.
Nonetheless, it is still a hack.
I don't really want to force us to UCS2, just because MS did. It is pretty
pointless if you think about Unicode as mean to encode every _written_
script in the world. Therefore if we want to apply any length checks, the
correct way is as specified by at least Unicode 3 e.g. UCS4.
Well, not just MS; a lot of folks (notably Sun/Java) were caught off
guard when Unicode was extended beyond the base 64k characters. I won't
replicate the flame wars here, they're all on Google. :-)
My personal opinion: UCS-4 wastes a lot of space given that Unicode 3.1
is a ~21-bit set and nobody is really using the >=U+10000 space in a
practical manner (yet?). But if you need to have a one-to-one mapping,
you don't have much choice.
Unless you have a machine which uses 21-bit bytes, of course. ;-)
More information about the Submit
mailing list