git: UTF-8: Multiple improvements (and detection of possible issue)

John Marino marino at crater.dragonflybsd.org
Tue Sep 1 02:21:11 PDT 2015


commit 7294feb84bbe3070ca901983bd143802d793d494
Author: John Marino <draco at marino.st>
Date:   Tue Sep 1 11:11:41 2015 +0200

    UTF-8: Multiple improvements (and detection of possible issue)
    
    This commit started out intending to fix "digit" definition on unicode,
    which it mostly does, but a lot more happened in the end, namely:
    
     * digits apparently are not part of CLDR definition.  I added a section
       in the manual portion of UTF-8 source file that defines digit classes
       for generated sections.
     * Add numbers classification for entire UTF-8.  Currently DragonFly and
       all BSDs do not support "number" type.  However, localedef understands
       it (its supported on Illumos), but currently the number flag value is
       zero, so it's a no-op.  A short term goal is to have DragonFly be the
       first BSD with proper number ctype handling.
     * Redefine "special" ctype once and for all.  There is no definitive
       agreement on what "special" characters are.  According to wiki which
       got it from unicode, it starts with 33 characters (0x20 - 0x2F, 0x3A -
       0x40, 0x5B - 0x60, 0x7B - 0x7E).  However, localedef objects to <space>
       because it sets "graph" and "print" flags, and <space> can't be graph.
       As a result, the <space> is not considered "special" here.  Moreover,
       the punctuation in Latin-1 supplement is "special".  The division and
       multiplication signs are ambiguous, so I set them to special (since
       plus and minus signs are special).  Finally, with the most doubt, the
       punctuation of "general punctuation" block is also considered special
       although I couldn't find convincing evidence either way.  Given the
       lack of definition, I don't think "special" classification is really
       used, especially not in unicode.
     * Fix NON-BREAK_SPACE classification (set as graph and space on previous
       commit)
     * the MICRO character was also warning due to being classified as both
       lower (in Greek section) and punctuation, so remove the punct. class.
     * When possible, don't define graph if digit is defined, and similarly
       with graph and punct.  Both digit and punct also set graph flag so
       having both is redundant.
     * add several new block definitions:
       - Syloti Nagri
       - Common Indic Number Forms
       - Phags-pa
       - Saurashra
       - Kayah Li
       - Rejang
       - Javanese
       - Cham
       - Tal Viet
       - Meetei Mayek & extension
     * Detection of possible bug in localedef
       The Tai Tham definition are producing the wrong code but there's
       nothing wrong with the definitions.  The 6 unused characters between
       the two digit definitions should not be graphable, but as soon as
       one "digit" is defined after the first digit range is defined, all
       the characters between are marked as graphable and digits.  There
       are similar "fill-ins" but so far only with Thai Tam.  It was
       detected while outputting all "digit" types against a python program
       that does the same and this error was reveal.  It requires further
       investigation about exactly what is causing it (and thus where the
       bug is) but right now it's either a bad definition elsewhere that
       affects Thai Tam or localedef has a bug somewhere (avl lookup?)

Summary of changes:
 share/ctypedef/en_US.UTF-8.src            | 324 ++++++++++++++++++++++--------
 tools/tools/locale/etc/common.UTF-8.src   | 324 ++++++++++++++++++++++--------
 tools/tools/locale/etc/manual-input.UTF-8 | 324 ++++++++++++++++++++++--------
 3 files changed, 729 insertions(+), 243 deletions(-)

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/7294feb84bbe3070ca901983bd143802d793d494


-- 
DragonFly BSD source repository



More information about the Commits mailing list