Naive HAMMER question

Mon Jan 14 19:55:04 PST 2008

Matthew Dillon wrote:
:Hello Matt,
:
:Just had a fight with ext3fs/FreeBSD/Win2K running on the same
:computer with not-fully-cp866-compliant Ukrainian filenames.
:The problem is old enough and several approaches are known, most of
:them are local fs oriented (like "just use the same charset
:everywhere"). But HAMMER is network/cluster/multisystem oriented, so
:different charsets are in use on different nodes, or even different
:users requires different charsets at the same time even locally.
:
:So, answer please these couple of questions:
:
:1. Will HAMMER carry any charset/language info for non-ASCII filenames?
:2. Will it map on-disk names to user-defined charset in any way?
:
:I'd preffer having UTF8 names on-disk (at least it will work for me
:and most of other people, I think).
:Upper layers could specify their one-byte charsets if needed and
:provide names translation on their own.
:
:Your vision?
:
:PS. I'm not an expert on FS/i18n issues.
:-- 
:Dennis Melentyev

    My personal opinion is that the kernel should be responsible for
    filename translation rather than the filesystem.  HAMMER just sees
    a character string, it doesn't know or care what format it is in.
Heavily dependent on UTF-8 (and several Chinese encodings) here - I'd 
take that a step further towards isolation/agnosticism, and place the 
burden on the userland application - not the kernel ro the fs at all.
(as seems to have been the *BSD 'way')

    Ultimately I think UTF8 has to be used for maximum compatibility.
When I look at a UTF-8 titled Chinese-named file in, for example OS X 
'Finder' - it shows the correct Chinese characters.

'ls' in a CLI show 'XXX' escaped sequences for the same file.

I prefer that minor nuisance because it tells me that the all-important 
OS and fs are not making *potentially wrong* guesses - just sticking 
with binary is binary.

Much less risk of damage if a browser or editor gets it wrong than if 
the fs or kernel get it wrong, as they can simply be optioned or 
configured otherwise. IOW - it is display conversion, not source conversion.

grep and friends can, of course, be a PITA, but cut n' paste in a 
half-smart terminal window seems to make the conversion to '\xxx' just 
fine (though not always the reverse...).

    The issue is not specifically addressed in DragonFly (UTF8 is kinda
    a cop-out but I still think its better then using UTF16 or UTF32).
					-Matt
					Matthew Dillon 
					<dillon at backplane.com>
More 'effective compromise' than cop-out.

UTF-8's great advantage is that it doesn't much intrude when one sticks 
with plain ASCII or any of the several common encodings based on same.

UTF-16 thus seems to have fallen into the gap. Not widely seen in the wild.

As to UTF-32. Does *anyone* actually use it?

Bill