Portable vkernel (emulator)

Thu Jul 10 21:35:09 PDT 2008

:I don't doubt the features, but if it has to compete with modern Linux
:filesystems for single-node file server roles, it'll need a lot more
:optimization. I'm not trying to troll, but it's fair to say that there
:are still plenty of use cases that HAMMER won't suit without a lot
:more work, even if most of that is because DragonFly itself still has
:a fair way to go in some areas.

    Well, one thing that can be said for linux is that it has a fairly
    long list of varied filesystems, but they all do different things
    and they are all in various states of repair or disrepair.  The
    only commonality is instant crash recovery and (typically) snapshot
    support, and that's it.

    The interesting thing about reliability is that it isn't entirely an
    issue to be solved by the filesystem any more.  People lose Ext
    and Reiser filesystems all the time.  Failover doesn't always
    work.  Clustered filesystems don't always stay in sync.  Software
    bugs, hardware bugs, RAID bugs... things that are out of the control
    of the filesystem can wreak havoc.  I know several people who scream
    about there being no Linux setup that does everything right.  You
    have to choose your poison and live with the quirks.  In the linux
    world, quantity does not equal quality, something Linus himself has
    commented on numerous times.

    There are a couple of cluster filesystems, Coda, GFS, IBM's GPFS,
    and a few others.  There's LVM at the block layer (which is really nice,
    I wish we had it).  I frankly don't know how well any of them work
    or what their trade-offs are.  Most people use, what, ext3/4 or
    reiser4 now?  Ext and Reiser aren't cluster filesystems.

    ZFS is the only current filesystem that tries to do self-repair, and
    even that won't save it from software bugs.  Multi-master replication
    and self-repair (off of other masters, or slaves) is the holy grail.
    I'm sure someone does it.  HAMMER will, eventually (probably another year
    for multi-master replication, but we will have master->multi-slave
    replication for this release).

    So there's a wide selection, but no single filesystem has the full set
    of features.  If one were to compare HAMMER against all of them as a
    group then, sure, I have a ton of work to do.  But if you compare
    HAMMER against any single linux filesystem (well, the non-cluster ones
    for the moment), I think you'd be surprised.

:I don't want to trivialise your work at all though. I've been
:following it through the mailing lists and it's very impressive. I
:respect that you've been uncompromising in getting the best possible
:on-disk format, while a lot of filesystems have obviously stopped
:short and left themselves with unfixable problems.
:
:>    But from a stability and use point of view I would agree that HAMMER
:>    needs to age a bit to achieve the reputation that UFS has acquired
:>    over the last 20 years.  This means people have to start using it for
:>    more then testing (once we release).
:
:It's a bit of a difficult thing to do in the traditionally
:conservative BSD community. You want to test it as if it's in a real
:world production environment, without actually trusting data to it. So
:you could drop files onto it and replicate them to another non-hammer
:FS, but that's not exactly real world usage. Or you could dump data on
:it directly, like on a file server, and trust it completely,
:immediately after its official release.
:
:Although to be honest, at this point I'd *rather* use HAMMER than UFS
:for a file server, because as a developer I intuitively trust new code
:by experienced developers more than old code that hasn't been
:maintained for years.

    We have a small developer and user community and it is not likely
    to increase all that much, even with HAMMER.  Keep in mind, though,
    that both Ext and Reiser were originally developed when Linux was
    a much smaller project.  Having a large project gives you more eyeballs
    and more people pounding the filesystem, but filesystem development
    isn't really governed by that pounding.  The bugs don't get worked out
    any faster with 1000 people pounding on something verses 100 people.

    All the major filesystems available today can be traced down to usually
    just one or two primary developers (for each one).  All the work flows
    through to them and they only have so many hours in a day to work with.
    Very few people in the world can do filesystem development, it's harder
    to do then OS development in my view.

    UFS went essentially undeveloped for 20 years because the original
    developers stopped working on it.  All the work done on it since FFS
    and BSD4-light (circa 1983-1991-ish) has mostly been in the form of
    hacks.  Even the UFS2 work in FreeBSD is really just a minor extension
    to UFS, making fields 64 bits instead of 32 bits and a few other
    little things, and nothing else.

    Ext, Reiser, Coda... I think every single filesystem you see on Linux
    *EXCEPT* the four commercial ones (Veritas, IBM's stuff, Sun's ZFS,
    and SGI's XFS) are almost single-person projects.  So I'm not too worried
    about my ability to develop HAMMER :-)

:>    This is why I'm making a big push to get it running as smoothly as
:>    possible for this release.  A lot of people outside the DragonFly
:>    project are going to be judging HAMMER based on its early use cases,
:>    and the clock starts ticking with the first official release.
:
:I agree. And you've obviously done very well already. I'm just saying
:that some people are getting a little too excited.
:
:Even if HAMMER was the best file system in the world, that wouldn't do
:much for DragonFly's adoption overall. In that case people would
:rather port it from DragonFly than run DragonFly itself, while
:DragonFly has severe limitations on the hardware it can run on, and
:some performance problems once running.
:
:But in the real world there are plenty of alternative file systems.
:It's a commodity. Look at Linux - it has half a dozen file systems,
:most of which perform at least as well as HAMMER with similar (or
:better) reliability guarantees. Sure, they're messy and stale. But
:they work and they work well. Most people don't care about clustering,
:and those that do have found other ways to do it.

    I kinda half agree and half disagree.  I don't think HAMMER will get
    much penetration in the world as a DragonFly-only FS, and making it
    condusive to porting is something I will be working very hard on.

    But DragonFly serves perfectly well as a development platform for
    HAMMER and regardless of where HAMMER is ported that will continue to
    be the case.  My work would not be any easier on Linux, or Solaris, or
    any of the other BSDs.  In fact, it would be harder.  I might get more
    eyeballs, but eyeballs won't make the filesystem work better.

    Filesystems are not commodities.  Just because there are a lot of
    choices in the Linux world doesn't make them all equally viable or
    drop-in replacements for each other, even for the most generic of
    purposes.  I have friends that use linux in production environments
    who tear their hair out because they can't find one filesystem that
    actually does everything they want to do.  Once you select a filesystem
    to use with Linux you are pretty much stuck with it.  Definitely *not*
    commodities.

    There's a reason why most linux installations stick with Ext or Reiser
    (or XFS I think now too), but don't dive into the many other FSs
    available.  And, similarly, there is a reason why certain folks have
    to use some of the others, particularly in clustered environments.
    It's a real mess.

:>    I think I stopped using the mailbox signals (they got replaced by the
:>    co-thread I/O model), but the vkernel still needs the syscall support
:>    for managing VM spaces and virtualized page tables.  I'm not sure
:>    what kind of APIs Linux has to support their UVM stuff.
:
:Ah, thanks for clearing that up. I had a feeling I missed a few email
:threads as my mailing list activity dropped for a while.
:
:>    The KVM stuff is pretty cool but the performance claims, particularly
:>    by companies such as VMWare, are all hype.  The bare fact of the matter
:>    is that no matter what you do you still have to cross a protection
:>    boundary to make a system call or do I/O or take a page fault.
:
:That's true, but those constant costs matter less and less with modern
:hardware and software. Now even production filesystems can run in
:userland containers like FUSE. That's become a trade-off we can make
:on modern hardware with modern software. Virtualisation is one of
:those things. It's heavyweight and hacky, but we can afford that, and
:the benefits are well worth it. So hype or not, the performance is
:good enough and since that translates into costs and feasibility,
:people are investing in it and using it in production. "Hype" would
:imply it's not meeting expectations. I don't think that's fair to say.

    On to VM.  Well, the thing is that they in fact *DO* matter.  Only
    an idle system can be hacked into having very low cost.  Everything
    is relative.  If the cpu requirements of the workloads aren't changing
    very quickly these days then the huge relative cost of the system
    calls becomes less important for those particular workloads, but if
    you have a workload that needs 100% of your machine resources you
    will quickly get annoyed at the VMs and start running your application
    on native hardware.

:Linux solved this problem almost completely by having dynamic ticks in
:its kernel. That means it doesn't have a fixed scheduler interrupt per
:se, or at least, it doesn't have nearly the impact it does on other
:kernels.
:
:I see what you mean by the way. FreeBSD 7 while idle in KVM takes up a
:FEW % CPU just because of all that overhead. Modern Linux takes up
:virtually nothing. They've solved this problem pretty well. And from
:what I hear pure Xen is still even better than KVM.

    Yah, I read about the linux work.  That was mainly IBM I think,
    though to be truthful it was primarily solved simply by IBM reducing
    the clock interrupt rate, which I think pushed the linux folks to
    move to a completely dynamic timing system.

    Similarly for something like FreeBSD or DragonFly, reducing the clock
    rate makes a big difference.   DragonFly's VKERNEL drops it down to
    20Hz.

:>    So the performance for something running under a KVM depends a lot on
:>    what that something is doing.  Cpu-bound programs which don't make many
:>    system calls (such as gcc), or I/O-bound programs which would be
:>    blocked on I/O much of the time anyway (such as sendmail), will
:>    perform fairly well.   System-call intensive programs, such as a web
:>    server, will lose a lot in the translation.
:
:Modern web servers don't have this problem as much as you'd think. The
:fashion these days is to serve static files off a very simple, highly
:optimized server or cluster, and serve dynamic (CPU-bound) content
:from application servers. The application servers are the ones that
:would be virtualised, and since they're mostly Python or PHP or J2EE,
:virtualisation is the least of their performance problems. It's just
:that the performance problem is at most 10% of their overall cost so
:they don't care if that tiny figure even doubles. These days it's much
:less than double, and shrinking every few months. But the one constant
:is that Linux is always at the top in terms of performance and
:efficiency.

    I don't think virtualization is used for performance reasons, most
    such deployments are going to assume at least a 20% loss in performance
    across the board.  The reason virtualization is used is because crazily
    enough it is far, far easier to migrate and hot-swap whole virtualized
    environments then it is to migrate or hot-swap a single process.

    Is that nuts?  But that's why.  CPU power is nearly free, but a loss
    of reliability costs real money.  Not so much 100% uptime, just making
    the downtime in the sub-second range, when something fails, is what
    is important.  Virtualization turns racks of hardware into commodities
    that can simply be powered up and down at a whim without impacting the
    business.  At least as long as we're not talking about financial
    transactions.

    No open source OS today is natively clusterable.  Not one.  Well, don't
    quote me on that :-).  I don't think OpenSolaris is, but I don't know
    much about it.   Linux sure as hell isn't, it takes reams of hacks to
    get any sort of clustering working on linux and it isn't native to the
    OS.  None of the BSDs.  Not DragonFly, not yet.  Only some of the
    big commercial mainframe OSs have it.

:...
:>    a performance standpoint.
:
:Of course. Hardware emulation has been the domain of VMWare and
:VirtualBox, which provide highly optimized drivers to the guest
:operating systems, and use KVM-like features to optimize out the
:inevitable overheads. VMWare even has experimental DirectX emulation.
:We'll see what happens with that, but it proves they're solving so
:many bottlenecks that they're finally stepping up to the task of
:virtualising modern games, the biggest virtualisation holdout to date.
:
:-- 
:Dmitri Nikulin

    Yah, and it's doable up to a point.  It works greats for racks of
    servers, but emulating everything needed for a workstation environment
    is a real mess.  VMWare might as well be its own OS, and in that respect
    the hypervisor support that Linux is developing is probably a better
    way to advance the field.  VMWare uses it too but VMWare is really a
    paper tiger... it wants to be an OS and a virtualization environment,
    so what happens to it when Linux itself becomes a virtualization
    environment?

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>