the 'why' of pseudofs

Wed Feb 18 02:14:40 PST 2009

Matthew Dillon wrote:
    There are several reasons for using PFSs.

    * Shared allocation space.  You don't have to worry about blowing
      out small filesystems and having to resize them.
    * Each PFS has its own inode space, allowing mirroring to be used
      to manage backups on a per-PFS basis.  Thus mirroring slaves can
      be conveniently created and destroyed, and masters can be used
      to differentiate what you do and do not want to back up.  e.g.
      I want to backup /home, I don't want to backup /usr/obj.
      In this respect there is actually a lot more to it... PFSs are
      the primary enabler for most of the future multi-master clustering
      work.  Even slaves are extremely inconvenient to do without PFSs
      to manage independant inode spaces.
    * Each PFS can have its own history/snapshot retention policy.
      For example you want to retain history on /home but who cares
      about /tmp or /usr/obj ?  You might want to retain only a few
      days worth of snapshots for /var but hundreds of days for /home.
    * Each PFS can be pruned / reblocked independantly of the others.
      For example /build on pkgbox is configured to spend a lot longer
      pruning and reblocking then /archive.
    With regards to softlinks vs null mounts, null mounts are preferred
    because softlinks are not always handled properly, or handled in
    the expected way, by utilities.
    An example of this would be, say, /usr/src.  If /usr/src is a softlink
    then the /usr/obj paths generated would be the expanded softlink.
    So instead of getting /usr/obj/usr/src/... you would instead get
    /usr/obj/pfs/@@0xffffffffffffffff:0007/src/...
    It can get messy very quickly when the filesystem space is glued 
    together with softlinks instead of mounts.

						-Matt

OK - let's say one of the goals here will be more comprehensive 
documentation - no only of what 'is now' (above) but w/r the perils, 
tribulations - or possibly the advantages, even if edge-case - of doing 
elsewise.

I'll first append Michaels' response - embedding a few words:

PFS is the smallest unit of mirroring, and [therefor also the smallest 
> |the] unit to which you can
apply specific retainment policies. For example while you do not want
to retain much history for /tmp, you might want to do so for /home.
When it comes to mirroring, you clearly do not want to mirror changes to
PFS /tmp, while you want to mirror changes to PFS /home.
Good concept. Bad choice of examples [1].

> If everything
would lie on a single huge filesystem "/", we could not decide what to
mirror and what not. That's the major design decision. 

. .. my /, /usr, /var, /home, /tmp are (traditionally) on separate 
partitions, if not slices.

I don't need hammer there. Logs aside, these could damn near be in ROM 
for as much as two years at a go.

I need HAMMER on the 500 GB single to several TB arrays where client 
applications, IMAP mailstore, web sites, and other *dynamic* data reside.

You might ask, why not simply specify which directories to mirror 
and which to leave out (without considering PFS)? The issue here is,
that, AFAIK, mirroring works on a very low level, where only inode
numbers are available and not full pathnames, so something like:

  tar -cvzf /tmp/backup.tgz --reject="/tmp;/var/tmp"

would not work, or would be slow.

Another issue is locality. Metadata from one PFS lies more close
together and as such is faster to iterate.
====================== Questions and counterpoints ================

Some GOSPEL we can agree on came at the end of Matt's reply.

Let's take these as givens:

>    With regards to softlinks vs null mounts, null mounts are preferred
>    *because softlinks are not always handled properly, or handled in
>    the expected way*, by utilities.
>
====
    *It can get messy very quickly when the filesystem space is glued 
    together with softlinks instead of mounts.*

====

To which I'll remind that neither (Matt's) excellent 'cpdup', nor rsync, 
nor a seasoned admin working manually can be *certain* that all 
softlinks will ever and always be handled as one wished they had been 
. ... mere seconds after the bullet holes appear in the feet.

They can be problem-solvers some of the time, but a potential 
maintenance hand-grenade ALL the time.

Further, a 'pseudo' mount is, IMNSHO, just another 'virtual' band-aid of 
a different flavor. It is NOT (yet) 'enough' more assured to be handled 
correctly. Not because the fs & utils cannot be made to do the right 
thing, but because the *sysadmin* and habit of long years may not.

New game. More education needed. New habits to be built, and old ones 
'unlearned'.

If/as/when a utility must actually *care* about the foundation level of 
inodes, AND THEN 'virtualize' a known-to-not-match set of these, we 
should be aware that hazards lurk, and have them 'boxed' ahead of time.

'virtual' means the next word is a lie, and if it hides, it bites.

So we need different utilities to avoid foot-shooting.
As in the 'version-display' capability in 'ls' already in work.
More will be needed...

I  *want* the benefits of a file system that is robust in new ways that 
address real needs. Enter hammerfs.

I DO NOT want to add-back fragility that negates those very benefits by 
moving the failure modes from device and OS and fs into the space 
between an admin's ears. Exit 'too much' dependence on softlinks and 
pseudo mounts.

Re-balancing seems to be in order.

I'll now motor off for a day or two, do a more research, 
experimentation, and preparation.

Back with specifics 'soon'.

Thanks,

Bill

NOTE: /, /usr/ var/, /home, /tmp are not even on the radar.

Properly reserved for system and sysadmins, these are easily cloned and 
kept synced, backed-up, even swapped in/out 'en bloc' - so long as one 
keeps non-admin users, webish-ness, data bases, mailstore, 
disaster-recovery image storage, etc. entirely OFF them.

Those needs should be mounted from different slices as a minimum, 
different devices (or arrays) *preferably*.

Think 20-200 GB system device, RAID1 - and UFS/FFS is good enough.

Elsewhere - it is on the 500 GB, 1 TB, 2 TB - and up - 'working storage' 
where HAMMER may rule.

- hammer mirror-stream over 100BT or Gig-E,

- the ability to up-rank the slave to master,*quickly*,

. .. and the incremental changes to even very large storage can be kept 
in sync faster, better, and with fewer resources than by any other method.

Providing one doesn't blow it away with fat-fingers....