BSD Magazine article on DragonFlyBSD hammer filesystem volume management

Sun Oct 8 04:29:30 PDT 2017

2017-10-08 1:47 GMT+03:00 Predrag Punosevac <punosevac72 at gmail.com>:
> Tomohiro Kusumi <kusumi.tomohiro at gmail.com> wrote:
>
>> Hi
>
> Hi Tomohiro,
>
> I figured out in the light of immanent DF 5.0 release with HAMMER 2
> preview the following diagram might be useful for noobs like me
>
>
>                     Linux
>
>
>  --------------------------------------------
> |     /data1   |    /data2    |    /data3    |
>  --------------------------------------------
> |     XFS      |     XFS      |    XFS       |  <--- mkfs.xfs
>  -------------------------------------------
> | LVM Volume 1 | LVM Volume 2 | LVM Volume 3 |  <--- lvcreate
>  -------------------------------------------
> |            LVM Volume Group                |  <--- pvcreate & vgcreate
>  -------------------------------------------
> |               RAID Volume                  |  <--- mdadm
>  -------------------------------------------
> |   GPT    |    GPT   |   GPT    |    GPT    |  <--- parted
>  -------------------------------------------
> | /dev/sda | /dev/sdb | /dev/sdc | /dev/sdd  |
>  --------------------------------------------
>
>
>
>
>             DragonFly BSD HAMMER 1
>
>
>  --------------------------------------------
> |            |     /data1    |    /data2     |
>  -------------------------------------------
> |            |  /pfs/@@-1:01 | /pfs/@@-1:02  |  <-- hammer pfs-master
>  -------------------------------------------
> |            |       HAMMER 1                |  <-- newfs_hammer -L DATA
>  -------------------------------------------
> |            |     /dev/ar0s1a               |  <-- disklable64
>  -------------------------------------------
> | /dev/ar0s0 |     /dev/ar0s1                |  <-- gpt (slices)
>  -------------------------------------------
> |                Hardware RAID               |
>  -------------------------------------------
> | /dev/da0 | /dev/da1 | /dev/da2 | /dev/da3  |
>  --------------------------------------------

 --------------------------------------------
|            |     /data1    |    /data2     |
 -------------------------------------------
|            |  /pfs/@@-1:01 | /pfs/@@-1:02  |  <-- hammer pfs-master
 -------------------------------------------
|            |       HAMMER 1                |  <-- newfs_hammer -L DATA

Above is correct from user's point of view, but physical layout is
more like below (given these figures show physical layout).
The @@ things are channels to dive into sub-trees, which represents
PFS, within hammer's single large B-tree per fs (not per PFS).
@@ things, aka PFS, are entry points to logically clustered sub-trees
within the B-Tree, thus PFS contents are not linearly mapped to the
underlying low-level storage layer.

 --------------------------------------------
|            |     /data1    |    /data2     |
 -------------------------------------------
|            |  /pfs/@@-1:01 | /pfs/@@-1:02  |  <-- hammer PFS
channels to B-Tree localizations (sub-trees)
              \\\/\\\ /  \\\ /// \\\ ///\\\//
               \/\\/\\/// \\\\// \\\//\\ ///\   <-- A single B-Tree
which contains everything (data/metadata)
                \\/\\  // /  \\\ /// \\\ ///\
 -------------------------------------------
|            |  HAMMER 1 low-level storage   |  <-- hammer low-level
storage layer
 -------------------------------------------
|            |meta|vol-1 |meta|vol-2 |meta...|  <-- hammer volumes (up
to 256 volumes)
 -------------------------------------------
|            |       HAMMER 1                |  <-- newfs_hammer -L DATA

> To my knowledge disklabel64 for creating BSD partitions is mandatory.
>
>
>
>
>
>
>
>                  FreeBSD ZFS
>
>
>  -------------------------------------------
> |      /data1         |      /data2          |
>  -------------------------------------------
> |     dataset1        |      dataset2        |  <-- zfs create
>  -------------------------------------------
> |              ZFS Storage Pool              |  <-- zpool create
>  -------------------------------------------
> |   GPT    |    GPT   |   GPT    |    GPT    |  <-- gpart
>  -------------------------------------------
> | /dev/da0 | /dev/da1 | /dev/da2 | /dev/da3  |
>  --------------------------------------------
>
>
>
>
> Note that GPT layer for ZFS can be skipped. It is merely put there to
> protect a user from having problems with the HDDs of slightly unequal
> sizes as well to make easier to identify the block devices (whole disks
> or partitions on the disk), so one doesn't get the ambiguity caused by
> the OS renumbering devices depending on which devices were found in
> hardware (which makes da0 into da1)? Labeling part can be accomplished
> with glabel from GEOM framework. One should use zfs create with some
> kind compression option.
>
>
>
>
>> Regarding LVM part,
>>
>> 2017-10-07 7:34 GMT+03:00 Predrag Punosevac <punosevac72 at gmail.com>:
>> > Siju George wrote:
>> >
>> >
>> >> Hi,
>> >>
>> >> I have written an article about DragonFlyBSD hammer filesystem volume
>> >> management. This will be particularly useful for Linux users familiar
>> >> with lvm.
>> >>
>> >> https://bsdmag.org/download/military-grade-data-wiping-freebsd-bcwipe
>> >>
>> >> Thanks
>> >>
>> >> --Siju
>> >
>> > I just read the article twice and I am completely confused (probably to
>> > be expected by the guy who is train in theoretical mathematics and
>> > astronomy). I would like if somebody could correct my understanding of
>> > the things.
>> >
>> > On the most basic level OSs manage disks through device nodes. For many
>> > years engineers were created filesystems with by assuming that the
>> > filesystem will go directly to the physical disk. Proliferation of
>> > commodity hardware have caused to re-evaluate this idea. Namely not all
>> > drives have the same rotational speed, number of platters etc. So much
>> > like French approach in mathematics somebody come with the idea of
>> > abstraction of the physical devices and stop lying to OSs about the
>> > geometry of hard drives. That is way all modern HDDs use Logical Block
>> > Addressing (LBA). Abstraction didn't end up here.
>> >
>> > On the next abstraction layer we have disk slicing (Linux guys oblivious
>> > to BSD partitions call these slices partitions). I am semi-familiar with
>> > to slicing schemes. One is old MBR scheme still default on OpenBSD and
>> > another is GPT scheme. On DragonFly one shell always use GPT scheme. man
>> > gpt. For example
>> >
>> > dfly# gpt show /dev/da3
>> >       start       size  index  contents
>> >           0          1      -  PMBR
>> >           1          1      -  Pri GPT header
>> >           2         32      -  Pri GPT table
>> >          34          1      0  GPT part - DragonFly Label64
>> >          35  976773100      1  GPT part - DragonFly Label64
>> >   976773135         32      -  Sec GPT table
>> >   976773167          1      -  Sec GPT header
>> >
>> >
>> > On Linux one uses parted to do GPT scheme while FreeBSD uses gpart.
>> > Once you slice your disk (typically Linux will do three slices while
>> > historically DragonFly is using only one slice (dangerously dedicated)
>> > or two slices like in the example above) we can start talking about BSD
>> > partitions. On the above hard drive dat is located in the slice indexed
>> > by 1. Slice with index 0 is created for historical reasons so that the
>> > drive is not dangerously dedicated.
>> >
>> >
>> > A slice can be further divided into the BSD partitions (up to 16) on
>> > DragonFly (Linux has no equivalent) using command disklabel64. Lets look
>> > how the slices on the above disk are divided
>> >
>> > dfly# disklabel64 -r /dev/da3s0
>> > disklabel64: /dev/da3s0: No such file or directory
>> > dfly# disklabel64 -r /dev/da3s1
>> > # /dev/da3s1:
>> > #
>> > # Informational fields calculated from the above
>> > # All byte equivalent offsets must be aligned
>> > #
>> > # boot space:    1059328 bytes
>> > # data space:  488385505 blocks # 476938.97 MB (500106757632 bytes)
>> > #
>> > # NOTE: If the partition data base looks odd it may be
>> > #       physically aligned instead of slice-aligned
>> > #
>> > diskid: a6a0a2ef-a4d1-11e7-98d9-b9aeed3cce35
>> > label:
>> > boot2 data base:      0x000000001000
>> > partitions data base: 0x000000103a00
>> > partitions data stop: 0x007470bfc000
>> > backup label:         0x007470bfc000
>> > total size:           0x007470bfd800    # 476939.99 MB
>> > alignment: 4096
>> > display block size: 1024        # for partition display only
>> >
>> > 16 partitions:
>> > #          size     offset    fstype   fsuuid
>> >   a:  104857600          0    4.2BSD    #  102400.000MB
>> >   b:  262144000  104857600    4.2BSD    #  256000.000MB
>> >   a-stor_uuid: 5ad35da2-a4d2-11e7-98d9-b9aeed3cce35
>> >   b-stor_uuid: 5ad35dba-a4d2-11e7-98d9-b9aeed3cce35
>> >
>> >
>> > As you can see the first slice contains no partitions. The second slice
>> > is partitioned in two parts
>> >
>> > da3s1a and da3s1b
>> >
>> > One could put the file system UFS or HAMMER on those partitions. It is
>> > totally possible to have different file systems on partitions. For
>> > example da3s1a is formated to use UFS while da3s1b is formated to use
>> > Hammer. DragonFly cousin FreeBSD can do the same. Linux can't.
>> >
>> > On Linux the slice sda0a can't be further subdivided and must contain a
>> > file system like XFS. Not so quickly. On Linux some people decided to
>> > add another abstract layer called logical volume manager (LVM).
>>
>> DragonFly has LVM too.
>> To be exact it's a port of Linux's LVM implementation called LVM2,
>> along with its kernel side subsystem Device Mapper which is an
>> abstraction layer for various block device level features, like LVM,
>> snapshot, software RAID, multipath, and much more.
>>
>> The problem with DragonFly is that most of the (useful) DM target
>> drivers are either not working or not ported.
>> I'm not even sure if LVM2 is stable, let alone software RAID.
>>
>>
>
> I am aware that DF has LVM2 but I have never read the code. I truly
> appreciate your comments regarding its usability. At some point in the
> distant past I foolishly asked a question if LVM2 can be used to resize
> HAMMER1 partition. The answer is of course not.
>
>
>
>
>> > One can
>> > combine several physical volumes (not recommended as the software RAID
>> > should be used for that) into the LVM volume group. Volume group is
>> > further divided into the logical volumes. File system is going onto the
>> > top of the logical volumes. Cool thing about logical volumes is that
>> > they can be expanded, shrank while mounted. Actually the coolest this
>> > is that one can take a snapshot of the logical volume group although I
>> > have not seeing many Linux users do that in the practice. In the terms
>> > of file system the only real choice is XFS. Historically LVM didn't
>> > provide any options for redundancy or parity thus software RAID (mdadm)
>> > or hardware RAID was needed.  It is perfectly OK to initiate RAID
>> > /dev/md0 as a volume group partition it into few logical volumes and
>> > than put the XFS on the top of it.
>>
>> LVM is usually a different thing from RAID.
>> In case of LVM in Linux (which is DM based LVM2), you can stack DM
>> devices to synthesize these (or other block level features provided by
>> DM infrastructure).
>> You can stack DM devices on DragonFly too, but there aren't that many
>> good and/or stable DM target drivers worth stacking...
>
> RedHat now recommends RAIDing with LVM over MDRAID. I am not sure how
> mature and featureful is LVM RAID? A true storage experts could pitch in
> on this question.

I also usually see people use md for software RAID as well (rather
than what dm provides regarding raid).

>
>>
>>
>> > Going back to cousin FreeBSD now. Unlike Linux FreeBSD uses GEOM instead
>> > of LVM. My understanding is that GEOM combines LVM, RAID, and even
>> > encryption (I am not sure that GELI is the part of GEOM) into the one.
>> > GEOM used as LVM allows for UFS journaling.
>>
>> The reason DragonFly doesn't have GEOM is because that's the way they
>> wanted it to be at least in 2010.
>> https://www.dragonflybsd.org/mailarchive/users/2010-07/msg00085.html
>>
>>
>
> I read the discussion many times in the past but I just read it one more
> time. It feels after all this years that Matt's position has being
> validated. However I am not sure about
>
> gpt+disklabel64   vs  gpart
>
> part of that discussion. I remember first time initializing a storage
> HDD thinking over several times why do I need to create a partition
> inside a slice if I am going to use entire slice just for one partition
> which is going to hold the data. I do understand Matt's point about
> disklabel64 though.
>
>
>
>> > So where are HAMMER1 and ZFS in all this story?
>> >
>> > On one hand ZFS makes hardware/software RAID obsolete as it is a volume
>> > manager (in the sense of RAID). It is also volume manager in the sense
>> > of LVM with caveats (ZFS volumes on FreeBSD IIRC could be grown only by
>> > off-lining physical drives and replacing them with larger drives before
>> > re-silvering).  However ZFS brings a lot of new goodies COW, checksum,
>> > self-healing, compression, snapshots (the one that people actually use
>> > unlike LVM), remote replication. It is possible to use ZFS Volume as a
>> > iSCSI Target and to boot from the ZFS volume (all my file servers do
>> > that and even use beadm) It is the best thing after the slice of bread
>> > if you are willing to reimplement large part of Solaris kernel (which
>> > FreeBSD people did). ZFS pools which could be thought of as LVM volume
>> > groups can be divided into the datasets which is in some sense
>> > equivalent to LVM logical volumes.
>> >
>> >
>> > HAMMER1 is unable to manage volumes in the sense of RAID. It requires
>> > software or hardware RAID for high availability and redundancy.
>>
>> This was actually the design of HAMMER1 as mentioned in section 11 of
>> below in 2008.
>> https://www.dragonflybsd.org/hammer/hammer.pdf
>>
>
> I am aware of the document. I whish somebody who read natacontrol code
> could chime in the state of software RAID on DF. Based upon this document
>
> https://www.dragonflybsd.org/docs/docs/howtos/howtosoftwareraid/
>
> that thing probably has not being touched since 2005 (like Jails). That
> further amplifies the need for regular testing of Hardware RAID cards on
> DF.
>
>>
>> > To my
>> > knowledge software RAID discipline 1 is achieved on DragonFly via old
>> > FreeBSD 4.8 framework natacontrol. I am not sure if that thing should be
>> > used in production any longer. Any how hardware RAID seems a way to go.
>> > HAMMER1 can't be expended so unlike ZFS it is much more pure file
>> > system. However as a file system it is second to none. COW, checksum
>> > healing via history, fine-grained journaling, snapshots, etc. HAMMER1
>> > equivalent of datasets are pseudo file systems pfs which are very cheap
>> > (for example home directory of each user on the file server could be a
>> > PFS which could be destroy in the case of policy violation). HAMMER1
>> > comes with built in backup (mirror-stream). Unfortunately slave PFSs
>> > are read only. DF can't boot from HAMMER1. Anyhow the PFS do indeed to
>> > some extend look like Linux logical volumes but they are so much more
>> > advanced. HAMMER1 is NFS and Samba aware (NAS) but I am not sure that
>> > DragonFly SAN (iSCSI) capabilities.
>> >
>
> I was wrong about iSCSI on DragonFly. Of course it has it
>
> https://www.dragonflybsd.org/cgi/web-man?command=iscsi.conf&section=5
>
> It would be interesting to hear from somebody who is actually using it
> and who read the actual code.
>
>
>> >
>> > I would really appreciate if people can point mistakes in the above
>> > write up and give me some references so that I can actually learn
>> > something.
>> >
>> >
>> > I am finishing this post by saying that I am all in suspense expecting
>> > the release of DF 5.0 and preview of HAMMER2.
>> >
>> > Cheers,
>> > Predrag