BSD Magazine article on DragonFlyBSD hammer filesystem volume management
Tomohiro Kusumi
kusumi.tomohiro at gmail.com
Sun Oct 8 04:29:30 PDT 2017
2017-10-08 1:47 GMT+03:00 Predrag Punosevac <punosevac72 at gmail.com>:
> Tomohiro Kusumi <kusumi.tomohiro at gmail.com> wrote:
>
>> Hi
>
> Hi Tomohiro,
>
> I figured out in the light of immanent DF 5.0 release with HAMMER 2
> preview the following diagram might be useful for noobs like me
>
>
> Linux
>
>
> --------------------------------------------
> | /data1 | /data2 | /data3 |
> --------------------------------------------
> | XFS | XFS | XFS | <--- mkfs.xfs
> -------------------------------------------
> | LVM Volume 1 | LVM Volume 2 | LVM Volume 3 | <--- lvcreate
> -------------------------------------------
> | LVM Volume Group | <--- pvcreate & vgcreate
> -------------------------------------------
> | RAID Volume | <--- mdadm
> -------------------------------------------
> | GPT | GPT | GPT | GPT | <--- parted
> -------------------------------------------
> | /dev/sda | /dev/sdb | /dev/sdc | /dev/sdd |
> --------------------------------------------
>
>
>
>
> DragonFly BSD HAMMER 1
>
>
> --------------------------------------------
> | | /data1 | /data2 |
> -------------------------------------------
> | | /pfs/@@-1:01 | /pfs/@@-1:02 | <-- hammer pfs-master
> -------------------------------------------
> | | HAMMER 1 | <-- newfs_hammer -L DATA
> -------------------------------------------
> | | /dev/ar0s1a | <-- disklable64
> -------------------------------------------
> | /dev/ar0s0 | /dev/ar0s1 | <-- gpt (slices)
> -------------------------------------------
> | Hardware RAID |
> -------------------------------------------
> | /dev/da0 | /dev/da1 | /dev/da2 | /dev/da3 |
> --------------------------------------------
--------------------------------------------
| | /data1 | /data2 |
-------------------------------------------
| | /pfs/@@-1:01 | /pfs/@@-1:02 | <-- hammer pfs-master
-------------------------------------------
| | HAMMER 1 | <-- newfs_hammer -L DATA
Above is correct from user's point of view, but physical layout is
more like below (given these figures show physical layout).
The @@ things are channels to dive into sub-trees, which represents
PFS, within hammer's single large B-tree per fs (not per PFS).
@@ things, aka PFS, are entry points to logically clustered sub-trees
within the B-Tree, thus PFS contents are not linearly mapped to the
underlying low-level storage layer.
--------------------------------------------
| | /data1 | /data2 |
-------------------------------------------
| | /pfs/@@-1:01 | /pfs/@@-1:02 | <-- hammer PFS
channels to B-Tree localizations (sub-trees)
\\\/\\\ / \\\ /// \\\ ///\\\//
\/\\/\\/// \\\\// \\\//\\ ///\ <-- A single B-Tree
which contains everything (data/metadata)
\\/\\ // / \\\ /// \\\ ///\
-------------------------------------------
| | HAMMER 1 low-level storage | <-- hammer low-level
storage layer
-------------------------------------------
| |meta|vol-1 |meta|vol-2 |meta...| <-- hammer volumes (up
to 256 volumes)
-------------------------------------------
| | HAMMER 1 | <-- newfs_hammer -L DATA
> To my knowledge disklabel64 for creating BSD partitions is mandatory.
>
>
>
>
>
>
>
> FreeBSD ZFS
>
>
> -------------------------------------------
> | /data1 | /data2 |
> -------------------------------------------
> | dataset1 | dataset2 | <-- zfs create
> -------------------------------------------
> | ZFS Storage Pool | <-- zpool create
> -------------------------------------------
> | GPT | GPT | GPT | GPT | <-- gpart
> -------------------------------------------
> | /dev/da0 | /dev/da1 | /dev/da2 | /dev/da3 |
> --------------------------------------------
>
>
>
>
> Note that GPT layer for ZFS can be skipped. It is merely put there to
> protect a user from having problems with the HDDs of slightly unequal
> sizes as well to make easier to identify the block devices (whole disks
> or partitions on the disk), so one doesn't get the ambiguity caused by
> the OS renumbering devices depending on which devices were found in
> hardware (which makes da0 into da1)? Labeling part can be accomplished
> with glabel from GEOM framework. One should use zfs create with some
> kind compression option.
>
>
>
>
>> Regarding LVM part,
>>
>> 2017-10-07 7:34 GMT+03:00 Predrag Punosevac <punosevac72 at gmail.com>:
>> > Siju George wrote:
>> >
>> >
>> >> Hi,
>> >>
>> >> I have written an article about DragonFlyBSD hammer filesystem volume
>> >> management. This will be particularly useful for Linux users familiar
>> >> with lvm.
>> >>
>> >> https://bsdmag.org/download/military-grade-data-wiping-freebsd-bcwipe
>> >>
>> >> Thanks
>> >>
>> >> --Siju
>> >
>> > I just read the article twice and I am completely confused (probably to
>> > be expected by the guy who is train in theoretical mathematics and
>> > astronomy). I would like if somebody could correct my understanding of
>> > the things.
>> >
>> > On the most basic level OSs manage disks through device nodes. For many
>> > years engineers were created filesystems with by assuming that the
>> > filesystem will go directly to the physical disk. Proliferation of
>> > commodity hardware have caused to re-evaluate this idea. Namely not all
>> > drives have the same rotational speed, number of platters etc. So much
>> > like French approach in mathematics somebody come with the idea of
>> > abstraction of the physical devices and stop lying to OSs about the
>> > geometry of hard drives. That is way all modern HDDs use Logical Block
>> > Addressing (LBA). Abstraction didn't end up here.
>> >
>> > On the next abstraction layer we have disk slicing (Linux guys oblivious
>> > to BSD partitions call these slices partitions). I am semi-familiar with
>> > to slicing schemes. One is old MBR scheme still default on OpenBSD and
>> > another is GPT scheme. On DragonFly one shell always use GPT scheme. man
>> > gpt. For example
>> >
>> > dfly# gpt show /dev/da3
>> > start size index contents
>> > 0 1 - PMBR
>> > 1 1 - Pri GPT header
>> > 2 32 - Pri GPT table
>> > 34 1 0 GPT part - DragonFly Label64
>> > 35 976773100 1 GPT part - DragonFly Label64
>> > 976773135 32 - Sec GPT table
>> > 976773167 1 - Sec GPT header
>> >
>> >
>> > On Linux one uses parted to do GPT scheme while FreeBSD uses gpart.
>> > Once you slice your disk (typically Linux will do three slices while
>> > historically DragonFly is using only one slice (dangerously dedicated)
>> > or two slices like in the example above) we can start talking about BSD
>> > partitions. On the above hard drive dat is located in the slice indexed
>> > by 1. Slice with index 0 is created for historical reasons so that the
>> > drive is not dangerously dedicated.
>> >
>> >
>> > A slice can be further divided into the BSD partitions (up to 16) on
>> > DragonFly (Linux has no equivalent) using command disklabel64. Lets look
>> > how the slices on the above disk are divided
>> >
>> > dfly# disklabel64 -r /dev/da3s0
>> > disklabel64: /dev/da3s0: No such file or directory
>> > dfly# disklabel64 -r /dev/da3s1
>> > # /dev/da3s1:
>> > #
>> > # Informational fields calculated from the above
>> > # All byte equivalent offsets must be aligned
>> > #
>> > # boot space: 1059328 bytes
>> > # data space: 488385505 blocks # 476938.97 MB (500106757632 bytes)
>> > #
>> > # NOTE: If the partition data base looks odd it may be
>> > # physically aligned instead of slice-aligned
>> > #
>> > diskid: a6a0a2ef-a4d1-11e7-98d9-b9aeed3cce35
>> > label:
>> > boot2 data base: 0x000000001000
>> > partitions data base: 0x000000103a00
>> > partitions data stop: 0x007470bfc000
>> > backup label: 0x007470bfc000
>> > total size: 0x007470bfd800 # 476939.99 MB
>> > alignment: 4096
>> > display block size: 1024 # for partition display only
>> >
>> > 16 partitions:
>> > # size offset fstype fsuuid
>> > a: 104857600 0 4.2BSD # 102400.000MB
>> > b: 262144000 104857600 4.2BSD # 256000.000MB
>> > a-stor_uuid: 5ad35da2-a4d2-11e7-98d9-b9aeed3cce35
>> > b-stor_uuid: 5ad35dba-a4d2-11e7-98d9-b9aeed3cce35
>> >
>> >
>> > As you can see the first slice contains no partitions. The second slice
>> > is partitioned in two parts
>> >
>> > da3s1a and da3s1b
>> >
>> > One could put the file system UFS or HAMMER on those partitions. It is
>> > totally possible to have different file systems on partitions. For
>> > example da3s1a is formated to use UFS while da3s1b is formated to use
>> > Hammer. DragonFly cousin FreeBSD can do the same. Linux can't.
>> >
>> > On Linux the slice sda0a can't be further subdivided and must contain a
>> > file system like XFS. Not so quickly. On Linux some people decided to
>> > add another abstract layer called logical volume manager (LVM).
>>
>> DragonFly has LVM too.
>> To be exact it's a port of Linux's LVM implementation called LVM2,
>> along with its kernel side subsystem Device Mapper which is an
>> abstraction layer for various block device level features, like LVM,
>> snapshot, software RAID, multipath, and much more.
>>
>> The problem with DragonFly is that most of the (useful) DM target
>> drivers are either not working or not ported.
>> I'm not even sure if LVM2 is stable, let alone software RAID.
>>
>>
>
> I am aware that DF has LVM2 but I have never read the code. I truly
> appreciate your comments regarding its usability. At some point in the
> distant past I foolishly asked a question if LVM2 can be used to resize
> HAMMER1 partition. The answer is of course not.
>
>
>
>
>> > One can
>> > combine several physical volumes (not recommended as the software RAID
>> > should be used for that) into the LVM volume group. Volume group is
>> > further divided into the logical volumes. File system is going onto the
>> > top of the logical volumes. Cool thing about logical volumes is that
>> > they can be expanded, shrank while mounted. Actually the coolest this
>> > is that one can take a snapshot of the logical volume group although I
>> > have not seeing many Linux users do that in the practice. In the terms
>> > of file system the only real choice is XFS. Historically LVM didn't
>> > provide any options for redundancy or parity thus software RAID (mdadm)
>> > or hardware RAID was needed. It is perfectly OK to initiate RAID
>> > /dev/md0 as a volume group partition it into few logical volumes and
>> > than put the XFS on the top of it.
>>
>> LVM is usually a different thing from RAID.
>> In case of LVM in Linux (which is DM based LVM2), you can stack DM
>> devices to synthesize these (or other block level features provided by
>> DM infrastructure).
>> You can stack DM devices on DragonFly too, but there aren't that many
>> good and/or stable DM target drivers worth stacking...
>
> RedHat now recommends RAIDing with LVM over MDRAID. I am not sure how
> mature and featureful is LVM RAID? A true storage experts could pitch in
> on this question.
I also usually see people use md for software RAID as well (rather
than what dm provides regarding raid).
>
>>
>>
>> > Going back to cousin FreeBSD now. Unlike Linux FreeBSD uses GEOM instead
>> > of LVM. My understanding is that GEOM combines LVM, RAID, and even
>> > encryption (I am not sure that GELI is the part of GEOM) into the one.
>> > GEOM used as LVM allows for UFS journaling.
>>
>> The reason DragonFly doesn't have GEOM is because that's the way they
>> wanted it to be at least in 2010.
>> https://www.dragonflybsd.org/mailarchive/users/2010-07/msg00085.html
>>
>>
>
> I read the discussion many times in the past but I just read it one more
> time. It feels after all this years that Matt's position has being
> validated. However I am not sure about
>
> gpt+disklabel64 vs gpart
>
> part of that discussion. I remember first time initializing a storage
> HDD thinking over several times why do I need to create a partition
> inside a slice if I am going to use entire slice just for one partition
> which is going to hold the data. I do understand Matt's point about
> disklabel64 though.
>
>
>
>> > So where are HAMMER1 and ZFS in all this story?
>> >
>> > On one hand ZFS makes hardware/software RAID obsolete as it is a volume
>> > manager (in the sense of RAID). It is also volume manager in the sense
>> > of LVM with caveats (ZFS volumes on FreeBSD IIRC could be grown only by
>> > off-lining physical drives and replacing them with larger drives before
>> > re-silvering). However ZFS brings a lot of new goodies COW, checksum,
>> > self-healing, compression, snapshots (the one that people actually use
>> > unlike LVM), remote replication. It is possible to use ZFS Volume as a
>> > iSCSI Target and to boot from the ZFS volume (all my file servers do
>> > that and even use beadm) It is the best thing after the slice of bread
>> > if you are willing to reimplement large part of Solaris kernel (which
>> > FreeBSD people did). ZFS pools which could be thought of as LVM volume
>> > groups can be divided into the datasets which is in some sense
>> > equivalent to LVM logical volumes.
>> >
>> >
>> > HAMMER1 is unable to manage volumes in the sense of RAID. It requires
>> > software or hardware RAID for high availability and redundancy.
>>
>> This was actually the design of HAMMER1 as mentioned in section 11 of
>> below in 2008.
>> https://www.dragonflybsd.org/hammer/hammer.pdf
>>
>
> I am aware of the document. I whish somebody who read natacontrol code
> could chime in the state of software RAID on DF. Based upon this document
>
> https://www.dragonflybsd.org/docs/docs/howtos/howtosoftwareraid/
>
> that thing probably has not being touched since 2005 (like Jails). That
> further amplifies the need for regular testing of Hardware RAID cards on
> DF.
>
>>
>> > To my
>> > knowledge software RAID discipline 1 is achieved on DragonFly via old
>> > FreeBSD 4.8 framework natacontrol. I am not sure if that thing should be
>> > used in production any longer. Any how hardware RAID seems a way to go.
>> > HAMMER1 can't be expended so unlike ZFS it is much more pure file
>> > system. However as a file system it is second to none. COW, checksum
>> > healing via history, fine-grained journaling, snapshots, etc. HAMMER1
>> > equivalent of datasets are pseudo file systems pfs which are very cheap
>> > (for example home directory of each user on the file server could be a
>> > PFS which could be destroy in the case of policy violation). HAMMER1
>> > comes with built in backup (mirror-stream). Unfortunately slave PFSs
>> > are read only. DF can't boot from HAMMER1. Anyhow the PFS do indeed to
>> > some extend look like Linux logical volumes but they are so much more
>> > advanced. HAMMER1 is NFS and Samba aware (NAS) but I am not sure that
>> > DragonFly SAN (iSCSI) capabilities.
>> >
>
> I was wrong about iSCSI on DragonFly. Of course it has it
>
> https://www.dragonflybsd.org/cgi/web-man?command=iscsi.conf§ion=5
>
> It would be interesting to hear from somebody who is actually using it
> and who read the actual code.
>
>
>> >
>> > I would really appreciate if people can point mistakes in the above
>> > write up and give me some references so that I can actually learn
>> > something.
>> >
>> >
>> > I am finishing this post by saying that I am all in suspense expecting
>> > the release of DF 5.0 and preview of HAMMER2.
>> >
>> > Cheers,
>> > Predrag
More information about the Users
mailing list