BSD Magazine article on DragonFlyBSD hammer filesystem volume management

Predrag Punosevac punosevac72 at gmail.com
Sat Oct 7 15:47:29 PDT 2017


Tomohiro Kusumi <kusumi.tomohiro at gmail.com> wrote:

> Hi

Hi Tomohiro,

I figured out in the light of immanent DF 5.0 release with HAMMER 2
preview the following diagram might be useful for noobs like me


                    Linux 


 --------------------------------------------
|     /data1   |    /data2    |    /data3    | 
 --------------------------------------------
|     XFS      |     XFS      |    XFS       |  <--- mkfs.xfs  
 -------------------------------------------
| LVM Volume 1 | LVM Volume 2 | LVM Volume 3 |  <--- lvcreate  
 -------------------------------------------
|            LVM Volume Group                |  <--- pvcreate & vgcreate
 -------------------------------------------
|               RAID Volume                  |  <--- mdadm
 -------------------------------------------
|   GPT    |    GPT   |   GPT    |    GPT    |  <--- parted
 -------------------------------------------
| /dev/sda | /dev/sdb | /dev/sdc | /dev/sdd  |
 --------------------------------------------



              
            DragonFly BSD HAMMER 1


 --------------------------------------------
|            |     /data1    |    /data2     |  
 -------------------------------------------
|            |  /pfs/@@-1:01 | /pfs/@@-1:02  |  <-- hammer pfs-master
 -------------------------------------------
|            |       HAMMER 1                |  <-- newfs_hammer -L DATA
 -------------------------------------------
|            |     /dev/ar0s1a               |  <-- disklable64 
 -------------------------------------------
| /dev/ar0s0 |     /dev/ar0s1                |  <-- gpt (slices) 
 -------------------------------------------
|                Hardware RAID               |  
 -------------------------------------------
| /dev/da0 | /dev/da1 | /dev/da2 | /dev/da3  |
 --------------------------------------------


To my knowledge disklabel64 for creating BSD partitions is mandatory.






              
                 FreeBSD ZFS 


 -------------------------------------------
|      /data1         |      /data2          |
 -------------------------------------------
|     dataset1        |      dataset2        |  <-- zfs create 
 -------------------------------------------
|              ZFS Storage Pool              |  <-- zpool create
 -------------------------------------------
|   GPT    |    GPT   |   GPT    |    GPT    |  <-- gpart
 -------------------------------------------
| /dev/da0 | /dev/da1 | /dev/da2 | /dev/da3  |
 --------------------------------------------




Note that GPT layer for ZFS can be skipped. It is merely put there to
protect a user from having problems with the HDDs of slightly unequal
sizes as well to make easier to identify the block devices (whole disks
or partitions on the disk), so one doesn't get the ambiguity caused by
the OS renumbering devices depending on which devices were found in
hardware (which makes da0 into da1)? Labeling part can be accomplished
with glabel from GEOM framework. One should use zfs create with some
kind compression option.




> Regarding LVM part,
> 
> 2017-10-07 7:34 GMT+03:00 Predrag Punosevac <punosevac72 at gmail.com>:
> > Siju George wrote:
> >
> >
> >> Hi,
> >>
> >> I have written an article about DragonFlyBSD hammer filesystem volume
> >> management. This will be particularly useful for Linux users familiar
> >> with lvm.
> >>
> >> https://bsdmag.org/download/military-grade-data-wiping-freebsd-bcwipe
> >>
> >> Thanks
> >>
> >> --Siju
> >
> > I just read the article twice and I am completely confused (probably to
> > be expected by the guy who is train in theoretical mathematics and
> > astronomy). I would like if somebody could correct my understanding of
> > the things.
> >
> > On the most basic level OSs manage disks through device nodes. For many
> > years engineers were created filesystems with by assuming that the
> > filesystem will go directly to the physical disk. Proliferation of
> > commodity hardware have caused to re-evaluate this idea. Namely not all
> > drives have the same rotational speed, number of platters etc. So much
> > like French approach in mathematics somebody come with the idea of
> > abstraction of the physical devices and stop lying to OSs about the
> > geometry of hard drives. That is way all modern HDDs use Logical Block
> > Addressing (LBA). Abstraction didn't end up here.
> >
> > On the next abstraction layer we have disk slicing (Linux guys oblivious
> > to BSD partitions call these slices partitions). I am semi-familiar with
> > to slicing schemes. One is old MBR scheme still default on OpenBSD and
> > another is GPT scheme. On DragonFly one shell always use GPT scheme. man
> > gpt. For example
> >
> > dfly# gpt show /dev/da3
> >       start       size  index  contents
> >           0          1      -  PMBR
> >           1          1      -  Pri GPT header
> >           2         32      -  Pri GPT table
> >          34          1      0  GPT part - DragonFly Label64
> >          35  976773100      1  GPT part - DragonFly Label64
> >   976773135         32      -  Sec GPT table
> >   976773167          1      -  Sec GPT header
> >
> >
> > On Linux one uses parted to do GPT scheme while FreeBSD uses gpart.
> > Once you slice your disk (typically Linux will do three slices while
> > historically DragonFly is using only one slice (dangerously dedicated)
> > or two slices like in the example above) we can start talking about BSD
> > partitions. On the above hard drive dat is located in the slice indexed
> > by 1. Slice with index 0 is created for historical reasons so that the
> > drive is not dangerously dedicated.
> >
> >
> > A slice can be further divided into the BSD partitions (up to 16) on
> > DragonFly (Linux has no equivalent) using command disklabel64. Lets look
> > how the slices on the above disk are divided
> >
> > dfly# disklabel64 -r /dev/da3s0
> > disklabel64: /dev/da3s0: No such file or directory
> > dfly# disklabel64 -r /dev/da3s1
> > # /dev/da3s1:
> > #
> > # Informational fields calculated from the above
> > # All byte equivalent offsets must be aligned
> > #
> > # boot space:    1059328 bytes
> > # data space:  488385505 blocks # 476938.97 MB (500106757632 bytes)
> > #
> > # NOTE: If the partition data base looks odd it may be
> > #       physically aligned instead of slice-aligned
> > #
> > diskid: a6a0a2ef-a4d1-11e7-98d9-b9aeed3cce35
> > label:
> > boot2 data base:      0x000000001000
> > partitions data base: 0x000000103a00
> > partitions data stop: 0x007470bfc000
> > backup label:         0x007470bfc000
> > total size:           0x007470bfd800    # 476939.99 MB
> > alignment: 4096
> > display block size: 1024        # for partition display only
> >
> > 16 partitions:
> > #          size     offset    fstype   fsuuid
> >   a:  104857600          0    4.2BSD    #  102400.000MB
> >   b:  262144000  104857600    4.2BSD    #  256000.000MB
> >   a-stor_uuid: 5ad35da2-a4d2-11e7-98d9-b9aeed3cce35
> >   b-stor_uuid: 5ad35dba-a4d2-11e7-98d9-b9aeed3cce35
> >
> >
> > As you can see the first slice contains no partitions. The second slice
> > is partitioned in two parts
> >
> > da3s1a and da3s1b
> >
> > One could put the file system UFS or HAMMER on those partitions. It is
> > totally possible to have different file systems on partitions. For
> > example da3s1a is formated to use UFS while da3s1b is formated to use
> > Hammer. DragonFly cousin FreeBSD can do the same. Linux can't.
> >
> > On Linux the slice sda0a can't be further subdivided and must contain a
> > file system like XFS. Not so quickly. On Linux some people decided to
> > add another abstract layer called logical volume manager (LVM).
> 
> DragonFly has LVM too.
> To be exact it's a port of Linux's LVM implementation called LVM2,
> along with its kernel side subsystem Device Mapper which is an
> abstraction layer for various block device level features, like LVM,
> snapshot, software RAID, multipath, and much more.
> 
> The problem with DragonFly is that most of the (useful) DM target
> drivers are either not working or not ported.
> I'm not even sure if LVM2 is stable, let alone software RAID.
> 
> 

I am aware that DF has LVM2 but I have never read the code. I truly
appreciate your comments regarding its usability. At some point in the
distant past I foolishly asked a question if LVM2 can be used to resize
HAMMER1 partition. The answer is of course not. 




> > One can
> > combine several physical volumes (not recommended as the software RAID
> > should be used for that) into the LVM volume group. Volume group is
> > further divided into the logical volumes. File system is going onto the
> > top of the logical volumes. Cool thing about logical volumes is that
> > they can be expanded, shrank while mounted. Actually the coolest this
> > is that one can take a snapshot of the logical volume group although I
> > have not seeing many Linux users do that in the practice. In the terms
> > of file system the only real choice is XFS. Historically LVM didn't
> > provide any options for redundancy or parity thus software RAID (mdadm)
> > or hardware RAID was needed.  It is perfectly OK to initiate RAID
> > /dev/md0 as a volume group partition it into few logical volumes and
> > than put the XFS on the top of it.
> 
> LVM is usually a different thing from RAID.
> In case of LVM in Linux (which is DM based LVM2), you can stack DM
> devices to synthesize these (or other block level features provided by
> DM infrastructure).
> You can stack DM devices on DragonFly too, but there aren't that many
> good and/or stable DM target drivers worth stacking...

RedHat now recommends RAIDing with LVM over MDRAID. I am not sure how
mature and featureful is LVM RAID? A true storage experts could pitch in
on this question.


> 
> 
> > Going back to cousin FreeBSD now. Unlike Linux FreeBSD uses GEOM instead
> > of LVM. My understanding is that GEOM combines LVM, RAID, and even
> > encryption (I am not sure that GELI is the part of GEOM) into the one.
> > GEOM used as LVM allows for UFS journaling.
> 
> The reason DragonFly doesn't have GEOM is because that's the way they
> wanted it to be at least in 2010.
> https://www.dragonflybsd.org/mailarchive/users/2010-07/msg00085.html
> 
> 

I read the discussion many times in the past but I just read it one more
time. It feels after all this years that Matt's position has being
validated. However I am not sure about 

gpt+disklabel64   vs  gpart 

part of that discussion. I remember first time initializing a storage
HDD thinking over several times why do I need to create a partition
inside a slice if I am going to use entire slice just for one partition
which is going to hold the data. I do understand Matt's point about
disklabel64 though. 



> > So where are HAMMER1 and ZFS in all this story?
> >
> > On one hand ZFS makes hardware/software RAID obsolete as it is a volume
> > manager (in the sense of RAID). It is also volume manager in the sense
> > of LVM with caveats (ZFS volumes on FreeBSD IIRC could be grown only by
> > off-lining physical drives and replacing them with larger drives before
> > re-silvering).  However ZFS brings a lot of new goodies COW, checksum,
> > self-healing, compression, snapshots (the one that people actually use
> > unlike LVM), remote replication. It is possible to use ZFS Volume as a
> > iSCSI Target and to boot from the ZFS volume (all my file servers do
> > that and even use beadm) It is the best thing after the slice of bread
> > if you are willing to reimplement large part of Solaris kernel (which
> > FreeBSD people did). ZFS pools which could be thought of as LVM volume
> > groups can be divided into the datasets which is in some sense
> > equivalent to LVM logical volumes.
> >
> >
> > HAMMER1 is unable to manage volumes in the sense of RAID. It requires
> > software or hardware RAID for high availability and redundancy.
> 
> This was actually the design of HAMMER1 as mentioned in section 11 of
> below in 2008.
> https://www.dragonflybsd.org/hammer/hammer.pdf
> 

I am aware of the document. I whish somebody who read natacontrol code
could chime in the state of software RAID on DF. Based upon this document

https://www.dragonflybsd.org/docs/docs/howtos/howtosoftwareraid/

that thing probably has not being touched since 2005 (like Jails). That
further amplifies the need for regular testing of Hardware RAID cards on
DF. 

> 
> > To my
> > knowledge software RAID discipline 1 is achieved on DragonFly via old
> > FreeBSD 4.8 framework natacontrol. I am not sure if that thing should be
> > used in production any longer. Any how hardware RAID seems a way to go.
> > HAMMER1 can't be expended so unlike ZFS it is much more pure file
> > system. However as a file system it is second to none. COW, checksum
> > healing via history, fine-grained journaling, snapshots, etc. HAMMER1
> > equivalent of datasets are pseudo file systems pfs which are very cheap
> > (for example home directory of each user on the file server could be a
> > PFS which could be destroy in the case of policy violation). HAMMER1
> > comes with built in backup (mirror-stream). Unfortunately slave PFSs
> > are read only. DF can't boot from HAMMER1. Anyhow the PFS do indeed to
> > some extend look like Linux logical volumes but they are so much more
> > advanced. HAMMER1 is NFS and Samba aware (NAS) but I am not sure that
> > DragonFly SAN (iSCSI) capabilities.
> >

I was wrong about iSCSI on DragonFly. Of course it has it

https://www.dragonflybsd.org/cgi/web-man?command=iscsi.conf&section=5

It would be interesting to hear from somebody who is actually using it
and who read the actual code.


> >
> > I would really appreciate if people can point mistakes in the above
> > write up and give me some references so that I can actually learn
> > something.
> >
> >
> > I am finishing this post by saying that I am all in suspense expecting
> > the release of DF 5.0 and preview of HAMMER2.
> >
> > Cheers,
> > Predrag



More information about the Users mailing list