pfs-delete seems to hang

Wed Nov 11 08:48:33 PST 2020

I managed to reproduce the problem:

    1. Created a fresh hammer2 filesystem with newfs_hammer2 on my home server.
    3. Created a dedicated pfs on it: hammer2 -s / pfs-create AR-HOME and mounted it under /backup/ar/home
    2. Run cpdup to copy around ~0.8 TB from my remote server:
        # cpdup ar.zta.lk:/snapshot/ROOT-2020-11-10-19-49-25.UTC/home /backup/ar/home
        periodically running hammer2 bulkfree /
    3. When the size of the pfs reached ~0.57 TB I started to see error in dmesg:

Nov 11 17:27:40 sa kernel: chain 0000006622a7200a.01 (Inode) meth=32 CHECK FAIL (flags=00144002, bref/data f78823c269c295ed/d1e4c998a0cf8c7c)
Nov 11 17:27:40 sa kernel: Resides at/in inode 444566
Nov 11 17:27:40 sa kernel: In pfs UNKNOWN on device serno/WCJ35GE0.s1d
Nov 11 17:27:40 sa kernel: chain 0000006622a7200a.01 (Inode) meth=32 CHECK FAIL (flags=00144002, bref/data f78823c269c295ed/d1e4c998a0cf8c7c)
Nov 11 17:27:40 sa kernel: Resides at/in inode 444566
Nov 11 17:27:40 sa kernel: In pfs UNKNOWN on device serno/WCJ35GE0.s1d
Nov 11 17:27:40 sa kernel: chain 000000d98584800e.02 (Indirect-Block) meth=30 CHECK FAIL (flags=00144002, bref/data 8699a945505b4181/1a70a68a06527f63)
Nov 11 17:27:40 sa kernel: Resides at/in inode 444567
Nov 11 17:27:40 sa kernel: In pfs UNKNOWN on device serno/WCJ35GE0.s1d
Nov 11 17:27:40 sa kernel: chain 000000d98584800e.02 (Indirect-Block) meth=30 CHECK FAIL (flags=00144002, bref/data 8699a945505b4181/1a70a68a06527f63)
Nov 11 17:27:40 sa kernel: Resides at/in inode 444567
Nov 11 17:27:40 sa kernel: In pfs UNKNOWN on device serno/WCJ35GE0.s1d
Nov 11 17:27:40 sa kernel: chain 000000d98584c00e.02 (Indirect-Block) meth=30 CHECK FAIL (flags=00144002, bref/data d65716aaa2f0f748/4994d79a156ea6f4)
Nov 11 17:27:40 sa kernel: Resides at/in inode 444568

I don't see any errors from da driver. Also smartct shows Current_Pending_Sector = 0 and Reallocated_Sector_Ct = 0.

Matthew Dillon, if you're interested in I can give you access to my data so that you will be able to reproduce the problem on your machine.
The only thing is that it requires copying >0.5TB. I am still not sure it's 100% reproducible, but for me I got these errors 3 times in a row. At different stages though.

--
Aleksej Lebedev

On Sun, Nov 8, 2020, at 23:45, Aleksej Lebedev wrote:
> Hi, users!
> 
> I am experiencing multiple problems with hammer2. I already wrote about 
> having some corrupted directory entries a few weeks ago. I thought it 
> was either bad media or maybe incorrect shutdown (+ me being very 
> unlucky.) Later I noticed that uname on my machine reports 
> "5.8-RELEASE" even though I am sure I rebuilt and reinstalled 
> kernel+world after the tag v5.8.2.
> 
> This is suspicious because there were two commits 
> (c41d6deadc7a5f0c2fb8cc6f2b8ad7db230db467 and 
> f4a0b284eb39609e24aadc7a61905d37d319bed6) with the message that starts 
> with "hammer2 - Fix serious de-duplication bug and a few other things.."
> 
> To be completely sure I pulled the most recent version of sources in 
> the DragonFly_RELEASE_5_8 branch, rebuilt and reinstalled on all my 
> machines.
> 
> But uname still reports the following:
> DragonFly ar.zta.lk 5.8-RELEASE DragonFly 5.8-RELEASE #1: Wed Oct 21 
> 15:34:15 CEST 2020     
> root at ar.zta.lk:/usr/obj/usr/src/sys/X86_64_GENERIC  x86_64
> 
> Is that OK? I vaguely remember that it used to different before: I 
> think I saw a commit hash in the uname -a message. Maybe I'm mixing 
> something. Anyway the date of the kernel shows that it's new.
> 
> After that I re-formatted my 5T volume on my home server that I am 
> using for backups and backed up all my data us usually: with hammer2 -s 
> / pfs-snapshot <lable> (on the remote server) and cpdup.
> 
> It some point I started to notice same kind of errors () on my remote 
> server which has a hardware RAID and ECC memory and didn't reboot.
> 
> I was planning to re-format and re-populate the volume from scratch on 
> the server.
> 
> I am writing right now because I accidentally executed the following 
> command twice. The first one reported that it deleted the snapshot but 
> the second one it hanged:
> 
> a at ar:~$ doas hammer2 -s / pfs-delete SA-ROOT-2020-11-03-16-21-03.UTC
> 
> On a separate tmux window ps shows the following:
> 
> a at ar:~$ ps ax | grep hammer2
>   6363 ??  I4s      0:04.37 hammer2: hammer2 autoconn_thread (hammer2)
> 949481  3  S3+      0:00.00 grep hammer2
> 948504  5  D0+      0:00.01 hammer2 -s / pfs-delete 
> SA-ROOT-2020-11-03-16-21-03.UTC
> 
> The cpu load is high (rising and falling):
> 
> a at ar:~$ uptime
> 10:33PM  up 18 days,  7:49, 1 user, load averages: 5.61, 3.46, 1.84
> a at ar:~$ uptime
> 10:35PM  up 18 days,  7:51, 1 user, load averages: 5.68, 4.28, 2.38
> a at ar:~$ uptime
> 10:39PM  up 18 days,  7:55, 1 user, load averages: 7.33, 6.27, 3.77
> 
> While top doesn't show anything procces that would use the cpu.
> 
> A few last messages in /var/log/messages:
> 
> Nov  8 22:28:06 ar kernel: FOUND PFS ROOT-2020-10-20-19-27-34.UTC CLINDEX 0
> Nov  8 22:28:39 ar kernel: FOUND PFS SA-ROOT-2020-11-03-16-21-03.UTC CLINDEX 0
> Nov  8 22:29:06 ar kernel: FOUND PFS SA-ROOT-2020-11-03-16-21-03.UTC CLINDEX 0
> 
> The server started to respond slower and slower and finally I lost it. 
> I ordered "soft reboot" (ctrl+alt+del) which didn't help and then 
> "hardware reset". It is now alive.
> 
> All in all there seem to be problems with hammer2. In order to 
> eliminate the last possibility of the bad media or a human error (there 
> is a tiny chance I indeed had kernel 5.8 at the moment I formatted the 
> disk on my remote server the first time) I will make sure I have a good 
> backup and then try to reformat and repopulate my server (the one with 
> RAID and ECC). This will probably take another week (I have around 3T 
> data while my internet connection is not the fastest - I live in a city 
> center of Haarlem, the Netherlands). I will report how it went when I'm 
> done.
> 
> --
> Aleksej Lebedev
>