pfs-delete seems to hang
Aleksej Lebedev
root at zta.lk
Wed Nov 11 08:48:33 PST 2020
I managed to reproduce the problem:
1. Created a fresh hammer2 filesystem with newfs_hammer2 on my home server.
3. Created a dedicated pfs on it: hammer2 -s / pfs-create AR-HOME and mounted it under /backup/ar/home
2. Run cpdup to copy around ~0.8 TB from my remote server:
# cpdup ar.zta.lk:/snapshot/ROOT-2020-11-10-19-49-25.UTC/home /backup/ar/home
periodically running hammer2 bulkfree /
3. When the size of the pfs reached ~0.57 TB I started to see error in dmesg:
Nov 11 17:27:40 sa kernel: chain 0000006622a7200a.01 (Inode) meth=32 CHECK FAIL (flags=00144002, bref/data f78823c269c295ed/d1e4c998a0cf8c7c)
Nov 11 17:27:40 sa kernel: Resides at/in inode 444566
Nov 11 17:27:40 sa kernel: In pfs UNKNOWN on device serno/WCJ35GE0.s1d
Nov 11 17:27:40 sa kernel: chain 0000006622a7200a.01 (Inode) meth=32 CHECK FAIL (flags=00144002, bref/data f78823c269c295ed/d1e4c998a0cf8c7c)
Nov 11 17:27:40 sa kernel: Resides at/in inode 444566
Nov 11 17:27:40 sa kernel: In pfs UNKNOWN on device serno/WCJ35GE0.s1d
Nov 11 17:27:40 sa kernel: chain 000000d98584800e.02 (Indirect-Block) meth=30 CHECK FAIL (flags=00144002, bref/data 8699a945505b4181/1a70a68a06527f63)
Nov 11 17:27:40 sa kernel: Resides at/in inode 444567
Nov 11 17:27:40 sa kernel: In pfs UNKNOWN on device serno/WCJ35GE0.s1d
Nov 11 17:27:40 sa kernel: chain 000000d98584800e.02 (Indirect-Block) meth=30 CHECK FAIL (flags=00144002, bref/data 8699a945505b4181/1a70a68a06527f63)
Nov 11 17:27:40 sa kernel: Resides at/in inode 444567
Nov 11 17:27:40 sa kernel: In pfs UNKNOWN on device serno/WCJ35GE0.s1d
Nov 11 17:27:40 sa kernel: chain 000000d98584c00e.02 (Indirect-Block) meth=30 CHECK FAIL (flags=00144002, bref/data d65716aaa2f0f748/4994d79a156ea6f4)
Nov 11 17:27:40 sa kernel: Resides at/in inode 444568
I don't see any errors from da driver. Also smartct shows Current_Pending_Sector = 0 and Reallocated_Sector_Ct = 0.
Matthew Dillon, if you're interested in I can give you access to my data so that you will be able to reproduce the problem on your machine.
The only thing is that it requires copying >0.5TB. I am still not sure it's 100% reproducible, but for me I got these errors 3 times in a row. At different stages though.
--
Aleksej Lebedev
On Sun, Nov 8, 2020, at 23:45, Aleksej Lebedev wrote:
> Hi, users!
>
> I am experiencing multiple problems with hammer2. I already wrote about
> having some corrupted directory entries a few weeks ago. I thought it
> was either bad media or maybe incorrect shutdown (+ me being very
> unlucky.) Later I noticed that uname on my machine reports
> "5.8-RELEASE" even though I am sure I rebuilt and reinstalled
> kernel+world after the tag v5.8.2.
>
> This is suspicious because there were two commits
> (c41d6deadc7a5f0c2fb8cc6f2b8ad7db230db467 and
> f4a0b284eb39609e24aadc7a61905d37d319bed6) with the message that starts
> with "hammer2 - Fix serious de-duplication bug and a few other things.."
>
> To be completely sure I pulled the most recent version of sources in
> the DragonFly_RELEASE_5_8 branch, rebuilt and reinstalled on all my
> machines.
>
> But uname still reports the following:
> DragonFly ar.zta.lk 5.8-RELEASE DragonFly 5.8-RELEASE #1: Wed Oct 21
> 15:34:15 CEST 2020
> root at ar.zta.lk:/usr/obj/usr/src/sys/X86_64_GENERIC x86_64
>
> Is that OK? I vaguely remember that it used to different before: I
> think I saw a commit hash in the uname -a message. Maybe I'm mixing
> something. Anyway the date of the kernel shows that it's new.
>
> After that I re-formatted my 5T volume on my home server that I am
> using for backups and backed up all my data us usually: with hammer2 -s
> / pfs-snapshot <lable> (on the remote server) and cpdup.
>
> It some point I started to notice same kind of errors () on my remote
> server which has a hardware RAID and ECC memory and didn't reboot.
>
> I was planning to re-format and re-populate the volume from scratch on
> the server.
>
> I am writing right now because I accidentally executed the following
> command twice. The first one reported that it deleted the snapshot but
> the second one it hanged:
>
> a at ar:~$ doas hammer2 -s / pfs-delete SA-ROOT-2020-11-03-16-21-03.UTC
>
> On a separate tmux window ps shows the following:
>
> a at ar:~$ ps ax | grep hammer2
> 6363 ?? I4s 0:04.37 hammer2: hammer2 autoconn_thread (hammer2)
> 949481 3 S3+ 0:00.00 grep hammer2
> 948504 5 D0+ 0:00.01 hammer2 -s / pfs-delete
> SA-ROOT-2020-11-03-16-21-03.UTC
>
> The cpu load is high (rising and falling):
>
> a at ar:~$ uptime
> 10:33PM up 18 days, 7:49, 1 user, load averages: 5.61, 3.46, 1.84
> a at ar:~$ uptime
> 10:35PM up 18 days, 7:51, 1 user, load averages: 5.68, 4.28, 2.38
> a at ar:~$ uptime
> 10:39PM up 18 days, 7:55, 1 user, load averages: 7.33, 6.27, 3.77
>
> While top doesn't show anything procces that would use the cpu.
>
> A few last messages in /var/log/messages:
>
> Nov 8 22:28:06 ar kernel: FOUND PFS ROOT-2020-10-20-19-27-34.UTC CLINDEX 0
> Nov 8 22:28:39 ar kernel: FOUND PFS SA-ROOT-2020-11-03-16-21-03.UTC CLINDEX 0
> Nov 8 22:29:06 ar kernel: FOUND PFS SA-ROOT-2020-11-03-16-21-03.UTC CLINDEX 0
>
> The server started to respond slower and slower and finally I lost it.
> I ordered "soft reboot" (ctrl+alt+del) which didn't help and then
> "hardware reset". It is now alive.
>
> All in all there seem to be problems with hammer2. In order to
> eliminate the last possibility of the bad media or a human error (there
> is a tiny chance I indeed had kernel 5.8 at the moment I formatted the
> disk on my remote server the first time) I will make sure I have a good
> backup and then try to reformat and repopulate my server (the one with
> RAID and ECC). This will probably take another week (I have around 3T
> data while my internet connection is not the fastest - I live in a city
> center of Haarlem, the Netherlands). I will report how it went when I'm
> done.
>
> --
> Aleksej Lebedev
>
More information about the Users
mailing list