pfs-delete seems to hang

Sun Nov 8 14:45:17 PST 2020

Hi, users!

I am experiencing multiple problems with hammer2. I already wrote about having some corrupted directory entries a few weeks ago. I thought it was either bad media or maybe incorrect shutdown (+ me being very unlucky.) Later I noticed that uname on my machine reports "5.8-RELEASE" even though I am sure I rebuilt and reinstalled kernel+world after the tag v5.8.2.

This is suspicious because there were two commits (c41d6deadc7a5f0c2fb8cc6f2b8ad7db230db467 and f4a0b284eb39609e24aadc7a61905d37d319bed6) with the message that starts with "hammer2 - Fix serious de-duplication bug and a few other things.."

To be completely sure I pulled the most recent version of sources in the DragonFly_RELEASE_5_8 branch, rebuilt and reinstalled on all my machines.

But uname still reports the following:
DragonFly ar.zta.lk 5.8-RELEASE DragonFly 5.8-RELEASE #1: Wed Oct 21 15:34:15 CEST 2020     root at ar.zta.lk:/usr/obj/usr/src/sys/X86_64_GENERIC  x86_64

Is that OK? I vaguely remember that it used to different before: I think I saw a commit hash in the uname -a message. Maybe I'm mixing something. Anyway the date of the kernel shows that it's new.

After that I re-formatted my 5T volume on my home server that I am using for backups and backed up all my data us usually: with hammer2 -s / pfs-snapshot <lable> (on the remote server) and cpdup.

It some point I started to notice same kind of errors () on my remote server which has a hardware RAID and ECC memory and didn't reboot.

I was planning to re-format and re-populate the volume from scratch on the server.

I am writing right now because I accidentally executed the following command twice. The first one reported that it deleted the snapshot but the second one it hanged:

a at ar:~$ doas hammer2 -s / pfs-delete SA-ROOT-2020-11-03-16-21-03.UTC

On a separate tmux window ps shows the following:

a at ar:~$ ps ax | grep hammer2
  6363 ??  I4s      0:04.37 hammer2: hammer2 autoconn_thread (hammer2)
949481  3  S3+      0:00.00 grep hammer2
948504  5  D0+      0:00.01 hammer2 -s / pfs-delete SA-ROOT-2020-11-03-16-21-03.UTC

The cpu load is high (rising and falling):

a at ar:~$ uptime
10:33PM  up 18 days,  7:49, 1 user, load averages: 5.61, 3.46, 1.84
a at ar:~$ uptime
10:35PM  up 18 days,  7:51, 1 user, load averages: 5.68, 4.28, 2.38
a at ar:~$ uptime
10:39PM  up 18 days,  7:55, 1 user, load averages: 7.33, 6.27, 3.77

While top doesn't show anything procces that would use the cpu.

A few last messages in /var/log/messages:

Nov  8 22:28:06 ar kernel: FOUND PFS ROOT-2020-10-20-19-27-34.UTC CLINDEX 0
Nov  8 22:28:39 ar kernel: FOUND PFS SA-ROOT-2020-11-03-16-21-03.UTC CLINDEX 0
Nov  8 22:29:06 ar kernel: FOUND PFS SA-ROOT-2020-11-03-16-21-03.UTC CLINDEX 0

The server started to respond slower and slower and finally I lost it. I ordered "soft reboot" (ctrl+alt+del) which didn't help and then "hardware reset". It is now alive.

All in all there seem to be problems with hammer2. In order to eliminate the last possibility of the bad media or a human error (there is a tiny chance I indeed had kernel 5.8 at the moment I formatted the disk on my remote server the first time) I will make sure I have a good backup and then try to reformat and repopulate my server (the one with RAID and ECC). This will probably take another week (I have around 3T data while my internet connection is not the fastest - I live in a city center of Haarlem, the Netherlands). I will report how it went when I'm done.

--
Aleksej Lebedev