Hammer errors.

Thu Jul 1 11:45:47 PDT 2021

Thanks, Matt, this is very helpful. I pulled some metadata dumps and
started poking at them. But in the interest of getting the machine back
online as soon as possible, I popped in another NVMe (a brand new part),
reinstalled Dragonfly (from scratch) and restored user data (which I
managed to grab with tar from the old NMVe -- no errors there,
fortunately). My intent was to poke at the funky NMVe on another machine.

Interestingly, the system started exhibiting the same errors sometime
overnight, with the new NVMe and a completely rebuilt filesystem.

Given that the kernel hasn't been upgraded to 6.0 when that came out, and
two separate NVMe parts started showing the exact same problem within ~a
day, I'm guessing something else other than the storage device being bad is
afoot. Potential culprits could be bad RAM (dropping a bit could certainly
manifest as a bad checksum), or perhaps an SI issue with the NVMe interface
in the machine. I'm going to poke at it a bit more.

Not exactly how I envisioned spending my Thursday, but hey.

        - Dan C.

(PS: My "find bad files" trick was to use `ripgrep`: `rg laskdfjasdof891m
/` as root shows IO errors on a number of files -- the search pattern to rg
doesn't matter, I just typed some random gibberish that's unlikely to show
up in any real file.)

On Wed, Jun 30, 2021 at 12:31 PM Matthew Dillon <dillon at backplane.com>
wrote:

> It looks like several different blocks failed a CRC test in your logs.  It
> would make sense to try to track down exactly where.  If you want to dive
> the filesystem meta-data you can dump it with full CRC tests using:
>
> hammer2 -vv show /dev/serno/S59ANMFNB34055E-1.s1d  > (save to a file not
> on the filesystem)
>
> And then look for 'failed)' lines in the output and track the inodes back
> to see which files are affected.  Its a bit round-about and you have to get
> familiar with the meta-data format, but that gives the most comprehensive
> results.   The output file is typically a few gigabytes (depends how big
> the filesystem is).   For example, I wound up with a single data block
> error in a mail file on one of my systems, easily rectified by copying-away
> the file and then deleting it.  I usually dump the output to a file and
> then run less on it, then search for failed crc checks.
>
>                   data.106     000000051676000f 00000000206a0000/16
>                                vol=0 mir=0000000000149dc6
> mod=0000000002572acb lfcnt=0
>                                (xxhash64
> 32:b65e740a8f5ce753/799af250bfaf8651 failed)
>
> A 'quick' way to try to locate problems is to use tar, something like
> this.  However, tar exits when it encounters the first error so that won't
> find everything, and if the problem is in a directory block that can
> complicate matters.
>
>     tar --one-file-system -cvf /dev/null /
>
> -Matt
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dragonflybsd.org/pipermail/users/attachments/20210701/e87dc370/attachment.htm>