crossd at gmail.com
Thu Jul 1 11:45:47 PDT 2021
Thanks, Matt, this is very helpful. I pulled some metadata dumps and
started poking at them. But in the interest of getting the machine back
online as soon as possible, I popped in another NVMe (a brand new part),
reinstalled Dragonfly (from scratch) and restored user data (which I
managed to grab with tar from the old NMVe -- no errors there,
fortunately). My intent was to poke at the funky NMVe on another machine.
Interestingly, the system started exhibiting the same errors sometime
overnight, with the new NVMe and a completely rebuilt filesystem.
Given that the kernel hasn't been upgraded to 6.0 when that came out, and
two separate NVMe parts started showing the exact same problem within ~a
day, I'm guessing something else other than the storage device being bad is
afoot. Potential culprits could be bad RAM (dropping a bit could certainly
manifest as a bad checksum), or perhaps an SI issue with the NVMe interface
in the machine. I'm going to poke at it a bit more.
Not exactly how I envisioned spending my Thursday, but hey.
- Dan C.
(PS: My "find bad files" trick was to use `ripgrep`: `rg laskdfjasdof891m
/` as root shows IO errors on a number of files -- the search pattern to rg
doesn't matter, I just typed some random gibberish that's unlikely to show
up in any real file.)
On Wed, Jun 30, 2021 at 12:31 PM Matthew Dillon <dillon at backplane.com>
> It looks like several different blocks failed a CRC test in your logs. It
> would make sense to try to track down exactly where. If you want to dive
> the filesystem meta-data you can dump it with full CRC tests using:
> hammer2 -vv show /dev/serno/S59ANMFNB34055E-1.s1d > (save to a file not
> on the filesystem)
> And then look for 'failed)' lines in the output and track the inodes back
> to see which files are affected. Its a bit round-about and you have to get
> familiar with the meta-data format, but that gives the most comprehensive
> results. The output file is typically a few gigabytes (depends how big
> the filesystem is). For example, I wound up with a single data block
> error in a mail file on one of my systems, easily rectified by copying-away
> the file and then deleting it. I usually dump the output to a file and
> then run less on it, then search for failed crc checks.
> data.106 000000051676000f 00000000206a0000/16
> vol=0 mir=0000000000149dc6
> mod=0000000002572acb lfcnt=0
> 32:b65e740a8f5ce753/799af250bfaf8651 failed)
> A 'quick' way to try to locate problems is to use tar, something like
> this. However, tar exits when it encounters the first error so that won't
> find everything, and if the problem is in a directory block that can
> complicate matters.
> tar --one-file-system -cvf /dev/null /
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Users