Thousands of "lookupdotdot failed" messages - should I be worried?

Thu Jun 17 10:40:44 PDT 2021

On Wed, Jun 16, 2021 at 10:10:48PM PDT, Matthew Dillon wrote:
> Ok, you could try this kernel patch and see if it helps.  If you know how
> to rebuild the system from sources.  The client-side I/O errors would go
> hand-in-hand with the dotdot stuff, but only because this is a NFS mount.
> The server-side should not be having any I/O errors, and no filesystem data
> is in error or corrupted or anything like that.
> 
> http://apollo.backplane.com/DFlyMisc/nullfs01.patch
> 
> Other questions I have are
> 
> (1) Jjust how many NULLFS filesystems
> (2) Are you making multiple NFS mounts based on the same source path?
> (3) And finally, what is the underlying filesystem type that the nullfs is
> taking as its source?
> 
> What I try to do in this patch is construct a FSID based on the nullfs's
> destination path instead of its source path, to try to reduce conflicts
> there.  Another possible problem is that the nullfs's underlying filesystem
> has a variable fsid due to not being a storage-based filesystem.  i.e. if
> the underlying filesystem is a NFS or TMPFS filesystem, for example.
> 
> The only other thing that I can think of that could cause dotdot lookup
> failures is if you rename a directory on one client and try to access it
> from another, or rename a directory on the server and try to access it from
> a client that had already cached it.  Directories are kinda finicky in NFS.
> 
> -Matt

Thanks for the patch - I'll give it a try asap.  The server side is, 
thankfully, not having any I/O errors, and FUSE-capable clients can use 
the NFS mounts over sshfs without encountering any.  To answer your 
other questions:

(1) I have 12 null mounts that are NFS exported; each of these is the 
root of a PFS.  I also have a number of unexported mounts; one being a 
PFS root and the others being subdirectories of a single PFS.  The total 
number of null mounts on the system is 24.

(2) Not sure I understood the question but, only one null mount of any 
underlying PFS is exported (for all $node, $node is only accessible over 
NFS through one real path).

(3) In all cases the underlying filesystem is HAMMER.

This is happening in directories that have never been renamed since they 
were created years ago, on files that have been there for years, even in 
directories that are only used by one client.

On Wed, Jun 16, 2021 at 10:52:09PM PDT, Matthew Dillon wrote:
> Oh, I forgot to mention... this patch changes the FSID for null mounts, so
> any clients should umount, then reboot the server, then clients can
> remount.  or reboot the clients after rebooting the server.
> 
> -Matt

That's my SOP for all NFS file server reboots, so no worries there.  :)

If it helps, it looks like this started happening after upgrading the 
server from 5.x to 6.0 a month[1] ago and has been exponentially ramping 
up in frequency since then.  It had never happened before as far as I'm 
aware.  The server has not been rebooted since, but some of the clients 
have been a few times.  At least three PFS's are affected for sure and I 
surmise the rest of them are as well.

[1] according to my own records, it's been a month, whereas uptime(1) 
would have me believe it's been up for 117 years, kern.boottime being in 
October 1903.  I have no idea how that happened.  Probably unrelated but 
sometimes things affect seemingly unrelated things in unforeseen ways...

-- 
A Dog