Current stable tag slip status 23-Mar-2005
Bill Hacker
wbh at conducive.org
Wed Mar 23 21:12:49 PST 2005
Matthew Dillon wrote:
:Matt,
:
:Dodgy TCP/IP - or even the possibility of it - is a
:'showstopper' to pushing this into a 'production' setting.
:
:Can a 'blueprint' be assembled and posted with all
:necessary environment, configuration, test procedures,
:and necessary documentation of same, so that some of
:us 'non-coders' can help with *relevant* tests?
:
:I can put 4 to 8 servers onto it over the next 48 hours, but
:am not sure where to start, what to look for, or what
:coders need as output from the process in order to
:isolate - and either fix or confirm as safe to ignore.
:
:Bill
It's not quite so easy.
If it was 'easy' I'd have asked a local script kiddie ;-)
Just reproducing these issues can be a major
task, because there are a huge number of unknowns... the kernel the
originator was running might be old, the bug might have been
indirectly fixed by some other commit made later on, the originator
may be using odd compiler optimizations or (as in the case of the ppbus
report) gcc-3.4 (which nobody running a production system should be using
yet). The particular hardware could bad, the issue could be driver
related... the originator might have tweaked some sysctls in odd ways
that are causing problem. There are a lot of unknowns.
ACK. The very reason I was/am trying for 'group think' advice as to where
to look first...
Most of the bugs were reported on machines running older kernels, and
those people are now trying to reproduce them on the latest kernels.
This is a particular liability for TCP related bugs due to the number of
bug fixes that have been committed recently.
IF any have NOT been reporoduced (at least by) by the
same folks on the same platforms, I am happy to
consider those 'no longer of interest'.
If you want to have a go at reproducing some of these issues please do!
I'm working on NFS right now. If you have an SMP box see if you can
reproduce the IPV4 connection issue reported by Peter Avalos.
Not running SMP presently. Can do, but this isn't a casus belli.
I'll give a quick summary of where my thinking is on the issues still
open. The stable tag is still going to be slipped today regardless,
because the existing stable has become a liability... there are just
too many bugs that have been fixed since then, some quite serious
(certainly more serious then a non-fatal repeatable tcp issue).
ACK. Put that way, it makes good sense. Risk/reward ratio is positive.
If we suddenly get a flurry of bug reports from other people related to
these particular bugs, you can be sure that the issue will be tracked
down quickly and fixed.
ACK. A Fresh starting point, as it were.
IPV4 connection problems - still diagnosing (probably SMP related)
(reported by Peter Avalos)
This issue is either due to an (as yet unknown) wildcard listen
socket problem or some resource is getting blown out and we just
haven't figured out which resource it is.
Symptoms: connections to apache or ftp on localhost sometimes
timeout, but then a few seconds later connect just find. A packet
trace shows the TCP connection requests going out but no
acknowledgement or RST coming back. Only one person has reported
the problem, so we don't know if it is a software bug or if some
resource is being blown out or if the machine is being attacked or
what.
Archaeology in both cases. Neither ftp or Apache of current interest.
I am going to try to reproduce it on my SMP test box today.
TCP Flooding issue - still diagnosing
(reported by Atte Peltomaki)
Atte's report is:
Every once in a while, when application crashes and leaves an open TCP
connection, data starts flowing full speed back and forth the boxes.
Here's tcpdump output from one occasion where Opera crashed:
=06:56:33.123424 IP webserverip.80 > myboxip.1632: . ack 1 win 33580 <nop,nop,ti
mestamp 859 139667162>
06:56:33.123461 IP myboxip.1632 > webserverip.80: F 838440838:838440838(0) ack 3
219133104 win 65535 <nop,nop,timest
[repeats at a high rate]
netstat says the connection is in CLOSE_WAIT state.
This worried me, as I had a similar happening within
the 'house' side of an ADSL router on 3 occasions,
evidenced by switching hub LED's going bug-fsck
and loss of usable connectivity all around. Simply
updated DragonFly 'til is ceased rather than analyze.
Two other boxen active at the time, PowerBook 17",
Mac OS X, patched to 2005-002, FreeBSD 4.11-STABLE
of 2 FEB 05. No stack tweaks. 'factory' defaults.
Atte was originally running a faily old kernel and is now retesting
with the latest kernel.
'Whatever it is' has not reappeared *here* in
either gcc2 or gcc3 ISO's nor cvs head with
back-to-back make cycles in any release since
o/a 13 MAR. May have gone away earlier,
I am not updating every day.
It is unknown whether the bug exists in the latest kernel. I have
not been able to replicate it as yet. Clearly some TCP parameters
have been tweaked (e.g. the default window size is not 65535), perhaps
the issue is related to some of the tweaks that have been made.
Highly probable. My undocumented issues, above,
were bog-standard, as-issued stacks, had no TCP tweaks.
I still have that ISO on CD, could try to back-pedal
and replicate, but disinclined to bother with scatology.
Whether fixed by design, or 'sideswipe' fixed as
peripheral to some other work, that one seems
to have gone away.
NFS TCP connection failure - still diagnosing
(reported by someone through Jeffrey Hsu)
The report is that a large volume of NFS operations over a TCP
mount result in the TCP connection dying. A TCP trace seems to
show the TCP connection getting a FIN and resetting.
We don't have any further information, so we don't know why the TCP
connection was closed. It could have been an nfsd being killed on
the server, or it could have been something else.
Way too little info here and, far too many external
variables in general with NFS. Not generally used
here for other reasons.
I have run gigabytes through an NFS tcp connection and not yet been
able to replicate the problem.
-Matt
Recap:
Too little meat on the NFS bone, other items appear to
have been local abberations, and/or not reproduced with
any certainty, if at all, on later releases.
Benefits of slipping tag clearly outweigh risks, IMO.
I'm for it!
Thanks for the analysis and sitrep.
Bill
More information about the Users
mailing list