Current stable tag slip status 23-Mar-2005
Matthew Dillon
dillon at apollo.backplane.com
Wed Mar 23 13:23:33 PST 2005
:Matt,
:
:Dodgy TCP/IP - or even the possibility of it - is a
:'showstopper' to pushing this into a 'production' setting.
:
:Can a 'blueprint' be assembled and posted with all
:necessary environment, configuration, test procedures,
:and necessary documentation of same, so that some of
:us 'non-coders' can help with *relevant* tests?
:
:I can put 4 to 8 servers onto it over the next 48 hours, but
:am not sure where to start, what to look for, or what
:coders need as output from the process in order to
:isolate - and either fix or confirm as safe to ignore.
:
:Bill
It's not quite so easy. Just reproducing these issues can be a major
task, because there are a huge number of unknowns... the kernel the
originator was running might be old, the bug might have been
indirectly fixed by some other commit made later on, the originator
may be using odd compiler optimizations or (as in the case of the ppbus
report) gcc-3.4 (which nobody running a production system should be using
yet). The particular hardware could bad, the issue could be driver
related... the originator might have tweaked some sysctls in odd ways
that are causing problem. There are a lot of unknowns.
Most of the bugs were reported on machines running older kernels, and
those people are now trying to reproduce them on the latest kernels.
This is a particular liability for TCP related bugs due to the number of
bug fixes that have been committed recently.
If you want to have a go at reproducing some of these issues please do!
I'm working on NFS right now. If you have an SMP box see if you can
reproduce the IPV4 connection issue reported by Peter Avalos.
I'll give a quick summary of where my thinking is on the issues still
open. The stable tag is still going to be slipped today regardless,
because the existing stable has become a liability... there are just
too many bugs that have been fixed since then, some quite serious
(certainly more serious then a non-fatal repeatable tcp issue).
If we suddenly get a flurry of bug reports from other people related to
these particular bugs, you can be sure that the issue will be tracked
down quickly and fixed.
IPV4 connection problems - still diagnosing (probably SMP related)
(reported by Peter Avalos)
This issue is either due to an (as yet unknown) wildcard listen
socket problem or some resource is getting blown out and we just
haven't figured out which resource it is.
Symptoms: connections to apache or ftp on localhost sometimes
timeout, but then a few seconds later connect just find. A packet
trace shows the TCP connection requests going out but no
acknowledgement or RST coming back. Only one person has reported
the problem, so we don't know if it is a software bug or if some
resource is being blown out or if the machine is being attacked or
what.
I am going to try to reproduce it on my SMP test box today.
TCP Flooding issue - still diagnosing
(reported by Atte Peltomaki)
Atte's report is:
Every once in a while, when application crashes and leaves an open TCP
connection, data starts flowing full speed back and forth the boxes.
Here's tcpdump output from one occasion where Opera crashed:
Here's tcpdump output from one occasion where Opera crashed:
=06:56:33.123424 IP webserverip.80 > myboxip.1632: . ack 1 win 33580 <nop,nop,ti
mestamp 859 139667162>
06:56:33.123461 IP myboxip.1632 > webserverip.80: F 838440838:838440838(0) ack 3
219133104 win 65535 <nop,nop,timest
[repeats at a high rate]
netstat says the connection is in CLOSE_WAIT state.
Atte was originally running a faily old kernel and is now retesting
with the latest kernel.
It is unknown whether the bug exists in the latest kernel. I have
not been able to replicate it as yet. Clearly some TCP parameters
have been tweaked (e.g. the default window size is not 65535), perhaps
the issue is related to some of the tweaks that have been made.
NFS TCP connection failure - still diagnosing
(reported by someone through Jeffrey Hsu)
The report is that a large volume of NFS operations over a TCP
mount result in the TCP connection dying. A TCP trace seems to
show the TCP connection getting a FIN and resetting.
We don't have any further information, so we don't know why the TCP
connection was closed. It could have been an nfsd being killed on
the server, or it could have been something else.
I have run gigabytes through an NFS tcp connection and not yet been
able to replicate the problem.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Users
mailing list