Current stable tag slip status 23-Mar-2005

Wed Mar 23 21:12:49 PST 2005

Matthew Dillon wrote:

:Matt,
:
:Dodgy TCP/IP  - or even the possibility of it - is a
:'showstopper' to pushing this into a 'production' setting.
:
:Can a 'blueprint' be assembled and posted with all
:necessary environment, configuration, test procedures,
:and necessary documentation of same, so that some of
:us 'non-coders' can help with *relevant* tests?
:
:I can put 4 to 8 servers onto it over the next 48 hours, but
:am not sure where to start, what to look for, or what
:coders need as output from the process in order to
:isolate - and either fix or confirm as safe to ignore.
:
:Bill
    It's not quite so easy. 
If it was 'easy' I'd have asked a local script kiddie ;-)

    Just reproducing these issues can be a major
    task, because there are a huge number of unknowns... the kernel the
    originator was running might be old, the bug might have been 
    indirectly fixed by some other commit made later on, the originator 
    may be using odd compiler optimizations or (as in the case of the ppbus
    report) gcc-3.4 (which nobody running a production system should be using
    yet).  The particular hardware could bad, the issue could be driver
    related... the originator might have tweaked some sysctls in odd ways 
    that are causing problem.  There are a lot of unknowns.
ACK.  The very reason I was/am trying for 'group think' advice as to where
to look first...
    Most of the bugs were reported on machines running older kernels, and 
    those people are now trying to reproduce them on the latest kernels.
    This is a particular liability for TCP related bugs due to the number of
    bug fixes that have been committed recently.

IF any have NOT been reporoduced (at least by) by the
same folks on the same platforms, I am happy to
consider those 'no longer of interest'.
    If you want to have a go at reproducing some of these issues please do!
    I'm working on NFS right now.  If you have an SMP box see if you can
    reproduce the IPV4 connection issue reported by Peter Avalos.
Not running SMP presently. Can do, but this isn't a casus belli.

    I'll give a quick summary of where my thinking is on the issues still
    open.  The stable tag is still going to be slipped today regardless,
    because the existing stable has become a liability... there are just
    too many bugs that have been fixed since then, some quite serious
    (certainly more serious then a non-fatal repeatable tcp issue).  

ACK. Put that way, it makes good sense. Risk/reward ratio is positive.

    If we suddenly get a flurry of bug reports from other people related to
    these particular bugs, you can be sure that the issue will be tracked
    down quickly and fixed.
ACK.  A Fresh starting point, as it were.

    IPV4 connection problems    - still diagnosing (probably SMP related)
        (reported by Peter Avalos)
	This issue is either due to an (as yet unknown) wildcard listen
	socket problem or some resource is getting blown out and we just
	haven't figured out which resource it is.
	Symptoms:  connections to apache or ftp on localhost sometimes
	timeout, but then a few seconds later connect just find.  A packet
	trace shows the TCP connection requests going out but no
	acknowledgement or RST coming back.  Only one person has reported
	the problem, so we don't know if it is a software bug or if some
	resource is being blown out or if the machine is being attacked or
	what.
Archaeology in both cases. Neither ftp or Apache of current interest.

	I am going to try to reproduce it on my SMP test box today. 

    TCP Flooding issue          - still diagnosing
        (reported by Atte Peltomaki)
	Atte's report is:

	Every once in a while, when application crashes and leaves an open TCP
	connection, data starts flowing full speed back and forth the boxes.
	Here's tcpdump output from one occasion where Opera crashed:

=06:56:33.123424 IP webserverip.80 > myboxip.1632: . ack 1 win 33580 <nop,nop,ti
mestamp 859 139667162>
06:56:33.123461 IP myboxip.1632 > webserverip.80: F 838440838:838440838(0) ack 3
219133104 win 65535 <nop,nop,timest
	[repeats at a high rate]
	netstat says the connection is in CLOSE_WAIT state. 
This worried me, as I had a similar happening within
the 'house' side of an ADSL router on 3 occasions,
evidenced by switching hub LED's going bug-fsck
and loss of usable connectivity all around.  Simply
updated DragonFly 'til is ceased rather than analyze.
Two other boxen active at the time, PowerBook 17",
Mac OS X, patched to 2005-002, FreeBSD 4.11-STABLE
of 2 FEB 05. No stack tweaks.  'factory' defaults.
	Atte was originally running a faily old kernel and is now retesting
	with the latest kernel.
'Whatever it is' has not reappeared *here* in
either gcc2 or gcc3 ISO's nor cvs head with
back-to-back make cycles in any release since
o/a 13 MAR.  May have gone away earlier,
I am not updating every day.
	It is unknown whether the bug exists in the latest kernel.  I have
	not been able to replicate it as yet.  Clearly some TCP parameters
	have been tweaked (e.g. the default window size is not 65535), perhaps
	the issue is related to some of the tweaks that have been made.
Highly probable. My undocumented issues, above,
were bog-standard, as-issued stacks,  had no TCP tweaks.
I still have that ISO on CD, could try to back-pedal
and replicate, but disinclined to bother with scatology.
Whether fixed by design,  or 'sideswipe' fixed as
peripheral to some other work, that one seems
to have gone away.
    NFS TCP connection failure  - still diagnosing
        (reported by someone through Jeffrey Hsu)
	The report is that a large volume of NFS operations over a TCP
	mount result in the TCP connection dying.  A TCP trace seems to
	show the TCP connection getting a FIN and resetting. 

	We don't have any further information, so we don't know why the TCP
	connection was closed.  It could have been an nfsd being killed on
	the server, or it could have been something else.
Way too little info here and, far too many external
variables in general with NFS. Not generally used
here for other reasons.
	I have run gigabytes through an NFS tcp connection and not yet been
	able to replicate the problem.
					-Matt
Recap:

Too little meat on the NFS bone, other items appear to
have been local abberations, and/or not reproduced with
any certainty, if at all, on later releases.
Benefits of slipping tag clearly outweigh risks, IMO.

I'm for it!

Thanks for the analysis and sitrep.

Bill