Current stable tag slip status 23-Mar-2005

Wed Mar 23 13:23:33 PST 2005

:Matt,
:
:Dodgy TCP/IP  - or even the possibility of it - is a
:'showstopper' to pushing this into a 'production' setting.
:
:Can a 'blueprint' be assembled and posted with all
:necessary environment, configuration, test procedures,
:and necessary documentation of same, so that some of
:us 'non-coders' can help with *relevant* tests?
:
:I can put 4 to 8 servers onto it over the next 48 hours, but
:am not sure where to start, what to look for, or what
:coders need as output from the process in order to
:isolate - and either fix or confirm as safe to ignore.
:
:Bill

    It's not quite so easy.  Just reproducing these issues can be a major
    task, because there are a huge number of unknowns... the kernel the
    originator was running might be old, the bug might have been 
    indirectly fixed by some other commit made later on, the originator 
    may be using odd compiler optimizations or (as in the case of the ppbus
    report) gcc-3.4 (which nobody running a production system should be using
    yet).  The particular hardware could bad, the issue could be driver
    related... the originator might have tweaked some sysctls in odd ways 
    that are causing problem.  There are a lot of unknowns.

    Most of the bugs were reported on machines running older kernels, and 
    those people are now trying to reproduce them on the latest kernels.
    This is a particular liability for TCP related bugs due to the number of
    bug fixes that have been committed recently.

    If you want to have a go at reproducing some of these issues please do!
    I'm working on NFS right now.  If you have an SMP box see if you can
    reproduce the IPV4 connection issue reported by Peter Avalos.

    I'll give a quick summary of where my thinking is on the issues still
    open.  The stable tag is still going to be slipped today regardless,
    because the existing stable has become a liability... there are just
    too many bugs that have been fixed since then, some quite serious
    (certainly more serious then a non-fatal repeatable tcp issue).  

    If we suddenly get a flurry of bug reports from other people related to
    these particular bugs, you can be sure that the issue will be tracked
    down quickly and fixed.

    IPV4 connection problems    - still diagnosing (probably SMP related)
        (reported by Peter Avalos)

	This issue is either due to an (as yet unknown) wildcard listen
	socket problem or some resource is getting blown out and we just
	haven't figured out which resource it is.

	Symptoms:  connections to apache or ftp on localhost sometimes
	timeout, but then a few seconds later connect just find.  A packet
	trace shows the TCP connection requests going out but no
	acknowledgement or RST coming back.  Only one person has reported
	the problem, so we don't know if it is a software bug or if some
	resource is being blown out or if the machine is being attacked or
	what.

	I am going to try to reproduce it on my SMP test box today. 

    TCP Flooding issue          - still diagnosing
        (reported by Atte Peltomaki)

	Atte's report is:

	Every once in a while, when application crashes and leaves an open TCP
	connection, data starts flowing full speed back and forth the boxes.
	Here's tcpdump output from one occasion where Opera crashed:

	Here's tcpdump output from one occasion where Opera crashed:

=06:56:33.123424 IP webserverip.80 > myboxip.1632: . ack 1 win 33580 <nop,nop,ti
mestamp 859 139667162>
06:56:33.123461 IP myboxip.1632 > webserverip.80: F 838440838:838440838(0) ack 3
219133104 win 65535 <nop,nop,timest
	[repeats at a high rate]

	netstat says the connection is in CLOSE_WAIT state. 

	Atte was originally running a faily old kernel and is now retesting
	with the latest kernel.

	It is unknown whether the bug exists in the latest kernel.  I have
	not been able to replicate it as yet.  Clearly some TCP parameters
	have been tweaked (e.g. the default window size is not 65535), perhaps
	the issue is related to some of the tweaks that have been made.

    NFS TCP connection failure  - still diagnosing
        (reported by someone through Jeffrey Hsu)

	The report is that a large volume of NFS operations over a TCP
	mount result in the TCP connection dying.  A TCP trace seems to
	show the TCP connection getting a FIN and resetting. 

	We don't have any further information, so we don't know why the TCP
	connection was closed.  It could have been an nfsd being killed on
	the server, or it could have been something else.

	I have run gigabytes through an NFS tcp connection and not yet been
	able to replicate the problem.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>