Delayed ACK triggered by Header Prediction
Matthew Dillon
dillon at apollo.backplane.com
Wed Mar 16 01:28:22 PST 2005
Hmmm. There is something going on but it doesn't have anything to do
with tcpcbackq[]. I will investigate why the acks are getting seriously
delayed on wednesday. Here is an explanation of how tcpcbackq[] works:
DragonFly doesn't send back-to-back acks which would otherwise occur
due to normal aggregation of receive packets by the ethernet hardware.
So, for example, a GigE interface typically only generates an interrupt
once every 8 to 10 received packets (if the packets are coming in at
full speed). If all the packets are associated with the same tcp
connection, DragonFly will only send one ACK back after processing
the whole set rather then sending 4 back-to-back acks. This greatly
reduces both the return channel bandwidth and the overhead AND reduces
the overhead on the sender's processing of the acks.
A 100BaseT interface typically does NOT aggregate packets... the packets
are coming in too slowly (usually) for such aggregation to occur. This
means that tcpcbackq[] would not the acks to occur less often then every
other packet over 100BaseT, not unless the cpu load is very, very
high.
It should be noted that DragonFly is *NOT* delaying the ack in the
time domain. In fact, the sender will get a more up-to-date ack more
quickly because it won't have to wade through 4 ack packets before
it gets the most up-to-date ack (in the GigE case).
I have never noticed any performance degredation from this. I get
10MBytes/sec over 100BaseT, at least between two DragonFly boxes.
Sure the congestion window might open up a bit more slowly on some
senders, but it requires a multi-packet data burst to trigger the
effect (i.e. 3 or more packets sent back-to-back) and at that point
the congestion window should already be sufficiently open to not effect
performance. At those speeds it would only take a few milliseconds
at most for the congestion window to open up completely. Unless your
link has a lot of packet loss, you shouldn't notice any degredation
in performance, and even if your link has packet loss you shouldn't
notice much (because the effect doesn't occur until the congestion
window is at least 3 packets long).
The business about sending one ack for every second segment is a very
old part of the RFC (if I remember correctly), and might have made
sense for a 10BaseT connection, but it makes very little sense for
a 100BaseT or GiGE connection with packet aggregation interrupt
hardware.
--
In anycase, I *AM* seeing a performance reduction when I FTP with
a DragonFly box as a receiver over 100BaseT. I am seeing 7-8MBytes/sec
instead of 10+MB/sec. It is NOT related to the way delayed acks work or
how tcpcbackq[] works, however. It looks like there is an output delay
being imposed somewhere but it is occuring outside the TCP stack. You
can verify this by doing a tcpdump in the middle of a transfer on the
sender, and then doing a tcpdump on the receiver. The receiver believes
it is sending an ack out every other packet (at 100BaseT speeds), but
the sender is seeing those acks globbed together. When I run the same
test with a FreeBSD box as the receiver the sender is NOT seeing the acks
globbed together (at least not anywhere near as badly). Clearly there
is something wrong here. I don't yet know what it is but I am fairly
sure from the tcpdump output that the TCP stack is not to blame.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
:I am using DragonFlyBSD as a TCP receiver since yersterday night in my
:experiences. And I found that the number of ACK segments sent in reply
:to received data segments is less than expected.
:
:Normally, ACK segments are sent for every second full-sized data segment.
:
: (As many of you know, it is called Delayed ACK and is specified in
: section 4.2.3.2 of RFC1122 as follows:
:
: A TCP SHOULD implement a delayed ACK, but an ACK should not
: be excessively delayed; in particular, the delay MUST be
: less than 0.5 seconds, and in a stream of full-sized
: segments there SHOULD be an ACK for at least every second
: segment.
: )
:
:But the Header Prediction code in DragonFlyBSD TCP sends ACK segments
:less frequently. It just queues an output request into tcpcbackq[].
:And tcp_willblock() processes the request later. It seems that
:tcp_willblock() is called less frequently than receiving two
:full-sized data segments in my environment (100Mbps). (I put printf()'s
:in tcp_input(), tcp_output() and tcp_willblock() and found this.)
:That would be the reason why the number of ACK segments is less than
:expected.
:
:In my experiences, since DragonFlyBSD sends less ACK segments than
:expected, the congestion window in the sender machine grows slowly
:and the TCP performance becoms poor.
:
:I tried the followings:
:
: 1. "sysctl -w net.inet.tcp.avoid_pure_win_update=0"
: But my problem was not solved.
:
: 2. I replaced the code fragment that inserts an output request in
: Header Prediction with a code that simply calls tcp_output().
: With this change, the TCP performance becomes normal.
: (compared with the performance when a Linux box is a receiver.)
:
:I checked "cvs log". tcpcbackq[] was introduced on Aug 3, 2004 to
:reduce the number of ACK segments across GbE. Unfortunately, it reduces
:the TCP performance on 100Mbps path when DragonFlyBSD acts as a receiver.
:I think the same phenomenon will occur when DragonFlyBSD acts as a receiver
:across 10GbE.
:
: What I would like to say here is that when acting as a receiver,
: if the number of ACK segments sent in reply to data segments is reduced,
: TCP performance from peer node would also be reduced because of
: the standard congestion control algorithm.
:
:So, I think it is better to send an ACK segment for every second
:full-sized data segment even on GbE. But I have not experienced
:DragonFlyBSD on GbE yet. So, I may be wrong. I am sorry in such
:case.
:
:Regards,
:Noritoshi Demizu
More information about the Kernel
mailing list