Delayed ACK triggered by Header Prediction

Matthew Dillon dillon at
Wed Mar 16 01:28:22 PST 2005

    Hmmm.  There is something going on but it doesn't have anything to do
    with tcpcbackq[].  I will investigate why the acks are getting seriously
    delayed on wednesday.  Here is an explanation of how tcpcbackq[] works:
    DragonFly doesn't send back-to-back acks which would otherwise occur
    due to normal aggregation of receive packets by the ethernet hardware.
    So, for example, a GigE interface typically only generates an interrupt
    once every 8 to 10 received packets (if the packets are coming in at 
    full speed).  If all the packets are associated with the same tcp
    connection, DragonFly will only send one ACK back after processing
    the whole set rather then sending 4 back-to-back acks.  This greatly
    reduces both the return channel bandwidth and the overhead AND reduces
    the overhead on the sender's processing of the acks.

    A 100BaseT interface typically does NOT aggregate packets... the packets
    are coming in too slowly (usually) for such aggregation to occur.  This
    means that tcpcbackq[] would not the acks to occur less often then every
    other packet over 100BaseT, not unless the cpu load is very, very

    It should be noted that DragonFly is *NOT* delaying the ack in the
    time domain.  In fact, the sender will get a more up-to-date ack more 
    quickly because it won't have to wade through 4 ack packets before
    it gets the most up-to-date ack (in the GigE case).

    I have never noticed any performance degredation from this.  I get
    10MBytes/sec over 100BaseT, at least between two DragonFly boxes.
    Sure the congestion window might open up a bit more slowly on some
    senders, but it requires a multi-packet data burst to trigger the
    effect (i.e. 3 or more packets sent back-to-back) and at that point
    the congestion window should already be sufficiently open to not effect
    performance.  At those speeds it would only take a few milliseconds
    at most for the congestion window to open up completely.  Unless your
    link has a lot of packet loss, you shouldn't notice any degredation 
    in performance, and even if your link has packet loss you shouldn't
    notice much (because the effect doesn't occur until the congestion
    window is at least 3 packets long).

    The business about sending one ack for every second segment is a very
    old part of the RFC (if I remember correctly), and might have made
    sense for a 10BaseT connection, but it makes very little sense for
    a 100BaseT or GiGE connection with packet aggregation interrupt


    In anycase, I *AM* seeing a performance reduction when I FTP with
    a DragonFly box as a receiver over 100BaseT.  I am seeing 7-8MBytes/sec
    instead of 10+MB/sec.  It is NOT related to the way delayed acks work or
    how tcpcbackq[] works, however.  It looks like there is an output delay
    being imposed somewhere but it is occuring outside the TCP stack.  You
    can verify this by doing a tcpdump in the middle of a transfer on the
    sender, and then doing a tcpdump on the receiver.  The receiver believes
    it is sending an ack out every other packet (at 100BaseT speeds), but
    the sender is seeing those acks globbed together.  When I run the same
    test with a FreeBSD box as the receiver the sender is NOT seeing the acks
    globbed together (at least not anywhere near as badly).  Clearly there 
    is something wrong here.  I don't yet know what it is but I am fairly
    sure from the tcpdump output that the TCP stack is not to blame.

					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>

:I am using DragonFlyBSD as a TCP receiver since yersterday night in my
:experiences.  And I found that the number of ACK segments sent in reply
:to received data segments is less than expected.
:Normally, ACK segments are sent for every second full-sized data segment.
:  (As many of you know, it is called Delayed ACK and is specified in
:   section of RFC1122 as follows:
:            A TCP SHOULD implement a delayed ACK, but an ACK should not
:            be excessively delayed; in particular, the delay MUST be
:            less than 0.5 seconds, and in a stream of full-sized
:            segments there SHOULD be an ACK for at least every second
:            segment.
:  )
:But the Header Prediction code in DragonFlyBSD TCP sends ACK segments
:less frequently.  It just queues an output request into tcpcbackq[].
:And tcp_willblock() processes the request later.  It seems that
:tcp_willblock() is called less frequently than receiving two
:full-sized data segments in my environment (100Mbps).  (I put printf()'s
:in tcp_input(), tcp_output() and tcp_willblock() and found this.)
:That would be the reason why the number of ACK segments is less than
:In my experiences, since DragonFlyBSD sends less ACK segments than
:expected, the congestion window in the sender machine grows slowly
:and the TCP performance becoms poor.
:I tried the followings:
:  1. "sysctl -w net.inet.tcp.avoid_pure_win_update=0"
:     But my problem was not solved.
:  2. I replaced the code fragment that inserts an output request in
:     Header Prediction with a code that simply calls tcp_output().
:     With this change, the TCP performance becomes normal.
:     (compared with the performance when a Linux box is a receiver.)
:I checked "cvs log".  tcpcbackq[] was introduced on Aug 3, 2004 to
:reduce the number of ACK segments across GbE.  Unfortunately, it reduces
:the TCP performance on 100Mbps path when DragonFlyBSD acts as a receiver.
:I think the same phenomenon will occur when DragonFlyBSD acts as a receiver
:across 10GbE.
:  What I would like to say here is that when acting as a receiver,
:  if the number of ACK segments sent in reply to data segments is reduced,
:  TCP performance from peer node would also be reduced because of
:  the standard congestion control algorithm.
:So, I think it is better to send an ACK segment for every second
:full-sized data segment even on GbE.  But I have not experienced
:DragonFlyBSD on GbE yet.  So, I may be wrong.  I am sorry in such
:Noritoshi Demizu

More information about the Kernel mailing list