DP performance

Tue Nov 29 04:23:28 PST 2005

On Monday 28 November 2005 22:13, Matthew Dillon wrote:
>     If we are talking about maxing out a machine in the packet
> routing role, then there are two major issue sthat have to be
> considered:
>
>     * Bus bandwidth.  e.g. PCI, PCIX, PCIE, etc etc etc.  A standard
> PCI bus is limited to ~120 MBytes/sec, not enough for even a single
> GiGE link going full duplex at full speed.  More recent busses can do
> better.
>
>     * Workload separation.  So e.g. if one has four interfaces and
> two cpus, each cpu could handle two interfaces.
>
>     An MP system would not reap any real gains over UP until one had
> three or more network interfaces, since two interfaces is no
> different from one interface from the point of view of trying to
> route packets.

Should we be really that pessimistic about potential MP performance, 
even with two NICs only?  Typically packet flows are bi-directional, 
and if we could have one CPU/core taking care of one direction, then 
there should be at least some room for parallelism, especially once the 
parallelized routing tables see the light.  Of course provided that 
each NIC is handled by a separate core, and that IPC doesn't become the 
actual bottleneck.

>     Main memory bandwidth used to be an issue but isn't so much any
> more.

The memory bandwidth isn't but latency _is_ now the major performance 
bottleneck, IMO.  DRAM access latencies are now in 50 ns range and will 
not noticeably decrease in the forseeable future.  Consider the amount 
of independent memory accesses that need to be performed on per-packet 
basis: DMA RX descriptor read, DMA RX buffer write, DMA RX descriptor 
update, RX descriptor update/refill, TX descriptor update, DMA TX 
desctiptor read, DMA TX buffer read, DMA TX descriptor update...  
Without doing any smart work at all we have to waste a few hundreds of 
ns of DRAM bus time per packet, provided we are lucky and the memory 
bus is not congested.  So to improve the forwarding performance 
anywhere above 1Mpps, UP or MP, having the CPU touch the DRAM in the 
forwarding path has to be avoided like the plaque.  The stack 
paralelization seems to be the right step in this direction.

Cheers

Marko

>     Insofar as DragonFly goes, we can almost handle the workload
> separation case now, but not quite.  We will be able to handle it
> with the work going in after the release.  Even so, it will probably
> only matter if the majority of packets being routed are tiny.  Bigger
> packets eat far less cpu for the amount of data transfered.
>
> 						-Matt