em driver - issue #2

EM1897 at aol.com EM1897 at aol.com
Sun Feb 6 13:04:28 PST 2005


In a message dated 2/6/2005 3:10:50 PM Eastern Standard Time, Matthew Dillon 
<dillon at xxxxxxxxxxxxxxxxxxxx> writes:

>:I think there are a couple of things wrong with that solution.
>:First, controllers know what to do with empty descriptors, in that they 
fall into a RNR condition. Thats part of the basic design. Its the drivers 
responsibility to clear up such conditions. At 145Kpps, you're not going to achieve 
much by trying to fool the driver into thinking that it has memory, except 
losing a lot of packets. The point of the RNR condition is to get the other end 
to stop sending until you can handle it. The driver is doing something wrong in 
this case, and it needs to be cleaned up properly.
>:
>:The second thing thats wrong is that the "problem" is that the memory MUST 
be available. That has to be corrected. Its not acceptable for it to fail the 
way its failing. There's no excuse for a system with 20K clusters supposedly 
allocated to not be able to get the 1600th cluster because of a "bucket" 
problem. The reason that many drivers don't handle the "cant get memory" condition 
is because it almost never happens in real world scenarios. Its a serious 
problem that it happens so quickly. 1000 packets at gigabit speeds is a tiny amount 
of time. It makes little sense to redesign the mbuf system only to leave it 
with such an inefficiency. I don't know enough about it to know how other O/Ss 
do it, but they don't fail the way the dfly does in this instance.
>
>    Well, we haven't resolved why the memory allocation is failing.  You
>    need to do a vmstat -m to see the real memory use.  A machine which has
>    say a gigabyte of ram will allow the mbuf subsystem to allocate
>    ~100 MBytes by default.
>
>    In the current design the processing of the input packet is decoupled
>    from the network interrupt.  This means that the machine can potentially
>    handle a 145Kpps rate at the interrupt layer but still not have 
sufficient
>    cpu to actually process packets at that rate.  The packets are almost
>    certainly backing up on the message port to the protocol threads.
>
>    So there's a tradeoff here...  we can virtually guarentee that memory
>    will be available if we flow control the interface or start to drop
>    packets early, or we can allow the interrupt to queue packets at the
>    maximum rate and allow memory allocations to fail if the memory limit
>    is reached.  But we can't do both.  If you stuff in more packets then
>    the cpu can handle and don't flow control it, the machine will hit its
>    allocation limit no matter how much memory is reserved for the network.
>
>    Perhaps what is needed here is some sort of feedback so the network
>    interrupt can be told that the system is running low on mbufs and flow 
>    control itself before we actually run out.

One thing thats needed if its not there is a receive/network queue threshold
so that you don't queue an infinite number of packets. The value could
be reasonably tied to the memory settings (perhaps even be dynamic
as an option based on whats available in terms of clusters). Its much
better to drop packets gracefully than to let memory run out, because the
processing in these drivers can get very ugly when you have to clean
up the rings (as I'm sure Joerg can attest to), and as I said, many drivers 
don't handle the case well. I would make the queue threshold a tunable 
as its rather easy to predict what any particular machine can handle based 
on its horsepower, with perhaps a 0 setting for "dynamic" based on 
memory available.

It would be good if a user could make the decision that mbuf allocations
should be a priority. For a normal system, the current settings are likely
adequate; while a network appliance would require a different set of 
priorities. What we do is test the absolute capacity of the box under
best conditions to forward packets, so it would be rather easy to set
a threshold. 

Also, in FreeBSD I tune the kernel to use about 120M of RAM no matter
how much is in the system as thats about the most that is needed. 
I assume that the kern.vm.kmem.size works the same in dfly? Is the 
default formula still 1/3 of available memory allocated to the kernel? I
admit that I haven't done this on my dfly test system, so I'll have to try
it and see what difference it makes. My test system only has 256M
in it and now that I think of it, without tuning that may only allocate
80M to the kernel which could contribute to the problem. Although,
it would seem to me that if a user overrides the default mbuf clusters
in the kernel config that those clusters should be preallocated, as
the entire point of changing the setting is that you really NEED that many, 
and you want to avoid running out at all costs. Isn't the point of making
the mbuf clusters requirement a kernel option that they need to be 
preallocated and the kernel needs to know how many in advance?





More information about the Users mailing list