MSI prototype

Matthew Dillon dillon at apollo.backplane.com
Tue Feb 21 02:00:41 PST 2006


:method used for SYS_RES_IRQ. You might wonder why we don't use the
:same technique as for IRQs and pre-assign vectors. Lazy allocation is
:a better strategy for MSI and MSI-X ("MSI extended") as the capability
:structure can only indicate the max supported number of vectors (up to
:1K vectors in some cases) which may be much larger than the actual
:number used. Therefore, it is best left to the drivers to decide how
:many vectors to allocate AND what to do if they can't get as many as
:they want/need.
:
:The problems
:
:I basically like the bus_alloc_resource and bus_setup_intr philosophy,
:but they currently pose separate problems for MSI and MSI-X.
:
:1) MSI allows multiple vectors per function but requires that the
:vectors are contiguous (e.g. vectors 1,2,3 or 2,3,4 are OK but 2,4,5
:are not). So while the current implementation finds the first unused
:vector, there isn't a way to specify that this must be the first of 3
:free vectors (for example). Additionally, when allocating the
:subsequent vectors, there isn't a way to correlate the current
:allocation with a previous one.
:
:2) MSI-X avoids the above problem because it has separate address/data
:registers for each supported interrupt (i.e. MSI only has a single
:address/data register for ALL interrupts). The problem here is with
:setting up the interrupt. Once an interrupt is allocated, there isn't
:a way too say "use this address/data for the 3rd interrupt" (which
:translates into "write this to the 3rd addr/data registers").
:
:3) Another minor problem is in the resource implementation which
:currently panics (IIRC ...) if the system tries to add more than one
:resource of the same type to the resource list (i.e. can't have
:multiple SYS_RES_MSI). This one I'm less worried about than the
:previous 2.
:
:So, what to do? If it were up to me, I'd be tempted to modify the
:current interfaces to support the additional information needed
:because I like the alloc/setup metaphors. The downside to this is
:modifying "a couple" ;) of drivers to support the new API. Thoughts?
:
:---chuck

    I think (3) isn't a big deal if the MSI irq abstraction is given
    the 'number of vectors' required and returns just the base vector,
    letting the driver do base+0, base+1, base+2 for the individual
    vectors.  One resource would thus represent all N requested contiguous
    IRQs.

    Reservation and allocation can be solved by removing the software
    limits on the number of supported IRQs.  In 1.4 I separated out the
    software interrupt bits from the hardware interrupt bits, increasing
    the number of supported IRQs from 24 to 32.  See the gd_fpending and
    gd_ipending fields in src/sys/i386/include/globaldata.h.
    (those fields are used to 'remember' interrupts which occured during
    a critical section and then process them when the critical section is
    released at crit_exit/doreti/splz time).

    Fortunately most of the other code dependancies, namely SPL masks, no
    longer exist in DragonFly.  And we also now have a fairly nicely
    abstracted interrupt vector management layer, aka src/sys/i386/{apic,icu}.

    If we do further work to increase the number of supported irqs in
    software to some reasonable number, like 256, then the device driver
    model would be greatly simplified.  We basically would never run out
    of IRQs and wouldn't have to worry about running out of IRQs.

    Device driver code tends to assume level-triggered so that would
    probably be the hardest bit once the infrastructure is in place.

    My understanding of MSI(X) is that it basically treats the interrupt
    as edge-triggered.  On trigger it writes the specified data to the
    specified address which, I assume, would be an IOAPIC or LAPIC address?
    I can't imagine how it would work with an IOAPIC with its stupid 
    register window crap, so I assume the MSI(X) address and data would
    inject an ICR command into the LAPIC.  I don't quite understand how
    the device is able to do that without polling the 'S'end Pending
    bit in the ICR, though.

						-Matt






More information about the Kernel mailing list