cvs commit: src/lib/libc/sys syslink.2

Matthew Dillon dillon at apollo.backplane.com
Tue Apr 3 10:46:46 PDT 2007


:Hey, that's exacly my area of interest!  So maybe I can contribute a bit =
:to the design...
:
:What do you mean with "optimal physical route"?  There are multiple metri=
:cs I could imagine:
:
:1. shortest hop
:2. maximize minimum bandwidth
:3. lowest latency
:4. node/link availability of the routing nodes

    All of the above.  What I want to do is have a multi-layered approach
    to finding a good route between two nodes.  I want a solid algorithm
    to serve as a fail-safe (such as forging a path by making a broadcast
    to the entire mesh)... it doesn't matter how inefficient it is, it
    just has to work regardless of the state of the network.  Then I want
    to add additional layers which provide for more optimal path discovery.  

    For example, lets say we limit the mesh to 50 broadcasts a second
    across any given link.  We could implement a heuristic that eats up,
    say, 10 broadcasts a second to implement an algorithm that is
    constantly trying to find better paths for existing connections.

    What is important here is that the mesh be able to 'settle down' into
    using more optimal paths within a reasonable period of time after a
    major change to the topology.
    
:for efficiently designing the routing overlay, we also should have an est=
:imate of the maximum number of nodes joining one cluster.  are we talking=
: 100, 1000 or 1000000?

    The limitation is the number of hops and the complexity (number
    of switch elements) per hop.  So, for example, if you want to support
    256 machines on an ethernet switch you would need 8 bits to represent
    the number of elements on that etherswitch.

    A 'physical' address is always relative.  Each individual hop can use
    a different number of bits, but for arguments sake lets say that each
    hop needed 4 bits.  A physical address like this:

    0xA1451CE3114762AE

    Would thus be broken down into blocks of 4 bits.  The originator 
    sends the packet to its route node (everyone has to connect to a
    route node somewhere, where the route node is represented by the
    syslink descriptor returned from the syslink system call).  This
    route node needs to know how to direct the packet.  It would pull
    off the low 4 bits, i.e. the 'E', and use that as an index to forward
    the packet to one of the 16 entities connecting to it in this example,
    and it would rewrite the target address as 0x0A1451CE3114762A.  That
    is, it would 'eat' the bits.

    The next entity would pull off the bits it needs, say another 4 bits,
    then forward the packet and rewrite the address again.

    This would continue until there are no bits left... well, in reality,
    until the address becomes '0' (since we are shifting 0 bits into the
    MSB).

    For any given link, all 0's always addresses the route node 
    itself, and all 1's represents a broadcast.

    It is important to note that the 'physical' address is a relative
    route, not an absolute address.  The physical address for node X
    will be different for a packet sent from node A verses a packet
    sent from node B.  This is the basic core needed to create a
    self-identifying network.

    Forging a route will be the function of the routing node (the syslink
    entity), and NOT usually a leaf on the network.  If you have 14
    entities connected to a syslink node, that node can forge a route
    to a target that can then be cached and supplied to all 14 of those
    entities.  We wouldn't have to forge 14 separate routes... that is,
    the trivial collapse cases would be collapsed.

    Also, it needs to be noted that the resources that are being routed
    are fairly coarse.  i.e. we might route to a filesystem, not to an
    individual file, which means that if there are 5000 processes talking
    to that filesystem, there is still only one route (well, maybe a 
    few in order to support multi-pathing or multi-homed targets as a 
    means of improving the bandwidth to said targets)... but not 5000.

:there are a number of possible self-structuring methods (preferably incre=
:mental, dynamic) to produce an effective mesh.  also, should we optimize =
:on writer vs readers, or are all nodes considered writers and readers?  o=
:ooh, that sounds exciting :)
:
:cheers
:  simon

    All nodes are considered to be full duplex.  Any given transaction of
    course has an originator and a target, but the major data flow can be in
    both directions (e.g. a read request verses a write request).

    The return direction need not use the same route that the sending
    direction used.  In fact, we would probably want the two directions
    to be separately routed in order to be able to make use of known
    or heuristically detected bandwidth delay limitations.  If someone
    connects to two parts of the mesh via a cable modem at home we do not
    want to try to route 50 MBits/sec of traffic over that link to get
    from one side of the mesh to the other!  heh.

    Even though the return direction does not need to use the same route
    in reverse, the message structure will construct a reverse route as
    the message is being forwarded (the syslink_msg structure has two 
    physical fields... one for the source address, one for the target.
    The target field is deconstructed as the message is routed while the
    source field is constructed as the message is routed).

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>





More information about the Commits mailing list