next up previous
Next: About this document ... Up: NIST Switch Project Description Previous: Architecture diagram

NIST Switch design and hardware notes

Supported operating systems

For Linux, we are starting with a 2.0.33 level kernel, but will need to migrate to 2.1.9x ($x \geq 8$) to use the new queueing support. We'll also need to be at that level for IPv6 support. For FreeBSD, we use the latest 3.0+ALTQ version.

The main point in getting code to work on two different operating systems is to provide real proof of its portability. (As in the old slogan, ``There is no such thing as portable code, only code which has been ported.'') It is not really an attempt to increase some potential "market size" for NIST Switch. Hence, if it proves difficult to provide some desirable, but not essential, feature on one or the other operating system, because of a major incompatibility or omission in the operating system base, we will generally not attempt to remedy it. (As a particular instance, if we can't find a sufficiently up-to-date version of RSVP for Linux, we will simply not implement RSVP-based label distribution for Linux.) For our purposes, it is sufficient that both implementations provide enough functionality that they can interoperate in at least a minimal fashion. (This amounts to ``basic'' label handling and distribution for Ethernet and IPv4.)

SMP support

SMP kernel support is available as extensions to current Linux and FreeBSD distributions. The main design considerations we have in making our code ``SMP-almost-ready'' at this point are:

To the extent possible, shared data structures should be designed to allow multiple simultaneous readers, but generally only one writer. Where allowing multiple writers is necessary (for example, in attaching packets to output queues), writes need to be fast enough that exclusive locking of the data structure won't have too adverse an impact. Global locks should never be presupposed.

When updating shared structures, it is good if they can always be kept in a consistent state, so that no locks on readers are required (a corollary to the previous point). In particular, it is almost certainly a bad idea for code running at task time to try to lock out interrupt-time code.

The locking mechanisms used in the code need to be isolated so they can be appropriately defined in each version.

Extendible/replaceable label routing tables

The label table routines will be packaged as a kernel module, with a set of standard interfaces defined. The one currently implemented is (hopefully) SMP-safe, with a reasonably fast lookup structure. The main thing remaining to do here is allow more general definition and generation of indexing fields.

Replaceable queuing algorithms

The BSD queueing algorithms are based on ALTQ from Sony. The Linux native versions are in the new kernel code, but not yet thoroughly tested. At this time, it looks like they should be sufficient for at least diffserv-style class of service handling, though. This (and some form of explicit routing) seems a reasonable minimal goal for the label-based QoS we need to implement.

\epsfig{file=queueing.pstex} %

For both implementations, we structure the queueing code to be callable both from the label level (for forwarding labeled packets) and the IP level (for (possibly) labeling and routing straight IP packets). This allows for proper queueing treatment for a router which handles both exterior (non-labeled) and interior (labeled) requests, including the case where a labeled route only extends partway through a domain, so the packet must ``bubble up'' to the IP layer for rerouting. The queueing code needs to be updated to be SMP-safe, but this is a relatively trivial change.

Hooks for label distribution protocols and path determination algorithms

Our assumption is that these generally run at user level, feeding information into the routing tables as needed. RSVP label distribution gets the label info into the kernel through its own means; for other LDP, we need to define a set of label APIs.


As mentioned, we view labels as corresponding to path or tree segments with defined QoS characteristics, e.g., assigned bandwidth or delay properties. These sticks will in general form a covering forest for the network graph.

From a graphic-theoric viewpoint, the question of interest then becomes how to allocate a labeled covering forest to achieve the desired routing goals. Typically the number of labels allowed will be fairly limited (say, in the low thousands), allowing only some fraction of the set of possible useful paths to be specified. While the details are too much to consider here, this is a quick description of a heuristic method for label allocation.

First, reserve some fraction (say, for the sake of simplicity, 1/2) of the label table space for ``on-demand'' allocated labels (through RSVP or the like). Then, for the remaining pool of labels to be preallocated, proceed as follows until space is exhausted:

Label implementation for Ethernet

There are two proposed methods of implementing labels for Ethernet. The ``shim'' proposal (draft-rosen-tag-stack-03.txt) simply inserts the label between the normal Ethernet header and the IP header. The ``MAC'' proposal (draft-srinivasan-mpls-lans-label-00.txt) actually makes the (24-bit) label part of the Ethernet destination MAC address. It would be useful to implement both proposals, for the purpose of meaningful comparisons of how they work on actual networks.

To implement the MAC proposal, we can use the ability of certain cards to respond to multiple Ethernet addresses. In particular, the ``Tulip'' cards (such as the SMC EtherPower series (9332)) have a table of up to 16 addresses they will respond to. Many cards have (for multicast support) a small (e.g. 3 entry) hash table for picking out addresses they will respond to. For any card, we can always put it in ``promiscuous'' mode to pick up all packets - maybe not great in general, but enough for testing purposes. To implement the shim proposal, and support for multiple labels with MAC, we can use the ability of certain cards to accept large (> 1518 bytes) packets. This avoids problems with packet fragmentation for now.

For Ethernet, we will first target Digital ``Tulip''-chip based boards, such as the SMC EtherPower series (9332), and the 3Com ``Vortex/Boomerang'' cards, such as the 3c90X series. These cards offer well-tested drivers, good performance and easy assignment of multiple addresses. Both cards can also send and accept large packets (4.5K for the 3Com boards, 48956 bytes for the Tulip-based boards). They are also reasonably cheap and common. There are also multiport Tulip-based cards such as the 4-port Znyx ZX345 and the Adaptec Cogent EM400 which would be useful for larger-scale testing of label switching implementations.

(As an aside, actually implementing both proposals reveals a technical difficultly with the MAC proposal: The position chosen for the label by necessity always falls across a word boundary. This is not a difficulty for any custom label-switching hardware, but for a general-purpose computer implementation like this one, it is a bit unfortunate. Of course, given the code path length even for the best case in label-switched routing, the extra memory references are not that significant (and are certainly overwhelmed by the savings if any packet fragmentation can be avoided), but esthetically it's displeasing.)

Ethernet switches

In implementing label switching, we have a similar fortunate situation with Ethernet switches. The 3Com SuperStack II 3000s support a ``switch database'' of some 4080 MAC addresses, which we can stock as desired. MAC addresses can be added to the databases either through discovery (by sending out label distribution messages tagged with the desired MAC addresses as the sender's source addresses) or via SNMP requests. At the low end, the cheap hubs will blast everything to everybody in any case.

The 3Com switches also accept large packets, so we can again avoid the fragmentation issue for shim.

Label implementation for ATM

For ATM, it looks like the Efficient Networks ENI-155p-U5-S (155Mb rate, PCI, 2MB RAM, UTP-5 cable) is a good choice, with Linux and FreeBSD drivers available and support for 1,024 VCIs. It is the main card used by the Linux ATM developers. Another reasonable choice is the SMC ATM Power155 9746D, with very similar specs and performance characteristics. The 9741D is also similar, but with only 512K RAM (though this should suffice for our purposes). There's also a Fore System card, PCA-200EPC/UTP5, with 256K RAM, which has been used successfully.

Unfortunately, these cards are not really cheap. The Fore card is around $640 (also available under SEWP II, but for $734), the SMC 9741D around $700, and the others over $1000. The Efficient Networks cards are also kind of hard to find (though we do have a few in stock). A cheaper alternative is the Adaptec ANA-5930, at around $350, with alpha level Linux drivers supposedly around somewhere. There are other cheaper cards, too, (including 25Mb cards under $200), but I haven't found any Linux drivers for them.

System platform

To deploy a larger environment for meaningful testing of label distribution protocols, we need a good high-performance platform which can support a number of Fast Ethernet interfaces. ASUS, FIC and Tyan currently have 440BX (100MHz system bus) AGP motherboards that include 5 or more PCI slots, all available for network cards, since there is an AGP slot for video. With a fast processor and memory in this configuration, the limit in switching speed will probably be the Ethernet cards and/or PCI bus.

There is interest in getting things like NIST Switch running in an SMP environment. This is as much to ensure ``purity'' of design and implementation as to show incredible performance. Here are some ATX form factor 440BX and 440LX dual processor motherboards with AGP which are supposed to be supported by Linux SMP (and probably FreeBSD SMP as well). I've included prices (as of 4/27) from Price Watch.
Name PCI slots ISA slots SCSI DIMM sockets Price
440BX (100MHz)
Abit IT6B 4 2 Y 4 (na)
Supermicro P6DBE 4 3 N 4 $285
Supermicro P6DBS 4 3 Y 4 $408
Tyan S1832DL Tiger 5 2 N 4 $265
Tyan S1836 DLUAN 6 1 Y 4 $585
(Thunder)     (sound, too)    
440LX (66MHz)
ASUS P2L97-D 5 2 N 3 $298
ASUS P2L97-DS 4 2 Y 4 $358
Intel DK440LX 4 2 Y 4 $473
Supermicro P6DLE 4 3 N 4 $225
Supermicro P6DLS 4 3 Y 4 $309
Tyan S1692DL 5 2 N 4 $225
(Tiger 2)
Tyan S1696DLUA 5 3 Y 4 $325
(Thunder 2)

For network-type development, SCSI seems unnecessary, and many PCI slots are nice. So of the ones listed, the Tyan S1832DL Tiger looks like the best bet.

next up previous
Next: About this document ... Up: NIST Switch Project Description Previous: Architecture diagram
Mark Carson