pkg://howto-html-en-10.1-4mdv2008.1.noarch.rpm:16697283/
usr/
share/
doc/
HOWTO/
HTML/
en/
Parallel-Processing/Parallel-Processing-HOWTO-3.html
info downloads
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
<TITLE>Linux Parallel Processing HOWTO: Clusters Of Linux Systems</TITLE>
<LINK HREF="Parallel-Processing-HOWTO-4.html" REL=next>
<LINK HREF="Parallel-Processing-HOWTO-2.html" REL=previous>
<LINK HREF="Parallel-Processing-HOWTO.html#toc3" REL=contents>
</HEAD>
<BODY>
<A HREF="Parallel-Processing-HOWTO-4.html">Next</A>
<A HREF="Parallel-Processing-HOWTO-2.html">Previous</A>
<A HREF="Parallel-Processing-HOWTO.html#toc3">Contents</A>
<HR>
<H2><A NAME="s3">3. Clusters Of Linux Systems</A></H2>
<P>
<P>This section attempts to give an overview of cluster parallel
processing using Linux. Clusters are currently both the most popular
and the most varied approach, ranging from a conventional network of
workstations (<B>NOW</B>) to essentially custom parallel machines
that just happen to use Linux PCs as processor nodes. There is also
quite a lot of software support for parallel processing using clusters
of Linux machines.
<P>
<H2><A NAME="ss3.1">3.1 Why A Cluster?</A>
</H2>
<P>
<P>Cluster parallel processing offers several important advantages:
<P>
<UL>
<LI>Each of the machines in a cluster can be a complete system,
usable for a wide range of other computing applications. This leads
many people to suggest that cluster parallel computing can simply
claim all the "wasted cycles" of workstations sitting idle on people's
desks. It is not really so easy to salvage those cycles, and it will
probably slow your co-worker's screen saver, but it can be done.
</LI>
<LI>The current explosion in networked systems means that most of the
hardware for building a cluster is being sold in high volume, with
correspondingly low "commodity" prices as the result. Further savings
come from the fact that only one video card, monitor, and keyboard are
needed for each cluster (although you may need to swap these into each
machine to perform the initial installation of Linux, once running, a
typical Linux PC does not need a "console"). In comparison, SMP and
attached processors are much smaller markets, tending toward somewhat
higher price per unit performance.
</LI>
<LI>Cluster computing can <EM>scale to very large systems</EM>.
While it is currently hard to find a Linux-compatible SMP with many
more than four processors, most commonly available network hardware
easily builds a cluster with up to 16 machines. With a little work,
hundreds or even thousands of machines can be networked. In fact, the
entire Internet can be viewed as one truly huge cluster.
</LI>
<LI>The fact that replacing a "bad machine" within a cluster is
trivial compared to fixing a partly faulty SMP yields much higher
availability for carefully designed cluster configurations. This
becomes important not only for particular applications that cannot
tolerate significant service interruptions, but also for general use
of systems containing enough processors so that single-machine
failures are fairly common. (For example, even though the average
time to failure of a PC might be two years, in a cluster with 32
machines, the probability that at least one will fail within 6 months
is quite high.)</LI>
</UL>
<P>
<P>OK, so clusters are free or cheap and can be very large and highly
available... why doesn't everyone use a cluster? Well, there are
problems too:
<P>
<UL>
<LI>With a few exceptions, network hardware is not designed for
parallel processing. Typically latency is very high and bandwidth
relatively low compared to SMP and attached processors. For example,
SMP latency is generally no more than a few microseconds, but is
commonly hundreds or thousands of microseconds for a cluster. SMP
communication bandwidth is often more than 100 MBytes/second; although
the fastest network hardware (e.g., "Gigabit Ethernet") offers
comparable speed, the most commonly used networks are between 10 and
1000 times slower.
The performance of network hardware is poor enough as an <EM>isolated
cluster network</EM>. If the network is not isolated from other
traffic, as is often the case using "machines that happen to be
networked" rather than a system designed as a cluster, performance can
be substantially worse.
</LI>
<LI>There is very little software support for treating a cluster as a
single system. For example, the <CODE>ps</CODE> command only reports the
processes running on one Linux system, not all processes running
across a cluster of Linux systems.</LI>
</UL>
<P>
<P>Thus, the basic story is that clusters offer great potential, but that
potential may be very difficult to achieve for most applications. The
good news is that there is quite a lot of software support that will
help you achieve good performance for programs that are well suited to
this environment, and there are also networks designed specifically to
widen the range of programs that can achieve good performance.
<P>
<H2><A NAME="ss3.2">3.2 Network Hardware</A>
</H2>
<P>
<P>Computer networking is an exploding field... but you already knew
that. An ever-increasing range of networking technologies and
products are being developed, and most are available in forms that
could be applied to make a parallel-processing cluster out of a group
of machines (i.e., PCs each running Linux).
<P>Unfortunately, no one network technology solves all problems best; in
fact, the range of approach, cost, and performance is at first hard to
believe. For example, using standard commercially-available hardware,
the cost per machine networked ranges from less than $5 to over
$4,000. The delivered bandwidth and latency each also vary
over four orders of magnitude.
<P>Before trying to learn about specific networks, it is important to
recognize that these things change like the wind (see
<A HREF="http://www.linux.org.uk/NetNews.html">http://www.linux.org.uk/NetNews.html</A> for Linux networking news),
and it is very difficult to get accurate data about some networks.
<P>Where I was particularly uncertain,
I've placed a <EM>?</EM>. I have spent a lot of time researching this
topic, but I'm sure my summary is full of errors and has omitted many
important things. If you have any corrections or additions, please
send email to
<A HREF="mailto:hankd@engr.uky.edu">hankd@engr.uky.edu</A>.
<P>Summaries like the LAN Technology Scorecard at
<A HREF="http://web.syr.edu/~jmwobus/comfaqs/lan-technology.html">http://web.syr.edu/~jmwobus/comfaqs/lan-technology.html</A> give
some characteristics of many different types of networks and LAN
standards. However, the summary in this HOWTO centers on the network
properties that are most relevant to construction of Linux clusters.
The section discussing each network begins with a short list of
characteristics. The following defines what these entries mean.
<P>
<DL>
<DT><B>Linux support:</B><DD><P>If the answer is <EM>no</EM>, the meaning is pretty clear. Other
answers try to describe the basic program interface that is used to
access the network. Most network hardware is interfaced via a kernel
driver, typically supporting TCP/UDP communication. Some other
networks use more direct (e.g., library) interfaces to reduce latency
by bypassing the kernel.
<P>
<P>
<P>Years ago, it used to be considered perfectly acceptable to access a
floating point unit via an OS call, but that is now clearly ludicrous;
in my opinion, it is just as awkward for each communication between
processors executing a parallel program to require an OS call. The
problem is that computers haven't yet integrated these communication
mechanisms, so non-kernel approaches tend to have portability problems.
You are going to hear a lot more about this in the near future, mostly
in the form of the new <B>Virtual Interface (VI) Architecture</B>,
<A HREF="http://www.viarch.org/">http://www.viarch.org/</A>, which is a standardized method for
most network interface operations to bypass the usual OS call layers.
The VI standard is backed by Compaq, Intel, and Microsoft, and is sure
to have a strong impact on SAN (System Area Network) designs over the
next few years.
<P>
<DT><B>Maximum bandwidth:</B><DD><P>This is the number everybody cares about. I have generally used the
theoretical best case numbers; your mileage <EM>will</EM> vary.
<P>
<DT><B>Minimum latency:</B><DD><P>In my opinion, this is the number everybody should care about even more
than bandwidth. Again, I have used the unrealistic best-case numbers,
but at least these numbers do include <EM>all</EM> sources of latency,
both hardware and software. In most cases, the network latency is just
a few microseconds; the much larger numbers reflect layers of
inefficient hardware and software interfaces.
<P>
<DT><B>Available as:</B><DD><P>Simply put, this describes how you get this type of network hardware.
Commodity stuff is widely available from many vendors, with price as
the primary distinguishing factor. Multiple-vendor things are
available from more than one competing vendor, but there are
significant differences and potential interoperability problems.
Single-vendor networks leave you at the mercy of that supplier
(however benevolent it may be). Public domain designs mean that even
if you cannot find somebody to sell you one, you or anybody else can
buy parts and make one. Research prototypes are just that; they are
generally neither ready for external users nor available to them.
<P>
<DT><B>Interface port/bus used:</B><DD><P>How does one hook-up this network? The highest performance and most
common now is a PCI bus interface card. There are also EISA, VESA
local bus (VL bus), and ISA bus cards. ISA was there first, and is
still commonly used for low-performance cards. EISA is still around
as the second bus in a lot of PCI machines, so there are a few cards.
These days, you don't see much VL stuff (although
<A HREF="http://www.vesa.org/">http://www.vesa.org/</A> would beg to differ).
<P>
<P>
<P>Of course, any interface that you can use without having to open your
PC's case has more than a little appeal. IrDA and USB interfaces are
appearing with increasing frequency. The Standard Parallel Port (SPP)
used to be what your printer was plugged into, but it has seen a lot
of use lately as an external extension of the ISA bus; this new
functionality is enhanced by the IEEE 1284 standard, which specifies
EPP and ECP improvements. There is also the old, reliable, slow RS232
serial port. I don't know of anybody connecting machines using VGA
video connectors, keyboard, mouse, or game ports... so that's about
it.
<P>
<DT><B>Network structure:</B><DD><P>A bus is a wire, set of wires, or fiber. A hub is a little box that
knows how to connect different wires/fibers plugged into it; switched
hubs allow multiple connections to be actively transmitting data
simultaneously.
<P>
<DT><B>Cost per machine connected:</B><DD><P>Here's how to use these numbers. Suppose that, not counting the
network connection, it costs $2,000 to purchase a PC for use as
a node in your cluster. Adding a Fast Ethernet brings the per node
cost to about $2,400; adding a Myrinet instead brings the cost
to about $3,800. If you have about $20,000 to spend,
that means you could have either 8 machines connected by Fast Ethernet
or 5 machines connected by Myrinet. It also can be very reasonable to
have multiple networks; e.g., $20,000 could buy 8 machines
connected by both Fast Ethernet and TTL_PAPERS. Pick the
network, or set of networks, that is most likely to yield a cluster
that will run your application fastest.
<P>
<P>
<P>By the time you read this, these numbers will be wrong... heck,
they're probably wrong already. There may also be quantity discounts,
special deals, etc. Still, the prices quoted here aren't likely to be
wrong enough to lead you to a totally inappropriate choice. It
doesn't take a PhD (although I do have one ;-) to see that expensive
networks only make sense if your application needs their special
properties or if the PCs being clustered are relatively expensive.
</DL>
<P>Now that you have the disclaimers, on with the show....
<P>
<H3>ArcNet</H3>
<P>
<UL>
<LI>Linux support: <EM>kernel drivers</EM></LI>
<LI>Maximum bandwidth: <EM>2.5 Mb/s</EM></LI>
<LI>Minimum latency: <EM>1,000 microseconds?</EM></LI>
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>ISA</EM></LI>
<LI>Network structure: <EM>unswitched hub or bus (logical ring)</EM></LI>
<LI>Cost per machine connected: <EM>$200</EM></LI>
</UL>
<P>
<P>ARCNET is a local area network that is primarily intended for use in
embedded real-time control systems. Like Ethernet, the network is
physically organized either as taps on a bus or one or more hubs,
however, unlike Ethernet, it uses a token-based protocol logically
structuring the network as a ring. Packet headers are small (3 or 4
bytes) and messages can carry as little as a single byte of data.
Thus, ARCNET yields more consistent performance than Ethernet, with
bounded delays, etc. Unfortunately, it is slower than Ethernet and
less popular, making it more expensive. More information is available
from the ARCNET Trade Association at
<A HREF="http://www.arcnet.com/">http://www.arcnet.com/</A>.
<P>
<H3>ATM</H3>
<P>
<UL>
<LI>Linux support: <EM>kernel driver, AAL* library</EM></LI>
<LI>Maximum bandwidth: <EM>155 Mb/s</EM> (soon, <EM>1,200 Mb/s</EM>)</LI>
<LI>Minimum latency: <EM>120 microseconds</EM></LI>
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>PCI</EM></LI>
<LI>Network structure: <EM>switched hubs</EM></LI>
<LI>Cost per machine connected: <EM>$3,000</EM></LI>
</UL>
<P>
<P>Unless you've been in a coma for the past few years, you have probably
heard a lot about how ATM (Asynchronous Transfer Mode) <EM>is</EM> the
future... well, sort-of. ATM is cheaper than HiPPI and faster than
Fast Ethernet, and it can be used over the very long distances that
the phone companies care about. The ATM network protocol is also
designed to provide a lower-overhead software interface and to more
efficiently manage small messages and real-time communications (e.g.,
digital audio and video). It is also one of the highest-bandwidth
networks that Linux currently supports. The bad news is that ATM isn't
cheap, and there are still some compatibility problems across
vendors. An overview of Linux ATM development is available at
<A HREF="http://lrcwww.epfl.ch/linux-atm/">http://lrcwww.epfl.ch/linux-atm/</A>.
<P>
<H3>CAPERS</H3>
<P>
<UL>
<LI>Linux support: <EM>AFAPI library</EM></LI>
<LI>Maximum bandwidth: <EM>1.2 Mb/s</EM></LI>
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
<LI>Available as: <EM>commodity hardware</EM></LI>
<LI>Interface port/bus used: <EM>SPP</EM></LI>
<LI>Network structure: <EM>cable between 2 machines</EM></LI>
<LI>Cost per machine connected: <EM>$2</EM></LI>
</UL>
<P>
<P>CAPERS (Cable Adapter for Parallel Execution and Rapid
Synchronization) is a spin-off of the PAPERS project,
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>, at the Purdue University
School of Electrical and Computer Engineering. In essence, it defines
a software protocol for using an ordinary "LapLink" SPP-to-SPP cable
to implement the PAPERS library for two Linux PCs. The idea doesn't
scale, but you can't beat the price. As with TTL_PAPERS, to improve
system security, there is a minor kernel patch recommended, but not
required:
<A HREF="http://garage.ecn.purdue.edu/~papers/giveioperm.html">http://garage.ecn.purdue.edu/~papers/giveioperm.html</A>.
<P>
<H3>Ethernet</H3>
<P>
<UL>
<LI>Linux support: <EM>kernel drivers</EM></LI>
<LI>Maximum bandwidth: <EM>10 Mb/s</EM></LI>
<LI>Minimum latency: <EM>100 microseconds</EM></LI>
<LI>Available as: <EM>commodity hardware</EM></LI>
<LI>Interface port/bus used: <EM>PCI</EM></LI>
<LI>Network structure: <EM>switched or unswitched hubs, or hubless bus</EM></LI>
<LI>Cost per machine connected: <EM>$100</EM> (hubless, <EM>$50</EM>)</LI>
</UL>
<P>
<P>For some years now, 10 Mbits/s Ethernet has been the standard network
technology. Good Ethernet interface cards can be purchased for well
under $50, and a fair number of PCs now have an Ethernet controller
built-into the motherboard. For lightly-used networks, Ethernet
connections can be organized as a multi-tap bus without a hub; such
configurations can serve up to 200 machines with minimal cost, but are
not appropriate for parallel processing. Adding an unswitched hub
does not really help performance. However, switched hubs that can
provide full bandwidth to simultaneous connections cost only about
$100 per port. Linux supports an amazing range of Ethernet
interfaces, but it is important to keep in mind that variations in the
interface hardware can yield significant performance differences. See
the Hardware Compatibility HOWTO for comments on which are supported
and how well they work; also see
<A HREF="http://cesdis1.gsfc.nasa.gov/linux/drivers/">http://cesdis1.gsfc.nasa.gov/linux/drivers/</A>.
<P>An interesting way to improve performance is offered by the 16-machine
Linux cluster work done in the Beowulf project,
<A HREF="http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html">http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html</A>, at NASA
CESDIS. There, Donald Becker, who is the author of many Ethernet card
drivers, has developed support for load sharing across multiple
Ethernet networks that shadow each other (i.e., share the same network
addresses). This load sharing is built-into the standard Linux
distribution, and is done invisibly below the socket operation level.
Because hub cost is significant, having each machine connected to two
or more hubless or unswitched hub Ethernet networks can be a very
cost-effective way to improve performance. In fact, in situations
where one machine is the network performance bottleneck, load sharing
using shadow networks works much better than using a single switched
hub network.
<P>
<H3>Ethernet (Fast Ethernet)</H3>
<P>
<UL>
<LI>Linux support: <EM>kernel drivers</EM></LI>
<LI>Maximum bandwidth: <EM>100 Mb/s</EM></LI>
<LI>Minimum latency: <EM>80 microseconds</EM></LI>
<LI>Available as: <EM>commodity hardware</EM></LI>
<LI>Interface port/bus used: <EM>PCI</EM></LI>
<LI>Network structure: <EM>switched or unswitched hubs</EM></LI>
<LI>Cost per machine connected: <EM>$400?</EM></LI>
</UL>
<P>
<P>Although there are really quite a few different technologies calling
themselves "Fast Ethernet," this term most often refers to a hub-based
100 Mbits/s Ethernet that is somewhat compatible with older "10 BaseT"
10 Mbits/s devices and cables. As might be expected, anything called
Ethernet is generally priced for a volume market, and these interfaces
are generally a small fraction of the price of 155 Mbits/s ATM cards.
The catch is that having a bunch of machines dividing the bandwidth of
a single 100 Mbits/s "bus" (using an unswitched hub) yields
performance that might not even be as good on average as using 10
Mbits/s Ethernet with a switched hub that can give each machine's
connection a full 10 Mbits/s.
<P>Switched hubs that can provide 100 Mbits/s for each machine
simultaneously are expensive, but prices are dropping every day, and
these switches do yield much higher total network bandwidth than
unswitched hubs. The thing that makes ATM switches so expensive is
that they must switch for each (relatively short) ATM cell; some Fast
Ethernet switches take advantage of the expected lower switching
frequency by using techniques that may have low latency through the
switch, but take multiple milliseconds to change the switch path...
if your routing pattern changes frequently, avoid those switches. See
<A HREF="http://cesdis1.gsfc.nasa.gov/linux/drivers/">http://cesdis1.gsfc.nasa.gov/linux/drivers/</A> for information
about the various cards and drivers.
<P>Also note that, as described for Ethernet, the Beowulf project,
<A HREF="http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html">http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html</A>, at NASA
has been developing support that offers improved performance by load
sharing across multiple Fast Ethernets.
<P>
<H3>Ethernet (Gigabit Ethernet)</H3>
<P>
<UL>
<LI>Linux support: <EM>kernel drivers</EM></LI>
<LI>Maximum bandwidth: <EM>1,000 Mb/s</EM></LI>
<LI>Minimum latency: <EM>300 microseconds?</EM></LI>
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>PCI</EM></LI>
<LI>Network structure: <EM>switched hubs or FDRs</EM></LI>
<LI>Cost per machine connected: <EM>$2,500?</EM></LI>
</UL>
<P>
<P>I'm not sure that Gigabit Ethernet,
<A HREF="http://www.gigabit-ethernet.org/">http://www.gigabit-ethernet.org/</A>, has a good technological
reason to be called Ethernet... but the name does accurately reflect
the fact that this is intended to be a cheap, mass-market, computer
network technology with native support for IP. However, current
pricing reflects the fact that Gb/s hardware is still a tricky thing
to build.
<P>Unlike other Ethernet technologies, Gigabit Ethernet provides for a
level of flow control that should make it a more reliable network.
FDRs, or Full-Duplex Repeaters, simply multiplex lines, using
buffering and localized flow control to improve performance. Most
switched hubs are being built as new interface modules for existing
gigabit-capable switch fabrics. Switch/FDR products have been shipped
or announced by at least
<A HREF="http://www.acacianet.com/">http://www.acacianet.com/</A>,
<A HREF="http://www.baynetworks.com/">http://www.baynetworks.com/</A>,
<A HREF="http://www.cabletron.com/">http://www.cabletron.com/</A>,
<A HREF="http://www.networks.digital.com/">http://www.networks.digital.com/</A>,
<A HREF="http://www.extremenetworks.com/">http://www.extremenetworks.com/</A>,
<A HREF="http://www.foundrynet.com/">http://www.foundrynet.com/</A>,
<A HREF="http://www.gigalabs.com/">http://www.gigalabs.com/</A>,
<A HREF="http://www.packetengines.com/">http://www.packetengines.com/</A>.
<A HREF="http://www.plaintree.com/">http://www.plaintree.com/</A>,
<A HREF="http://www.prominet.com/">http://www.prominet.com/</A>,
<A HREF="http://www.sun.com/">http://www.sun.com/</A>, and
<A HREF="http://www.xlnt.com/">http://www.xlnt.com/</A>.
<P>There is a Linux driver,
<A HREF="http://cesdis.gsfc.nasa.gov/linux/drivers/yellowfin.html">http://cesdis.gsfc.nasa.gov/linux/drivers/yellowfin.html</A>, for
the Packet Engines "Yellowfin" G-NIC,
<A HREF="http://www.packetengines.com/">http://www.packetengines.com/</A>. Early tests under Linux achieved
about 2.5x higher bandwidth than could be achieved with the best 100
Mb/s Fast Ethernet; with gigabit networks, careful tuning of PCI bus
use is a critical factor. There is little doubt that driver
improvements, and Linux drivers for other NICs, will follow.
<P>
<H3>FC (Fibre Channel)</H3>
<P>
<UL>
<LI>Linux support: <EM>no</EM></LI>
<LI>Maximum bandwidth: <EM>1,062 Mb/s</EM></LI>
<LI>Minimum latency: <EM>?</EM></LI>
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>PCI?</EM></LI>
<LI>Network structure: <EM>?</EM></LI>
<LI>Cost per machine connected: <EM>?</EM></LI>
</UL>
<P>
<P>The goal of FC (Fibre Channel) is to provide high-performance block
I/O (an FC frame carries a 2,048 byte data payload), particularly for
sharing disks and other storage devices that can be directly connected
to the FC rather than connected through a computer. Bandwidth-wise,
FC is specified to be relatively fast, running anywhere between 133
and 1,062 Mbits/s. If FC becomes popular as a high-end SCSI
replacement, it may quickly become a cheap technology; for now, it is
not cheap and is not supported by Linux. A good collection of FC
references is maintained by the Fibre Channel Association at
<A HREF="http://www.amdahl.com/ext/CARP/FCA/FCA.html">http://www.amdahl.com/ext/CARP/FCA/FCA.html</A><P>
<H3>FireWire (IEEE 1394)</H3>
<P>
<UL>
<LI>Linux support: <EM>no</EM></LI>
<LI>Maximum bandwidth: <EM>196.608 Mb/s</EM> (soon, <EM>393.216 Mb/s</EM>)</LI>
<LI>Minimum latency: <EM>?</EM></LI>
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>PCI</EM></LI>
<LI>Network structure: <EM>random without cycles (self-configuring)</EM></LI>
<LI>Cost per machine connected: <EM>$600</EM></LI>
</UL>
<P>
<P>FireWire,
<A HREF="http://www.firewire.org/">http://www.firewire.org/</A>, the IEEE 1394-1995
standard, is destined to be the low-cost high-speed digital network
for consumer electronics. The showcase application is connecting DV
digital video camcorders to computers, but FireWire is intended to be
used for applications ranging from being a SCSI replacement to
interconnecting the components of your home theater. It allows up to
64K devices to be connected in any topology using busses and bridges
that does not create a cycle, and automatically detects the
configuration when components are added or removed. Short (four-byte
"quadlet") low-latency messages are supported as well as ATM-like
isochronous transmission (used to keep multimedia messages
synchronized). Adaptec has FireWire products that allow up to 63
devices to be connected to a single PCI interface card, and also has
good general FireWire information at
<A HREF="http://www.adaptec.com/serialio/">http://www.adaptec.com/serialio/</A>.
<P>Although FireWire will not be the highest bandwidth network available,
the consumer-level market (which should drive prices very low) and low
latency support might make this one of the best Linux PC cluster
message-passing network technologies within the next year or so.
<P>
<H3>HiPPI And Serial HiPPI</H3>
<P>
<UL>
<LI>Linux support: <EM>no</EM></LI>
<LI>Maximum bandwidth: <EM>1,600 Mb/s</EM> (serial is <EM>1,200 Mb/s</EM>)</LI>
<LI>Minimum latency: <EM>?</EM></LI>
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>EISA, PCI</EM></LI>
<LI>Network structure: <EM>switched hubs</EM></LI>
<LI>Cost per machine connected: <EM>$3,500</EM> (serial is <EM>$4,500</EM>)</LI>
</UL>
<P>
<P>HiPPI (High Performance Parallel Interface) was originally intended to
provide very high bandwidth for transfer of huge data sets between a
supercomputer and another machine (a supercomputer, frame buffer, disk
array, etc.), and has become the dominant standard for
supercomputers. Although it is an oxymoron, <B>Serial HiPPI</B> is
also becoming popular, typically using a fiber optic cable instead of
the 32-bit wide standard (parallel) HiPPI cables. Over the past few
years, HiPPI crossbar switches have become common and prices have
dropped sharply; unfortunately, serial HiPPI is still pricey, and that
is what PCI bus interface cards generally support. Worse still, Linux
doesn't yet support HiPPI. A good overview of HiPPI is maintained by
CERN at
<A HREF="http://www.cern.ch/HSI/hippi/">http://www.cern.ch/HSI/hippi/</A>; they also maintain
a rather long list of HiPPI vendors at
<A HREF="http://www.cern.ch/HSI/hippi/procintf/manufact.htm">http://www.cern.ch/HSI/hippi/procintf/manufact.htm</A>.
<P>
<H3>IrDA (Infrared Data Association)</H3>
<P>
<UL>
<LI>Linux support: <EM>no?</EM></LI>
<LI>Maximum bandwidth: <EM>1.15 Mb/s</EM> and <EM>4 Mb/s</EM></LI>
<LI>Minimum latency: <EM>?</EM></LI>
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>IrDA</EM></LI>
<LI>Network structure: <EM>thin air</EM> ;-)</LI>
<LI>Cost per machine connected: <EM>$0</EM></LI>
</UL>
<P>
<P>IrDA (Infrared Data Association,
<A HREF="http://www.irda.org/">http://www.irda.org/</A>) is
that little infrared device on the side of a lot of laptop PCs. It is
inherently difficult to connect more than two machines using this
interface, so it is unlikely to be used for clustering. Don Becker
did some preliminary work with IrDA.
<P>
<H3>Myrinet</H3>
<P>
<UL>
<LI>Linux support: <EM>library</EM></LI>
<LI>Maximum bandwidth: <EM>1,280 Mb/s</EM></LI>
<LI>Minimum latency: <EM>9 microseconds</EM></LI>
<LI>Available as: <EM>single-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>PCI</EM></LI>
<LI>Network structure: <EM>switched hubs</EM></LI>
<LI>Cost per machine connected: <EM>$1,800</EM></LI>
</UL>
<P>
<P>Myrinet
<A HREF="http://www.myri.com/">http://www.myri.com/</A> is a local area network (LAN)
designed to also serve as a "system area network" (SAN), i.e., the
network within a cabinet full of machines connected as a parallel
system. The LAN and SAN versions use different physical media and
have somewhat different characteristics; generally, the SAN version
would be used within a cluster.
<P>Myrinet is fairly conventional in structure, but has a reputation for
being particularly well-implemented. The drivers for Linux are said
to perform very well, although shockingly large performance variations
have been reported with different PCI bus implementations for the host
computers.
<P>Currently, Myrinet is clearly the favorite network of cluster groups
that are not too severely "budgetarily challenged." If your idea of a
Linux PC is a high-end Pentium Pro or Pentium II with at least 256 MB
RAM and a SCSI RAID, the cost of Myrinet is quite reasonable. However,
using more ordinary PC configurations, you may find that your choice
is between <EM>N</EM> machines linked by Myrinet or <EM>2N</EM> linked
by multiple Fast Ethernets and TTL_PAPERS. It really depends
on what your budget is and what types of computations you care about
most.
<P>
<H3>Parastation</H3>
<P>
<UL>
<LI>Linux support: <EM>HAL or socket library</EM></LI>
<LI>Maximum bandwidth: <EM>125 Mb/s</EM></LI>
<LI>Minimum latency: <EM>2 microseconds</EM></LI>
<LI>Available as: <EM>single-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>PCI</EM></LI>
<LI>Network structure: <EM>hubless mesh</EM></LI>
<LI>Cost per machine connected: <EM>> $1,000</EM></LI>
</UL>
<P>
<P>The ParaStation project
<A HREF="http://wwwipd.ira.uka.de/parastation">http://wwwipd.ira.uka.de/parastation</A> at University of Karlsruhe
Department of Informatics is building a PVM-compatible custom
low-latency network. They first constructed a two-processor ParaPC
prototype using a custom EISA card interface and PCs running BSD UNIX,
and then built larger clusters using DEC Alphas. Since January 1997,
ParaStation has been available for Linux. The PCI cards are being
made in cooperation with a company called Hitex (see
<A HREF="http://www.hitex.com:80/parastation/">http://www.hitex.com:80/parastation/</A>). Parastation hardware
implements both fast, reliable, message transmission and simple barrier
synchronization.
<P>
<H3>PLIP</H3>
<P>
<UL>
<LI>Linux support: <EM>kernel driver</EM></LI>
<LI>Maximum bandwidth: <EM>1.2 Mb/s</EM></LI>
<LI>Minimum latency: <EM>1,000 microseconds?</EM></LI>
<LI>Available as: <EM>commodity hardware</EM></LI>
<LI>Interface port/bus used: <EM>SPP</EM></LI>
<LI>Network structure: <EM>cable between 2 machines</EM></LI>
<LI>Cost per machine connected: <EM>$2</EM></LI>
</UL>
<P>
<P>For just the cost of a "LapLink" cable, PLIP (Parallel Line Interface
Protocol) allows two Linux machines to communicate through standard
parallel ports using standard socket-based software. In terms of
bandwidth, latency, and scalability, this is not a very serious
network technology; however, the near-zero cost and the software
compatibility are useful. The driver is part of the standard Linux
kernel distributions.
<P>
<H3>SCI</H3>
<P>
<UL>
<LI>Linux support: <EM>no</EM></LI>
<LI>Maximum bandwidth: <EM>4,000 Mb/s</EM></LI>
<LI>Minimum latency: <EM>2.7 microseconds</EM></LI>
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>PCI, proprietary</EM></LI>
<LI>Network structure: <EM>?</EM></LI>
<LI>Cost per machine connected: <EM>> $1,000</EM></LI>
</UL>
<P>
<P>The goal of SCI (Scalable Coherent Interconnect, ANSI/IEEE 1596-1992)
is essentially to provide a high performance mechanism that can
support coherent shared memory access across large numbers of
machines, as well various types of block message transfers. It is
fairly safe to say that the designed bandwidth and latency of SCI are
both "awesome" in comparison to most other network technologies. The
catch is that SCI is not widely available as cheap production units,
and there isn't any Linux support.
<P>SCI primarily is used in various proprietary designs for
logically-shared physically-distributed memory machines, such as the
HP/Convex Exemplar SPP and the Sequent NUMA-Q 2000 (see
<A HREF="http://www.sequent.com/">http://www.sequent.com/</A>). However, SCI is available as a PCI
interface card and 4-way switches (up to 16 machines can be connected
by cascading four 4-way switches) from Dolphin,
<A HREF="http://www.dolphinics.com/">http://www.dolphinics.com/</A>, as their CluStar product line. A
good set of links overviewing SCI is maintained by CERN at
<A HREF="http://www.cern.ch/HSI/sci/sci.html">http://www.cern.ch/HSI/sci/sci.html</A>.
<P>
<H3>SCSI</H3>
<P>
<UL>
<LI>Linux support: <EM>kernel drivers</EM></LI>
<LI>Maximum bandwidth: <EM>5 Mb/s</EM> to over <EM>20 Mb/s</EM></LI>
<LI>Minimum latency: <EM>?</EM></LI>
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>PCI, EISA, ISA card</EM></LI>
<LI>Network structure: <EM>inter-machine bus sharing SCSI devices</EM></LI>
<LI>Cost per machine connected: <EM>?</EM></LI>
</UL>
<P>
<P>SCSI (Small Computer Systems Interconnect) is essentially an I/O bus
that is used for disk drives, CD ROMS, image scanners, etc. There are
three separate standards SCSI-1, SCSI-2, and SCSI-3; Fast and Ultra
speeds; and data path widths of 8, 16, or 32 bits (with FireWire
compatibility also mentioned in SCSI-3). It is all pretty confusing,
but we all know a good SCSI is somewhat faster than EIDE and can handle
more devices more efficiently.
<P>What many people do not realize is that it is fairly simple for two
computers to share a single SCSI bus. This type of configuration is
very useful for sharing disk drives between machines and implementing
<B>fail-over</B> - having one machine take over database requests
when the other machine fails. Currently, this is the only mechanism
supported by Microsoft's PC cluster product, WolfPack. However, the
inability to scale to larger systems renders shared SCSI uninteresting
for parallel processing in general.
<P>
<H3>ServerNet</H3>
<P>
<UL>
<LI>Linux support: <EM>no</EM></LI>
<LI>Maximum bandwidth: <EM>400 Mb/s</EM></LI>
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
<LI>Available as: <EM>single-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>PCI</EM></LI>
<LI>Network structure: <EM>hexagonal tree/tetrahedral lattice of hubs</EM></LI>
<LI>Cost per machine connected: <EM>?</EM></LI>
</UL>
<P>
<P>ServerNet is the high-performance network hardware from Tandem,
<A HREF="http://www.tandem.com">http://www.tandem.com</A>. Especially in the online transation
processing (OLTP) world, Tandem is well known as a leading producer of
high-reliability systems, so it is not surprising that their network
claims not just high performance, but also "high data integrity and
reliability." Another interesting aspect of ServerNet is that it
claims to be able to transfer data from any device directly to any
device; not just between processors, but also disk drives, etc., in a
one-sided style similar to that suggested by the MPI remote memory
access mechanisms described in section 3.5. One last comment about
ServerNet: although there is just a single vendor, that vendor is
powerful enough to potentially establish ServerNet as a major
standard... Tandem is owned by Compaq.
<P>
<H3>SHRIMP</H3>
<P>
<UL>
<LI>Linux support: <EM>user-level memory mapped interface</EM></LI>
<LI>Maximum bandwidth: <EM>180 Mb/s</EM></LI>
<LI>Minimum latency: <EM>5 microseconds</EM></LI>
<LI>Available as: <EM>research prototype</EM></LI>
<LI>Interface port/bus used: <EM>EISA</EM></LI>
<LI>Network structure: <EM>mesh backplane (as in Intel Paragon)</EM></LI>
<LI>Cost per machine connected: <EM>?</EM></LI>
</UL>
<P>
<P>The SHRIMP project,
<A HREF="http://www.CS.Princeton.EDU/shrimp/">http://www.CS.Princeton.EDU/shrimp/</A>,
at the Princeton University Computer Science Department is building a
parallel computer using PCs running Linux as the processing elements.
The first SHRIMP (Scalable, High-Performance, Really Inexpensive
Multi-Processor) was a simple two-processor prototype using a
dual-ported RAM on a custom EISA card interface. There is now a
prototype that will scale to larger configurations using a custom
interface card to connect to a "hub" that is essentially the same mesh
routing network used in the Intel Paragon (see
<A HREF="http://www.ssd.intel.com/paragon.html">http://www.ssd.intel.com/paragon.html</A>). Considerable effort
has gone into developing low-overhead "virtual memory mapped
communication" hardware and support software.
<P>
<H3>SLIP</H3>
<P>
<UL>
<LI>Linux support: <EM>kernel drivers</EM></LI>
<LI>Maximum bandwidth: <EM>0.1 Mb/s</EM></LI>
<LI>Minimum latency: <EM>1,000 microseconds?</EM></LI>
<LI>Available as: <EM>commodity hardware</EM></LI>
<LI>Interface port/bus used: <EM>RS232C</EM></LI>
<LI>Network structure: <EM>cable between 2 machines</EM></LI>
<LI>Cost per machine connected: <EM>$2</EM></LI>
</UL>
<P>
<P>Although SLIP (Serial Line Interface Protocol) is firmly planted at
the low end of the performance spectrum, SLIP (or CSLIP or PPP) allows
two machines to perform socket communication via ordinary RS232 serial
ports. The RS232 ports can be connected using a null-modem RS232
serial cable, or they can even be connected via dial-up through a
modem. In any case, latency is high and bandwidth is low, so SLIP
should be used only when no other alternatives are available. It is
worth noting, however, that most PCs have two RS232 ports, so it would
be possible to network a group of machines simply by connecting the
machines as a linear array or as a ring. There is even load sharing
software called EQL.
<P>
<H3>TTL_PAPERS</H3>
<P>
<UL>
<LI>Linux support: <EM>AFAPI library</EM></LI>
<LI>Maximum bandwidth: <EM>1.6 Mb/s</EM></LI>
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
<LI>Available as: <EM>public-domain design, single-vendor hardware</EM></LI>
<LI>Interface port/bus used: <EM>SPP</EM></LI>
<LI>Network structure: <EM>tree of hubs</EM></LI>
<LI>Cost per machine connected: <EM>$100</EM></LI>
</UL>
<P>
<P>The PAPERS (Purdue's Adapter for Parallel Execution and Rapid
Synchronization) project,
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>, at the Purdue University
School of Electrical and Computer Engineering is building scalable,
low-latency, aggregate function communication hardware and software
that allows a parallel supercomputer to be built using unmodified
PCs/workstations as nodes.
<P>There have been over a dozen different types of PAPERS hardware built
that connect to PCs/workstations via the SPP (Standard Parallel Port),
roughly following two development lines. The versions called "PAPERS"
target higher performance, using whatever technologies are appropriate;
current work uses FPGAs, and high bandwidth PCI bus interface designs
are also under development. In contrast, the versions called
"TTL_PAPERS" are designed to be easily reproduced outside
Purdue, and are remarkably simple public domain designs that can be
built using ordinary TTL logic. One such design is produced
commercially,
<A HREF="http://chelsea.ios.com:80/~hgdietz/sbm4.html">http://chelsea.ios.com:80/~hgdietz/sbm4.html</A>.
<P>Unlike the custom hardware designs from other universities,
TTL_PAPERS clusters have been assembled at many universities
from the USA to South Korea. Bandwidth is severely limited by the SPP
connections, but PAPERS implements very low latency aggregate function
communications; even the fastest message-oriented systems cannot
provide comparable performance on those aggregate functions. Thus,
PAPERS is particularly good for synchronizing the displays of a video
wall (to be discussed further in the upcoming Video Wall HOWTO),
scheduling accesses to a high-bandwidth network, evaluating global
fitness in genetic searches, etc. Although PAPERS clusters have been
built using IBM PowerPC AIX, DEC Alpha OSF/1, and HP PA-RISC HP-UX
machines, Linux-based PCs are the platforms best supported.
<P>User programs using TTL_PAPERS AFAPI directly access the SPP
hardware port registers under Linux, without an OS call for each
access. To do this, AFAPI first gets port permission using either
<CODE>iopl()</CODE> or <CODE>ioperm()</CODE>. The problem with these calls is
that both require the user program to be privileged, yielding a
potential security hole. The solution is an optional kernel patch,
<A HREF="http://garage.ecn.purdue.edu/~papers/giveioperm.html">http://garage.ecn.purdue.edu/~papers/giveioperm.html</A>, that
allows a privileged process to control port permission for any process.
<P>
<H3>USB (Universal Serial Bus)</H3>
<P>
<UL>
<LI>Linux support: <EM>kernel driver</EM></LI>
<LI>Maximum bandwidth: <EM>12 Mb/s</EM></LI>
<LI>Minimum latency: <EM>?</EM></LI>
<LI>Available as: <EM>commodity hardware</EM></LI>
<LI>Interface port/bus used: <EM>USB</EM></LI>
<LI>Network structure: <EM>bus</EM></LI>
<LI>Cost per machine connected: <EM>$5?</EM></LI>
</UL>
<P>
<P>USB (Universal Serial Bus,
<A HREF="http://www.usb.org/">http://www.usb.org/</A>) is a
hot-pluggable conventional-Ethernet-speed, bus for up to 127
peripherals ranging from keyboards to video conferencing cameras. It
isn't really clear how multiple computers get connected to each other
using USB. In any case, USB ports are quickly becoming as standard on
PC motherboards as RS232 and SPP, so don't be surprised if one or two
USB ports are lurking on the back of the next PC you buy. Development
of a Linux driver is discussed at
<A HREF="http://peloncho.fis.ucm.es/~inaky/USB.html">http://peloncho.fis.ucm.es/~inaky/USB.html</A>.
<P>In some ways, USB is almost the low-performance, zero-cost, version of
FireWire that you can purchase today.
<P>
<H3>WAPERS</H3>
<P>
<UL>
<LI>Linux support: <EM>AFAPI library</EM></LI>
<LI>Maximum bandwidth: <EM>0.4 Mb/s</EM></LI>
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
<LI>Available as: <EM>public-domain design</EM></LI>
<LI>Interface port/bus used: <EM>SPP</EM></LI>
<LI>Network structure: <EM>wiring pattern between 2-64 machines</EM></LI>
<LI>Cost per machine connected: <EM>$5</EM></LI>
</UL>
<P>
<P>WAPERS (Wired-AND Adapter for Parallel Execution and Rapid
Synchronization) is a spin-off of the PAPERS project,
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>, at the Purdue University
School of Electrical and Computer Engineering. If implemented
properly, the SPP has four bits of open-collector output that can be
wired together across machines to implement a 4-bit wide wired AND.
This wired-AND is electrically touchy, and the maximum number of
machines that can be connected in this way critically depends on the
analog properties of the ports (maximum sink current and pull-up
resistor value); typically, up to 7 or 8 machines can be networked by
WAPERS. Although cost and latency are very low, so is bandwidth;
WAPERS is much better as a second network for aggregate operations
than as the only network in a cluster. As with TTL_PAPERS, to
improve system security, there is a minor kernel patch recommended,
but not required:
<A HREF="http://garage.ecn.purdue.edu/~papers/giveioperm.html">http://garage.ecn.purdue.edu/~papers/giveioperm.html</A>.
<P>
<H2><A NAME="ss3.3">3.3 Network Software Interface</A>
</H2>
<P>
<P>Before moving on to discuss the software support for parallel
applications, it is useful to first briefly cover the basics of
low-level software interface to the network hardware. There are
really only three basic choices: sockets, device drivers, and
user-level libraries.
<P>
<H3>Sockets</H3>
<P>
<P>By far the most common low-level network interface is a socket
interface. Sockets have been a part of unix for over a decade, and
most standard network hardware is designed to support at least two
types of socket protocols: UDP and TCP. Both types of socket allow
you to send arbitrary size blocks of data from one machine to another,
but there are several important differences. Typically, both yield a
minimum latency of around 1,000 microseconds, although performance can
be far worse depending on network traffic.
<P>These socket types are the basic network software interface for most
of the portable, higher-level, parallel processing software; for
example, PVM uses a combination of UDP and TCP, so knowing the
difference will help you tune performance. For even better
performance, you can also use these mechanisms directly in your
program. The following is just a simple overview of UDP and TCP; see
the manual pages and a good network programming book for details.
<P>
<H3>UDP Protocol (SOCK_DGRAM)</H3>
<P>
<P><B>UDP</B> is the User Datagram Protocol, but you more easily can
remember the properties of UDP as Unreliable Datagram Processing. In
other words, UDP allows each block to be sent as an individual message,
but a message might be lost in transmission. In fact, depending on
network traffic, UDP messages can be lost, can arrive multiple times,
or can arrive in an order different from that in which they were
sent. The sender of a UDP message does not automatically get an
acknowledgment, so it is up to user-written code to detect and
compensate for these problems. Fortunately, UDP does ensure that if a
message arrives, the message contents are intact (i.e., you never get
just part of a UDP message).
<P>The nice thing about UDP is that it tends to be the fastest socket
protocol. Further, UDP is "connectionless," which means that each
message is essentially independent of all others. A good analogy is
that each message is like a letter to be mailed; you might send
multiple letters to the same address, but each one is independent of
the others and there is no limit on how many people you can send
letters to.
<P>
<H3>TCP Protocol (SOCK_STREAM)</H3>
<P>
<P>Unlike UDP, <B>TCP</B> is a reliable, connection-based, protocol.
Each block sent is not seen as a message, but as a block of data
within an apparently continuous stream of bytes being transmitted
through a connection between sender and receiver. This is very
different from UDP messaging because each block is simply part of the
byte stream and it is up to the user code to figure-out how to extract
each block from the byte stream; there are no markings separating
messages. Further, the connections are more fragile with respect to
network problems, and only a limited number of connections can exist
simultaneously for each process. Because it is reliable, TCP
generally implies significantly more overhead than UDP.
<P>There are, however, a few pleasant surprises about TCP. One is that,
if multiple messages are sent through a connection, TCP is able to
pack them together in a buffer to better match network hardware packet
sizes, potentially yielding better-than-UDP performance for groups of
short or oddly-sized messages. The other bonus is that networks
constructed using reliable direct physical links between machines can
easily and efficiently simulate TCP connections. For example, this was
done for the ParaStation's "Socket Library" interface software, which
provides TCP semantics using user-level calls that differ from the
standard TCP OS calls only by the addition of the prefix
<CODE>PSS</CODE> to each function name.
<P>
<H3>Device Drivers</H3>
<P>
<P>When it comes to actually pushing data onto the network or pulling data
off the network, the standard unix software interface is a part of the
unix kernel called a device driver. UDP and TCP don't just transport
data, they also imply a fair amount of overhead for socket management.
For example, something has to manage the fact that multiple TCP
connections can share a single physical network interface. In
contrast, a device driver for a dedicated network interface only needs
to implement a few simple data transport functions. These device
driver functions can then be invoked by user programs by using
<CODE>open()</CODE> to identify the proper device and then using system
calls like <CODE>read()</CODE> and <CODE>write()</CODE> on the open "file."
Thus, each such operation could transport a block of data with little
more than the overhead of a system call, which might be as fast as
tens of microseconds.
<P>Writing a device driver to be used with Linux is not hard... if you
know <EM>precisely</EM> how the device hardware works. If you are not
sure how it works, don't guess. Debugging device drivers isn't fun
and mistakes can fry hardware. However, if that hasn't scared you
off, it may be possible to write a device driver to, for example, use
dedicated Ethernet cards as dumb but fast direct machine-to-machine
connections without the usual Ethernet protocol overhead. In fact,
that's pretty much what some early Intel supercomputers did.... Look
at the Device Driver HOWTO for more information.
<P>
<H3>User-Level Libraries</H3>
<P>
<P>If you've taken an OS course, user-level access to hardware device
registers is exactly what you have been taught never to do, because
one of the primary purposes of an OS is to control device access.
However, an OS call is at least tens of microseconds of overhead. For
custom network hardware like TTL_PAPERS, which can perform a
basic network operation in just 3 microseconds, such OS call overhead
is intolerable. The only way to avoid that overhead is to have
user-level code - a user-level library - directly access hardware
device registers. Thus, the question becomes one of how a use