|File Search||Catalog||Content Search|
fftw - Fast Fourier Transform library… more info»
This version of FFTW contains specific support for the Cell Broadband Engine (``Cell'') processor. ACKNOWLEDGMENTS --------------- The code in the cell/ directory was written and graciously donated to the FFTW project by the IBM Austin Research Laboratory. We are grateful to Pat Bohrer and Lorraine Herger of IBM for this generous contribution. SCOPE ----- Cell consists of one PowerPC core (``PPE'') and of a number of Synergistic Processing Elements (``SPE'') to which the PPE can delegate computation. The IBM QS20 Cell blade offers 8 SPEs per Cell chip. The Sony Playstation 3 contains 6 useable SPEs. This version of FFTW fully utilizes the SPEs for one- and multi-dimensional complex FFTs of sizes that can be factored into small primes, both in single and double precision. Transforms of real data use SPEs only partially at this time. If FFTW cannot use the SPEs, it falls back to a slower computation on the PPE. This library is meant to use the SPEs transparently without user intervention. However, certain caveats apply, which are discussed later in this document. INSTALLATION ------------ To enable support for Cell in double precision: configure --enable-cell make make install In single precision: configure --enable-cell --enable-single make make install In addition, the PPE supports the Altivec (or VMX) instruction set in single precision. (Altivec is Apple/Freescale terminology, VMX is IBM terminology for the same thing.) You can enable support for Altivec with the ``--enable-altivec'' flag (single precision only). The software compiles with the Cell SDK 2.0, and probably with earlier ones as well. CAVEATS ------- * The benchmark program allocates memory using malloc() or equivalent library calls, reflecting the common usage of the FFTW library. However, you can sometimes improve performance significantly by allocating memory in system-specific large TLB pages. E.g., we have seen 39 GFLOPS/s for a 256x256x256 problem using large pages, whereas the speed is about 25 GFLOPS/s with normal pages. YMMV. * FFTW hoards all available SPEs for itself. You can optionally choose a different number of SPEs by calling the undocumented function fftw_cell_set_nspe(n), where ``n'' is the number of desired SPEs. Expect this interface to go away once we figure out how to make FFTW play nicely with other Cell software. In particular, if you try to link both the single and double precision of FFTW in the same program (which you can do), they will both try to grab all SPEs and the second one will hang. * The SPEs demand that data be stored in contiguous arrays aligned at 16-byte boundaries. If you instruct FFTW to operate on noncontiguous or nonaligned data, the SPEs will not be used, resulting in slow execution. * The FFTW_ESTIMATE mode may produce seriously suboptimal plans, and it becomes particularly confused if you enable both the SPEs and Altivec. If you care about performance, please use FFTW_MEASURE until we figure out a more reliable performance model. ACCURACY -------- The SPEs are fully IEEE-754 compliant in double precision. In single precision, they only implement round-towards-zero as opposed to the standard round-to-even mode. (The PPE is fully IEEE-754 compliant like all other PowerPC implementations.) Because of the rounding mode, FFTW is less accurate when running on the SPEs than on the PPE. The accuracy loss is hard to quantify in general, but as a rough guideline, the L2 norm of the relative roundoff error for random inputs is 4-8 times larger than the corresponding calculation in round-to-even arithmetic. In other words, expect to lose 2 to 3 bits of accuracy. FFTW currently does not use any algorithm that degrades accuracy to gain performance on the SPE. One implication of this choice is that large 1D transforms run slower than they would if we were willing to sacrifice another bit or so of accuracy.