Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

BENCHMARKING OF FFT ALGORITHMS*

Miclzirel Balducci, Arnvind Gannpathirnju, Ajitha Ghoudary, Anthony Skjelksm


Hamaker, Joseph Picone
J~~nfltlafln
Dcpartmcnt of Electrical and Computer Engineering Department of Computer Science
Mississippi State University Mississippi Stalc University
Mississippi State, Mississippi 39762, USA Mississippi State, Mississippi 39762, USA
Ph (601) 325-3149 -Fax (601) 325-3149 Ph (601) 325-8435 - Fax (601) 325-8997
f ganapath, hamaker, picone) @jisip.msstate.edu {ajitha,tony] @cs.msstate.edu

Ahstract - A large number of Fast Fourier Transform efficient algorithm under application-specific constraints.
(FFT) algorithms have been developed over the years. Our approach is to benchmark each algorithm undcr a variety
Among these, the most promising are the Radix-2, of constraints, and to use the benchmark statistics to create a
Radix-4, Split-Radix, Fast Martley Transform (FHT), wrapper capable of choosing the algorithm bcst suited for a
Quick Fourier Transform (QFT), and the Decimation- given application. Obviously, object-orientcd programming
in-Time-Frequency (DITF) algorithms. In this paper, methodologies will play a large role.
we present a rigorous analysis of these algorithms that
includes the number of mathematical operations, ALGORITHMS
computational time, memory requirements, and object
code size. The results of this work will serve as a Thc definition of the Discrete Fourier Transform (DFT) is
framework for creating an object-oriented, poly- shown in (1).
functional FFT implementation which will
automatically choose the most efficient algorithm given
user-specified constraints.

Most of the algorithms take the divide-and-conquer approach


Though development of the Fast Fourier Transform (FFT)
to rcducc computations. The Radix-2 and Radix-4
algorithms is a fairly mature area, scvcral interesting
approaches decompose the N-point DFT computations into
algorithms have been introduccd in thc last ten years that
sets of two and four-point DFTs, respectively [2]. The core
provide unprccedentcd levels of performance. The first
computation in a Radix-4 butterfly involves fewer complex
major brcaklhrough was thc Cooky-Tukey algorithm
multiplications than the Radix-2 butterfly, yielding an
developed in the mid-sixties which rcsulled in a flurry of
incrcasc in efficiency whcn the order OC the transform is a
activity on FFTs [2]. Furthcr rescarch Icd Lo the
power of 4. To take advantage of this fact, the Split-radix
developincnt of the Fast Hartley Transform, and Split-
algorithm makes use of both the Radix-2 and Radix-4
Radix algorithm. Rccently, two new algorithms have also
decomposition [ 3 ] .
c m e r g e d : the Quick Fourier TransCorm and the
Decimation-in-Time-Frcqucncy algorithm. The Hartley Transform, shown in (2), furthcr reduces
computation by replacing the complex exponential term in
WhiIc there has been extensive research on the theoretical
the DFT with a using real variables [41.
clTiciency of these algorithms, therc has been little research
to-datc comparing thcse algorithms on practical terms.
N-l
Architecturcs have become quite complex today with
multi-level caches, super-pipelined processors, long word xH(k)= J- c
x , [ c o s 2lrkn
fin=(-)
(T)+ s i n2(nkn
T)] (2)
instruction sets, etc. Efficiency, as we will show, is
intricately related to how an algorithm can bc implemented
This reduces the number of rea1 multiplications and
on a given architecture. The issues to be considered include
additions, with only a modest gain in memory. The main
computation spccd, memory, algorithm complexity,
drawback of the Hartley Transform is the additional
machinc architecture, and the compiler design. In this
computation needed to transform the results from the real
papcr, we report on preliminary work to find the most
Hartley cocfficients to the standard complex Fourier
* This work was supported by DARPA through US Air Force's Rome Lab-
cocfficients. However, since the relationship between thc two
oralorics iindcr contract FiO(iO2-96-1-0329. forms of coefficients is linear, the additional cost incurred is

0-7803-38$4-8/97/$10.00 0 1997 E E E -328-


less than the efficiency gained. operations. This gives a very efficient method for viewing the
dynamic memory usage. An iterative approach to testing was
The Quick Fourier Transform (QFT) uses the symmetry of also used to reduce the transients of processor loading. This
thc cosinc and sine terms to reduce the number of complex method involved running each test for a large number of
calculations [5]. The QFT breaks a signal into its cven and iterations, using mcdian values for Comparison.
odd components. A Discrete Cosine Transform (DCT) is
uscd on the cven samples to calculate the real portions of RESULTS
the Fourier cocfficicnts, whilc a Discrete Sinc Transform
(DST) is uscd on the odd samples to compute thc An often used criterion for comparison of FFT algorithms
imaginary portions. Both the DCT and DST arc in turn thus far has been the number of mathematical operations
computed recursively. An important aspect of the QFT is involved. These statistics usually do not, however, account
that all complex operations occur at the last stage of for the cost of incremcnting integer counters and indices.
rccursion, making it wcll-suited for real data. lntegcr operations can bc a significant portion of the
computation timc. Operation counts for a 1024-point
Finally, the Decimation-In-Time-Frequency (DITF) complcx FFT is prescntcd in Table 1 . Most algorithms tradc
algorithm Ievcrages the Radix-2 approach in both the floating-point operations for integer operations.
time-domain (DIT) and frequency (DIF). Thc DITF is
bascd on the obscrvation that the DIT algorithm has a A critcrion closely relatcd to the number of operations is the
majority of its complex operations towards thc end of the computation spccd. A summary of the computation times for
computation cyclc and the DIF algorithm has a majority the selected algorithms is shown in Table 2. One would
towards the beginning. The DITF makes use of this fact by expect the algorithm with the least number of operations to
pcrfortning thc DIT at the outsct and then switching to a be the fastcst. However, we show that compiler optimizations
D I F to complcte the transform. Combining thcsc play a large rolc by virtuc of the largc difference between the
algorithms comes at the cost of computing complex FHT and all other algorithms. It is intcresting to note that the
conversion factors at the transition stage [6]. difference in performance tends to decrease as the order
incrcases. As the order incrcases in the FHT, the cost of
E VALUATION METHODOLOGY converting the Iiartlcy coefficients to Fouricr coefficients
seems to becomes substantial, explaining the less dramatic
Criteria: Computation speed was selected as the core difference in performance.
critcria for comparison since the fastest method is generally
1hc most desirable one. However, the amount of memory It is a well-known fact that most FFT algorithms achieve an
availablc is not unlimitcd; therefore, memory was also O [Nlog NI complexity. The constants of proportionality are
included as a measure of cfficiency. The number of what differentiates the performance of the algorithms. As
mathcmatical operations is also important since it is shown in Tablc 2, second-order effects can be dominated by
directly related to thc computation time and the hardware compiler eaciency. For lower orders the FHT algorithm in
requirements. The additions and multiplications werc its present implementation seems to be making better use of
broken into floating-point and intcger operations because the cache than the other algorithms. This advantage flattens
floating point operations are more costly in computation out as the order increases, though.
timc than are integcr operations (on most hardware). Of
course, all of this is mitigated by the degree to which the
compiler can perform optimizations. Modem compilers are
ablc to optimize code for speed and hardware usage by 1 I I 1 I 1 I
Algorithm
Float
Mults
Float Int Int Bin
Adds Mults Adds Shifts
such techniques as loop-unrolling, delayed-branching,ctc.
The lcvcl of optimization performed on an algorithm will I RAD-2 I 20480 I 30720 I I I I
0 15357 1024
be highly algorithm and implementation dcpendent. IRAD-4 I 15701 I 28842 I 336 I 8877 I 2738 I

Implementation: Bearing in mind the evaluation critcria, ISRFFT I 10016 I 25488 I I I


502 12448 2937 I
we designed a class structure that is intuitive and allows for
casy inclusion of ncw algorithms to thc cxisting set.
Computation spced is mcasured using system utilities
accuralc to 1 ms . For evaluation of memory usage, we
DITF
dcvclopcd floating-point and intcgcr classes lhat have thc
fcatures of accumulating a count for every variable that is Table 1. Comparison of mathematical operations involved in
dcclared and counting the numbcr of mathematical a 1024-point complcx FFT.

-329-
requires a marginally higher number of floating-point
operations than the Radix-4, yet the FHT is faster than the
Radix-4 by more than a factor oi-2. This implies that the FHT
makes more efficient use of the resources available (cache,
pipclining, ctc.) than does the Radix-4. Also, this makes it
even morc important to look at thc order of operations
bccause multiplications and additions that can bc pipciined
need much lcss computation time than algorithms where the
flow of opcrations cannot bc casily pipelincd. This is an
cspccially important consideration €or thc current generation
of complex architcctures (such as the Pentium Pro chip).
Q FT 143 762 3476 15952 78000
We also see that, in general, the number of integer additions
D ITF 238 1000 4191 18714 80333 is a good indication o f speed. Note that the ranking of
algorithms by speed is almost the same as the ranking of the
algorithms by integer additions. The only exccption is the
QFT that secms to make up for thc high number of integer
opcrations by using vcry few floating-point multiplications.

Our work has laid the foundation for an “intelligent”


Another common design criterion in practical systems is environment which will automatically choosc the best
memory. Algorithm erfciency can always be traded for algorithm and execute it €or a given set of user constraints.
memory and code size. Table 3 illustrates this fact. As the hardware available is becoming more specialized, it
becomes imperative that the software take full advantage of
CONCLUSIONS the hardware capabilities. In the futurc, our work will be
cxtended to parallel processing environments. Detailed
Wc have prescnted results on a comprehensive collection of documentation of our experiments can be downloaded from
FFT algorithms, each of which was programmed in a http://ww. isip. msstate, edu/sofiwai~e/pa~allel~dsp.
similar framework. We have gcnerated statistics to
supplement the mathematical formulation for fh-e ACKNQWLEDGEMENTS
complcxity of thcsc algorithms. Combined, this data gives
the developer a clear picture of the computational We gratefully acknowledge the suggestions and assistance
requirements of cach algorithm. given by Shane Hebert of ICDCRL in the Computer Science
Department at Mississippi State University.
At our current level of implcmcntation, the FHT appears to
be lhc best overall algorithm. It is intercsting to note that REFERENCES
thc highcr number of mathematical opcrations docs not
nccessarily translate into a reduction in speed. The FHT J.W. Codcy and J.W. Tukey, “An Algorithm for
Machinc Computation of Complcx Fourier Series,”
Math. Coinp., vol. 19, pp. 297-301, April 2965.
C.S.Bums and T.W.Parks, DFT/FFT mid Convolutior?
Algorithms: Thee? and Implementation, John Wiley
and Sons, Ncw York, NY, USA, 1985.
P. Duhamcl and II. Ilollomann, “Split Radix FFT
Algorithm,” Eleclioiiic Leriers, vol. 20, pp. 14-16, Jan.
1984.
R. Braccwell, The 1fm-dt.v Tlamfoi-m, Oxford Press,
QFT I 491521 75201 Oxford, England, 1985.

DlTF I 246161 30601


H. Guo, G.A. Sitton, and C.S. Burrus, “The Quick
Discrete Fourier Transform,” Pivc. qf ZCASSR vol. 111,
Tablc 3 . Comparison of memory usagc [in bytes] for a pp. 445-447, Adelaide, Australia, April 1994.
1024-point complex FFT on a 200 MHz UltraSparc A. Saidi, “Decimation-In-Time-Frequency FFT
compiled using gcc with optimization level 3. Note that Algorithm.” Proceedings qf ICASSR vol. 111, gp. 453-
pcak memory requirements are shown. 456, Adelaide, Australia, April 1994.

-330-

You might also like