J. Software Engineering & Applications, 2010, 3: 391-403
doi:10.4236/jsea.2010.34044 Published Online April 2010 (http://www.SciRP.org/journal/jsea)
Copyright © 2010 SciRes JSEA
391
DSPs/FPGAs Comparative Study for Power
Consumption, Noise Cancellation, and Real
Time High Speed Applications
Alon Hayim, Michael Knieser, Maher Rizkalla
Department of Electrical and Computer Engineering, Indiana University Purdue University Indianapolis, Indianapolis,
USA.
Email: mrizkall@iupui.edu, mrizkall@yahoo.com
Received December 24th, 2009; revised January 6th, 2010; accepted February 3rd, 2010.
ABSTRACT
Adaptive noise data filtering in real-time requ ires dedicated hardware to meet deman ding time requ irements. Both DSP
processors and FPGAs were studied with respect to their performance in power consumption, hardware architecture,
and speed for real time app lications. For testing purposes, real time adaptive noise filt ers have been implemented and
simulated on two different platforms, Motorola DSP56303 EVM and Xilinx Spartan III boards. This study has shown
that in high speed applications, FPGAs are advantageous over DSPs with respect of their speed and noise reduction
because of their parallel architecture. FPGAs can handle more processes at the same time when compared to DSPs,
while the later can only handle a limited number of parallel instructions at a time. The speed in both processors impacts
the noise reduction in real time. As the DSP core gets slower, the noise removal in real time gets harder to achieve.
With respect to power, DSPs are advantageous over FPGAs. FPGAs have reconfigurable gate structure which con-
sumes more power. In case of DSPs, the hardware has been already configured, which requires less power consump-
tion? FPGAs are built for general purposes, and their silicon area in the core is bigger than that of DSPs. This is an-
other factor that affects power consumption. As a result, in high frequency applications, FPGAs are advantageous as
compared to DSPs. In lo w frequency a pplication s, DSPs and FPGAs bo th satisfy the requirements for no ise cancelling.
For low frequency applications, DSPs are advantageous in their power consumption and applications for the battery
power devices. Softwa re utilizing Ma tlab, VHDL code run on Xilin ix system, and assembly running on Motoro la devel-
opment systems, have been used for the demonstration of this study.
Keywords: Four Quadrant (4Q) Converter, Interlacing, Traction Systems, Power Quality Analysis
1. Introduction
The performance of real-time data processing is often
limited to the processing capability of the system.
Therefore, evaluation of different digital signal process-
ing platforms to determine the most efficient platform is
an important task. There have been many discussions
regarding the preference of Digital Signal processors
(DSPs) or Field Programmable Gate Arrays (FPGA) in
real time noise cancellation. The purpose of this work is
to study features of DSPs and FPGAs with respect to
their power consumption, speed, architecture and cost.
DSP is found in a wide variety of applications, such as
filtering, speech recognition, image enhancement and
data compression, neural networks, as well as analog
linear-phase filters. Signals from the real world received
in analog form, then discretely sampled for a digital com-
puter to understand and manipulate. There are many ad-
vantages of hardware that can be reconfigured with dif-
ferent programming. Reconfigurable hardware devices
offer both the flexibility of computer software, and the
ability to construct custom high performance computing
circuits. In space applications, it may be necessary to
install new functionality into a system, which may have
been unforeseen. For example, satellite applications need
to adjust to changing operation requirements. With a re-
configurable chip, functionality that is not normally pre-
dicted at the outset can be uploaded to the satellite when
needed. To test the adaptive noise cancelling, the least
mean square (LMS) approach has been used. Besides the
standard LMS algorithm, the modified algorithms that
are proposed by Stefano [1] and by Das [2] have been
implemented for the noise cancellation approach, giving
the opportunity of co mparing both platforms with respect
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications
392
to their speed, noise, architecture, cost, and power.
2. Adaptive Filter Design on Motorola
DSP56300
Adaptive filters have the ability to adjust their own pa-
rameters and coefficients automatically. Hence, their
design requires little or no prior knowledge of the input
signal or noise characteristics of the system. Adaptive
filters have two inputs, x(n) and d(n), which are usually
correlated in some manner. Figure 1 gives the basic con-
cept of the adaptive filter.
The filter’s output y(n), which is computed with the
parameter estimates, is compared with the input signal
d(n). The resulting prediction error e(n) is fed back
through a parameter adaption algorithm that produces a
new estimate for the parameters and as the next input
sample is received, a new prediction error can be gener-
ated. The adaptive filter features minimum prediction
error. Two aspects of the adaptive filter are its internal
structure and adaptation algorithm. Its internal structure
can be either that of a nonrecursive (FIR) filter or that of
a recursive (IIR) filter. An adaptation algorithm can be
divided into two major classes; gradient algorithms and
nongradient algorithms. A gradient algorithm is used to
adjust the parameters of the FIR filter. The least mean
square (LMS) algorithm is the most widely applied gra-
dient algorithm. This adjusts the filter’s parameters to
minimize the mean-square error between the filter’s out-
put y(n) and the desired respon se input d(n) [3]. When an
adaptive filter is implemented on the DSP56300 proc-
esser, address pointer to mimic FIFO (First-In-First-
Out)-like shifting of the RAM data, modulo addressing
capability to provide wrap around data buffers, multi-
ply/accumulate (MAC) instruction top both multiply two
operands and ad d the product to a third operand in a sin-
gle instruction cycle, data move in parallel with the MAC
instructions to keep the multiplier running at 100% ca-
pacity and Repeat Next Instruction (REP) to provide
compact filter code are being used by the processor. The
processor’s capability to perform modulo addressing
allows an address register (Rn) value to be incremented
(or decremented) and yet remain within an address range
of size L, where L is defined by a lower and an upper
x
(
n
)
d
(
n
)
+
-
e(n)
Filter
Parameters
Figure 1. Basic concep the adaptive filter
addressis the
t of
boundary. For the adaptive FIR filter, L
number of coefficients (taps). The value L-1 is stored in
the processor’s Modifier Register (Mn). The upper ad-
dress boundary is calculated by the processor and is not
stored in a register. When modulo addressing is used, the
Address Register (Rn) points to a modulo data buffer
located in X-Memory and/or Y-Memory. The address
pointer (Rn) is not required to point at the lower address
boundary; it can point anywhere within the defined
modulo address range L. If the address pointer incre-
ments past the upper address boundary (base address plus
L-1 plus 1), it will wrap around to the base address.
Modulo Register M1 is programmed to the value
NTAPS-1 (modulo NTAPS). Address Register R1 is
programmed to point to the state variable modulo buffer
located in X-Memory. Modulo Register M4 is pro-
grammed to the value NTAPS-1. Address Register R4 is
programmed to point to the coefficient buffer located in
Y-Memory. Given that the FIR filter algorithm has been
executing for some time and is ready to process the input
sample x(n) in the Data ALU input Register X0, the ad-
dress in R4 is the base address (lower boundary) of the
coefficient buffer. The address in R1 is M, where M is
greater than or equal to the lower boundary of X-Memory
address and less than or equal to the upper boundary of
X-Memory address. The X-Memory map for the filter
states, the Y-Memory map for the coefficients, and the
contents of the processor’s A and B Accumulators and
Data ALU Input Registers X0, X1, Y0 and Y1 are shown
in the Figure 2. The CLR instruction clears the A-Accu-
Figure 2. Memory map and data registers after last MAC
instruction
Copyright © 2010 SciRes JSEA
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications393
tim
Y1 and the error sample e(n) to the Data
In
mulator and simultaneously moves the input sample x(n)
from the Data ALU’s Input Register X0 to the
X-Memory location pointed to by address register R1,
and moves the first coefficient from the Y-Memory loca-
tion pointed to by address register R4 to the Data ALU’s
Input Register Y0. Both Address Registers R1 and R4
are automatically incremented by one at the end of the
CLR instruction (post-in cremented). The REP instru ction
regulates execution of NTAPS-1 iteration of the MAC
instruction. The MAC instruction multiplies the filter
state variable X0 by the coefficient in Y0, adds the
product to the A-Accumulator and simultaneously moves
the next state variable from the X-Memory location
pointed to by the Address Register R1 to the Input Reg-
ister X0, and moves the next coefficient from the
Y-Memory location pointed to by Address Register R4 to
Input Register Y0. Both Address Registers R1 and R4
are automatically incremented by one at the end of the
MAC instruction (post-incremented).
During the execution of the filter algorithm, Address
Register R4 is post incremented to a total of NTAPS
es; once in conjunction with the CLR instruction and
NTAPS-1 times (due to the REP instruction) in conjunc-
tion with the MAC instruction. Since the modulus for R4
is NTAPS and R4 is incremented NTAPS times, the ad-
dress value in R4 wraps around and points to the coeffi-
cient buffer’s lower boundary location [3]. Also Address
Register R1 is post incremented to a to tal NTAPS times;
once in conjunction with the CLR instruction and
NTAPS-1 times (due to the REP instruction) in conjunc-
tion with the MAC instruction. Also at the beginning of
the algorithm, the input sample x(n) is moved from the
Data ALU Input Register X0 to the X-Memory location
pointed to by R1. Since the modulus for R1 is NTAPS
and R1is incremented NTAPS times, the address value in
R1 wraps aroun d and points to the state variable buffer’s
X-Memory location M. The MACR instru ction calculates
the final tap of the filter algorithm and performs conver-
gent rounding of the result. The data move portion of this
instruction loads the input sample x(n) into the B-Ac-
cumulator. At the end of the MACR instruction, the ac-
cumulator contains the filter output sample y(n) as shown
in Figure 3.
The two Move instructions transfers th e loop gain K to
the data register
put Register X1. The first MOVE instruction in the “do
loop” transfers the parameter bi(n) to th e A-Accumulator
and the filter state x(n-i) to the Data Input Register X0.
Address Register R1 is incremented by one to point to
the next filter state. The MAC instruction multiplies the
filter state, in X0, by the product of the loop g ain and the
error sample, in Y1, and adds the product to the A-Ac-
cumulator. The result in the A-Accumulator is the up-
dated parameter bi(n+1). The second Move instruction in
the “do loop” transfers the parameter bi(n+1) to the
Y-Memory location pointed to by the Address Register
R4. R4 is incremented by one to point to the next filter
parameter as shown in Figure 4. The LUA instruction
decrements R1 by one, and R1 then points to the state
variable buffer’s X-Memory location M-1. When the
algorithm is executed, a new (next) input sample x(n+1)
will overwrite the value in X-Memory location M-1.
Thus FIFO-like shifting of the filter state variables is
accomplished by adjusting the R1 address pointer as
shown in Figure 5.
Figure 3. Memory map and data registers after MACR
instruction
Figure 4. Memory map and data registers after last pass of
do loop
Copyright © 2010 SciRes JSEA
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications
394
Figure 5. Memory map and data registers after LUA in-
struction
Consider the problem of finding the linear minimum
mean square estimate (LMMSE) of a zero-mean signal
vector, S, from a noisy zero-mean data vector, X = S + N,
where N denotes the additive noise vector. A LMMSE of
S is given in Equation (1), where A denotes a matrix of
filter coefficients as given in Equation (2).
Here, CSS and Cnn denote the covariance matrices of sig-
nal and noise, respectively. Notice that if X has a
non-zero mean vector, μ, Equation e becomes:
For point-wise processing of a non-stationary signal of a
local mean, µS, and local variance, σS2, and the noise to
be zero-mean, white with a local variance, σn2, the
point-wise LMMSE will be given by:
XAS  (1)

XCCCS nnSSSS 1
 (2)


XCCCS nnSSSS 1
(3)

S
nS
S
SxS

 2
 22 (4)
σn2 is constant, while σS2 and μS vary with the time index,
k. Thus the filtered estimate at time, k can be written as:
 
  

kkx
k
k
kS S
nS
S
S

22
2
(5)
where μ(k) and σ2(k) d
S S
of local mean and local variance
ad
filtering.
Lee’s adaptive wiener filter suffers from
oising perform-
ance of the filter is improved by introducing a non-rec-
tangular window to process weighted dat
second, a scheme for online estimation of noise power is
observed data consists of predominantly low-frequency
signal components and additive white noise, the
can be modeled as a sum of the spectral density of the
enote the time varying estimates
of S(k). An improved
version of Lee’s aptive wiener filter has been propo sed
by Das [4]. The main contributions of this algorithm in-
clude a better technique for estimation of noise variance,
and incorporation of a d ata win dow for ad ap tive
two major
drawbacks. First, it requires prior knowledge of noise
power and second, its performance deteriorates when the
signal-to-noise ratio (SNR) is low and noise power is
imprecisely known. The improved wiener filter incorpo-
rates two modifications. First, the de-n
a samples and
incorporated which is based on analyzing the power
spectral density, S(ω), of the data. Assuming that the
n S(ω)
signal and a constant, σn2, which represents the variance
of noise. The estimated σn2 is the average value of the
high-frequency section of S(ω) [2]. The improved wiener
filter can be done in a fashion similar to that of Lee’s
wiener filter, but Equation (2) now takes the form S =
AWX, where A denotes a matrix of filter coefficients, and
W is a (diagonal) data weighting matrix. The LMMSE of
S is now given by Equation (6), where XW = WX, and
similarly, the point-wise LMMSE is given by
WnnSSSS XCCCS1
 (6)

SW
nS
S
SXS

 22
2
(7)
3. FPGAs Adaptive Filter Design
The efficient realization of complex algorithms on
FPGAs requires a familiarity with their specific archi-
tectures. The modifications needed to implement an al-
gorithm on an FPGA and also the specific architectures
for adaptive filtering and their advantages are given be-
low.
3.1 FPGA Realization Issues
FPGAs are ideally suited for the implementa tion of ad ap-
tive filters. However, there are several issues that need to
be addressed. When performing software simulations of
adaptive filters , calculations are n ormally carried out with
floating point precision. Unfortunately, the resources re-
quired of an FPGA to perform floating point arithmetic
are normally too large to be justified. A
the filter tap itself. Numerous techniques have been de-
vised to efficiently calculate the convolution
when the filter’s coefficients are fixed in advan
nother concern is
operation
ce. For an
Copyright © 2010 SciRes JSEA
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications395
r time, these
ugh computing floatin g point arithmetic in FPGA is
d with the inclusion of
costly in terms of
deci-
decimal places is ade-
for a given algorithm to
s only four bits. For simple convolution,
then dividing the output
adaptive filter whose coefficients chan ge ove
methods will not work or need to be modified signifi-
cantly [5]. The reconfigurable filter tap is the most im-
portant issue for high performance adaptive filter archi-
tecture, and as such it will be discussed at length.
3.2 Finite Precision Effects
Altho
possible, it is usually accomplishe
custom floating point units, which are
logic resources. Therefore, a small number of floating
point units can be used in the entire design, and must be
shared between processes. This does not take full advan-
tage of the parallelization that is possible with FPGAs
and is therefore not the most efficient method. All calcu-
lation should therefore be mapped into fixed point only,
but this can introduce some errors. The main errors in
DSP include ADC quantization error, coefficient quanti-
zation error, overflow error caused impermissible word
length, and round off error. The other three issues will be
addressed later.
3.2.1 Scale Factor Adjustment
A suitable compromise for dealing with the loss of preci-
sion when transitioning from a floating point to a fixed-
point representation is to keep a limited number of
mal digits. Normally, two to three
quate, but the number required
converge must be found through experimentation. When
performing software simulations of a digital filter for
example, it is determined that two decimal places is suf-
ficient for accurate data processing. This can easily be
obtained by multiplying the filter’s coefficients by 100
and truncating to an integer value. Dividing the output by
100 recovers the anticipated valu e. Since multiplyin g and
dividing be powers of two can be done easily in hard-
ware by shifting bits, a power of two can be used to sim-
plify the process. In this case, on e would multiply by 128,
which would require seven extra bits in hardware. If it is
determined that three decimal digits are needed, then ten
extra bits would be needed in hardware, while one deci-
mal digit require
multiplying by a preset scale and
by the same scale has no effect on the calculation. For a
more complex algorithm, there are several modifications
that are required for this scheme to work [6]. The first
change needed to maintain the original algorithm’s con-
sistency requires dividing by a scale constant any time
and previously scaled values are multiplied together.
Consider, for example, the values a and b and the scale
constant s, the scaled integer values are represented by
a
s
and b
s
. To multiply theses values requires divid-
ing by s to correct for the s2 term that would be intro-
duced and recover the scaled product ba.
abs
s
b
s
a
s

 (8)
Likewise, division must be corrected with a subse-
quent multiplication. It should now be evident why a
power of two is chosen for the scale constant, since mul-
tiplication and division by power of two results in simple
bit shifting. Addition and subtraction require no addi-
tional adjustment. The aforementioned procedure must
be applied with caution, however, and does not work in
all circumstances. While it is perfectly legal to apply to
the convolution operation of a filter, it may need to be
tailored for certain aspects of a given algorithm. Consider
the tap-weight adaptation equation for the LMS algo-
rithm in Equation (9).
)()()(
ˆ
)1(
ˆnenunwnw

(9)
where μ is the learning rate parameter; its purpose is to
control the speed of the adaptation process. The LMS
rithm ionvergent in the mean square provided in
Equation (10) .
algos c

MAX
2
0 (10)
where
MAX
is the largest eigenvalue of the correla-
tion matrix Rx of the filter’s input. Typically this is a
fraction value and its product with the error term has the
effect of keeping the algorithm from diverging. If µ is
blindly multiplied by some scale factor and truncated to a
fixed-point integer, it will take on a value greater than
one. The affect will be to make the LMS algorithm di-
verge, as its inclusion will now amplify the added error
term. The heuristic adopted in this case is to divide by
the inverse value, which will be greater than one. Simi-
larly, division by values smaller than one should be re-
placed by multiplication with its inverse. The outputs of
the algorithm will then
need to be divided by th
obtain the true output. The following algorithm
Scale = accuracy rounded up to a power of two.
Multiply all constants by scal
vide by
e scale to
describes
the fixed poin t conversion:
Determine Scale
Through simulations, find the needed accuracy (#
decimal places).
e
- Di scale when two scaled values are multi-
plied.
- Multiply by scale when two scaled values are di-
vided.
Replace
For multiplication by valu es less than 1
- Replace with division by the reciprocal value.
Likewise, for division by values less than 1
Replace with multiplication by the reciprocal value.
3.2.2 Training Algorithm Modification
The training algorithms for the adaptive filter need some
minor modifications in order to converge for a fixed-
point implementation. Changes to the LMS weight up-
date equation were discussed in the previous section.
Copyright © 2010 SciRes JSEA
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications
396
Specifically, the learning rate µ and all other constants
should be multiplied by the scale factor. When µ is ad-
jurm in Equation (11). With µ modifi-
casted it takes the fo
tion weight update Equation (11) can be modified as in
Equation (12) .
scale
ˆ
1 (11)
ˆ
)()1( nwnw (12)
)()(nenu
ˆˆ 
t form FIR structure has a delay that is de-
tetree, which is
de IR, on
thnd one
ad d-
va e-
idth. Figure 6
R structure is shown in Figure 6 and the
output y at any time n is given by Equation (13), where
nodes B and C are described
respectively.
Figure 7. Transposed form FIR structure
The direc
rmined by the depth of the output adder
pendent on the filter’s order. The transposed F
ier ae other hand, has a delay of only one multipl
der, regardless of the filter length. It is therefore a
ntageous to use the transposed form for FPGA impl
mentation to achieve maximum bandw
shows the direct and Figure 7 shows the transposed FIR
structures for a three tap filter. The relevant nodes have
been labeled A, B and C for a data flow analysis. Each
filter has three coefficients, and are labeled h0[n], h1[n]
and h2[n]. The coefficien ts ’ sub script denotes the relevant
filter tap, and the n subscript represents the time index,
which is required since adaptive filters adjust their coef-
ficients at every time instance.
The direct FI
in Equations (14) and (15)
Figure 6. Direct form FIR structure
][][][][][ 0nBnhnxnAny  (13)
][][]1[][ 1nCnhnxnB
][]2[][ 2nhnxnC
(14)
(15)
][]2[][]1[][][][ 210 nhnxnhnxnhnxny 
(16)
][][][ knhknxny
2
0
N
k
posed FIR strs shown i
(17)
n Figure 7 and
The trans
the ou any time nen ow.
ucture i
is giv
tput y atbel
]1[][][][ 0
nBnhnxny (18)
][][][ 1[] 1
nCxnB
nhn (19)
][][][ 2nhnxnC
(20)
]2[]2]1[]1[][][][ 210 [

nhxnhnxnhnxny
n
(21)
 2][][][ N
kknhknxny
with the direct FIR output, the di
0k (22)
Compared
the [n-k] index of the coefficient indicates th
produce equivalent output only when the
don’t change with time. This means
architecture is used, the LMS algorithm will not con
verge differently from the direct implementation i
[7]. The change needed was to account for the weights as
shown in Equation (23). A suitable app
up
slower. Though simulations show that it nev
converges with as good results as the tr
algorithm. It may be acceptable still thou
increased bandwidth of the tran
high conver gence rates are not re
fference in
at the filters
coefficients
if the transposed FIR
-
s used
roximation is to
date the weights at every N input, where N is the
length of the filter. This obvious ly will converge N times
h0[n]
er actually
aditional LMS
gh, due to the
sposed form FIR, when
quired.
scale
nenu
nMwnMw

)()(
)(
ˆ
)1(
ˆ (23)
3.3 Implementing Adaptive Noise Filter with
FPGAs
Adaptive noise filtering techniques are applied to low
frequency like voice signals, and high frequency signals
such as video streams, modulated data, and multiplexed
data coming from an array of sensors. Unfortunately in
all high frequency and high speed applications, a soft-
ware implementation of the adaptive noise filtering usu-
ally doesn’t meet the required processing speed, unless a
high end DSP processor is used. A convenient solution
can be represented by a dedicated hardware implementa-
tion using a Field Programmable Gate Array (FPGA). In
this case the limiting factor is represented by a number of
z-1
z-1
x A
h1[n]
h2[n]
B
C
y
h1[n]
x y A
h2[n] C
B
z
h0[n]
-1
z-1
Copyright © 2010 SciRes JSEA
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications
Copyright © 2010 SciRes JSEA
397
ultipliers. More-
over experimental data showed that the modified algo-
rithm achieves the same or even better performan
the standard LMS version. There are many possiost
IR)
digital filter, whose coefficients are iteratively updated
multiplications required by the adaptive noise cancella-
tion algorithm. By using a novel modified version of the
LMS algorithm, the proposed implementation allows the
use of a reduced number of hardware m
ces than
ble im-
plementations for an adaptive noise filter, but the m
widely used employs a Finite Impulse Response (F
using the LMS algorithm. The algorithm is described in
Equations (24) to (26), leading to the evaluation of the
FIR output, the error, and the weights update.
i
T
ii WXY (24)
iiiY
D
e (25)
iiii XeWW
2
1
(26)
In the above equations, Xi is a vector containing the
reference noise samples, Di is the primary input signal,
Wi is the filter weights vector at the ith iteration, and ei is
the error signal. The µ coefficient is often empirically
chosen to optimize the learning rate of the LMS algo-
rithm. The hardware implementation of the algorithm in
an FPGA device is not trivial, since the FIR filter has not
constant coefficients, so multipliers cannot be synthe-
sized by using a look-up table (LUT) based approach.
This however, should be straightforward in FPGA archi-
tecture. Multipliers with changing inputs instead need to
be built by using a significantly greater number of inter-
nal logic resources (either elementary logic blocks or
embedded multipliers). In an Nth order filter the algo-
rithm requires at least 2N multiplications and 2N addi-
tions. Note the factor 2µ that is usually chosen to be a
power of two in order to be executed by shifting. This
makes it impractical for fully parallel hardware imple-
he value of N grows. This mentation of the algorithm as t
is due to the huge number of m
der to reduce the complexity of
weights update expression (Equation
as
pability of the filter. To overcome
this weakness, and significantly improve the
characteristics, a dynamic learning rate coefficient
t an adaptive filter whose order can
i-
ultipliers required. In or-
the algorithm, the
(26)) is simplified
in Equation (27).

iiiiiiWXeWW 
sgn
1
(27)
As a consequence the weights are updated using a
factor proportional to the error and the sign of the current
reference noise sample, instead of its value. This implies
that weights can be updated by using an addition (or sub-
traction) instead of a multiplication. This simplified al-
gorithm requires only N multiplication s and 2N addition s.
However the simplification of the weights update rule
usually results in worse learning performances, i.e. in a
slower adaptation ca learning
α has
been used. Generally this can be done by updating it with
an adaptive rule, or, by using a heuristic function. Simu-
lations of the above mentioned method shows that a dy-
namic learning rate gives an advantage not only in the
learning characteristics, but also in the accuracy of the
final solution (in term of improvement of the signal to
noise ratio of the steady state solution). The product αei
is used to update all weights; only one additional multi-
plication is required.
3.4 Architecture for Implementation on FPGA
The architecture of the adaptive noise filtering based on
the modified LMS algorithm is shown in Figure 8. It was
designed to implement 32 tap adaptive noise filter in a
medium density FPGA device. It has a modular and
scalable structure composed by 8 parallel stages, each
one capable of executing 1 to 4 multiply and accumulate
(MAC) operations and weights update. By controlling
the number of operation performed by each block it is
possible to implemen
range from 8 to 32. In the first case, by exploiting max
mum parallelism, the filter is capable of processing a
data sample per clock cycle. In the other cases 2 to 4
clock cycles are requested. Some FPGA’s internal RAM
blocks were used to implement the tap delays and to
store weights coefficients. Each weights update block is
mainly composed by an adder/subtractor accumulator.
The weights update coefficients Δi are computed by a
separated block, which also handles the learning rate
update function, following the above mentioned heuristic
algorithm, and implements its multiplication with the
error signal. By slightly modifying this unit, a more so-
phisticated adaptive function, can be easily obtained, thus
enhancing the performances of the adaptive noise filter-
ing for non stationa ry signal s.
he modified LMS filterFigure 8. Architecture of t
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications
398
4. Simulations and Results
Adaptive noise filters have been implemented on DSPs
and FPGAs. Motorola DSP56303 has been used for DSP
platform, while Xilinx Spartan III boards are used to im-
plement FPGA adaptive noise filtering. Matlab Simulink
has been used to test the effectiveness and correctness of
the adaptive filters b efore hardware implementation.
4.1 Matlab Simulink Simulations and Results
To test the theory and see the impro
er that is proposed by
ugh Matlab Simulink.
tool
ise
vements visually that
is proposed by Das, the adaptive filt
Lee and Das has been compares thro
(see Figure 9)
The target simulink model is responsible for code gen-
eration where as the host simulink model is responsible
for testing. The host drives the target model with heavy
wavelet noisy test data consisting of 4096 samples gen-
erated from wnoise function in Matlab. Matlab’s fda
is used for designing th e bandpass filter to co lor the no
source. A colored Gaussian noise is then added to the
input test signal. This noisy signal and the reference
noise are inputs to the terminal of the LMS filter Simu-
link block. Figure 10 Desired Signal (top), received
Figure 9. Block diagram of Matlab Simulink
Figure 10. Desired signal
signal (middle), output (bottom) This code has been im-
plemented in C programming language. The LMS filter
is placed in the virtual internal ram of the simulink model.
In the code, breakpoints are placed in the corresponding
section of the code where FIR filtering takes place. It
takes 46, 213 and 266 clock cycles to run the filtering
section. The time computation would be the clock cycles
measured, divided by 225 MHz, which is the virtual
clock speed. The execution time is 20s. The imple-
mentation of LMS filter takes worst case time of 38.95
iltering of heavy sine noisy signal
consisting of 4096 samples per frame. Figure 11 shows
the comparison between the Das proposal of the wiener
filter and the Lee’s wiener filter proposal in the signal to
noise ratio aspect. As it can be seen from the Figure 11
the performance for the Das proposal is higher than the
Lee’s wiener filter. The improved adaptive wiener filter
provides SNR improvement from 2.5 to 4 dB as com-
pared to Lee’s adaptive wiener filter.
4.2 Motorola DSP56300 Results
The DSP system consists of two analog-to-digital (A/D)
converters, and two digital-to-analog converters (D/A)
converters. The DSP56303EVM evolution module is
used to provide and control the DSP56300 processor, the
two A/D converters, and the two D/A converters. The left
analog input sigsired int sig-
5 m
ms to compute the f
nal x(t) consists of the depu
nal s(n) plus a white noise signal w(n). The left analog
input signal x(t) is first digitized u sing the A/D converter
on the evaluation board. DSP Processor executes the
adaptive filter algorithm to process the left digitized in-
put signal x(n), the left and right output signals y1(n) and
y2(n) will be generated. The left output signal y1(n) is the
error signal. The right output signal y2(n) is the filtered
version of the left digitized input signal x(n), which is an
estimate of the desired input signal s(n). The two D/A
converters on the evaluation board are then used to con-
vert the left and right digital output signals y1(n) and y2(n)
to the left and right analog output signals y1(t) and y2(t).
Figure 11. SNR performance comparison between Lee and
Das proposals
Copyright © 2010 SciRes JSEA
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications399
The continuous analog signal was sampled at a rate of
twice the h ighest frequency present in th e spectrum of the
sampled analog signal in order to accurately recreate the
analog audio signal from the discrete samples. The analog
audio signal was mixed with noise using a sum block
which is bound to occur when the audio signal passes
through the channel. The noise however, first low pass
passed filter using a finite impu lse response filter to make
it finite in bandwidth. FIR noise filter was observed to
have little or no sign ifican t effect on th e signal with no ise.
The information bearing signal is a sine wave of
sample
cycles
055.0 is shown in Figure 12. The noise picked
p by the secondary microphone is the input for the adap-u
tive filter as shown in Figure 13. The noise that corrupts
the sine wave is a low pass filtered version of the noise.
The sum of the filtered no ise and the informatio n bearing
signal is the desired signal for the adaptive filter. The
noise corrupting the information bearing signal is a fil-
tered version of noise as shown in the Figure 14. Figure
15 shows that the adaptive filter converges and follows
the desired filter response. The filtered noise should be
completely subtracted from the signal noise combination
and the error signal should only have the original signal.
The results can be seen in Figures 12 to 16.
Figure 12. Plot showing the input signal
Figure 13. Plot of the noise signal
Figure 14. Noise corrupting the original
Figure 15ponse to
the respon
. Convergence of the adaptive filter res
se of the FIR filter
Voltage (V)
signal
Figure 1l signal
4.3 Xilinx Spartan III Results
The algorithm for adaptive filtering were coded in Mat-
lab experimented to determine optimal parameters
suchth e learning rate for the LMS algorithm. After the
paraters have been determined, algorithms were coded
for Xilinx in VHDL language.
4.3.1 Standard LMS Al Results
The dt was
corrupted by a higher frequency sinusoid and random
Gaussian noise with a signal to noise ratio of 5.86 dB.
The input signal can be seen in Figure 17. A direct form
FIR filter of length 32 is used to filter the input signal.
The adaptive is trained with the LMS algorithm with a
learning rate
6. Plot of the error and the origina
and
as
me
gorithm
esired signal output was a sine wave, and i
05.0
. It appears that the filter with the
standard LMS has learned the signal statistics
and is filtering within 200-250 iterations. Since te re-
that the clock for standard
LMS algorithm is 25 MHz. The input and output sig-
nals fhe standard LMS algorithms are given in Fig-
ures and 18.
4.3.2odified LMS Algorithm Results
The se reduction obtained by both the standard LMS
algorithm and the modified algorithm as applied to a sta-
algorithm h
sults have shown that the standard LMS algorithm re-
moves the noise from the signal, the next section. The
timing analyzer has showed
or t
17
M
noi
Figure 17. Input signal for standard LMS algorithm
Voltage (V)
Voltage (V)
Voltage (V)
Voe (V)
) Time (s
ltag
Time (s)
Time (s)
Time (s) Time (s)
Copyright © 2010 SciRes JSEA
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications
400
Figure 18. Output signal for standard LMS algorithm
tionary signal composed by 3 frequencies, corrupted by a
random Gaussian noise, with signal to noise ratio of 5.86
dB were studied. Both algo rithms used 16 bit fixed point
representation for data and filter coefficients [14]. The
frequency spectrum of the original signal, standard LMS,
and modified LMS filter are given in Figure 19. The
modified LMS used a dynamic learning rate coefficient α
based on a heuristic function formerly proposed by
Widrow [8], and consisted of 1/n decaying function, co-
efficients were approximated by a piecewise linear curve,
starting from the value 0.1 down to 0.001 (in about 1000
aster conver-
the standard LMS used a static learning rate with the best
performances obtained by setting the µ parameter equal
. The two algorithms reported noise attenuation
ater than 40 dB and 36 dB respectively. As can be
n from the two learning characteristics in Figure 20
steps). This heuristic function achieved a f
gence, and les gradient noise. It has proved to be effec-
tive when applied to stationary signals. On the other hand
to 0.05
gre
see
Figure 19. Frequency Spectrum of a signal processed with
the standard and modified LMS
Figure 20. Learning Characteristics of both LMS algo-
the modified LMS offered a faster convergence. A large
class of signals (either stationary or short term statio nary)
rithms
nd noises showed similar simulation results. The adap-
tive noise filtering was implemented using a 16 bit 2’s
complement fixed point representation for samples and
weights. As it can be seen in Figure 5 , the floor planned
design required 1776 slices (logic blocks) of 3072 avail-
able (about 57%), and allowed a running clock frequency
of 50 MHz (with a non optimized, fully automatic place
& route process). It would require 2750 slices (8 9%) and
would run at less than 25 MHz (due mainly to routing
congestion). The Assembly file used for th e simulation is
given in Appendix A. The assembly code is provided
elsewhere [26].
s discussed in the previous chapters, the concept of the
adaptive noise filtering applications can be implemented
in both DSP processors like Motorola DSP56300 series
and also in the Field Programmable Gate Array such as
Xilinx Spartan III boards. In high performance signal
processing applications, FPGAs have several advantages
over high end DSP processors. Literature survey has
showed that high-end FPGAs have a huge throughput
advantage over high performance DSP processors for
certain types of signal processing applications. FPGAs
use highly flexible architectures that can be greatest ad-
vantage over regular DSP processors. However, FPAs
ith more gates FPGAs can process more
e time. Thus power consumption per
a
5. Conclusions
A
G
come with a hardware cost. The flexibility comes with a
great number of gates, which means more silicon area,
more routing and higher power consumption. DSP proc-
essors are highly efficient for common DSP tasks, but the
DSP typically takes only a tiny fraction of the silicon
area, which is dedicated for computation purposes. Most
of the area is designated for instruction codes and data
moving. In high performance signal processing applica-
tions like video processing, FPGAs can take highly par-
allel architectures and offer much higher throughput as
compared to DSP processors. As a result FPGA’s overall
energy consumption may be significantly lower than
DSP processors, in spite of the fact that their chip level
power consumption is often higher. DSP processors can
consume 2-3 watts, while the FPGAs can consume in the
order of 10 watts. The pipeline technique, more compu-
tation area and w
channels at the sam
channel is significantly less in the FPGA’s [15]. DSPs
are specialized forms of microprocessor, while the
FPGA’s are form of highly configurable hardware. In the
past, the usage of DSPs has been nearly ubiquitous, but
with the needs of many applications outstripping the
processing capabilities (MIPS) of DSPs, the use of
FPGAs has become very prevalent. It has generally come
to be expected that all software, (DSP code is considered
a type of software) will contain some bugs and that the
Vo
Time (s)
ltage (V)
Copyright © 2010 SciRes JSEA
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications401
best can be done is to minimize them. Common DSP
software bugs are caused because of, failure of interrupts
to completely restore processor state upon completion,
non-uniform assumptions regarding processor resources
by multiple engineers simultaneously developing and
integrating disparate functions, blocking of critical inter-
rupt by another interrupt or by an uninterruptible process,
undetected corruption or non-initialization of pointers,
failing to properly initialize or disable circular buffering
addressing modes, memory leaks, the gradual consump-
tion of available volatile memory due to failure of a
thread to release all memory when finished, dependency
of DSP routines on specific memory arrangements of
variables, use of special DSP “core mode” instruction
options in core, conflict or excessive latency between
peripheral accesses, such as DMA, serial ports, L1, L2,
and external SDRAM memories, corrupted stack or
semaphores, subroutine execution times dependent on
input data or configuration, mixture of “C” or high-level
language subroutines with assembly language subroutines,
and pipeline restrictions of some assembly instructions
[15]. Both FPGA and DSP implementation routes offer
the option of using third party implementation for com-
mon signal processing algorithms, interfaces and proto-
cols. Each offers the ability to reuse existing IP in the
future designs. FPGA’s are more native implementation
for more DSP algorithms. Figures 21 and 22 give the
block diagram s of the DSP and FPGA respecti vely .
Motorola DSP5630 0 series can only do one arith metic
Figure 21. Digital signal processor block diagram
Figure 22. FPGA’s block diagram
computation and two move instructions at a time. How-
ever, in the case of FPGAs, each task can be computed
by its own configurable core and designated input and
output interface.
5.
Speed is one of the most important concepts that deter-
mine the computation time and also it is one of the most
important concepts in the market. In the adaptive filters
the parameters are updated with the each iteration and
after the each iteration the error between the input and
the desired signal get smaller. After some number of it-
erations the error becomes zero and the desired signal is
achieved. According to the specifications from the
manufacturer manuals, Motorola DSP56300 series has a
CPU clock of 100 MHz, but this speed depend on the
instruction fetch, computation speed and also the speed
of th au-
dio codec runs on 24.57 MHz, this clock speed is deter-
mined by an external crystal. In the other hand Xilinx
Spartan 3 has the maximum clock frequency of 125 MHz,
but this speed can be reduced because of the number of
instruction ns, gates and the congestion on the routing of
the signals. Both of the modified adaptive noise filtering
applications take about 200-250 iterations to cancel the
noise and achieve the desired signal. In the Motorola
DSP processor case because of the actual clock speed
being lower, causality co nditio n s and the speed limitatio n
that is coming from the audio codec part ofe board, the
running timeis 20 MHz.
e clock to be faster.
s
1 Speed Comparison
e peripherals. On the DSP56303EVM board the
th
of the modified LMS algorithm
in the case of the FPGA’s the running speed is around
50 MHz. This due to discussions from the previous sec-
tion, which is FPGA’s flexibility and reconfigurable
gates allows for th
5.2 General Conclusion
As discussed in the previous sections, we have shown the
differences between the DSP processors and FPGAs. As
far as power and cost are considered, DSP processors in
general have lower power consumption, which makes
them suitable for battery powered applications. These
applications can be done on audio applications. These
voice applications are very straight forward and do not
require sophisticated pipeline and parallel moves. Audio
applications can be different filter applications . These are
used especially in the voice transmission lines and cell
phones. When it comes to the high frequency applica-
tions, DSP processors have some restrictions on their part
when they are compared to the FPGAs. In high speed
applications, FPGA’s are much faster than the DSP
processors. When it comes to high speed applications,
the DSP boards have some limita tions when compared to
the FPGAs. FPGAs can offer more channels, and thus
when cost per channel is considered because FPGAs can
Copyright © 2010 SciRes JSEA
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications
402
offer more channels, the cost per channel is lower than
the DSP’s. Also the partitioning of the FPGA’s can
offer more throughputs as compared to DSP processors.
Thus FPGAs can handle multiple tasks when their con-
trols and finite state machines are configured correctly.
According to our study, th e final conclusion is that for
simple audio applications like adaptive noise cancelling,
Motorola DSP56300 is more beneficial, because the re-
quirements for audio applications are met with DSP
processors. Also they are more power efficient and can
devices. But when adaptive
in high speed applications
y & Sons,
tions,” Proceedings of the IEEE, Vol. 63,
[9] S. M. Kuo and Noise Control: A
rial Review,”EEE, Vol. 87, No.
3, pp. 351-354.
n Speech and Audio Process-
n-
-IIIE 1.8V FPGA Family: Func-
r VLSI
national Associa-
l Conference on Signal and Image Processing,
dvanced Systems, Kuala
ral Networks, Vol.
cal Signal Process-
be used for battery powered
noise filtering is considered
like video streaming and multiplexed array signals,
FPGA’s are offering a faster approach and thus they are
more suitable for high frequency applications.
5.3 Future Work
In the future, the adaptive noise filtering can be imple-
mented on high frequency applications, such as noise
removal from video streaming and noise removal from
multiplexed data arrays. These applications may be ap-
plied first to FPGAs with Verilog HDL or VHDL. After
application has been verified, hardware code can be
converted to a net list and thru Synop sys a custom ASIC
design can created. The ASIC design and FPGA design
may be compared in the aspect of cost, power, architec-
ture, noise removal and speed. These comparisons would
be helping us to provide us a more educated choice for
future applications.
REFERENCES
[1] A. Di Stefano, A. Scaglione and C. Giaconia, “Efficient
FPGA Implementation of an Adaptive Noise Canceller,”
Proceedings Seventh International Workshop on Com-
puter Architecture for Machine Perception, Palermo, 2005,
pp. 87-89.
[2] M. El-Sharkawy, “Digital Signal Processing Applications
with Motorola's DSP56002 Processor,” Prentice Hall,
Upper Saddle River, 1996.
[3] K. Joonwan and A. D. Poularikas, “Performance of Noise
Canceller Using Adjusted Step Size LMS Algorithm,”
Proceedings of the Thirty-Fourth Southeastern Sympo-
sium on System Theory, Huntsville, 2002, pp. 248-250.
[4] R. M. Mersereau and M. J. T. Smith, “Digital Filtering A
Computer Laboratory Textbook,” John Wile
Inc., New York, 1994.
[5] J. Proakis and D. Manolakis, “Digital Signal Processing
Principles, Algorithms, and Applications,” 4th Edition,
Pearson Prentice Hall, Upper Saddle River, 2007.
[6] G. Saxena, S. Ganesan and M. Das, “Real Time Imple-
mentation of Adaptive Noise Cancellation,” EIT 2008
IEEE International Conference on Electro/Information
Technology, Ames, 2008, pp. 431-436.
[7] K. L. Su, “Analog Filters,” Chapman & Hall, London,
1996.
[8] B. Widrow, J. R. Glover, Jr., J. M. McCool, J. Kaunitz, C.
S. Williams, R. H. Hearn, J. R. Zeidler, Eugene Dong, Jr.,
and R. C. Goodlin, “Adaptive Noise Cancelling: Princi-
ples and Applica
1975, pp. 1692-1716.
D. R. Morgan, “Active
Proceedings of the ITuto
6, June 1999, pp. 943-973.
[10] K. C. Zangi, “A New Two-Sensor Active Noise Cancella-
tion Algorithm,” IEEE International Conference on
Acoustics, Speech, and Signal Processing, Minneapolis,
Vol. 2, 199
[11] A. V. Oppenheim, E. Weinstein, K. C. Zangi, M. Feder,
and D. Gauger, “Single-Sensor Active Noise Cancella-
tion,” IEEE Transactions o
ing, Vol. 2, 1994, pp. 285-290.
[12] T. H. Yeap, D. K. Fenton and P. D. Lefebvre, “Novel
Common Mode Noise Cancellation Techniques for xDSL
Applications,” Proceedings of the 19th IEEE Instrume
tation and Measurement Technology Conference, An-
chorage, Vol. 2, 2002, pp. 1125-1128.
[13] Xilinx Corp., “Spartan
tional Description,” November 2002.
[14] B. Dukel, M. E. Rizkalla and P. Salama, “Implementation
of Pipelined LMS Adaptive Filter for Low-Powe
Applications,” The 45th Midwest Symposium on Circuits
and Systems, Tulsa, Vol. 2, 2002, pp. II-533- II-536.
[15] M. Das, “An Improved Adaptive Wiener Filter for
De-noising and Signal Detection,” Inter
tion of Science and Technology for Development, Inter-
nationa
Honolulu, 2005, p. 258.
[16] K. Schutz, “Code Verification using RTDX,” MathWorks
Matlab Central File Exchange.
[17] S. Haykin, “Adaptive Filter Theory,” Englewood Cliffs,
Prentice Hall, Upper Saddle River, 1991.
[18] D. L. Donoho and J. M. Johnstone, “Ideal Spatial Adapta-
tion by Wavelet Shrinkage,” Biometrika, Vol. 81, 1 Sep-
tember 1994, pp. 425-455.
[19] J. Petrone, “Adaptive Filter Architectures for FPGA Im-
plementation,” Master’s Thesis, Department of Electrical
and Computer Engineering, Florida State University, Tal-
lahassee, 2004.
[20] S. Manikandan and M. Madheswaran, “A New Design of
Adaptive Noise Cancellation for Speech Signals Using
Grazing Estimation of Signal Method,” International
Conference on Intelligent and A
Lumpur, 2007, pp. 1265-1269.
[21] K. Chang-Min, P. Hyung-Min, K. Taesu, C. Yoon-Kyung
and L. Soo-Young, “FPGA Implementation of ICA Algo-
rithm for Blind Signal Separation and Adaptive Noise
Canceling,” IEEE Transactions on Neu
14, 2003, pp. 1038-1046.
[22] S. M. Kay, “Fundamentals of Statisti
Copyright © 2010 SciRes JSEA
DSPs/FPGAs Compara tive Study for Power Consumption, Noise Cancellation, and Real Time High Speed Applications
Copyright © 2010 SciRes JSEA
403
CE Thesis, Purdue Uni-
p #ntaps-1
ac x0,y0,a x:(r1)+,x0 y:(r4)+,y0
acr x0,y0,a x:(r1),b
ove a,x:foutput
b a,b
op
ove b,x:ferror
py x1,y1,b
x0
put,b
rr,a
ing,” Prentice Hall, Upper Saddle River, 1996.
[23] J.-S. Lee, “Digital Image Enhancement and Noise Filter-
ing by Use of Local Statistics,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. PAMI-2, versity, Lafayette, 2009.
1980, pp. 165-168.
[24] Alon Halim, “Real Time Noise Cancellation Field Pro-
grammable Gate Arrays,” MSE
endm
stafir macro ntaps,lg,foutput,ferror
lr a x0,x:(r1)+ y:(r4)+,y0
Appendix A
init_filter macro
move #states,r1
move #ntaps-1,m1
move #coef,r4
move #ntaps-1,m4
c
re
m
m
m
su
n
m
move #lg,y1
move b,x1
m
move b,y1
do #ntaps,_update
move y:(r4),a x:(r1)+,
mac x0,y1,a
move a,y:(r4)+
_update
lua (r1)-,r1
nop
move x:fout
move x:fero
endm