Cheetah: A Library for Parallel Ultrasound Beamforming in Multi-Core Systems

doi:10.4236/jamp.2015.38131

Paper Menu >>

Journal Menu >>

Journal of Applied Mathematics and Physics, 2015, 3, 1056-1061

Published Online August 2015 in SciRes. http://www.scirp.org/journal/jamp

http://dx.doi.org/10.4236/jamp.2015.38131

How to cite this paper: Romero-Laorden, D., Martín-Arguedas, C.J., Villazón-Terrazas, J., Martinez-Graullera, O., Peñas,

M.S., Gutierrez-Fernandez, C. and Martín, A.J. (2015) Cheetah: A Library for Parallel Ultrasound Beamforming in Multi-Core

Systems. Journal of Applied Mathematics and Physics, 3, 1056-1061. http://dx.doi.org/10.4236/jamp.2015.38131

Cheetah: A Library for Parallel Ultrasound

Beamforming in Multi-Core Systems

David Romero-Laorden1, Carlos Julián Martín-Arguedas2, Javier Villazón-Te rraz a s 1,

Oscar Martinez-Graullera1, Matilde Santos Peñas3, César Gutierrez-Fernandez2,

Ana Jiménez Martín2

1ITEFI, Spanish National Research Council, Madrid, Spain

2Department of Electronics, University of Alcalá, Alcalá de Henares, Spain

3Department of Computer Architecture and Automation, Complutense University of Madrid, Madrid, Spain

Email: cj.martin@uah .es

Received 15 August 2015; accepted 19 August 2015; published 26 August 2015

Abstract

Developing new imaging methods needs to establish some proofs of concept before implementing

them on real-time scenarios. Nowadays, the high computational power reached by multi-core

CPUs and GPUs have driven the development of software-based beamformers. Taking this into ac-

count, a library for the fast generation of ultrasound images is presented. It is based on Synthetic

Aperture Imaging Techniques (SAFT) and it is fast because of the use of parallel computing tech-

niques. Any kind of transducers as well as SAFT techniques can be defined although it includes

some pre-built SAFT methods like 2R-SAFT and TFM. Furthermore, 2D and 3D imaging (slice-

based or full volume computation) is supported along with the ability to generate both rectangu-

lar and angular images. For interpolation, linear and polynomial schemes can be chosen. The ver-

satility of the library is ensured by interfacing it to Matlab, Python and any programming language

over different operating systems. On a standard PC equipped with a single NVIDIA Quadro 4000

(256 cores), the library is able to calculate 262,144 pixels in ≈105 ms using a linear transducer

with 64 elements, and 2,097,152 voxels in ≈ 5 seconds using a matrix transducer with 121 ele-

ments when TFM is applied.

Keywords

GPGPU, Ultrasound Image, Beamforming, Array Transducer

1. Introduction

During recent years, computing industry has opened a way to parallel computing. First, dual-core processors

(CPUs) were introduced in personal systems at the beginning of 2005, and it is currently common to find them

in laptops as well as 8 and 16-core workstation computers, which means that parallel computing is not relegated

to big supercomputers. On the other hand, Graphics Processor Units (GPUs), as their name suggests, came about

as accelerators for graphics applications, predominantly those using the OpenGL and DirectX programming in-

terfaces. Although originally they were pure fixed-function devices, the demand for real-time and 3D graphics

D. Romero-Laorden et al.

1057

made them evolve into small computational units, multithreaded processors with extremely high computational

power and very high memory bandwidth that are now available to anyone with a standard PC or laptop.

Since 2006 GPUs can be programmed directly in C/C++ using CUDA or OpenCL [1], allowing each and

every arithmetic logic unit on the chip to be used by programs intended for general-purpose computations

(GPGPU). A CUDA program consists of one or more stages that are executed on either the host (CPU) and a

NVIDIA GPU. The stages that exhibit little or no data parallelism are implemented in CPU code whereas those

that exhibit rich amount of data parallelism are implemented in the GPU code. These parallel functions are

called kernels, and typically generate a large number of threads to exploit data parallelism

From an architectural analysis viewpoint, beamforming techniques are pretty interesting in particular because

they can be seen as a data parallel process, making possible its implementation on machines with diverse com-

putational and I/O capabilities. Some previous works as [2] show toolboxes for beamforming computation over

CPUs and multi-core CPUs obtaining very good timing results. Our research group has been working on GPUs

applied for field modeling acceleration [3] [4] as well as for fast beamforming [5]-[7] achieving speed ups of

150× over conventional CPU beamforming. Nowadays, GPU beamformers are a reality and there are lot of re-

search groups working on solutions for NDT and medical applications [8].

The aim of this work is to present CHEETAH, a fast ultrasonic imaging library to assist on fast development

of new ultrasound beamforming strategies (currently, only SAFT methods are supported) making possible to

generate 2D and 3D images on a standard PC or laptop in just few milliseconds. The input data can originate

from either a simulation program or from an experimental setup. The library is composed by several routines

written in CUDA for fast execution, thus a NVIDIA© GPU is required at the present time. Nowadays, the 1.0

version (Windows and Linux OS) will be soon available.

2. CHEETAH Core Features

CHEETAH has been designed as a free multi-platform library written in C++ and CUDA which can handle

multitude of focusing methods, interpolation schemes and apodizations, to generate images from real RF signals

obtained from any application and any acquisition process. The main features currently supported are:

 Custom transducers. Commands for defining linear and matrix arrays are given. Likewise, arbitrary geo-

metry transducers such as sparse arrays can be also specified. Thereby different transducers can be used de-

pending of the concrete application.

 Custom SAFT techniques. Commands for easily defining specific SAFT sequences of emission/reception

are given. Anyway, the library comes with some predefined techniques, like 2R-SAFT [6], and TFM [9].

 2D/3D imaging. Commands for composing bidimensional and volumetric images (also C-Scan images) are

given what makes possible to span a wide range of applications.

 Coherence factor. Commands for the application of coherence factors are given [10].

 Matlab©/Python bindings. As the library is written in C++ and CUDA, its functionality is available in

Windows or Linux OS. Likewise, we have developed specific bindings to connect it to Matlab© what allows

to reuse existing code or utilize specific toolbox functionalities.

3. TFM Design on Cheetah Library

In this section, TFM method has been chosen as the case of study to analyze the main implementation aspects of

the beamforming algorithm on the library.

3.1. TFM Imaging Principles

Synthetic Aperture Focusing Techniques (SAFT) are based on the sequential activation of the array elements in

emission and reception, and the separate acquisition of all the signals involved in the process. Then, a beam-

forming algorithm is applied to focus the image dynamically in emission and reception obtaining the maximum

quality at each image point. One of the most common beamforming methods is Total Focusing Method (TFM)

[9] [11] and it is based on Full Matrix Array (FMA), which is the complete data matrix X(t) created by any

transmitter -receiver combination:

{ }

()()|1( ,)

tx rx

tXttx rxN=∀≤ ≤X

(1)

D. Romero-Laorden et al.

1058

where Xtx,rx(t) is the corresponding signal to tx transmitter and rx receiver, and N is the number of array elements.

For ease the envelope computation, acquired signals are decomposed into their analytic signals form (in-phase I

and quadrature components Q) applying the Hilbert Transform being now expressed as:

() ()()tt jt= +XI Q

(2)

According to the Hilbert transformation applied in Equation (2), a complex data matrix has been created.

Then, for the case of a 2D scenario 2 a Delay-and -Sum (DAS) beamforming process is used to calculate two

images in the following way:

( )

(,) (,)

Itx rx

A xzIDxz

= =

∑∑

(3)

( )

(,) (,)

Qtx rx

A xzQDxz

= =

∑∑

(4)

where AI(x, z) and AQ(x, z) are the in-phase and quadrature images respectively, and D(x, z) is the delay corres-

ponding to the focus point (xfp, zfp) in the space which is calculated as follows:

22 22

() ()

(,)

fptx fpfprx fp

xx zxxz

Dxz c

− ++−+

(5)

being xtx and xrx the coordinates of the transducer elements tx and rx, respectively. Over this general scheme it is

possible to introduce any type of apodizations.

AAA= +

(6)

Finally, the envelope is calculated according to the equation6 to obtain the final image.

3.2. Parallel GPU Implementation

The background behind CHEETAH library comes from the knowledge that our group has on beamforming ac-

celeration using GPUs [5]-[7]. Nevertheless, it has been considered to briefly review the main concepts involved

in the parallel implementation of the TFM process on the library.

During the processing the input raw data and the output image pixels are stored as single-precision floating-

point numbers (4 bytes). We have also used fast intrinsic math routines which provide better performance at the

price of IEEE compliance (the precision is slightly reduced). In our case, this causes a minimal numerical dif-

ference who has little influence on the final output ultrasound image quality.

Designed parallel strategies perfectly fit the SIMD (Single Instruction Multiple Data) model of GPU archi-

tecture [1]. The general parallelization scheme can be resumed as Figure 1 suggest s:

a) The process starts when the Full Matrix Array X(t) defined in Equation (1) is transferred from CPU memo-

ry to GPU global memory via PCI Express bus.

b) According to Equation (2) the Hilbert Transform is applied to every signal using CUFFT libraries [1]. The

parallelism strategy is signal-oriented which properly splits the algorithm and computes the FFT of the data.

This means that NxN threads work concurrently. The result of these operations is a complex data matrix I(t) and

Q(t) which is stored in GPU texture memory.

c) DAS kernel is applied to the complex data matrix to calculate N low resolution images (LRI I and LRIQ)

each one corresponding to an emitter-all receivers [8] as Equations (3) and (4) suggest. The parallelization is

performed launching as many threads as image pixels what, supposing image dimensions of SX × SZ, means N ×

SX × SZ threads. Several optimizations (reuse computed values, symmetries, shared/constant memories [7]) have

been used to maximize performance. The focusing delay is calculated on-the-fly and it is indexed in the I(t) and

Q(t) complex signals interpolating real and imaginary parts. Finally, the complex samples are multiplied by the

corresponding apodization gains and added together to beamform each of the images.

d) A new kernel is defined to calculate the final AI and AQ images performing the sum of the (LRII and LRIQ)

images respectively. Once the final values for both images are computed, the envelope calculation is carried out.

This strategy follows a pixel-oriented parallelism launching SX × SZ threads.

We must mention that, this scheme has been further optimized for those SAFT strategies based on the coarray

D. Romero-Laorden et al.

1059

Figure 1. Parallel scheme for beamforming acceleration.

model and their particular characteristics as well as for 2D and 3Dimaging, in order to squeeze the maximum

performance and speed [6] [7].

4. Library Performance Evaluation

To evaluate the performance of the library we have chosen two experimental scenarios, a Multi-Tissue ultra-

sound phantom (040 GSE model by CIRS Inc. company) to evaluate the 2D-imaging; and a methacrylate piece

with five drills for 3D imaging.TFM has been chosen as the beamforming method for both cases. Details are

given in Table 1.

As a 2D imaging example, we use a linear 64 elements transducer which results in a FMA matrix of 4096

signals. The result of the processing is shown in Fig ure 2(a).

In Table 2 and Table 3 computing times can be observed for several CPUs (testing in mono-core and multi-

core). Respect to GPUs, they are equipped with different number of cores.

Given that a standard PC cannot use a NVIDIA Tesla, we underline the results of NVIDIA Quadro 4000,

which has 256 cores and 2 GB RAM and took around 105 ms for an image of 512 × 512 pixels, as well as, the

results of the NVIDA QuadroK5000, which took less than 50 ms.

As a 3D imaging example we use a matrix array of 11 × 11 elements. 3D-TFM method has been applied, us-

ing an FMA of 14,641 signals and the same GPUs to generate a volume of 128 × 128 × 128 pixels which is

shown in Figure 2(b). The results of the imaging generation took less than 5 seconds in the NVIDA Quadro

4000, and less than 2 seconds in the NVIDA Quadro K5000, more details are described in Table 4.

5. Conclusions and Future Work

A fast and versatile library for the generation of ultrasound images has been created which exploits GPU tech-

nology to implement the beamformer via software. We have made a detailed description of the library features

and it has been quantified the benefits of using the GPU as a processing tool. So by using a simple graphics card

equipped with NVIDIA CUDA technology now is possible to accelerate the development of new imaging tech-

niques.

D. Romero-Laorden et al.

1060

(a) (b )

Figure 2. (a) TFM image from Multi-Tissue ultrasound phantom; (b) Volumetric image from 5 drills using 3D-

TFM method.

Table 1. Beamfor mation scenarios tested for 2D and 3D imaging.

Parameters Image

2D Imaging

3D Imaging

Scenario Multi-Tissue Phantom Methacrylate piece

Medium velocity 1540 [m/s] 2690 [m\s]

Array size

64 elements

121 elements

Array pitch 0.28 [mm] 1.0 [mm] both directions

Imaging frequency 2.6 [MHz] 3.16 [MHz]

Sample frequency 40 [MHz] 40 [MHz]

Table 2. Performance of CPU-based TFM algorithm.

CPU processor RAM # Cores TFM time [seg]

Intel Core 2 Quad Q9450 4 GB

24.13

6.96

Intel Core i7 3632Q M 8 GB

6.91

2.08

Intel Xeon E51650 30 GB

6.84

2.23

1.80

1.64

Table 3. Performance of GPU-based TFM algorithm using CHEETAH.

NVIDIA GPU RAM # Cores TFM time [mseg]

GeForce 540 M 1 GB 96 265.57

GeForce 9800 GTX+ 512 MB 128 215.01

GeForce 635 M 1 GB 144 239.27

Quadro 4000 2 GB 2 56 105.07

Quadro K2000 2 GB 384 119.23

Quadro K5000 6 GB 1536 45.11

D. Romero-Laorden et al.

1061

Table 4. Performance of GPU-based TFM-3D algorithm using CHEETAH.

NVIDIA GPU RAM # Cores TFM time [mseg]

GeForce 540 M 1 GB 96 7972.70

GeForce 9800 GTX+ 512 MB 128 7056.01

GeForce 635 M 1 GB 144 7587.25

Quadro 4000 2 GB 256 4874.28

Quadro K2000 2 GB 384 6697.05

Quadro K5000 6 GB 1536 1929.13

We are currently working on the implementation of more beamforming algorithms (e.g. adaptative beam-

forming and PA), on supporting CPU capabilities for parallel computing (OpenACC, OpenMP, MPI) and multi-

GPU processing and on improving the overall performance. We are open to receive collaboration/feedback of

any group that will be interested in using our library in their own research.

Acknowledgements

This work has been supported by the Spanish Government and the University of Alcalá under projects

DPI2010-19376 and CCG2014/EXP-084, respectively.

References

[1] Hwu, W.-M.W. and Kirk, D.B. (2010) Programming Massively Parallel Processors: A Hands-On Approach.

[2] Hansen, J.M., et al. (2011) An Ob ject-Oriented Multi-Threaded Software Beamformation Toolbox, SPIE Medical Im-

aging: Ultrasonic Imaging, Tomography, and Therapy. In: D’hooge, J. and Doyley, M.M., Eds., 79680Y.

[3] Romero-Laorden, D., et al. (2011 ) Field Modelling Acceleration on Ultrasonic Systems Using Graphic Hardware.

Computer Physics Communications, 182, 590 -599. http://dx.doi.org/10.1016/j.cpc.2010.10.032

[4] Villazón-Terrazas, J., et al. (2012) A Fast Acoustic Field Simulator. In 43o Congreso Español de Acústica

(TECNIACUSTICA), Evora, 1-9.

[5] Romero-Laord en, D., et al. (2009) Using GPUs for Beamforming Acceleration on SAFT Imag in g. IEEE International

Ultrasonics Symposium, Rome , 1334-1337.

[6] Martín-Arguedas, C.J., et al. (201 2 ) An Ultrasonic Imaging System Based on a New SAFT Approach and a GPU

Beamfor mer. IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control, 59, 1402-14 1 2.

http://dx.doi.org/10.1109/TUFFC.2012.2341

[7] Romero-Laord en, D., et al. ( 2013 ) Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging

Systems. Advancements and Breakthroughs in Ultrasound Imaging, G. P. P. Gunarathne, Ed., 2013.

http://dx.doi.org/10.5772/55910

[8] So, H.K.H., et al. (2011) Medical Ultrasound Imaging: To GPU or Not to GPU. IEEE Micro, 31, 54-65.

http://dx.doi.org/10.1109/MM.2011.65

[9] H ol me s , C., et al. (2008) Advanced Postprocessing for Scanned Ultrasonic Arrays: Application to Defect Detection

and Classification in Non-Destructive Evaluation. Ultrasonics, 48 , 636-642.

http://dx.doi.org/10.1016/j.ultras.2008.07.019

[10] Martí nez-Graull era, O. , et al. (201 1 ) A New Beamforming Process Based on the Phase Dispersion Analysis. The In-

ternational Congress on Ultrasonics, Gdansk, 185-188.

[11] Hunter, A.J., et al. (2008) The Wavenumber Algorithm for Full-Matrix Imaging Using an Ultrasonic Array. IEEE

Transactions on Ultrasonics, Ferroelectrics and Frequency Control, 55, 2450-2462.

http://dx.doi.org/10.1109/TUFFC.952