System-Level Performance Evaluation of Very High Complexity Media Applications : A H264/AVC Encoder Case Study

doi:10.4236/ijcns.2011.47052

Paper Menu >>

Journal Menu >>

Int. J. Communications, Network and System Sciences, 2011, 4, 436-446

doi:10.4236/ijcns.2011.47052 Published Online July 2011 (http://www.SciRP.org/journal/ijcns)

System-Level Performance Evaluation of Very High

Complexity Media Applications: A H264/AVC Encoder

Case Study

Hajer Krichene Zrida1, Abderrazek Jemai2, Ahmed C. Ammari3, Mohamed Abid1

1Computer & Embedded Syst ems Laborat ory, Nati onal School of Engi neers of Sf ax, Sfax University, Sfax, Tunisia

2LIP2 Laboratory, Faculty of Science of Tunis, Tunis, Tunisia

3Research Unit in Materi al s Meas urem e nts an d Ap pl i cat i o ns, National Institute of Applied Sciences and of the

Technology, Carthage University, Carthage, Tunisia

E-mail: hajer_kri@yahoo.co.nz

Received March 23, 201 1; revised May 20, 2011; accepted J une 10, 2011

Abstract

Given the substantially increasing complexity of embedded systems, the use of relatively detailed clock cy-

cle-accurate simulators for the design-space exploration is impractical in the early design stages. Raising the

abstraction level is nowadays widely seen as a solution to bridge the gap between the increasing system

complexity and the low design productivity. For this, several system-level design tools and methodologies

have been introduced to efficiently explore the design space of heterogeneous signal processing systems. In

this paper, we demonstrate the effectiveness and the flexibility of the Sesame/Artemis system-level modeling

and simulation methodology for efficient performance evaluation and rapid architectural exploration of the

increasing complexity heterogeneous embedded media systems. For this purpose, we have selected a system

level design of a very high complexity media application; a H.264/AVC (Advanced Video Codec) video en-

coder. The encoding performances will be evaluated using system-level simulations targeting multiple het-

erogeneous multiprocessors platforms.

Keywords: System-Level Performance Evaluation, Embedded Systems Design Space Exploration Tools, the

Sesame/Artemis Design Tool, a Parallel H.264/AVC Video Encoder

1. Introduction

The architectural complexity of System-on-Chip (SoC)-

based embedded systems, as well as the design re- quire-

ments regarding real-time performance, high flexibility,

low power consumption and cost greatly complicate the

system design. Nowadays, the classical design methods,

typically starting from a single application specification,

become short used for designing such an embedded sys-

tem. In order to resolve the increasing design complexity,

researchers have recently come up with a new design

concept called system-level design [1]. For this purpose,

a new generation of system-level tools and methodolo-

gies has been introduced to efficiently explore the design

space of heterogeneous signal processing systems. Each

tool/methodology directly reflects a well-defined design

flow.

The Y-chart layer’s based approach, considered as the

most popular approach for designing multimedia ori-

ented systems, is already being followed in most recent

system-level design works [1]. It tries to improve the

shortcomings of the classical HW/SW co-design app-

roach by abandoning the usage of low-level (instruction-

level or cycle-accurate) simulators for the design space

exploration at an early stage of the flow, and abandoning

a single system specification to describe both hardware

and software parts. Indeed, the Y-chart methodology reco-

gnizes a clear separation between an application model,

an architecture model and an explicit mapping step to

relate the application model to the architecture model.

The application model describes the functional behavior

of an application, independent of architectural specifics

like the HW/SW partitioning or timing characteristics.

The architecture model defines the architecture resources,

captures their timing characteristics, and then simulates

the performance consequences of the application events

H. K. ZRIDA ET AL.

437

(communication and computation operations) for soft-

ware (programmable components) and hardware (recon-

figurable/dedicated) executions.

As showed in Figure 1, the Y-chart general design

scheme is composed of four steps [1]. The first step “Ap-

plication Modeling” aims to capture a functional specifi-

cation of the system in the form of a set of benchmark

applications. The second step “Architecture Modeling”

consists in modeling the target architecture by the re-

sources available in the system. In embedded systems,

these resources typically are processors, operating sys-

tems, buses and memories. After that, the parallel appli-

cation processes are mapped onto the resources of the

architecture. The result of the mapping step is an imple-

mentation of the system which can be used as an input

for the performance analysis step. Typically, based on

the Y-chart approach principle, a system designer studies

the set of benchmark applications, makes some initial

calculations, and proposes the architecture. The designer

then evaluates and compares several instances of the

platform by mapping each application onto the platform

architecture by means of performance analysis. The re-

sulting performance numbers may inspire the design er to

improve the architecture, restructure the application, or

change the mapping. The possible designer actions are

shown with the light bulbs in Figure 1.

The outline of the paper is as follow. In Section 2, we

first present the underlying properties of some different

tools. Based on these most important design criterions

considered in our comparative synthesis, the Sesame soft-

ware framework is selected among the best system-level

design methodologies. In Section 3, we desc ribe the main

features, tools, and methods provided by the Sesame/

Artemis simulation and modeling environment. Section 4

presents a complexity analysis of the H.264/AVC stan-

dard and reviews a performed previous work for the de-

velopment of an optimized parallel encoder model. Se-

same is used in Section 5 to evaluate the encoder per-

formance targeting multiprocessors architectures and will

Figure 1. The Y-chart: a general scheme for heterogeneous

system design.

show up the effectiveness and the flexibility of this de-

sign methodology.

2. System-Level Exploration Tools

Comparative Synthesis

In the literature, there are a number of exploration envi-

ronments that facilitate the system-level design space

exploration by providing support for mapping a behav-

ioral application specification to an architecture platform

model [1]. Although all the system-level design method-

ologies are created to be used in the same field: design-

ing embedded systems at high system level, there exist

wide diversity among them. In Table 1, we summarize

the most interesting design properties of the some repre-

sentative ones.

The study found that selected metho dologies and tools

as shown in Table 1 differ from each other in first their

HW/SW design approach. Some of them, like the Pto-

lemy tool [2], don’t support a layered abstraction level

design approach and use a single specification including

both functional behavior and architecture models. Others

support the platform-based design approach (like Me-

tropolis [3]) or the top-down design methodology (like

SpecC [4]). However, it is demonstrated that the Y-chart

layer’s based approach, which is followed by several

recent works, became nowadays the most popular and

used for designing heterogeneous multiprocessors em-

bedded systems.

Although the differences among the seven tools are

not absolute, the features shown in Table 1 indicate that

the most preferable methodologies/tools are Metropolis,

VCC [4], and Sesame-like [5,6], because they have the

largest amount of positive marks (“+”). By elimination,

the Metropolis environment is excluded of the most pre-

ferable methodologies list since it does not facilitate ex-

plicitly the Y-chart approach. Between the VCC and

Sesame tools, we observe that the mixed-level simulation

is only supported by the Sesame tool. For this, we have

opted for selecting the Artemis/Sesame methodology to

implement at system-level the H.264/AVC video encod-

ing application on a multiprocessor SoC-based architec-

ture. Indeed, the system-level modeling and simulation

framework Sesame/Artemis is developed to directly re-

flect the Y-chart design approach. It provides several me-

thods and tools to quickly and separately build the appli-

cation process network model, the target architecture mo-

del, and the mapping model of the application onto the

architecture.

Currently, Sesame has been evaluated for the design of

two medium complexity media applications: an MPEG-2

decoder and a variant of M-JPEG encoder [7,8]. Our

objective in this paper is to use this methodology for the

H. K. ZRIDA ET AL.

438

Table 1. Overview of some properties of presented tools and methodologies.

Methodology/tool Ptolemy Polis MetropolisSpecC VCC SystemC Sesame/Artemis

Y-chart supported - + - - + x +

MoC variety supported + x + x + x -

Dynamic performance models + + + + + + +

Formal analysis and verification - + + + x x x

Reusability supported + + + + + + +

Complex applications domains supported x - + x + + +

Target architecture variety supported x - + + + + +

All abstraction levels supported - + + + + + +

HW synthesis and IP Integration supported - + + + x - +

Mixed simul ation supported - x x x - + +

Automatic HW/SW partitioning supported - - - - - - -

Automatic mapping - - + - - - -

Rapid prototyping x + + + + + +

Legend: True, +; False, -; May be, x .

design of very complex media systems. The H.264/AVC

reference video encoder represents an example of a very

complex case study typical of the multimedia domain. It

has been designed with the goal of enabling significantly

improved compression performance relative to all exist-

ing video coding standards [9]. Such a standard uses ad-

vanced compression techniques that in turn, require high

computational power [10]. Implementing a H.264/AVC

video encoder for an embedded SoC is thus a big chal-

lenge since this encoder requires very high computation

power to achieve real-time encoding. In this study, both

modeling and mapping stages of the Sesame design flow

are explored for an optimal H.264/AVC encoder imple-

mentation verifying constraints. This will demonstrate

the effectiveness and the flexibility of the methods and

tools provided by this methodology for rapid system-

level design space exploration of complex embedded

systems.

3. The Sesame/Artemis Simulation and

Modeling Environment

In this section, we will briefly describe the Sesame/Arte-

mis simulation and modeling environment [11,12]. The

required software model layers are first presented. The

implementation of these layers is based on specific tools.

A brief description of these tools is given along with the

application in the literature of these tools to some me-

dium complexity multimedia systems.

3.1. The Sesame Layer’s Software Model

Using the Sesame system-level design software frame-

work, three software specification model layers are re-

quired: the application process network layer, the target

architecture layer, and the layer for mapping the applica-

tion onto the architecture, as showed in the Figure 2 [6] .

3.1.1. Application Modeling Layer

Applications in Sesame are modeled using the Kahn Pro-

cess Network (KPN) model of computation in which pa-

rallel processes, implemented in a high-level language,

communicate with each other via unbounded FIFO chan-

nels [13]. Each process is executed sequentially. Reading

from channels is blocking; writing to channels is non-

blocking. The execution of a Kahn Process Network is

deterministic, meaning that for a given input always the

same output is produced and the same workload is gen-

erated, irrespective of the execution schedule. The model

fits nicely with signal processing applications, as it can

model stream processing with the guarantee that no data

is lost. The key characteristic of the KPN model is that it

specifies an application in terms of distributed control

and distributed memory which allows us to map the ap-

plication onto a multiprocessor platform in a systematic

and efficient way.

3.1.2. Architecture Modelin g L ay er

An architecture model is constructed from generic build-

ing blocks provided by a library containing template per-

formance models for processors, co-processors, memo-

ries, buffers, busses, and so on. The evaluation of archi-

tecture is performed by simulating the performance con-

sequences of the application model events that are

mapped onto the architecture model. This requires each

process and channel of the Kahn process network to be

associated with, or mapped onto, one component of the

architecture model. When executed, each Kahn process

generates a trace of events, and these event traces are

routed towards a specific component of the architecture

model through a trace event queue. A Kahn process places

its application events into this queue while the corre-

sponding architecture component consumes them (Fig-

ure 2).

H. K. ZRIDA ET AL.

439

Figure 2. The three layers within Sesame: the application

model layer, the architecture model layer, and the mapping

layer.

3.1.3. Mapping L ay er

The mapping layer maps the event traces generated by

the Kahn processes of an application model onto the re-

sources in the architecture model. In addition, it maps the

Kahn communication channels onto communication re-

sources at the architecture level. Each Kahn channel can

be thus mapped onto a point-to-point FIFO channel be-

tween two processors or onto a software buffer in shared

memory. As showed in Figure 2, it is possible to map

multiple Kahn processes onto a single architecture com-

ponent (e.g., in the case of a programmable component).

Such mappings require the events from the event traces

that are mapped onto the same architecture resource to be

scheduled. This scheduling is also performed by the

mapping layer [7].

3.2. Sesame Implementation Tools

In the previous section, we have seen that the Sesame

softwa re structure is composed of th ree layers: th e appli-

cation layer, the architecture layer, and the mapping layer

which is an interface between the two previous ones. All

three layers in Sesame are composed of components

which should be instantiated and connected using some

form of object creation and initialization mechanism, as

shown in Figure 3. This allows reusing of code and

guarantees the flexibility to easily manipulate the model

based on performance results as dictated by the Y-Chart

methodology (Figure 1). The three models layers are im-

plemented by the following tools:

3.2.1. YML M odel i ng La ng u age

Sesame was developed to guarantee a rapid construction

of the simulation models thought the use of libraries of

pre-built architecture simulation components. In order to

enable quick modification, a flexible description format

for the interconnection of these components is required.

For this, the YML (Y-chart Modeling Language) is de-

fined to create the structure of Sesame’s simulation mod-

els. YML is an XML-based language. Using XML is

attractive because it is simple and flexible, reinforces

reuse of model descriptions, and comes with good pro-

gramming language support. The core elements of YML

are networ k, n o de, port, li nk, and property [ 6].

3.2.2. Application PNRunner Simulator

Sesame’s application simulator is called PNRunner, or

Process Network Runner. PNRunner implements the se-

mantics of Kahn process networks in C++. It reads an

YML application description file and executes the cor-

respondent application model. The PNRunner execution

allows generating a trace of application events (trace API)

to driv e an architec ture simul ation ( Figure 3). Using this

API, PNRunner can send application events (communi-

cation and computation operations) to the architecture

simulator where their performance consequences are

simulated. Hence, application/architecture trace-events

co-simulation is possible.

3.2.3. Architecture Pearl Simulator

The target architecture model in Sesame is implemented

in the Pearl discrete event simulation language [14]. Pearl

is a small but powerful object-based language wh ich pr o-

vides easy construction of abstract architecture models

and fast performance simulation. It has a C-like syntax

with a few additional p rimitives for simulation purpos es.

A Pearl program is a collection of concurrent objects

which communicate with each other through synchronous

or asynchronous message-passing. After sending an asyn-

chronous message, the sending object continues execu-

tion, while waiting for a synchronous reply message from

the receiver.

3.3. Medium Complexity Media Case Studies

The Sesame modeling and simulation methodology has

been applied to two medium complexity media applica-

tions: an MPEG-2 decoder and a variant of an M-JPEG

encoder [7,8]. These both studies have been performed at

the black-box architecture model level and showed pro-

mising results. These media applications have been im-

plemented on various multiprocessors architectures mo-

dels. For these architectures, different hardware-soft-

ware partitioning, application to architecture mappings,

processor speeds, and interconnect structures (bus,

Crossbar, and Omega networks) are evaluated [7]. Based

on the obtained execution performance results, the Cross-

bar model is demonstrated better in terms of the meas-

ured number of frames per second than the Omega net-

H. K. ZRIDA ET AL.

440

Figure 3. Sesame software tools overview.

work and common bus structures (about 5% faster than

the Omega network) [7].

A lot of design space exploration has been so consid-

ered for getting the optimal system design. For these

medium complexity media applications, the performance

evaluation is straightforward and very fast obtained. For

all the used configurations, the ob tained simulation times

range from 5 to 10 seconds. This is many ord ers of mag-

nitude faster than using classical RTL-level simulators

and is very acceptable for design space explo ration. G i v e n

this, the next sections aim to further demonstrate the ef-

fectiveness and the flexibility of the Sesame methods and

tools even for the design of very complex media applica-

tions. This will be performed by using this methodology

for the design and performance evaluation of a H.264/

AVC reference encoder targeting multiple heterog eneous

multiprocessors platforms.

4. The H.264/AVC Video Encoder Case

Study

The H.264/AVC has been designed with the goal of ena-

bling significantly improved compression performance

relative to all existing video co ding standards [9]. Such a

standard uses advanced compression techniques that in

turn, require high computational power [10]. Implemen-

ting a H264 video encoder for an embedded SoC requires

very high computation power to achieve real-time en-

coding. This section first presents a complexity analysis

of the H.264/AVC reference encoder in comparison to

M-JPEG application case. Then, it reviews a performed

previous work to get an optimized parallel model of the

encoder using an appropriate high-level independent target-

architecture parallelization approach.

4.1. Complexity Analysis of the H264/AVC

Reference Video Encoder

The complexity of the H.264/AVC encoder application

depends on the algorithm, the encoding option tools, the

input sequences and the architecture in which it is im-

plemented. In a previous work [15], we performed a

complete high level performance and complexity analy-

sis of a H.264/AVC video encoding application. The

experiments have been performed on a General-Purpose

Processor (GPP) 1.6 GHZ INTEL Centrino platform

using the JM 10.2 software reference version [16] with a

main profile @ level 4. For an optimal balance between

the encoding efficiency and the implementation cost, a

proper use of the H.264 /AVC tools has been proposed to

maintain an acceptable performance while considerably

reducing complexity. Using the obtained optimal encod-

ing tools for a very low bit rate 7 frames QCIF “bridge

far” sequence, the computing time for the encoding pro-

cess on the GPP platform is of 15.2 seconds. The associ-

ated complexity in frames per second is of 2.16 fps. For

this test sequence, the peak memory usage is also meas-

ured using the “memprof” GNU profiler [17]. For the

used sequence, the obtained peak me mor y c ost is o f 5 .02

MB. This result refers to none optimized source code.

Applying platform independent memory optimizations

through C level code transformations may be used to get

an optimized memory and algorithmic version of the

reference code.

In comparison to the Motion JPEG application pre-

sented in Section 2, the non optimized H.264/AVC ref-

erence encoder is about two to three orders of magnitude

more complex in terms of computing time and memory

usage. To illustrate this, the dynamic instru ction distribu-

tion by operation types have been obtained for both ap-

plications using an “objdump” utility and are reported in

Figure 4. For the H.264/AVC, the dominated instruc-

tions types are the “arithmetic” and the “Memory” (L oad/

Store) operations. Actually, these results confirm the

very high complexity of this new standard, the potential

memory allocation needed and the high volume of com-

putation required. The SoC implementation of such a

complex application will point out the outcome of the

Sesame methodology for the design of such complex

systems.

4.2. High Level Parallel Specification of a

H.264/AVC Video Encoder

To speedup the compu ting of this encoder, a multiproce-

ssor implementation is probab ly needed. Prior to this im-

plementation, the sequential encoding reference C source

code [16] should be transformed into concurrent KPN

tasks communicating via dedicated FIFO channels. The

goal of this step is to extract the available task-paralle-

lism from the application by splitting compute nodes as

far as possible to get a valid parallel KPN model of the

encoder. For an optimal design flow, it is our aim to pro-

vide a parallel specification of the application which

forms a good starting point for mapping onto different

H. K. ZRIDA ET AL.

441

Figure 4. The dynamic instruction distribution by operation types for both the H.264/AVC and M-JPEG applications.

systems-on-chip platforms. To do so, we proposed in a

previous work a high-level independent target-archite-

cture parallelization approach [18,19] to get an optimized

parallel model of the encoder with the best computation

and communication workload balance.

The proposed parallelization approach is based on the

use of the KPN/YAPI [20] parallel programming models

of computation, and the selection of a fine-grain Macro-

Block communication granularity level. The key charac-

teristic of the approach is the simultaneo us ex plor ation of

the two predominant concepts of parallelism; the data-

level partitioning and the task-lev el splitting and merging .

This means that communication and computation work-

load analysis are needed to provide a global guidance

when optimizing concurrency between processes. In ge-

neral, when the concurrency bottlenecks are identified,

task and data levels splitting and/or merging are per-

formed for better distributing the computing workload

over the processes. For the most computational-expen-

sive tasks, data splitting is proposed for a better concur-

rency optimizati on [18].

Given the proposed parallelization approach, the Task

Level Parallelism (TLP) is first considered. The goal of

this step is to extract the available task-parallelism by

splitting compute nodes as far as possible to get a first

starting valid parallel KPN model of the encoder. For

this case, the encoder block diagram [21] has served as a

starting point for extracting the task-level parallelism.

Then to get a parallel implementation of the encoder with

the best computation and communication workload bala-

nce, different steps of task level splitting or merging and

data level splitting are used to derive in a structured way

a final optimized model. Further details on the different

steps used are given in [19]. Finally, the optimal model

obtained is given in Figure 5. This figure shows that the

low-complexity DCT, Quantification, Decoder, and Fil-

ter modules have been merged into only one “Dct_Dec_

Filter” process. For the most computational-expensive

Motion estimation and compensation “Mec” task, a data

partitioning strategy h as been consid ered to distribute the

computing of this process into three “Mec1”, “Mec2”,

and “Mec3” processes with tripling of the associated

Input/Output F IFO chan nel s.

Given the “Gprof” [22] computation profiling results

of the obtained parallel model reported in Figure 6, it is

clear that the final pro pos ed model ha s good con curren cy

properties with an acceptable computation and commu-

nication workload balance.

5. System Level Performance Evaluation of

the H.264/AVC Video Encoder

This section will show up the effectiveness and the flexi-

bility of the Sesame system level design methodo log y for

efficient and rapid design space exploration of such com-

plex systems. For this, the base target architecture and

the mapping strategy are first presented. The sesame sys-

tem level design is then used for performance evaluation

and design space exploration of the encoder targeting

multiple multiprocessors architectures. Finally, the effi-

cacy of the methodology is evaluated for the design of

very complex systems.

5.1. The Base Target Architecture and Mapping

Starting with the Sesame system-level design methodo-

logy presented in Section 2, three software model speci-

fications are required: the application process network

model, the target architecture model, and the mapping

model of the application onto the architecture. For this,

the optimized parallel model of the H.264/AVC encoder

of Figure 5 is first ported to the Sesame framework. This

has been performed by transforming the previously vali-

dated YAPI model into a C++ PNRunner network model.

The obtained network model is then simulated with the

PNRunner simulator to generate a computational and

communication event traces of the application execution,

called trace-event queues [6].

H. K. ZRIDA ET AL.

442

Figure 5. Proposed optimized parallel KPN model of the H.264 encoder.

Figure 6. Computation workload profiling of the final parallel model.

Parallel to the application model specification , the tar-

get architecture is modeled with the Pearl object-based

simulation language. The Sesame environment provides

a small library of architecture black-box base models:

processing cores, a generic bus, a generic memory, and

several interfaces for connecting these base model buil-

ding blocks. Once a target architecture model is validated,

a trace-driven co-simulation of the application events

traces queues mapped to the architectural components is

carried out. Such a co-simulation requires an explicit

mapping of the KPN processes and channels to the par-

ticular components of the target architecture. More than

one KPN process can be mapped to a same processor as

the system simulator automatically schedules the events

from the different queues.

In our case, the base target architecture is given in Fig-

ure 7 that represents a multiprocessor platform commu-

nicating with a shared DRAM memory through a com-

mon bus. For this platform, we have used general pur-

pose processors (assumed to be MIPS R3000), and as-

sumed a conservative timing of 100 ns to read/write a

64-bit word from/to DRAM. The instruction latencies for

the MIPS R3000 microprocessors components were esti-

mated using technical documentation. Communication

between components is performed through buffers in

shared memory.

H. K. ZRIDA ET AL.

443

Figure 7. H.264 encoder’s application to architecture mapping.

For sufficient design space exploration, several plat-

form models are used. The platforms differ by the num-

ber of used processors. One platform is used with two pro-

cessors; a second is with four and a third is tested with

six processors. Given the optimized parallel model of the

H.264/AVC encoder, the Sesame design space explora-

tion consisted in changing the mapping combination or

adding another architectural component without touching

the H.264 encoding parallel specification since this ap-

plication presents already good concurrency properties

with an acceptable computation and communication

workload balance [19]. When such a system modifica-

tion is performed, we have to recompile first the hard-

ware architecture, and then the entire system to regener-

ate the new YML files of the target architecture and map-

ping layers. Adding a new architectural component con-

sist in acceding through an “YMLEditor” YML graphical

editor to the Sesame library of black-box components

models and after that making by simple clicks its addition

to the architecture model and its connection to the bus.

The “YMLEditor” editor is also used to project quickly

the application tasks and Kahn communications channels

on the different architecture resources.

Mapping application processes to this platform has

been decided explicitly given the obtained computation

and communication load distribution results of Figure 6.

For the bi-processor platform example, the total compu-

tation load has been distributed between the two proces-

sors. The “Mec1”, “Mec2”, and “Dct_Dec_Filter” proc-

esses are mapped to one, and all the others to the second

processor. The mapping strategy used with the four pro-

cessors platform is showed in Figure 7. In this case, first,

the most complex “Mec1”, “Mec2”, “Mec3”, and “Intra-

Pred” tasks are mapped separately to each used core to

guarantee a competitive execution between them. Then,

the “Dct_Dec_Filter” process is added to run with the

“Mec2” process on the same core. The “Vlc” is also ad-

ded to the “Intra-Pred” process and is mapped to the

fourth processor.

5.2. System Level Performance Evaluation and

Design Space Exploration

After having mapped the PNRunner optimized H264/

AVC network model to the different used platforms with

two four and six microprocessors, the performance ana-

lysis step is performed by system-level simulations. In all

the experiments, the input test video sequence consists of

YUV frames captured in a QCIF resolution of 176 * 144

pixels. The simulatio n results of the QCIF “Bridg e-close”

sequence H.264/AVC encoding process are obtained for

the different used platforms and are presented in the fol-

lowing Figure 8. It is clear from this figure, that the en-

coding performances obtained in frames per second are

getting better linearly when the number of simulated

microprocessors is increased. For each case, as the ap-

plication model is considered to be optimal, the execu-

tion/communication performances gain may be improved

by changing the mapping policy or/and the platform ar-

chitecture.

H. K. ZRIDA ET AL.

444

Figure 8. H.264/AVC encoding performances vs. simulated

processors with the common bus structure.

To modify the architecture, a designer can also explore

the use of other communication models or enhance the

architecture with hardware components using appropriate

HW/SW partitioning. For example, for the four-proce-

ssor platform with the common bus structure, perfor-

mance numbers for the execution/communication work-

load is obtained for each used architecture component.

The obtained results are shown in Figure 9. For each

component, a bar shows the breakdown of the time spent

on reading/writing, being busy and being idle. Given this

figure, it is obvious that the computation cost is much

more important than the time spent in reading/writing

from/to the shared memory. The communication and

computation loads are nearly balanced for all the used

components. Such a result confirms the good concur-

rency properties of the proposed optimized parallel ap-

plication model and the appropriate used mapping policy.

However, the times being id le are too much important in

comparison with the times being busy for all the archi-

tecture components. This has caused probably a substan-

tial degradation of the final encoding performances.

Given the important amount of data communicated be-

tween processes for this encoding process, it is clear that

the common memory bus structure constitutes a serious

communication bottleneck. Indeed, the very important

data dependency between processors requires a potential

memory access and allocation for the read/write opera-

tions. For a common-bus-based multiprocessor architec-

ture, this causes a saturation of bus and thus a lot of time

spent in waiting to read/write data from/to other compo-

nent.

For further design space exploration and in order to

reduce the communication bottleneck observed for the

common-bus-based architecture, others inter processors

communication structures and topologies should be

tested. In the Sesame framework, a Crossbar and an

Omega network Pearl model structures are implemented

[7]. Given this, we selected in our experiments the cross-

bar switch structure in replacement of the common bus

model. A Pearl simulation model of a 4 × 4 crossbar

switch is implemented, as shown in Figure 10. For the

obtained architecture of Figure 10, the processors com-

municate with each other over the crossbar. The memory

is distributed per processor and resides in the Virtual

Buffers (VBs). Data is written to the virtual buffer asso-

ciated with the writing processor. Only reads are for-

warded over the crossbar, and, it is possible to use it for

write calls also. The performance results are obtained for

the different used platforms and are presented in the fol-

lowing Figure 11. As shown in Figure 11, the use of the

Crossbar structure come up with a substantial perform-

ance encoding gain in frames per second (fps) in com-

parison with the common bus architecture. In effect, we

achieved the 9.6 fps with six processors (MIPS R3000)

connected via a 4 × 4 crossbar communication model. In

addition, for the four-processor platform, the execution/

communication workload is obtained for each used com-

ponent. The obtained results are reported in Figure 12.

The performance numbers statistics of Figure 12 clearly

show that the components spend much more time being

busy doing work and more less time waiting for reading

and writing. This confirms the performance gain ob-

tained.

5.3. Evaluation of the Methodology for Rapid

Design Space Exploration

The Sesame framework has been used for the design of a

very complex media application verifying constraints.

Given the complexity of case studied, the previous sec-

tion outlined the difficulty to evaluate one design using

detailed clock cycle-accurate simulators. For the system

level design case, the simulation times did not take more

then 5 minutes for all the used configurations. Measure-

ments have been done on a General-Purpose Processor

(GPP) platform based on an INTEL Centrino 1.6 GHZ

with 512 MB RAM memory running a Linux operating

system. In comparison to classical RTL-level simulators,

this is many orders of magnitude faster and is acceptable

for design space exploration.

Figure 9. Reading-Writing/Execution/Idle statistics for the

common-bus-based architecture.

H. K. ZRIDA ET AL.

445

Figure 10. Used Crossbar-structure-based four processors

architecture model.

Figure 11. H.264 encoding performances vs. simulated

processors with the Crossbar model.

Figure 12. Reading-Writing/Execution/Idle statistics for the

crossbar-model-based architecture.

Due to the simplicity and expressive power of Ses-

ame’s Pearl simulation language, modeling all the plat-

form architectures has been rapidly performed. Indeed,

the system-level modeling relieves the designer from low

level implementation details. Performance evaluation at

high abstraction levels makes it possible to control the

speed, required modeling effort, and attainable accuracy

of the simulations. Th is enables to efficiently explore th e

large design space in the early design stages. Applying

more detailed models at a later stage allows focused ar-

chitectural exploration.

Finally, we find that the Sesame methodology facili-

tates the performance analysis of embedded systems ar-

chitectures in a way that directly reflects the Y-chart de-

sign approach. Essential in this modeling methodo logy is

that an application model is independent from architec-

tural specifics, assumptions on hardware/software parti-

tioning, and timing characteristics. Thus, the application

is studied in isolation by means of a functional (behav-

ioral) software model written in a high level language.

Given the complexity of the case studied, an appropriate

parallelization methodology has been separately and un-

dependably proposed to get the optimal application

model with the best computation and communication

workload balance [19]. This results in a good starting

model with primary estimations of its performance re-

quirements. As a result, a single optimally and separately

designed application model has been used to exercise

different mappings onto a range of architecture models.

This clearly demonstrates the strength of decoupling ap-

plication models and architecture models and it enables

the reuse of both types of models.

6. Conclusions

In this paper, we motivated the use of the Sesame/Arte-

mis system-level design methodology for efficient archi-

tectural exploration of the increasing complexity hetero-

geneous embedded media systems. The case studied is

concerned with an optimal H.264/AVC encoder SoC

design verifying constraints. The complexity analysis of

the H.264/AVC reference encoder confirmed the very

high complexity of this new standard, the potential

memory allocation needed and the high volume of com-

putation required. For the design of this complex media

system, the outcome, the effectiveness and the flexibility

of the methods and tools provided by the Sesame meth-

odology have been clearly illustrated. Both modeling and

mapping stages of the Sesame design flow are explored.

A lot of design space exploration has been considered for

getting an optimal design. For all the used configurations,

the simulation times did not take more then 5 minutes for

all the used configurations. In addition, due to the sim-

plicity and expressive power of the architecture specifi-

cation language, modeling all the proposed platform ar-

chitectures has been rapidly performed. This enables to

efficiently explore the large design space in the early

design stages.

7. References

[1] C. Erbas, “System-Level Modeling and Design Space

Exploration for Multiprocessor Embedded System-on-

Chip Architectures,” Ph.D Thesis, University of Amster-

dam, Amsterdam, 2006.

[2] E. A. Lee, et al., “Overview of the Ptolemy Project,”

Technical Memorandum UCB/ERL M01/11, University

of California, Berkeley, May 2001.

[3] X. Chen, H. Hsieh and F. Balarin, “Verification Approach

of Metropolis Design Framework for Embedded Sys-

H. K. ZRIDA ET AL.

446

tems,” International Journal of Parallel Programming,

Vol. 34, No. 1, February 2006, pp. 3-27.

doi:10.1007/s10766-005-0002-x

[4] L. Cai and D. D. Gajski, “C/C++ Based System Design

Flow Using SpecC, VCC and SystemC,” Technical Re-

port CECS_02_30, 14 June 2002.

[5] J. Coffland, “SESAME Users Guide,” Technical Report,

University of Amsterdam, 5 April 2006.

http://sesamesim.sourceforge.net/docs/SESAME/SESAM

E_Users_Guide.html

[6] J. E. Coffland and A. D. Pimentel, “A Software Frame-

work for Efficient System Level Performance Evaluation

of Embedded Systems,” Proceedings of the 2003 ACM

Symposium on Applied Computing, Melbourne, March

2003. doi:10.1145/952532.952663

[7] A. D. Pimentel, S. Polstra, F. Terpstra, A. W. van Hal-

deren, J. E. Coffland and L. O. Hertzberger, “Towards Ef-

ficient Design Space Exploration of Heterogeneous Em-

bedded Media Systems,” Technical Report, Department of

Computer S cience, University of Amster dam, 2001.

[8] P. van der Wolf, P. Lieverse, M. Goel, D. La Hei and K.

Vissers, “An MPEG-2 Decoder Case Study as a Driver

for a System Level Design Methodology,” Proceedings

of the 7th International Workshop on Hardware/Software

Codesign, Rome, 3-5 May 1999, pp. 33-37.

doi:10.1145/301177.301196

[9] E. G. Iain and H. Richardson, “264 and MPEG-4 Video

Compression: Video Coding for Next-generation Multi-

media,” John Wiley &Sons Ltd, Hoboken, 2003.

[10] M. Alvarez, A. Salami, A. Ramirez and M. Valero, “A

Performance Characterization of high Definition Digital

Video Decoding Using H264/AVC,” Proceedings of IEEE

International, Symposium on Workload Characterization,

Austin, 6-8 October 2005, pp. 24-33.

[11] A. D. Pimentel, P. Lieverse, P. van der Wolf, L. O.

Hertzberger and E. F. Deprettere, “Exploring Embedded-

Systems Architectures with Artemis,” IEEE Computer,

Vol. 34, No. 11, November 2001, pp. 57-63.

[12] P. Lieverse, P. van der Wolf, E. F. Deprettere and K. A.

Vissers, “A Methodology for Architecture Exploration of

Heterogeneous Signal Processing Systems,” Journal of

VLSI Signal Processing for Signal, Image and Video T ec h-

nology, Special Issue on SiPS’99, Vol. 29, No. 3, No-

vember 2001, pp. 197-207.

[13] G. Kahn, “The Semantics of a Simple Language for Par-

allel Programming,” In: J. L. Rosenfeld, Ed., Information

Processing, Proceedings of the IFIP Congress 74, North-

Holland Publishing Co., Stockholm, 5-10 August 1974.

[14] H. Muller, “Pearl: A Language for Architecture Simula-

tion,” 25 February 1993. http://www.pearl.org/

[15] H. Krichene Zrida, A. C. Ammari, A. Jemai a nd M. Abid,

“Performance/Complexity Analysis of a H264 Video

Encoder,” International Review on Computers and Soft-

ware, Vol. 2, No. 4, July 2007, pp. 401-414.

[16] H264 Reference Software Version JM 10.2, November

2005. http://iphome.hhi.de/suehring/tml/.

[17] MemProf—Profiling and leak detection, July 2008.

http://www.gnome.org/projects/memprof/

[18] H. Krichene Zrida, A. C. Ammari, A. Jemai and M. Abid.

“A YAPI-KPN Parallel Model of a H264/AVC Video

Encoder,” Proceedings of the 4th IEEE International

Conference on Ph.D Research in Microelectronics and

Electronics, Istanbul, 22-25 June 2008, pp. 109-112.

[19] H. K. Zrida, A. Jemai, A. C. Ammari and M. Abid, “High

Level H.264/AVC Video Encoder Parallelization for Mul -

tiprocessor Implementation,” Proceedings of the 12th ACM/

IEEE Design Automation and Test in Europe Conference

and Exhibition, Nice, 20-24 April 2009, pp. 940- 945.

[20] E. A. Kock, G. Essink, W. J. M. Smits, P. van der Wolf, J.

Y. Brunel, W. M. Kruijtzer, P. Lieverse and K. A. Vissers,

“YAPI: Application Modeling for Signal Processing Sys-

tem,” Proceeding 37th Design Automation Conference,

Los Angeles, 5-9 June 2000, pp. 402-405.

doi:10.1109/DAC.2000.855344

[21] R. Schäfer, T. Wiegand and H. Schwarz, “The Emerging

H264/AVC Standard,” EBU Technical Review, January

2003.

[22] S. L. Graham, P. B. Kessler and M. K. Mc Kusick, “Gprof:

A Call Graph Execution Profiler,” Proceedings of the

SIGPLAN’ 82 Symposium on Compiler Construction,

Boston, 23-25 June 1982.

http://sourceware.org/binutils/docs/gprof/(October 2009)