Int. J. Communications, Network and System Sciences, 2011, 4, 436-446
doi:10.4236/ijcns.2011.47052 Published Online July 2011 (
Copyright © 2011 SciRes. IJCNS
System-Level Performance Evaluation of Very High
Complexity Media Applications: A H264/AVC Encoder
Case Study
Hajer Krichene Zrida1, Abderrazek Jemai2, Ahmed C. Ammari3, Mohamed Abid1
1Computer & Embedded Syst ems Laborat ory, Nati onal School of Engi neers of Sf ax, Sfax University, Sfax, Tunisia
2LIP2 Laboratory, Faculty of Science of Tunis, Tunis, Tunisia
3Research Unit in Materi al s Meas urem e nts an d Ap pl i cat i o ns, National Institute of Applied Sciences and of the
Technology, Carthage University, Carthage, Tunisia
Received March 23, 201 1; revised May 20, 2011; accepted J une 10, 2011
Given the substantially increasing complexity of embedded systems, the use of relatively detailed clock cy-
cle-accurate simulators for the design-space exploration is impractical in the early design stages. Raising the
abstraction level is nowadays widely seen as a solution to bridge the gap between the increasing system
complexity and the low design productivity. For this, several system-level design tools and methodologies
have been introduced to efficiently explore the design space of heterogeneous signal processing systems. In
this paper, we demonstrate the effectiveness and the flexibility of the Sesame/Artemis system-level modeling
and simulation methodology for efficient performance evaluation and rapid architectural exploration of the
increasing complexity heterogeneous embedded media systems. For this purpose, we have selected a system
level design of a very high complexity media application; a H.264/AVC (Advanced Video Codec) video en-
coder. The encoding performances will be evaluated using system-level simulations targeting multiple het-
erogeneous multiprocessors platforms.
Keywords: System-Level Performance Evaluation, Embedded Systems Design Space Exploration Tools, the
Sesame/Artemis Design Tool, a Parallel H.264/AVC Video Encoder
1. Introduction
The architectural complexity of System-on-Chip (SoC)-
based embedded systems, as well as the design re- quire-
ments regarding real-time performance, high flexibility,
low power consumption and cost greatly complicate the
system design. Nowadays, the classical design methods,
typically starting from a single application specification,
become short used for designing such an embedded sys-
tem. In order to resolve the increasing design complexity,
researchers have recently come up with a new design
concept called system-level design [1]. For this purpose,
a new generation of system-level tools and methodolo-
gies has been introduced to efficiently explore the design
space of heterogeneous signal processing systems. Each
tool/methodology directly reflects a well-defined design
The Y-chart layer’s based approach, considered as the
most popular approach for designing multimedia ori-
ented systems, is already being followed in most recent
system-level design works [1]. It tries to improve the
shortcomings of the classical HW/SW co-design app-
roach by abandoning the usage of low-level (instruction-
level or cycle-accurate) simulators for the design space
exploration at an early stage of the flow, and abandoning
a single system specification to describe both hardware
and software parts. Indeed, the Y-chart methodology reco-
gnizes a clear separation between an application model,
an architecture model and an explicit mapping step to
relate the application model to the architecture model.
The application model describes the functional behavior
of an application, independent of architectural specifics
like the HW/SW partitioning or timing characteristics.
The architecture model defines the architecture resources,
captures their timing characteristics, and then simulates
the performance consequences of the application events
Copyright © 2011 SciRes. IJCNS
(communication and computation operations) for soft-
ware (programmable components) and hardware (recon-
figurable/dedicated) executions.
As showed in Figure 1, the Y-chart general design
scheme is composed of four steps [1]. The first step “Ap-
plication Modeling” aims to capture a functional specifi-
cation of the system in the form of a set of benchmark
applications. The second step “Architecture Modeling”
consists in modeling the target architecture by the re-
sources available in the system. In embedded systems,
these resources typically are processors, operating sys-
tems, buses and memories. After that, the parallel appli-
cation processes are mapped onto the resources of the
architecture. The result of the mapping step is an imple-
mentation of the system which can be used as an input
for the performance analysis step. Typically, based on
the Y-chart approach principle, a system designer studies
the set of benchmark applications, makes some initial
calculations, and proposes the architecture. The designer
then evaluates and compares several instances of the
platform by mapping each application onto the platform
architecture by means of performance analysis. The re-
sulting performance numbers may inspire the design er to
improve the architecture, restructure the application, or
change the mapping. The possible designer actions are
shown with the light bulbs in Figure 1.
The outline of the paper is as follow. In Section 2, we
first present the underlying properties of some different
tools. Based on these most important design criterions
considered in our comparative synthesis, the Sesame soft-
ware framework is selected among the best system-level
design methodologies. In Section 3, we desc ribe the main
features, tools, and methods provided by the Sesame/
Artemis simulation and modeling environment. Section 4
presents a complexity analysis of the H.264/AVC stan-
dard and reviews a performed previous work for the de-
velopment of an optimized parallel encoder model. Se-
same is used in Section 5 to evaluate the encoder per-
formance targeting multiprocessors architectures and will
Figure 1. The Y-chart: a general scheme for heterogeneous
system design.
show up the effectiveness and the flexibility of this de-
sign methodology.
2. System-Level Exploration Tools
Comparative Synthesis
In the literature, there are a number of exploration envi-
ronments that facilitate the system-level design space
exploration by providing support for mapping a behav-
ioral application specification to an architecture platform
model [1]. Although all the system-level design method-
ologies are created to be used in the same field: design-
ing embedded systems at high system level, there exist
wide diversity among them. In Table 1, we summarize
the most interesting design properties of the some repre-
sentative ones.
The study found that selected metho dologies and tools
as shown in Table 1 differ from each other in first their
HW/SW design approach. Some of them, like the Pto-
lemy tool [2], don’t support a layered abstraction level
design approach and use a single specification including
both functional behavior and architecture models. Others
support the platform-based design approach (like Me-
tropolis [3]) or the top-down design methodology (like
SpecC [4]). However, it is demonstrated that the Y-chart
layer’s based approach, which is followed by several
recent works, became nowadays the most popular and
used for designing heterogeneous multiprocessors em-
bedded systems.
Although the differences among the seven tools are
not absolute, the features shown in Table 1 indicate that
the most preferable methodologies/tools are Metropolis,
VCC [4], and Sesame-like [5,6], because they have the
largest amount of positive marks (“+”). By elimination,
the Metropolis environment is excluded of the most pre-
ferable methodologies list since it does not facilitate ex-
plicitly the Y-chart approach. Between the VCC and
Sesame tools, we observe that the mixed-level simulation
is only supported by the Sesame tool. For this, we have
opted for selecting the Artemis/Sesame methodology to
implement at system-level the H.264/AVC video encod-
ing application on a multiprocessor SoC-based architec-
ture. Indeed, the system-level modeling and simulation
framework Sesame/Artemis is developed to directly re-
flect the Y-chart design approach. It provides several me-
thods and tools to quickly and separately build the appli-
cation process network model, the target architecture mo-
del, and the mapping model of the application onto the
Currently, Sesame has been evaluated for the design of
two medium complexity media applications: an MPEG-2
decoder and a variant of M-JPEG encoder [7,8]. Our
objective in this paper is to use this methodology for the
Copyright © 2011 SciRes. IJCNS
Table 1. Overview of some properties of presented tools and methodologies.
Methodology/tool Ptolemy Polis MetropolisSpecC VCC SystemC Sesame/Artemis
Y-chart supported - + - - + x +
MoC variety supported + x + x + x -
Dynamic performance models + + + + + + +
Formal analysis and verification - + + + x x x
Reusability supported + + + + + + +
Complex applications domains supported x - + x + + +
Target architecture variety supported x - + + + + +
All abstraction levels supported - + + + + + +
HW synthesis and IP Integration supported - + + + x - +
Mixed simul ation supported - x x x - + +
Automatic HW/SW partitioning supported - - - - - - -
Automatic mapping - - + - - - -
Rapid prototyping x + + + + + +
Legend: True, +; False, -; May be, x .
design of very complex media systems. The H.264/AVC
reference video encoder represents an example of a very
complex case study typical of the multimedia domain. It
has been designed with the goal of enabling significantly
improved compression performance relative to all exist-
ing video coding standards [9]. Such a standard uses ad-
vanced compression techniques that in turn, require high
computational power [10]. Implementing a H.264/AVC
video encoder for an embedded SoC is thus a big chal-
lenge since this encoder requires very high computation
power to achieve real-time encoding. In this study, both
modeling and mapping stages of the Sesame design flow
are explored for an optimal H.264/AVC encoder imple-
mentation verifying constraints. This will demonstrate
the effectiveness and the flexibility of the methods and
tools provided by this methodology for rapid system-
level design space exploration of complex embedded
3. The Sesame/Artemis Simulation and
Modeling Environment
In this section, we will briefly describe the Sesame/Arte-
mis simulation and modeling environment [11,12]. The
required software model layers are first presented. The
implementation of these layers is based on specific tools.
A brief description of these tools is given along with the
application in the literature of these tools to some me-
dium complexity multimedia systems.
3.1. The Sesame Layer’s Software Model
Using the Sesame system-level design software frame-
work, three software specification model layers are re-
quired: the application process network layer, the target
architecture layer, and the layer for mapping the applica-
tion onto the architecture, as showed in the Figure 2 [6] .
3.1.1. Application Modeling Layer
Applications in Sesame are modeled using the Kahn Pro-
cess Network (KPN) model of computation in which pa-
rallel processes, implemented in a high-level language,
communicate with each other via unbounded FIFO chan-
nels [13]. Each process is executed sequentially. Reading
from channels is blocking; writing to channels is non-
blocking. The execution of a Kahn Process Network is
deterministic, meaning that for a given input always the
same output is produced and the same workload is gen-
erated, irrespective of the execution schedule. The model
fits nicely with signal processing applications, as it can
model stream processing with the guarantee that no data
is lost. The key characteristic of the KPN model is that it
specifies an application in terms of distributed control
and distributed memory which allows us to map the ap-
plication onto a multiprocessor platform in a systematic
and efficient way.
3.1.2. Architecture Modelin g L ay er
An architecture model is constructed from generic build-
ing blocks provided by a library containing template per-
formance models for processors, co-processors, memo-
ries, buffers, busses, and so on. The evaluation of archi-
tecture is performed by simulating the performance con-
sequences of the application model events that are
mapped onto the architecture model. This requires each
process and channel of the Kahn process network to be
associated with, or mapped onto, one component of the
architecture model. When executed, each Kahn process
generates a trace of events, and these event traces are
routed towards a specific component of the architecture
model through a trace event queue. A Kahn process places
its application events into this queue while the corre-
sponding architecture component consumes them (Fig-
ure 2).
Copyright © 2011 SciRes. IJCNS
Figure 2. The three layers within Sesame: the application
model layer, the architecture model layer, and the mapping
3.1.3. Mapping L ay er
The mapping layer maps the event traces generated by
the Kahn processes of an application model onto the re-
sources in the architecture model. In addition, it maps the
Kahn communication channels onto communication re-
sources at the architecture level. Each Kahn channel can
be thus mapped onto a point-to-point FIFO channel be-
tween two processors or onto a software buffer in shared
memory. As showed in Figure 2, it is possible to map
multiple Kahn processes onto a single architecture com-
ponent (e.g., in the case of a programmable component).
Such mappings require the events from the event traces
that are mapped onto the same architecture resource to be
scheduled. This scheduling is also performed by the
mapping layer [7].
3.2. Sesame Implementation Tools
In the previous section, we have seen that the Sesame
softwa re structure is composed of th ree layers: th e appli-
cation layer, the architecture layer, and the mapping layer
which is an interface between the two previous ones. All
three layers in Sesame are composed of components
which should be instantiated and connected using some
form of object creation and initialization mechanism, as
shown in Figure 3. This allows reusing of code and
guarantees the flexibility to easily manipulate the model
based on performance results as dictated by the Y-Chart
methodology (Figure 1). The three models layers are im-
plemented by the following tools:
3.2.1. YML M odel i ng La ng u age
Sesame was developed to guarantee a rapid construction
of the simulation models thought the use of libraries of
pre-built architecture simulation components. In order to
enable quick modification, a flexible description format
for the interconnection of these components is required.
For this, the YML (Y-chart Modeling Language) is de-
fined to create the structure of Sesame’s simulation mod-
els. YML is an XML-based language. Using XML is
attractive because it is simple and flexible, reinforces
reuse of model descriptions, and comes with good pro-
gramming language support. The core elements of YML
are networ k, n o de, port, li nk, and property [ 6].
3.2.2. Application PNRunner Simulator
Sesame’s application simulator is called PNRunner, or
Process Network Runner. PNRunner implements the se-
mantics of Kahn process networks in C++. It reads an
YML application description file and executes the cor-
respondent application model. The PNRunner execution
allows generating a trace of application events (trace API)
to driv e an architec ture simul ation ( Figure 3). Using this
API, PNRunner can send application events (communi-
cation and computation operations) to the architecture
simulator where their performance consequences are
simulated. Hence, application/architecture trace-events
co-simulation is possible.
3.2.3. Architecture Pearl Simulator
The target architecture model in Sesame is implemented
in the Pearl discrete event simulation language [14]. Pearl
is a small but powerful object-based language wh ich pr o-
vides easy construction of abstract architecture models
and fast performance simulation. It has a C-like syntax
with a few additional p rimitives for simulation purpos es.
A Pearl program is a collection of concurrent objects
which communicate with each other through synchronous
or asynchronous message-passing. After sending an asyn-
chronous message, the sending object continues execu-
tion, while waiting for a synchronous reply message from
the receiver.
3.3. Medium Complexity Media Case Studies
The Sesame modeling and simulation methodology has
been applied to two medium complexity media applica-
tions: an MPEG-2 decoder and a variant of an M-JPEG
encoder [7,8]. These both studies have been performed at
the black-box architecture model level and showed pro-
mising results. These media applications have been im-
plemented on various multiprocessors architectures mo-
dels. For these architectures, different hardware-soft-
ware partitioning, application to architecture mappings,
processor speeds, and interconnect structures (bus,
Crossbar, and Omega networks) are evaluated [7]. Based
on the obtained execution performance results, the Cross-
bar model is demonstrated better in terms of the meas-
ured number of frames per second than the Omega net-
Copyright © 2011 SciRes. IJCNS
Figure 3. Sesame software tools overview.
work and common bus structures (about 5% faster than
the Omega network) [7].
A lot of design space exploration has been so consid-
ered for getting the optimal system design. For these
medium complexity media applications, the performance
evaluation is straightforward and very fast obtained. For
all the used configurations, the ob tained simulation times
range from 5 to 10 seconds. This is many ord ers of mag-
nitude faster than using classical RTL-level simulators
and is very acceptable for design space explo ration. G i v e n
this, the next sections aim to further demonstrate the ef-
fectiveness and the flexibility of the Sesame methods and
tools even for the design of very complex media applica-
tions. This will be performed by using this methodology
for the design and performance evaluation of a H.264/
AVC reference encoder targeting multiple heterog eneous
multiprocessors platforms.
4. The H.264/AVC Video Encoder Case
The H.264/AVC has been designed with the goal of ena-
bling significantly improved compression performance
relative to all existing video co ding standards [9]. Such a
standard uses advanced compression techniques that in
turn, require high computational power [10]. Implemen-
ting a H264 video encoder for an embedded SoC requires
very high computation power to achieve real-time en-
coding. This section first presents a complexity analysis
of the H.264/AVC reference encoder in comparison to
M-JPEG application case. Then, it reviews a performed
previous work to get an optimized parallel model of the
encoder using an appropriate high-level independent target-
architecture parallelization approach.
4.1. Complexity Analysis of the H264/AVC
Reference Video Encoder
The complexity of the H.264/AVC encoder application
depends on the algorithm, the encoding option tools, the
input sequences and the architecture in which it is im-
plemented. In a previous work [15], we performed a
complete high level performance and complexity analy-
sis of a H.264/AVC video encoding application. The
experiments have been performed on a General-Purpose
Processor (GPP) 1.6 GHZ INTEL Centrino platform
using the JM 10.2 software reference version [16] with a
main profile @ level 4. For an optimal balance between
the encoding efficiency and the implementation cost, a
proper use of the H.264 /AVC tools has been proposed to
maintain an acceptable performance while considerably
reducing complexity. Using the obtained optimal encod-
ing tools for a very low bit rate 7 frames QCIF “bridge
far” sequence, the computing time for the encoding pro-
cess on the GPP platform is of 15.2 seconds. The associ-
ated complexity in frames per second is of 2.16 fps. For
this test sequence, the peak memory usage is also meas-
ured using the “memprof” GNU profiler [17]. For the
used sequence, the obtained peak me mor y c ost is o f 5 .02
MB. This result refers to none optimized source code.
Applying platform independent memory optimizations
through C level code transformations may be used to get
an optimized memory and algorithmic version of the
reference code.
In comparison to the Motion JPEG application pre-
sented in Section 2, the non optimized H.264/AVC ref-
erence encoder is about two to three orders of magnitude
more complex in terms of computing time and memory
usage. To illustrate this, the dynamic instru ction distribu-
tion by operation types have been obtained for both ap-
plications using an “objdump” utility and are reported in
Figure 4. For the H.264/AVC, the dominated instruc-
tions types are the “arithmetic” and the “Memory” (L oad/
Store) operations. Actually, these results confirm the
very high complexity of this new standard, the potential
memory allocation needed and the high volume of com-
putation required. The SoC implementation of such a
complex application will point out the outcome of the
Sesame methodology for the design of such complex
4.2. High Level Parallel Specification of a
H.264/AVC Video Encoder
To speedup the compu ting of this encoder, a multiproce-
ssor implementation is probab ly needed. Prior to this im-
plementation, the sequential encoding reference C source
code [16] should be transformed into concurrent KPN
tasks communicating via dedicated FIFO channels. The
goal of this step is to extract the available task-paralle-
lism from the application by splitting compute nodes as
far as possible to get a valid parallel KPN model of the
encoder. For an optimal design flow, it is our aim to pro-
vide a parallel specification of the application which
forms a good starting point for mapping onto different
Copyright © 2011 SciRes. IJCNS
Figure 4. The dynamic instruction distribution by operation types for both the H.264/AVC and M-JPEG applications.
systems-on-chip platforms. To do so, we proposed in a
previous work a high-level independent target-archite-
cture parallelization approach [18,19] to get an optimized
parallel model of the encoder with the best computation
and communication workload balance.
The proposed parallelization approach is based on the
use of the KPN/YAPI [20] parallel programming models
of computation, and the selection of a fine-grain Macro-
Block communication granularity level. The key charac-
teristic of the approach is the simultaneo us ex plor ation of
the two predominant concepts of parallelism; the data-
level partitioning and the task-lev el splitting and merging .
This means that communication and computation work-
load analysis are needed to provide a global guidance
when optimizing concurrency between processes. In ge-
neral, when the concurrency bottlenecks are identified,
task and data levels splitting and/or merging are per-
formed for better distributing the computing workload
over the processes. For the most computational-expen-
sive tasks, data splitting is proposed for a better concur-
rency optimizati on [18].
Given the proposed parallelization approach, the Task
Level Parallelism (TLP) is first considered. The goal of
this step is to extract the available task-parallelism by
splitting compute nodes as far as possible to get a first
starting valid parallel KPN model of the encoder. For
this case, the encoder block diagram [21] has served as a
starting point for extracting the task-level parallelism.
Then to get a parallel implementation of the encoder with
the best computation and communication workload bala-
nce, different steps of task level splitting or merging and
data level splitting are used to derive in a structured way
a final optimized model. Further details on the different
steps used are given in [19]. Finally, the optimal model
obtained is given in Figure 5. This figure shows that the
low-complexity DCT, Quantification, Decoder, and Fil-
ter modules have been merged into only one “Dct_Dec_
Filter” process. For the most computational-expensive
Motion estimation and compensation “Mec” task, a data
partitioning strategy h as been consid ered to distribute the
computing of this process into three “Mec1”, “Mec2”,
and “Mec3” processes with tripling of the associated
Input/Output F IFO chan nel s.
Given the “Gprof” [22] computation profiling results
of the obtained parallel model reported in Figure 6, it is
clear that the final pro pos ed model ha s good con curren cy
properties with an acceptable computation and commu-
nication workload balance.
5. System Level Performance Evaluation of
the H.264/AVC Video Encoder
This section will show up the effectiveness and the flexi-
bility of the Sesame system level design methodo log y for
efficient and rapid design space exploration of such com-
plex systems. For this, the base target architecture and
the mapping strategy are first presented. The sesame sys-
tem level design is then used for performance evaluation
and design space exploration of the encoder targeting
multiple multiprocessors architectures. Finally, the effi-
cacy of the methodology is evaluated for the design of
very complex systems.
5.1. The Base Target Architecture and Mapping
Starting with the Sesame system-level design methodo-
logy presented in Section 2, three software model speci-
fications are required: the application process network
model, the target architecture model, and the mapping
model of the application onto the architecture. For this,
the optimized parallel model of the H.264/AVC encoder
of Figure 5 is first ported to the Sesame framework. This
has been performed by transforming the previously vali-
dated YAPI model into a C++ PNRunner network model.
The obtained network model is then simulated with the
PNRunner simulator to generate a computational and
communication event traces of the application execution,
called trace-event queues [6].
Copyright © 2011 SciRes. IJCNS
Figure 5. Proposed optimized parallel KPN model of the H.264 encoder.
Figure 6. Computation workload profiling of the final parallel model.
Parallel to the application model specification , the tar-
get architecture is modeled with the Pearl object-based
simulation language. The Sesame environment provides
a small library of architecture black-box base models:
processing cores, a generic bus, a generic memory, and
several interfaces for connecting these base model buil-
ding blocks. Once a target architecture model is validated,
a trace-driven co-simulation of the application events
traces queues mapped to the architectural components is
carried out. Such a co-simulation requires an explicit
mapping of the KPN processes and channels to the par-
ticular components of the target architecture. More than
one KPN process can be mapped to a same processor as
the system simulator automatically schedules the events
from the different queues.
In our case, the base target architecture is given in Fig-
ure 7 that represents a multiprocessor platform commu-
nicating with a shared DRAM memory through a com-
mon bus. For this platform, we have used general pur-
pose processors (assumed to be MIPS R3000), and as-
sumed a conservative timing of 100 ns to read/write a
64-bit word from/to DRAM. The instruction latencies for
the MIPS R3000 microprocessors components were esti-
mated using technical documentation. Communication
between components is performed through buffers in
shared memory.
Copyright © 2011 SciRes. IJCNS
Figure 7. H.264 encoder’s application to architecture mapping.
For sufficient design space exploration, several plat-
form models are used. The platforms differ by the num-
ber of used processors. One platform is used with two pro-
cessors; a second is with four and a third is tested with
six processors. Given the optimized parallel model of the
H.264/AVC encoder, the Sesame design space explora-
tion consisted in changing the mapping combination or
adding another architectural component without touching
the H.264 encoding parallel specification since this ap-
plication presents already good concurrency properties
with an acceptable computation and communication
workload balance [19]. When such a system modifica-
tion is performed, we have to recompile first the hard-
ware architecture, and then the entire system to regener-
ate the new YML files of the target architecture and map-
ping layers. Adding a new architectural component con-
sist in acceding through an “YMLEditor” YML graphical
editor to the Sesame library of black-box components
models and after that making by simple clicks its addition
to the architecture model and its connection to the bus.
The “YMLEditor” editor is also used to project quickly
the application tasks and Kahn communications channels
on the different architecture resources.
Mapping application processes to this platform has
been decided explicitly given the obtained computation
and communication load distribution results of Figure 6.
For the bi-processor platform example, the total compu-
tation load has been distributed between the two proces-
sors. The “Mec1”, “Mec2”, and “Dct_Dec_Filter” proc-
esses are mapped to one, and all the others to the second
processor. The mapping strategy used with the four pro-
cessors platform is showed in Figure 7. In this case, first,
the most complex “Mec1”, “Mec2”, “Mec3”, and “Intra-
Pred” tasks are mapped separately to each used core to
guarantee a competitive execution between them. Then,
the “Dct_Dec_Filter” process is added to run with the
“Mec2” process on the same core. The “Vlc” is also ad-
ded to the “Intra-Pred” process and is mapped to the
fourth processor.
5.2. System Level Performance Evaluation and
Design Space Exploration
After having mapped the PNRunner optimized H264/
AVC network model to the different used platforms with
two four and six microprocessors, the performance ana-
lysis step is performed by system-level simulations. In all
the experiments, the input test video sequence consists of
YUV frames captured in a QCIF resolution of 176 * 144
pixels. The simulatio n results of the QCIF “Bridg e-close”
sequence H.264/AVC encoding process are obtained for
the different used platforms and are presented in the fol-
lowing Figure 8. It is clear from this figure, that the en-
coding performances obtained in frames per second are
getting better linearly when the number of simulated
microprocessors is increased. For each case, as the ap-
plication model is considered to be optimal, the execu-
tion/communication performances gain may be improved
by changing the mapping policy or/and the platform ar-
Copyright © 2011 SciRes. IJCNS
Figure 8. H.264/AVC encoding performances vs. simulated
processors with the common bus structure.
To modify the architecture, a designer can also explore
the use of other communication models or enhance the
architecture with hardware components using appropriate
HW/SW partitioning. For example, for the four-proce-
ssor platform with the common bus structure, perfor-
mance numbers for the execution/communication work-
load is obtained for each used architecture component.
The obtained results are shown in Figure 9. For each
component, a bar shows the breakdown of the time spent
on reading/writing, being busy and being idle. Given this
figure, it is obvious that the computation cost is much
more important than the time spent in reading/writing
from/to the shared memory. The communication and
computation loads are nearly balanced for all the used
components. Such a result confirms the good concur-
rency properties of the proposed optimized parallel ap-
plication model and the appropriate used mapping policy.
However, the times being id le are too much important in
comparison with the times being busy for all the archi-
tecture components. This has caused probably a substan-
tial degradation of the final encoding performances.
Given the important amount of data communicated be-
tween processes for this encoding process, it is clear that
the common memory bus structure constitutes a serious
communication bottleneck. Indeed, the very important
data dependency between processors requires a potential
memory access and allocation for the read/write opera-
tions. For a common-bus-based multiprocessor architec-
ture, this causes a saturation of bus and thus a lot of time
spent in waiting to read/write data from/to other compo-
For further design space exploration and in order to
reduce the communication bottleneck observed for the
common-bus-based architecture, others inter processors
communication structures and topologies should be
tested. In the Sesame framework, a Crossbar and an
Omega network Pearl model structures are implemented
[7]. Given this, we selected in our experiments the cross-
bar switch structure in replacement of the common bus
model. A Pearl simulation model of a 4 × 4 crossbar
switch is implemented, as shown in Figure 10. For the
obtained architecture of Figure 10, the processors com-
municate with each other over the crossbar. The memory
is distributed per processor and resides in the Virtual
Buffers (VBs). Data is written to the virtual buffer asso-
ciated with the writing processor. Only reads are for-
warded over the crossbar, and, it is possible to use it for
write calls also. The performance results are obtained for
the different used platforms and are presented in the fol-
lowing Figure 11. As shown in Figure 11, the use of the
Crossbar structure come up with a substantial perform-
ance encoding gain in frames per second (fps) in com-
parison with the common bus architecture. In effect, we
achieved the 9.6 fps with six processors (MIPS R3000)
connected via a 4 × 4 crossbar communication model. In
addition, for the four-processor platform, the execution/
communication workload is obtained for each used com-
ponent. The obtained results are reported in Figure 12.
The performance numbers statistics of Figure 12 clearly
show that the components spend much more time being
busy doing work and more less time waiting for reading
and writing. This confirms the performance gain ob-
5.3. Evaluation of the Methodology for Rapid
Design Space Exploration
The Sesame framework has been used for the design of a
very complex media application verifying constraints.
Given the complexity of case studied, the previous sec-
tion outlined the difficulty to evaluate one design using
detailed clock cycle-accurate simulators. For the system
level design case, the simulation times did not take more
then 5 minutes for all the used configurations. Measure-
ments have been done on a General-Purpose Processor
(GPP) platform based on an INTEL Centrino 1.6 GHZ
with 512 MB RAM memory running a Linux operating
system. In comparison to classical RTL-level simulators,
this is many orders of magnitude faster and is acceptable
for design space exploration.
Figure 9. Reading-Writing/Execution/Idle statistics for the
common-bus-based architecture.
Copyright © 2011 SciRes. IJCNS
Figure 10. Used Crossbar-structure-based four processors
architecture model.
Figure 11. H.264 encoding performances vs. simulated
processors with the Crossbar model.
Figure 12. Reading-Writing/Execution/Idle statistics for the
crossbar-model-based architecture.
Due to the simplicity and expressive power of Ses-
ame’s Pearl simulation language, modeling all the plat-
form architectures has been rapidly performed. Indeed,
the system-level modeling relieves the designer from low
level implementation details. Performance evaluation at
high abstraction levels makes it possible to control the
speed, required modeling effort, and attainable accuracy
of the simulations. Th is enables to efficiently explore th e
large design space in the early design stages. Applying
more detailed models at a later stage allows focused ar-
chitectural exploration.
Finally, we find that the Sesame methodology facili-
tates the performance analysis of embedded systems ar-
chitectures in a way that directly reflects the Y-chart de-
sign approach. Essential in this modeling methodo logy is
that an application model is independent from architec-
tural specifics, assumptions on hardware/software parti-
tioning, and timing characteristics. Thus, the application
is studied in isolation by means of a functional (behav-
ioral) software model written in a high level language.
Given the complexity of the case studied, an appropriate
parallelization methodology has been separately and un-
dependably proposed to get the optimal application
model with the best computation and communication
workload balance [19]. This results in a good starting
model with primary estimations of its performance re-
quirements. As a result, a single optimally and separately
designed application model has been used to exercise
different mappings onto a range of architecture models.
This clearly demonstrates the strength of decoupling ap-
plication models and architecture models and it enables
the reuse of both types of models.
6. Conclusions
In this paper, we motivated the use of the Sesame/Arte-
mis system-level design methodology for efficient archi-
tectural exploration of the increasing complexity hetero-
geneous embedded media systems. The case studied is
concerned with an optimal H.264/AVC encoder SoC
design verifying constraints. The complexity analysis of
the H.264/AVC reference encoder confirmed the very
high complexity of this new standard, the potential
memory allocation needed and the high volume of com-
putation required. For the design of this complex media
system, the outcome, the effectiveness and the flexibility
of the methods and tools provided by the Sesame meth-
odology have been clearly illustrated. Both modeling and
mapping stages of the Sesame design flow are explored.
A lot of design space exploration has been considered for
getting an optimal design. For all the used configurations,
the simulation times did not take more then 5 minutes for
all the used configurations. In addition, due to the sim-
plicity and expressive power of the architecture specifi-
cation language, modeling all the proposed platform ar-
chitectures has been rapidly performed. This enables to
efficiently explore the large design space in the early
design stages.
7. References
[1] C. Erbas, “System-Level Modeling and Design Space
Exploration for Multiprocessor Embedded System-on-
Chip Architectures,” Ph.D Thesis, University of Amster-
dam, Amsterdam, 2006.
[2] E. A. Lee, et al., “Overview of the Ptolemy Project,”
Technical Memorandum UCB/ERL M01/11, University
of California, Berkeley, May 2001.
[3] X. Chen, H. Hsieh and F. Balarin, “Verification Approach
of Metropolis Design Framework for Embedded Sys-
Copyright © 2011 SciRes. IJCNS
tems,” International Journal of Parallel Programming,
Vol. 34, No. 1, February 2006, pp. 3-27.
[4] L. Cai and D. D. Gajski, “C/C++ Based System Design
Flow Using SpecC, VCC and SystemC,” Technical Re-
port CECS_02_30, 14 June 2002.
[5] J. Coffland, “SESAME Users Guide,” Technical Report,
University of Amsterdam, 5 April 2006.
[6] J. E. Coffland and A. D. Pimentel, “A Software Frame-
work for Efficient System Level Performance Evaluation
of Embedded Systems,” Proceedings of the 2003 ACM
Symposium on Applied Computing, Melbourne, March
2003. doi:10.1145/952532.952663
[7] A. D. Pimentel, S. Polstra, F. Terpstra, A. W. van Hal-
deren, J. E. Coffland and L. O. Hertzberger, “Towards Ef-
ficient Design Space Exploration of Heterogeneous Em-
bedded Media Systems,” Technical Report, Department of
Computer S cience, University of Amster dam, 2001.
[8] P. van der Wolf, P. Lieverse, M. Goel, D. La Hei and K.
Vissers, “An MPEG-2 Decoder Case Study as a Driver
for a System Level Design Methodology,” Proceedings
of the 7th International Workshop on Hardware/Software
Codesign, Rome, 3-5 May 1999, pp. 33-37.
[9] E. G. Iain and H. Richardson, “264 and MPEG-4 Video
Compression: Video Coding for Next-generation Multi-
media,” John Wiley &Sons Ltd, Hoboken, 2003.
[10] M. Alvarez, A. Salami, A. Ramirez and M. Valero, “A
Performance Characterization of high Definition Digital
Video Decoding Using H264/AVC,” Proceedings of IEEE
International, Symposium on Workload Characterization,
Austin, 6-8 October 2005, pp. 24-33.
[11] A. D. Pimentel, P. Lieverse, P. van der Wolf, L. O.
Hertzberger and E. F. Deprettere, “Exploring Embedded-
Systems Architectures with Artemis,” IEEE Computer,
Vol. 34, No. 11, November 2001, pp. 57-63.
[12] P. Lieverse, P. van der Wolf, E. F. Deprettere and K. A.
Vissers, “A Methodology for Architecture Exploration of
Heterogeneous Signal Processing Systems,” Journal of
VLSI Signal Processing for Signal, Image and Video T ec h-
nology, Special Issue on SiPS’99, Vol. 29, No. 3, No-
vember 2001, pp. 197-207.
[13] G. Kahn, “The Semantics of a Simple Language for Par-
allel Programming,” In: J. L. Rosenfeld, Ed., Information
Processing, Proceedings of the IFIP Congress 74, North-
Holland Publishing Co., Stockholm, 5-10 August 1974.
[14] H. Muller, “Pearl: A Language for Architecture Simula-
tion,” 25 February 1993.
[15] H. Krichene Zrida, A. C. Ammari, A. Jemai a nd M. Abid,
“Performance/Complexity Analysis of a H264 Video
Encoder,” International Review on Computers and Soft-
ware, Vol. 2, No. 4, July 2007, pp. 401-414.
[16] H264 Reference Software Version JM 10.2, November
[17] MemProf—Profiling and leak detection, July 2008.
[18] H. Krichene Zrida, A. C. Ammari, A. Jemai and M. Abid.
“A YAPI-KPN Parallel Model of a H264/AVC Video
Encoder,” Proceedings of the 4th IEEE International
Conference on Ph.D Research in Microelectronics and
Electronics, Istanbul, 22-25 June 2008, pp. 109-112.
[19] H. K. Zrida, A. Jemai, A. C. Ammari and M. Abid, “High
Level H.264/AVC Video Encoder Parallelization for Mul -
tiprocessor Implementation,” Proceedings of the 12th ACM/
IEEE Design Automation and Test in Europe Conference
and Exhibition, Nice, 20-24 April 2009, pp. 940- 945.
[20] E. A. Kock, G. Essink, W. J. M. Smits, P. van der Wolf, J.
Y. Brunel, W. M. Kruijtzer, P. Lieverse and K. A. Vissers,
“YAPI: Application Modeling for Signal Processing Sys-
tem,” Proceeding 37th Design Automation Conference,
Los Angeles, 5-9 June 2000, pp. 402-405.
[21] R. Schäfer, T. Wiegand and H. Schwarz, “The Emerging
H264/AVC Standard,” EBU Technical Review, January
[22] S. L. Graham, P. B. Kessler and M. K. Mc Kusick, “Gprof:
A Call Graph Execution Profiler,” Proceedings of the
SIGPLAN 82 Symposium on Compiler Construction,
Boston, 23-25 June 1982. 2009)