Paper Menu >>
Journal Menu >>
![]() Int. J. Communications, Network and System Sciences, 2011, 4, 436-446 doi:10.4236/ijcns.2011.47052 Published Online July 2011 (http://www.SciRP.org/journal/ijcns) Copyright © 2011 SciRes. IJCNS System-Level Performance Evaluation of Very High Complexity Media Applications: A H264/AVC Encoder Case Study Hajer Krichene Zrida1, Abderrazek Jemai2, Ahmed C. Ammari3, Mohamed Abid1 1Computer & Embedded Syst ems Laborat ory, Nati onal School of Engi neers of Sf ax, Sfax University, Sfax, Tunisia 2LIP2 Laboratory, Faculty of Science of Tunis, Tunis, Tunisia 3Research Unit in Materi al s Meas urem e nts an d Ap pl i cat i o ns, National Institute of Applied Sciences and of the Technology, Carthage University, Carthage, Tunisia E-mail: hajer_kri@yahoo.co.nz Received March 23, 201 1; revised May 20, 2011; accepted J une 10, 2011 Abstract Given the substantially increasing complexity of embedded systems, the use of relatively detailed clock cy- cle-accurate simulators for the design-space exploration is impractical in the early design stages. Raising the abstraction level is nowadays widely seen as a solution to bridge the gap between the increasing system complexity and the low design productivity. For this, several system-level design tools and methodologies have been introduced to efficiently explore the design space of heterogeneous signal processing systems. In this paper, we demonstrate the effectiveness and the flexibility of the Sesame/Artemis system-level modeling and simulation methodology for efficient performance evaluation and rapid architectural exploration of the increasing complexity heterogeneous embedded media systems. For this purpose, we have selected a system level design of a very high complexity media application; a H.264/AVC (Advanced Video Codec) video en- coder. The encoding performances will be evaluated using system-level simulations targeting multiple het- erogeneous multiprocessors platforms. Keywords: System-Level Performance Evaluation, Embedded Systems Design Space Exploration Tools, the Sesame/Artemis Design Tool, a Parallel H.264/AVC Video Encoder 1. Introduction The architectural complexity of System-on-Chip (SoC)- based embedded systems, as well as the design re- quire- ments regarding real-time performance, high flexibility, low power consumption and cost greatly complicate the system design. Nowadays, the classical design methods, typically starting from a single application specification, become short used for designing such an embedded sys- tem. In order to resolve the increasing design complexity, researchers have recently come up with a new design concept called system-level design [1]. For this purpose, a new generation of system-level tools and methodolo- gies has been introduced to efficiently explore the design space of heterogeneous signal processing systems. Each tool/methodology directly reflects a well-defined design flow. The Y-chart layer’s based approach, considered as the most popular approach for designing multimedia ori- ented systems, is already being followed in most recent system-level design works [1]. It tries to improve the shortcomings of the classical HW/SW co-design app- roach by abandoning the usage of low-level (instruction- level or cycle-accurate) simulators for the design space exploration at an early stage of the flow, and abandoning a single system specification to describe both hardware and software parts. Indeed, the Y-chart methodology reco- gnizes a clear separation between an application model, an architecture model and an explicit mapping step to relate the application model to the architecture model. The application model describes the functional behavior of an application, independent of architectural specifics like the HW/SW partitioning or timing characteristics. The architecture model defines the architecture resources, captures their timing characteristics, and then simulates the performance consequences of the application events ![]() H. K. ZRIDA ET AL. Copyright © 2011 SciRes. IJCNS 437 (communication and computation operations) for soft- ware (programmable components) and hardware (recon- figurable/dedicated) executions. As showed in Figure 1, the Y-chart general design scheme is composed of four steps [1]. The first step “Ap- plication Modeling” aims to capture a functional specifi- cation of the system in the form of a set of benchmark applications. The second step “Architecture Modeling” consists in modeling the target architecture by the re- sources available in the system. In embedded systems, these resources typically are processors, operating sys- tems, buses and memories. After that, the parallel appli- cation processes are mapped onto the resources of the architecture. The result of the mapping step is an imple- mentation of the system which can be used as an input for the performance analysis step. Typically, based on the Y-chart approach principle, a system designer studies the set of benchmark applications, makes some initial calculations, and proposes the architecture. The designer then evaluates and compares several instances of the platform by mapping each application onto the platform architecture by means of performance analysis. The re- sulting performance numbers may inspire the design er to improve the architecture, restructure the application, or change the mapping. The possible designer actions are shown with the light bulbs in Figure 1. The outline of the paper is as follow. In Section 2, we first present the underlying properties of some different tools. Based on these most important design criterions considered in our comparative synthesis, the Sesame soft- ware framework is selected among the best system-level design methodologies. In Section 3, we desc ribe the main features, tools, and methods provided by the Sesame/ Artemis simulation and modeling environment. Section 4 presents a complexity analysis of the H.264/AVC stan- dard and reviews a performed previous work for the de- velopment of an optimized parallel encoder model. Se- same is used in Section 5 to evaluate the encoder per- formance targeting multiprocessors architectures and will Figure 1. The Y-chart: a general scheme for heterogeneous system design. show up the effectiveness and the flexibility of this de- sign methodology. 2. System-Level Exploration Tools Comparative Synthesis In the literature, there are a number of exploration envi- ronments that facilitate the system-level design space exploration by providing support for mapping a behav- ioral application specification to an architecture platform model [1]. Although all the system-level design method- ologies are created to be used in the same field: design- ing embedded systems at high system level, there exist wide diversity among them. In Table 1, we summarize the most interesting design properties of the some repre- sentative ones. The study found that selected metho dologies and tools as shown in Table 1 differ from each other in first their HW/SW design approach. Some of them, like the Pto- lemy tool [2], don’t support a layered abstraction level design approach and use a single specification including both functional behavior and architecture models. Others support the platform-based design approach (like Me- tropolis [3]) or the top-down design methodology (like SpecC [4]). However, it is demonstrated that the Y-chart layer’s based approach, which is followed by several recent works, became nowadays the most popular and used for designing heterogeneous multiprocessors em- bedded systems. Although the differences among the seven tools are not absolute, the features shown in Table 1 indicate that the most preferable methodologies/tools are Metropolis, VCC [4], and Sesame-like [5,6], because they have the largest amount of positive marks (“+”). By elimination, the Metropolis environment is excluded of the most pre- ferable methodologies list since it does not facilitate ex- plicitly the Y-chart approach. Between the VCC and Sesame tools, we observe that the mixed-level simulation is only supported by the Sesame tool. For this, we have opted for selecting the Artemis/Sesame methodology to implement at system-level the H.264/AVC video encod- ing application on a multiprocessor SoC-based architec- ture. Indeed, the system-level modeling and simulation framework Sesame/Artemis is developed to directly re- flect the Y-chart design approach. It provides several me- thods and tools to quickly and separately build the appli- cation process network model, the target architecture mo- del, and the mapping model of the application onto the architecture. Currently, Sesame has been evaluated for the design of two medium complexity media applications: an MPEG-2 decoder and a variant of M-JPEG encoder [7,8]. Our objective in this paper is to use this methodology for the ![]() H. K. ZRIDA ET AL. Copyright © 2011 SciRes. IJCNS 438 Table 1. Overview of some properties of presented tools and methodologies. Methodology/tool Ptolemy Polis MetropolisSpecC VCC SystemC Sesame/Artemis Y-chart supported - + - - + x + MoC variety supported + x + x + x - Dynamic performance models + + + + + + + Formal analysis and verification - + + + x x x Reusability supported + + + + + + + Complex applications domains supported x - + x + + + Target architecture variety supported x - + + + + + All abstraction levels supported - + + + + + + HW synthesis and IP Integration supported - + + + x - + Mixed simul ation supported - x x x - + + Automatic HW/SW partitioning supported - - - - - - - Automatic mapping - - + - - - - Rapid prototyping x + + + + + + Legend: True, +; False, -; May be, x . design of very complex media systems. The H.264/AVC reference video encoder represents an example of a very complex case study typical of the multimedia domain. It has been designed with the goal of enabling significantly improved compression performance relative to all exist- ing video coding standards [9]. Such a standard uses ad- vanced compression techniques that in turn, require high computational power [10]. Implementing a H.264/AVC video encoder for an embedded SoC is thus a big chal- lenge since this encoder requires very high computation power to achieve real-time encoding. In this study, both modeling and mapping stages of the Sesame design flow are explored for an optimal H.264/AVC encoder imple- mentation verifying constraints. This will demonstrate the effectiveness and the flexibility of the methods and tools provided by this methodology for rapid system- level design space exploration of complex embedded systems. 3. The Sesame/Artemis Simulation and Modeling Environment In this section, we will briefly describe the Sesame/Arte- mis simulation and modeling environment [11,12]. The required software model layers are first presented. The implementation of these layers is based on specific tools. A brief description of these tools is given along with the application in the literature of these tools to some me- dium complexity multimedia systems. 3.1. The Sesame Layer’s Software Model Using the Sesame system-level design software frame- work, three software specification model layers are re- quired: the application process network layer, the target architecture layer, and the layer for mapping the applica- tion onto the architecture, as showed in the Figure 2 [6] . 3.1.1. Application Modeling Layer Applications in Sesame are modeled using the Kahn Pro- cess Network (KPN) model of computation in which pa- rallel processes, implemented in a high-level language, communicate with each other via unbounded FIFO chan- nels [13]. Each process is executed sequentially. Reading from channels is blocking; writing to channels is non- blocking. The execution of a Kahn Process Network is deterministic, meaning that for a given input always the same output is produced and the same workload is gen- erated, irrespective of the execution schedule. The model fits nicely with signal processing applications, as it can model stream processing with the guarantee that no data is lost. The key characteristic of the KPN model is that it specifies an application in terms of distributed control and distributed memory which allows us to map the ap- plication onto a multiprocessor platform in a systematic and efficient way. 3.1.2. Architecture Modelin g L ay er An architecture model is constructed from generic build- ing blocks provided by a library containing template per- formance models for processors, co-processors, memo- ries, buffers, busses, and so on. The evaluation of archi- tecture is performed by simulating the performance con- sequences of the application model events that are mapped onto the architecture model. This requires each process and channel of the Kahn process network to be associated with, or mapped onto, one component of the architecture model. When executed, each Kahn process generates a trace of events, and these event traces are routed towards a specific component of the architecture model through a trace event queue. A Kahn process places its application events into this queue while the corre- sponding architecture component consumes them (Fig- ure 2). ![]() H. K. ZRIDA ET AL. Copyright © 2011 SciRes. IJCNS 439 Figure 2. The three layers within Sesame: the application model layer, the architecture model layer, and the mapping layer. 3.1.3. Mapping L ay er The mapping layer maps the event traces generated by the Kahn processes of an application model onto the re- sources in the architecture model. In addition, it maps the Kahn communication channels onto communication re- sources at the architecture level. Each Kahn channel can be thus mapped onto a point-to-point FIFO channel be- tween two processors or onto a software buffer in shared memory. As showed in Figure 2, it is possible to map multiple Kahn processes onto a single architecture com- ponent (e.g., in the case of a programmable component). Such mappings require the events from the event traces that are mapped onto the same architecture resource to be scheduled. This scheduling is also performed by the mapping layer [7]. 3.2. Sesame Implementation Tools In the previous section, we have seen that the Sesame softwa re structure is composed of th ree layers: th e appli- cation layer, the architecture layer, and the mapping layer which is an interface between the two previous ones. All three layers in Sesame are composed of components which should be instantiated and connected using some form of object creation and initialization mechanism, as shown in Figure 3. This allows reusing of code and guarantees the flexibility to easily manipulate the model based on performance results as dictated by the Y-Chart methodology (Figure 1). The three models layers are im- plemented by the following tools: 3.2.1. YML M odel i ng La ng u age Sesame was developed to guarantee a rapid construction of the simulation models thought the use of libraries of pre-built architecture simulation components. In order to enable quick modification, a flexible description format for the interconnection of these components is required. For this, the YML (Y-chart Modeling Language) is de- fined to create the structure of Sesame’s simulation mod- els. YML is an XML-based language. Using XML is attractive because it is simple and flexible, reinforces reuse of model descriptions, and comes with good pro- gramming language support. The core elements of YML are networ k, n o de, port, li nk, and property [ 6]. 3.2.2. Application PNRunner Simulator Sesame’s application simulator is called PNRunner, or Process Network Runner. PNRunner implements the se- mantics of Kahn process networks in C++. It reads an YML application description file and executes the cor- respondent application model. The PNRunner execution allows generating a trace of application events (trace API) to driv e an architec ture simul ation ( Figure 3). Using this API, PNRunner can send application events (communi- cation and computation operations) to the architecture simulator where their performance consequences are simulated. Hence, application/architecture trace-events co-simulation is possible. 3.2.3. Architecture Pearl Simulator The target architecture model in Sesame is implemented in the Pearl discrete event simulation language [14]. Pearl is a small but powerful object-based language wh ich pr o- vides easy construction of abstract architecture models and fast performance simulation. It has a C-like syntax with a few additional p rimitives for simulation purpos es. A Pearl program is a collection of concurrent objects which communicate with each other through synchronous or asynchronous message-passing. After sending an asyn- chronous message, the sending object continues execu- tion, while waiting for a synchronous reply message from the receiver. 3.3. Medium Complexity Media Case Studies The Sesame modeling and simulation methodology has been applied to two medium complexity media applica- tions: an MPEG-2 decoder and a variant of an M-JPEG encoder [7,8]. These both studies have been performed at the black-box architecture model level and showed pro- mising results. These media applications have been im- plemented on various multiprocessors architectures mo- dels. For these architectures, different hardware-soft- ware partitioning, application to architecture mappings, processor speeds, and interconnect structures (bus, Crossbar, and Omega networks) are evaluated [7]. Based on the obtained execution performance results, the Cross- bar model is demonstrated better in terms of the meas- ured number of frames per second than the Omega net- ![]() H. K. ZRIDA ET AL. Copyright © 2011 SciRes. IJCNS 440 Figure 3. Sesame software tools overview. work and common bus structures (about 5% faster than the Omega network) [7]. A lot of design space exploration has been so consid- ered for getting the optimal system design. For these medium complexity media applications, the performance evaluation is straightforward and very fast obtained. For all the used configurations, the ob tained simulation times range from 5 to 10 seconds. This is many ord ers of mag- nitude faster than using classical RTL-level simulators and is very acceptable for design space explo ration. G i v e n this, the next sections aim to further demonstrate the ef- fectiveness and the flexibility of the Sesame methods and tools even for the design of very complex media applica- tions. This will be performed by using this methodology for the design and performance evaluation of a H.264/ AVC reference encoder targeting multiple heterog eneous multiprocessors platforms. 4. The H.264/AVC Video Encoder Case Study The H.264/AVC has been designed with the goal of ena- bling significantly improved compression performance relative to all existing video co ding standards [9]. Such a standard uses advanced compression techniques that in turn, require high computational power [10]. Implemen- ting a H264 video encoder for an embedded SoC requires very high computation power to achieve real-time en- coding. This section first presents a complexity analysis of the H.264/AVC reference encoder in comparison to M-JPEG application case. Then, it reviews a performed previous work to get an optimized parallel model of the encoder using an appropriate high-level independent target- architecture parallelization approach. 4.1. Complexity Analysis of the H264/AVC Reference Video Encoder The complexity of the H.264/AVC encoder application depends on the algorithm, the encoding option tools, the input sequences and the architecture in which it is im- plemented. In a previous work [15], we performed a complete high level performance and complexity analy- sis of a H.264/AVC video encoding application. The experiments have been performed on a General-Purpose Processor (GPP) 1.6 GHZ INTEL Centrino platform using the JM 10.2 software reference version [16] with a main profile @ level 4. For an optimal balance between the encoding efficiency and the implementation cost, a proper use of the H.264 /AVC tools has been proposed to maintain an acceptable performance while considerably reducing complexity. Using the obtained optimal encod- ing tools for a very low bit rate 7 frames QCIF “bridge far” sequence, the computing time for the encoding pro- cess on the GPP platform is of 15.2 seconds. The associ- ated complexity in frames per second is of 2.16 fps. For this test sequence, the peak memory usage is also meas- ured using the “memprof” GNU profiler [17]. For the used sequence, the obtained peak me mor y c ost is o f 5 .02 MB. This result refers to none optimized source code. Applying platform independent memory optimizations through C level code transformations may be used to get an optimized memory and algorithmic version of the reference code. In comparison to the Motion JPEG application pre- sented in Section 2, the non optimized H.264/AVC ref- erence encoder is about two to three orders of magnitude more complex in terms of computing time and memory usage. To illustrate this, the dynamic instru ction distribu- tion by operation types have been obtained for both ap- plications using an “objdump” utility and are reported in Figure 4. For the H.264/AVC, the dominated instruc- tions types are the “arithmetic” and the “Memory” (L oad/ Store) operations. Actually, these results confirm the very high complexity of this new standard, the potential memory allocation needed and the high volume of com- putation required. The SoC implementation of such a complex application will point out the outcome of the Sesame methodology for the design of such complex systems. 4.2. High Level Parallel Specification of a H.264/AVC Video Encoder To speedup the compu ting of this encoder, a multiproce- ssor implementation is probab ly needed. Prior to this im- plementation, the sequential encoding reference C source code [16] should be transformed into concurrent KPN tasks communicating via dedicated FIFO channels. The goal of this step is to extract the available task-paralle- lism from the application by splitting compute nodes as far as possible to get a valid parallel KPN model of the encoder. For an optimal design flow, it is our aim to pro- vide a parallel specification of the application which forms a good starting point for mapping onto different ![]() H. K. ZRIDA ET AL. Copyright © 2011 SciRes. IJCNS 441 Figure 4. The dynamic instruction distribution by operation types for both the H.264/AVC and M-JPEG applications. systems-on-chip platforms. To do so, we proposed in a previous work a high-level independent target-archite- cture parallelization approach [18,19] to get an optimized parallel model of the encoder with the best computation and communication workload balance. The proposed parallelization approach is based on the use of the KPN/YAPI [20] parallel programming models of computation, and the selection of a fine-grain Macro- Block communication granularity level. The key charac- teristic of the approach is the simultaneo us ex plor ation of the two predominant concepts of parallelism; the data- level partitioning and the task-lev el splitting and merging . This means that communication and computation work- load analysis are needed to provide a global guidance when optimizing concurrency between processes. In ge- neral, when the concurrency bottlenecks are identified, task and data levels splitting and/or merging are per- formed for better distributing the computing workload over the processes. For the most computational-expen- sive tasks, data splitting is proposed for a better concur- rency optimizati on [18]. Given the proposed parallelization approach, the Task Level Parallelism (TLP) is first considered. The goal of this step is to extract the available task-parallelism by splitting compute nodes as far as possible to get a first starting valid parallel KPN model of the encoder. For this case, the encoder block diagram [21] has served as a starting point for extracting the task-level parallelism. Then to get a parallel implementation of the encoder with the best computation and communication workload bala- nce, different steps of task level splitting or merging and data level splitting are used to derive in a structured way a final optimized model. Further details on the different steps used are given in [19]. Finally, the optimal model obtained is given in Figure 5. This figure shows that the low-complexity DCT, Quantification, Decoder, and Fil- ter modules have been merged into only one “Dct_Dec_ Filter” process. For the most computational-expensive Motion estimation and compensation “Mec” task, a data partitioning strategy h as been consid ered to distribute the computing of this process into three “Mec1”, “Mec2”, and “Mec3” processes with tripling of the associated Input/Output F IFO chan nel s. Given the “Gprof” [22] computation profiling results of the obtained parallel model reported in Figure 6, it is clear that the final pro pos ed model ha s good con curren cy properties with an acceptable computation and commu- nication workload balance. 5. System Level Performance Evaluation of the H.264/AVC Video Encoder This section will show up the effectiveness and the flexi- bility of the Sesame system level design methodo log y for efficient and rapid design space exploration of such com- plex systems. For this, the base target architecture and the mapping strategy are first presented. The sesame sys- tem level design is then used for performance evaluation and design space exploration of the encoder targeting multiple multiprocessors architectures. Finally, the effi- cacy of the methodology is evaluated for the design of very complex systems. 5.1. The Base Target Architecture and Mapping Starting with the Sesame system-level design methodo- logy presented in Section 2, three software model speci- fications are required: the application process network model, the target architecture model, and the mapping model of the application onto the architecture. For this, the optimized parallel model of the H.264/AVC encoder of Figure 5 is first ported to the Sesame framework. This has been performed by transforming the previously vali- dated YAPI model into a C++ PNRunner network model. The obtained network model is then simulated with the PNRunner simulator to generate a computational and communication event traces of the application execution, called trace-event queues [6]. ![]() H. K. ZRIDA ET AL. Copyright © 2011 SciRes. IJCNS 442 Figure 5. Proposed optimized parallel KPN model of the H.264 encoder. Figure 6. Computation workload profiling of the final parallel model. Parallel to the application model specification , the tar- get architecture is modeled with the Pearl object-based simulation language. The Sesame environment provides a small library of architecture black-box base models: processing cores, a generic bus, a generic memory, and several interfaces for connecting these base model buil- ding blocks. Once a target architecture model is validated, a trace-driven co-simulation of the application events traces queues mapped to the architectural components is carried out. Such a co-simulation requires an explicit mapping of the KPN processes and channels to the par- ticular components of the target architecture. More than one KPN process can be mapped to a same processor as the system simulator automatically schedules the events from the different queues. In our case, the base target architecture is given in Fig- ure 7 that represents a multiprocessor platform commu- nicating with a shared DRAM memory through a com- mon bus. For this platform, we have used general pur- pose processors (assumed to be MIPS R3000), and as- sumed a conservative timing of 100 ns to read/write a 64-bit word from/to DRAM. The instruction latencies for the MIPS R3000 microprocessors components were esti- mated using technical documentation. Communication between components is performed through buffers in shared memory. ![]() H. K. ZRIDA ET AL. Copyright © 2011 SciRes. IJCNS 443 Figure 7. H.264 encoder’s application to architecture mapping. For sufficient design space exploration, several plat- form models are used. The platforms differ by the num- ber of used processors. One platform is used with two pro- cessors; a second is with four and a third is tested with six processors. Given the optimized parallel model of the H.264/AVC encoder, the Sesame design space explora- tion consisted in changing the mapping combination or adding another architectural component without touching the H.264 encoding parallel specification since this ap- plication presents already good concurrency properties with an acceptable computation and communication workload balance [19]. When such a system modifica- tion is performed, we have to recompile first the hard- ware architecture, and then the entire system to regener- ate the new YML files of the target architecture and map- ping layers. Adding a new architectural component con- sist in acceding through an “YMLEditor” YML graphical editor to the Sesame library of black-box components models and after that making by simple clicks its addition to the architecture model and its connection to the bus. The “YMLEditor” editor is also used to project quickly the application tasks and Kahn communications channels on the different architecture resources. Mapping application processes to this platform has been decided explicitly given the obtained computation and communication load distribution results of Figure 6. For the bi-processor platform example, the total compu- tation load has been distributed between the two proces- sors. The “Mec1”, “Mec2”, and “Dct_Dec_Filter” proc- esses are mapped to one, and all the others to the second processor. The mapping strategy used with the four pro- cessors platform is showed in Figure 7. In this case, first, the most complex “Mec1”, “Mec2”, “Mec3”, and “Intra- Pred” tasks are mapped separately to each used core to guarantee a competitive execution between them. Then, the “Dct_Dec_Filter” process is added to run with the “Mec2” process on the same core. The “Vlc” is also ad- ded to the “Intra-Pred” process and is mapped to the fourth processor. 5.2. System Level Performance Evaluation and Design Space Exploration After having mapped the PNRunner optimized H264/ AVC network model to the different used platforms with two four and six microprocessors, the performance ana- lysis step is performed by system-level simulations. In all the experiments, the input test video sequence consists of YUV frames captured in a QCIF resolution of 176 * 144 pixels. The simulatio n results of the QCIF “Bridg e-close” sequence H.264/AVC encoding process are obtained for the different used platforms and are presented in the fol- lowing Figure 8. It is clear from this figure, that the en- coding performances obtained in frames per second are getting better linearly when the number of simulated microprocessors is increased. For each case, as the ap- plication model is considered to be optimal, the execu- tion/communication performances gain may be improved by changing the mapping policy or/and the platform ar- chitecture. ![]() H. K. ZRIDA ET AL. Copyright © 2011 SciRes. IJCNS 444 Figure 8. H.264/AVC encoding performances vs. simulated processors with the common bus structure. To modify the architecture, a designer can also explore the use of other communication models or enhance the architecture with hardware components using appropriate HW/SW partitioning. For example, for the four-proce- ssor platform with the common bus structure, perfor- mance numbers for the execution/communication work- load is obtained for each used architecture component. The obtained results are shown in Figure 9. For each component, a bar shows the breakdown of the time spent on reading/writing, being busy and being idle. Given this figure, it is obvious that the computation cost is much more important than the time spent in reading/writing from/to the shared memory. The communication and computation loads are nearly balanced for all the used components. Such a result confirms the good concur- rency properties of the proposed optimized parallel ap- plication model and the appropriate used mapping policy. However, the times being id le are too much important in comparison with the times being busy for all the archi- tecture components. This has caused probably a substan- tial degradation of the final encoding performances. Given the important amount of data communicated be- tween processes for this encoding process, it is clear that the common memory bus structure constitutes a serious communication bottleneck. Indeed, the very important data dependency between processors requires a potential memory access and allocation for the read/write opera- tions. For a common-bus-based multiprocessor architec- ture, this causes a saturation of bus and thus a lot of time spent in waiting to read/write data from/to other compo- nent. For further design space exploration and in order to reduce the communication bottleneck observed for the common-bus-based architecture, others inter processors communication structures and topologies should be tested. In the Sesame framework, a Crossbar and an Omega network Pearl model structures are implemented [7]. Given this, we selected in our experiments the cross- bar switch structure in replacement of the common bus model. A Pearl simulation model of a 4 × 4 crossbar switch is implemented, as shown in Figure 10. For the obtained architecture of Figure 10, the processors com- municate with each other over the crossbar. The memory is distributed per processor and resides in the Virtual Buffers (VBs). Data is written to the virtual buffer asso- ciated with the writing processor. Only reads are for- warded over the crossbar, and, it is possible to use it for write calls also. The performance results are obtained for the different used platforms and are presented in the fol- lowing Figure 11. As shown in Figure 11, the use of the Crossbar structure come up with a substantial perform- ance encoding gain in frames per second (fps) in com- parison with the common bus architecture. In effect, we achieved the 9.6 fps with six processors (MIPS R3000) connected via a 4 × 4 crossbar communication model. In addition, for the four-processor platform, the execution/ communication workload is obtained for each used com- ponent. The obtained results are reported in Figure 12. The performance numbers statistics of Figure 12 clearly show that the components spend much more time being busy doing work and more less time waiting for reading and writing. This confirms the performance gain ob- tained. 5.3. Evaluation of the Methodology for Rapid Design Space Exploration The Sesame framework has been used for the design of a very complex media application verifying constraints. Given the complexity of case studied, the previous sec- tion outlined the difficulty to evaluate one design using detailed clock cycle-accurate simulators. For the system level design case, the simulation times did not take more then 5 minutes for all the used configurations. Measure- ments have been done on a General-Purpose Processor (GPP) platform based on an INTEL Centrino 1.6 GHZ with 512 MB RAM memory running a Linux operating system. In comparison to classical RTL-level simulators, this is many orders of magnitude faster and is acceptable for design space exploration. Figure 9. Reading-Writing/Execution/Idle statistics for the common-bus-based architecture. ![]() H. K. ZRIDA ET AL. Copyright © 2011 SciRes. IJCNS 445 Figure 10. Used Crossbar-structure-based four processors architecture model. Figure 11. H.264 encoding performances vs. simulated processors with the Crossbar model. Figure 12. Reading-Writing/Execution/Idle statistics for the crossbar-model-based architecture. Due to the simplicity and expressive power of Ses- ame’s Pearl simulation language, modeling all the plat- form architectures has been rapidly performed. Indeed, the system-level modeling relieves the designer from low level implementation details. Performance evaluation at high abstraction levels makes it possible to control the speed, required modeling effort, and attainable accuracy of the simulations. Th is enables to efficiently explore th e large design space in the early design stages. Applying more detailed models at a later stage allows focused ar- chitectural exploration. Finally, we find that the Sesame methodology facili- tates the performance analysis of embedded systems ar- chitectures in a way that directly reflects the Y-chart de- sign approach. Essential in this modeling methodo logy is that an application model is independent from architec- tural specifics, assumptions on hardware/software parti- tioning, and timing characteristics. Thus, the application is studied in isolation by means of a functional (behav- ioral) software model written in a high level language. Given the complexity of the case studied, an appropriate parallelization methodology has been separately and un- dependably proposed to get the optimal application model with the best computation and communication workload balance [19]. This results in a good starting model with primary estimations of its performance re- quirements. As a result, a single optimally and separately designed application model has been used to exercise different mappings onto a range of architecture models. This clearly demonstrates the strength of decoupling ap- plication models and architecture models and it enables the reuse of both types of models. 6. Conclusions In this paper, we motivated the use of the Sesame/Arte- mis system-level design methodology for efficient archi- tectural exploration of the increasing complexity hetero- geneous embedded media systems. The case studied is concerned with an optimal H.264/AVC encoder SoC design verifying constraints. The complexity analysis of the H.264/AVC reference encoder confirmed the very high complexity of this new standard, the potential memory allocation needed and the high volume of com- putation required. For the design of this complex media system, the outcome, the effectiveness and the flexibility of the methods and tools provided by the Sesame meth- odology have been clearly illustrated. Both modeling and mapping stages of the Sesame design flow are explored. A lot of design space exploration has been considered for getting an optimal design. For all the used configurations, the simulation times did not take more then 5 minutes for all the used configurations. In addition, due to the sim- plicity and expressive power of the architecture specifi- cation language, modeling all the proposed platform ar- chitectures has been rapidly performed. This enables to efficiently explore the large design space in the early design stages. 7. References [1] C. Erbas, “System-Level Modeling and Design Space Exploration for Multiprocessor Embedded System-on- Chip Architectures,” Ph.D Thesis, University of Amster- dam, Amsterdam, 2006. [2] E. A. Lee, et al., “Overview of the Ptolemy Project,” Technical Memorandum UCB/ERL M01/11, University of California, Berkeley, May 2001. [3] X. Chen, H. Hsieh and F. Balarin, “Verification Approach of Metropolis Design Framework for Embedded Sys- ![]() H. K. ZRIDA ET AL. Copyright © 2011 SciRes. IJCNS 446 tems,” International Journal of Parallel Programming, Vol. 34, No. 1, February 2006, pp. 3-27. doi:10.1007/s10766-005-0002-x [4] L. Cai and D. D. Gajski, “C/C++ Based System Design Flow Using SpecC, VCC and SystemC,” Technical Re- port CECS_02_30, 14 June 2002. [5] J. Coffland, “SESAME Users Guide,” Technical Report, University of Amsterdam, 5 April 2006. http://sesamesim.sourceforge.net/docs/SESAME/SESAM E_Users_Guide.html [6] J. E. Coffland and A. D. Pimentel, “A Software Frame- work for Efficient System Level Performance Evaluation of Embedded Systems,” Proceedings of the 2003 ACM Symposium on Applied Computing, Melbourne, March 2003. doi:10.1145/952532.952663 [7] A. D. Pimentel, S. Polstra, F. Terpstra, A. W. van Hal- deren, J. E. Coffland and L. O. Hertzberger, “Towards Ef- ficient Design Space Exploration of Heterogeneous Em- bedded Media Systems,” Technical Report, Department of Computer S cience, University of Amster dam, 2001. [8] P. van der Wolf, P. Lieverse, M. Goel, D. La Hei and K. Vissers, “An MPEG-2 Decoder Case Study as a Driver for a System Level Design Methodology,” Proceedings of the 7th International Workshop on Hardware/Software Codesign, Rome, 3-5 May 1999, pp. 33-37. doi:10.1145/301177.301196 [9] E. G. Iain and H. Richardson, “264 and MPEG-4 Video Compression: Video Coding for Next-generation Multi- media,” John Wiley &Sons Ltd, Hoboken, 2003. [10] M. Alvarez, A. Salami, A. Ramirez and M. Valero, “A Performance Characterization of high Definition Digital Video Decoding Using H264/AVC,” Proceedings of IEEE International, Symposium on Workload Characterization, Austin, 6-8 October 2005, pp. 24-33. [11] A. D. Pimentel, P. Lieverse, P. van der Wolf, L. O. Hertzberger and E. F. Deprettere, “Exploring Embedded- Systems Architectures with Artemis,” IEEE Computer, Vol. 34, No. 11, November 2001, pp. 57-63. [12] P. Lieverse, P. van der Wolf, E. F. Deprettere and K. A. Vissers, “A Methodology for Architecture Exploration of Heterogeneous Signal Processing Systems,” Journal of VLSI Signal Processing for Signal, Image and Video T ec h- nology, Special Issue on SiPS’99, Vol. 29, No. 3, No- vember 2001, pp. 197-207. [13] G. Kahn, “The Semantics of a Simple Language for Par- allel Programming,” In: J. L. Rosenfeld, Ed., Information Processing, Proceedings of the IFIP Congress 74, North- Holland Publishing Co., Stockholm, 5-10 August 1974. [14] H. Muller, “Pearl: A Language for Architecture Simula- tion,” 25 February 1993. http://www.pearl.org/ [15] H. Krichene Zrida, A. C. Ammari, A. Jemai a nd M. Abid, “Performance/Complexity Analysis of a H264 Video Encoder,” International Review on Computers and Soft- ware, Vol. 2, No. 4, July 2007, pp. 401-414. [16] H264 Reference Software Version JM 10.2, November 2005. http://iphome.hhi.de/suehring/tml/. [17] MemProf—Profiling and leak detection, July 2008. http://www.gnome.org/projects/memprof/ [18] H. Krichene Zrida, A. C. Ammari, A. Jemai and M. Abid. “A YAPI-KPN Parallel Model of a H264/AVC Video Encoder,” Proceedings of the 4th IEEE International Conference on Ph.D Research in Microelectronics and Electronics, Istanbul, 22-25 June 2008, pp. 109-112. [19] H. K. Zrida, A. Jemai, A. C. Ammari and M. Abid, “High Level H.264/AVC Video Encoder Parallelization for Mul - tiprocessor Implementation,” Proceedings of the 12th ACM/ IEEE Design Automation and Test in Europe Conference and Exhibition, Nice, 20-24 April 2009, pp. 940- 945. [20] E. A. Kock, G. Essink, W. J. M. Smits, P. van der Wolf, J. Y. Brunel, W. M. Kruijtzer, P. Lieverse and K. A. Vissers, “YAPI: Application Modeling for Signal Processing Sys- tem,” Proceeding 37th Design Automation Conference, Los Angeles, 5-9 June 2000, pp. 402-405. doi:10.1109/DAC.2000.855344 [21] R. Schäfer, T. Wiegand and H. Schwarz, “The Emerging H264/AVC Standard,” EBU Technical Review, January 2003. [22] S. L. Graham, P. B. Kessler and M. K. Mc Kusick, “Gprof: A Call Graph Execution Profiler,” Proceedings of the SIGPLAN’ 82 Symposium on Compiler Construction, Boston, 23-25 June 1982. http://sourceware.org/binutils/docs/gprof/(October 2009) |