Paper Menu >>
Journal Menu >>
Int. J. Communications, Network and System Sciences, 2010, 3, 453-461 doi:10.4236/ijcns.2010.35060 Published Online May 2010 (http://www.SciRP.org/journal/ijcns/) Copyright © 2010 SciRes. IJCNS ASIP Solution for Implementation of H.264 Multi Resolution Motion Estimation Fethi Tlili, Akram Ghorbel CITRA’COM Research Laboratory, Engineering School of Communications(SUP’COM), Tunis, Tunisia E-mail: fethi.tlili@supcom.rnu.tn Received March 19, 2010; revised April 20, 2010; accepted May 15, 2010 Abstract Motion estimation is the most important module in H.264 video encoding algorithm since it offer the best compression ratio compared to intra prediction and entropy encoding. However, using the allowed features for inter prediction such as variable block size matching, multi-reference frames and fractional pel search needs a lot of computation cycles. For this purpose, we propose in this paper an Application Specific Instruc- tion-set Processor (ASIP) solution for implementing inter prediction. An exhaustive full and fractional pel combined with variable block size matching search are used. The solution, implemented in FPGA, offers both performance and flexibility to the user to reconfigure the search algorithm. Keywords: Motion Estimation, Half Pel, Quarter Pel, ASIP 1. Introduction The fast growth of digital transmission services has cre- ated a great interest in digital transmission of image and video signals. These signals require very high bit rates in order to guarantee good video quality. Therefore, com- pression is used to reduce the amount of data needed for representing such signals. Compression is achieved by exploiting spatial and temporal redundancies in signals [1]. H.264 video coding standard currently allows an ap- proximately 2:1 advantage in terms of bandwidth savings over MPEG-2, and it has the potential to allow further bandwidth savings of 3:1 and beyond. In other words, an H.264 coded stream needs roughly half of bit-rates to provide the same quality got by an MPEG-2 encoder. It also includes a video coding layer, which efficiently re- presents the video content independently of the targeted application. A network adaptation layer which formats the video data and provides header information in a manner appropriate to a particular transport layer is used. Finally, in order to decrease the decoder complexity, several application-targeted profiles and levels are de- fined which enable its successful use in different video applications and markets [2]. Despite the fact that it has kept the same coding aspect as previous standards based mainly on prediction, trans- form and entropy encoding, H.264 has introduced some key feature modules that have increased considerably the coding efficiency as well as more flexibility in most of the coding process. However, H.264 is also a substantially more complex standard than MPEG-2; and both the H.264 encoders and decoders are much more demanding in terms of compu- tations and memory than their MPEG-2 counterparts [3]. This, coupled with the substantial amount of research needed to properly implement and optimize the entire relevant H.264 features, makes the development of high-quality H.264 encoders a daunting task. In addition to the complexity added by H.264 standard, low power consumption, high performance and scalabil- ity are the major constraints imposed to designers in the development of video encoders and decoders [4]. In fact, with the diversity of configurations supported by this standard in terms of resolutions and applications, scal- able architectures for video encoders are much appreci- ated by service providers. In this context, neither hard- ware implementation solutions are efficient since they lack flexibility, nor software solutions present good per- formance since processors are no longer satisfying the high computational processing tasks [5]. To meet all these constraints, processor characteristics can be customized to match the application profile. Cus- tomization of a processor for a specific application holds the system cost down, which is particularly important for embedded consumer products manufactured in high volume. Application Specific Instruction set Processors (ASIPs) are in between custom hardware architectures F. TLILI ET AL. Copyright © 2010 SciRes. IJCNS 454 offering good processing performance and commercial programmable DSP processors with high programmabil- ity possibilities. They offer good programmability and performance level but are targeted to a certain class of applications as to limit the amount of hardware area and power needed [6]. This paper is organized as follows: Section 2 presents a complexity analysis of the different encoder’s modules followed by the description of motion estimation stan- dardized by H.264. In Section 3, we will present the pro- posed algorithm for multi resolution motion estimation. Section 4 presents the proposed ASIP solution. In Sec- tion 5 we will present implementation results. Finally, we enclose the paper by Section 6 in which we will con- clude this work. 2. H.264 Video Encoder Study 2.1. Main Innovations of H.264 To achieve the required performance, H.264 allows some key features that ensure good coding efficiency. The main innovations of this standard are: - Intra prediction process. - Tree structured motion estimation, weighted predic- tion, multiple resolution search. - Spatial in loop deblocking filter. - Integer DCT like Transform. - Efficient Macro Block Field Frame coding - CABAC which provides a reduction in bit-rate from 5% to 15% over CAVLC. 2.2. Complexity Analysis of H.264 Video Encoder In order to analyze the complexity of the H.264 encoding procedure, some profiling tasks were done on the several modules of the encoder mentioned above. For this reason, some implementations were performed on single chip DSP using CIF resolution in baseline profile to get the most accurate results since we have to avoid inter-chip communication that can bother the profiling results. Figure 1 presents the profiling results of UBVideo en- coder implemented on DM642 DSP of Texas Instru- ments [7]. We can see that the most consuming video tasks are motion search which is using about 30% of the process- ing time while the intra prediction, motion compensation and encoding (including transform, quantization and en- tropy encoding) are using only 23% of the system re- sources. Motion search includes only the best matching search while all load and store tasks are included in data transfer task which is using about 32% of system re- sources. The remaining 15% of the resources are used by other tasks such as rate control, video effect detection and bitstream formatting. Hence, we can see that motion estimation is a bot- tle-neck for video encoding algorithms which is taking most of system resources. However, motion estimation is the most important module in the compression procedure due to its efficiency. In this context, some video encoders are using FPGA solutions for implementing motion es- timators as hardware accelerators since DSPs cannot handle the processing required by such tasks. 3. Proposed Motion Estimation Implementation 3.1. H.264 Motion Estimation Luminance component of each macro-block (16 × 16 samples) may be split up in 4 ways: 16 × 16, 16 × 8, 8 × 16 or 8 × 8 as shown in Figure 2. Each of the sub-divided regions corresponds to a macro-block partition. If the 8 × 8 mode is chosen, each of the four 8 × 8 macro-block partitions within the macro-block may be split in a fur- ther 4 ways: 8 × 8, 8 × 4, 4 × 8 or 4 × 4 as presented in Figure 3. Partitions and sub-partitions give rise to a large number of possible combinations within each macro- block. This method of partitioning macro-blocks into motion compensated sub-blocks of varying size is known as tree structured motion compensation. In addition to the variable block size matching, H.264 defines multi resolution search process in order to pro- vide better quality especially for non translational motion and aliasing caused by camera noise. Experimental ana- lysis shows that the half and quarter-sample-accuracy 23% 15% 32% 30% Intra Prediction/Motion Compensation/Encode Others Data Transfers Motion Search Figure 1. UbVideo encoder profile. 16 × 16 8 × 16 16 × 8 8 × 8 16 × 8 8 × 16 8 × 8 8 × 8 8 × 8 Figure 2. Macro-block partition. 8 × 8 4 × 8 8 × 4 4 × 4 8 × 4 4 × 8 4 × 4 4 × 4 4 × 4 Figure 3. Macro-block sub partition. F. TLILI ET AL. Copyright © 2010 SciRes. IJCNS 455 motion search adopted by H.264/AVC provide a coding gain of 2 dB compared with MPEG-2 and H.263, which corresponds to a bit-rate savings of up to 30% [8]. Half pel search is performed on pixels interpolated using a 6 tap low pass filter. Furthermore, a quarter pel resolution search is established using a bi-liner filter applied on half pel interpolated pixels. 3.2. Proposed Motion Estimation Algorithm The first step of the proposed ME algorithm consists in full pel resolution search. Current MB is searched in a predefined search area in the reference frame. In order to avoid unused computations and data load, the search is performed on 4 × 4 partitions base of the MB. For each 4 × 4 block, we search for the best matching position in the reference area. Every 4 × 4 block is independently parsed in all reference area. After that, a merging process is started in order to determine the best partition to be used for the current MB based on the best position which is stored relative to the top left pixel of the 4 × 4 block. The merging process is first used to determine if the current MB can be coded in partitions above than 4 × 4. So, we compare the best positions of adjacent blocks for all 8 × 8 partitions: if all blocks have the same best position, cur- rent sub partition is 8 × 8, otherwise, it could be 8 × 4, 4 × 8 or 4 × 4. If 8 × 8 mode is selected, a best position of the top left pixel is stored. After that, we determine the MB prediction type that can be 16 × 16, 16 × 8, 8 × 16 or 8 × 8. A merging process similar to the previous one is also used: if all 8 × 8 sub partitions have the same type and the same best position, MB prediction type is 16 × 16; otherwise it could be 16 × 8, 8 × 16 or 8 × 8. After fixing the MB prediction type, a motion vector is stored for each partition. Obviously, the more we use sub partitions, more data to be transferred increases. We note that at least 40% of inter prediction data is used to code motion vectors. For this reason, it is better to use bigger partitions when possible. So, a pre- diction cost can be added by making conditions for the merge process based on tolerance of one or two pixels in the best positions: for example, if two 8 × 8 blocks have the best positions displaced of 1 pixel, we can decide to merge them into one 16 × 8 partition. After searching for the best matching and the best par- tition, we start fractional pel search. According to the best position, for each MB partition we interpolate the possible 8 half pixels positions around the selected parti- tion as shown in Figure 4. The interpolation is equiva- lent to an up-sampling of the frame pixels using 6 tap low pass filter. After that, a further search is performed in quarter pel accuracy using another interpolation process. Based on the best position obtained in half pel search, we generate pixels of all the 8 possible positions around the best loca- tion. We note that motion vectors are multiplied by 4 in order to mention to the decoder if it has to interpolate pixels for motion compensation or not. 4. Proposed ASIP Solution 4.1. Analysis of the Proposed Motion Estimation Algorithm In our work, we will adopt instruction selection method- ology based on hardware architecture: first the hardware architecture is fixed containing selected functional units (FU) and then, instruction set architecture is determined according to the FUs. For this purpose, proposed algo- rithm is analyzed in order to pick up the most complex modules. These modules will be implemented in inde- pendent hardware blocks (dedicated FUs). Proposed algorithm is composed mainly of 3 parts: full pel search, half pel interpolation and its associated search and finally quarter pel search with its final search. In full pel search, the MB parses the whole reference area and 4 × 4 SADs are computed. In this step, the most complex process is the SAD computation since it in- cludes difference computation, absolute value determina- tion and accumulation. In [9], an analysis was performed on a motion estimation algorithm using SAD as a distor- tion measure; we found that SAD computation is using more than 97% of system resources. In addition, sub pel motion estimation is also complex. In fact, the interpolation process for half pel is using 6-tap filter. Half samples are calculated through a 6-tap Wiener filter in both horizontal and vertical dimensions. The interpolation is processed as represented in Figure 5: dashed pixels correspond to full pixels in an 8x8 bloc. Non dashed pixels are half pixels that are calculated. For example, to interpolate half pixel ‘b’, we use E, F, G, H, I and J as full pixels. Calculation process is done as fol- lows: b = Clip1 (((E − 5 × F + 20 × G + 20 × H − 5 × I + J) + 16) >> 5); clip function is used to provide result in the interval [0 , 255]: if result is less than 0 we affect 0 to ‘b’ and if it is more than 255 we affect 255 to ‘b’. The same calculation process is done for vertical rows as ‘h’. Full pel motion vector Half pel search Quarter pel search F F F H H H H H7 H 6 H 5 H 3 H 4 H 0 H 1 H 2 H 4 Q0 Q1 Q2 Q4 Q3 Q5 Q6 Q7 Figure 4. Fractional accuracy pixel search. F. TLILI ET AL. Copyright © 2010 SciRes. IJCNS 456 Figure 5. Half pel interpolation process. Hence, half pel interpolation, as any filtering process is a very time consuming task and needs a lot of data load and store. Similarly, quarter pel interpolation is us- ing bilinear filter to generate quarter pixels. Although the simplicity of the filter, this process also needs a lot of timing since it is applied to a large number of data. In conclusion, the main complex modules in our pro- posed algorithm are the motion search, half pel interpo- lation and quarter pel interpolation. In our architecture, we will use hardware accelerators for these modules for better performance for our ASIP. 4.2. Functional Unit Selection In our proposal, 3 hardware accelerators are used: SAD calculator, half pel interpolator and quarter pel interpo- lator. The SAD calculator will be used to handle all SAD computation process including data load from internal memory and SAD calculation. The result is stored in a general purpose register. Half pel interpolator module is used to interpolate half pixels according to the standardized filter. This module loads data from internal memory and interpolates pixels. Due to the complexity of interpolation, half pixels are stored in an internal memory to be used in further pos- sessing tasks such as quarter pel interpolation or even half pixels. Finally, quarter pel interpolator loads data from internal memory and applies bilinear filter to gen- erate quarter pixels. In order to avoid storing quarter pix- els in memory, a SAD calculator is integrated in this module: reference pixels are loaded and quarter pel resolution SAD is computed. In motion compensation process, these pixels are re-computed since their compu- tation is not as complex as half pixels. In addition to the hardware accelerators for video processing, an Arithmetic and Logic Unit is used in the solution in order to accumulate SADs, generate pixel locations and memory addresses. 4.3. Instruction Set Selection 4.3.1. Video Instructions • SAD4Pix(DestReg,Curr_Pix_Addr,Ref_Pix_Add r,Pitch): this instruction is used to compute SAD of 4 pixels based on current and reference pixel location and Pitch value. The choice of the 4 pixels size is based on the fact that the smallest partition allowed is 4 × 4; so to avoid using SAD instructions for all partitions, we call this instruction as much as the current partition contains 4 pixel lines. Since we adopt RISC (Reduced instruction Set Computer) architecture, current and reference pixel locations as well as Pitch value are stored in Special Purpose Registers (SPR). These registers are used only for video instructions since they need more than 2 input operands. Output of this instruction is stored in a General Purpose Register (GPR), DestReg in order to be accu- mulated to constitute the required SAD. The choice of the SAD computation size offers the flexibility to the user to choose block lines to be compared. In fact, we can compute only some specific lines in order to mini- mize the processing (for example odd lines or even lines). • Interp4HafPix(RefPixAddr,Pitch): interpolates 4 half pixels and stores the result in internal memory. Input operands include the reference pixel address which refers to the first full pixel from which we start interpolation and a pitch value that is used for data load in case of ver- tical interpolation. This value is used to give the pro- grammer the flexibility of modifying the search window size. These operands are loaded from SPRs while output interpolated pixels are stored in half pel memory since there is no need to store them in registers. In our motion estimation algorithm, after calling this instruction to in- terpolate half pixels of 1 MB, SAD4Pix instruction can be called in order to compute SAD in half pel resolution. For this reason, the pitch value is used in this instruction since the loading step in half pel memory is equal to 2. Hence, we avoid the use of 2 SAD instructions (one for full pel SAD and the other for half pel SAD). • Interp4QpixSAD(DestReg,Ref_pix,Curr_pix,Pitch): used to interpolate 4 quarter pixels and compute quarter pel resolution SAD. We have chosen to separate half pel interpolation from quarter pel interpolation in order to give the user the flexibility to stop the search at any resolution according to the complexity of the algorithm. However, quarter pels are not stored and the corre- sponding SAD is immediately computed. In fact, quarter pels are no longer used by the system except the best match that is used for motion compensation where the best matching pixels are used. So, to avoid using huge memory size corresponding to store all interpolated pix- els, we made the choice not to store them and to recom- pute the best matching pixels when required in motion compensation since their re-computation is easy as op- F. TLILI ET AL. Copyright © 2010 SciRes. IJCNS 457 posed to half pels. This instruction returns the SAD of the current position and the ALU decides for the best one to be used in motion compensation. Input operands to this instruction, reference and current pixels positions as well as pitch value are stored in SPRs. The output is stored in GPR, DestReg to be processed by the ALU for further decisions. 4.3.2. Memory Instructions Memory instructions are used to transfer data between memory and registers or inter register transfer. Four in- structions are used for this purpose: • MOVSG(Src,Dest) is used to move data from spe- cific to general purpose register. The operands of this instruction are formed by the addresses of registers to be manipulated. • MOVGS(Src,Dest) is used to perform the inverse operation performed by MOVSG. • LOAD(SrcAddr,DestReg) is used to load data from data memory to general purpose register. SrcAddr is the source address of data to be loaded while DestReg in the destination register ID. • STORE(SrcReg,DestAddr) is used to store the content of a general purpose register in memory. The operands are SrcReg corresponding to the source register ID and DestAddr is the destination memory address. 4.3.3. Arithmetic and Logic Instructions The main goal of these instructions is the accumulation of SAD values computed for each 4 pixels, computing pixel addresses, compare MB SADs and provide data for conditional jump. ALU instructions are processing only data from general purpose registers. We defined 3 arith- metic instructions: • ADD, SUB and MUL are used respectively for addi- tion, subtraction and multiplication operations. These instructions have 3 operands: the first one is the destina- tion register ID containing the operation result while the 2 remaining operands are the IDs of registers containing source data to be processed. • SHIFT(SrcReg1,SrcReg2,SrcReg3) is used for shi- fting data contained in SrcReg1 by the number of bits contained in SrcReg2. The shift direction is indicated by SrcReg3. 4.3.4. Control Instruction The instruction JUMP introduces a change in the control flow of a program by updating the program counter with an immediate value that corresponds to an effective ad- dress. The instruction has 2 bits condition field (cc) that specifies the condition that must be verified for the jump: in if case the outcome of the last executed arithmetic is negative, positive or zero. Not only this instruction is important for algorithmic purposes, but also for improv- ing code density, since it allows a minimization of the number of instructions required to implement a ME al- gorithm and therefore a reduction of the required capac- ity of the program memory. 4.4. Architecture of the Proposed ASIP 4.4.1. Data Word Length Data word length is a tradeoff between performance and complexity. In fact, the data word length corresponds to the instruction word length which is stored and manipu- lated by the processor. Hence, in case of longer instruc- tion word length, we have the possibility of using more instructions and more registers which will accelerate the processing since memory access will be reduced. How- ever, the instruction decoder will be more complex as well as the interconnection between components; there- fore, the processor area will be larger. In our proposal, we have only 12 instructions which can be coded on 4 bits. In order to simplify the hardware architecture, we have chosen to use 16 bits to code all instructions. So, 12 bits can be used to address the regis- ter file. 4.4.2. Register File Size Since the instruction length is 16 bits and 4 bits are used to code instructions, the 12 remaining are used to code the different registers used. Since arithmetic instructions are using 3 GPPs, we will code each register on 4 bits, so 16 GPPs can be used in our architecture. On the other side, video instructions are using both GPPs and SPPs. So, 8 bits only can be used to code 3 registers in the in- struction call: each register is addressed on 2 bits. So, 4 SPPs are used. At this stage, we can see the importance of the use of GPPs and SPPs: if we use only one register type, when calling video instruction, 12 bits are used to code 4 registers: 3 bits are used per register as a conse- quence. Therefore, only 8 registers are used in this case while in our design we are using 20 registers with the same instruction length. Table 1 presents the different Table 1. Instruction set architecture of the proposed ASIP. Instrution 15 12 11 10 9 8 7 6 5 4 3 2 1 0 SAD4Pix 0000 RestReg R1 R2 R3 - Interp4HafPix 0001 - R1 R2 - Interp4QpixSAD 0010 DestReg R1 R2 R3 - MOVSG 0010 - Src DestReg MOVGS 0011 SreReg Dest LOAD 0100 #addr DestReg STOR 0101 SreReg #addr ADD 0110 DestReg SreReg1 SreReg2 SUB 0111 DestReg SreReg1 SreReg2 MUL 1000 DestReg SreReg1 SreReg2 SHIFT 1001 SreReg1 SreReg2 SreReg3 JUMP 1010 CC #addr F. TLILI ET AL. Copyright © 2010 SciRes. IJCNS 458 instructions with the corresponding codes, operands with their corresponding size. 4.4.3. Micro Architecture Figure 6 presents the micro architecture of the proposed ASIP. The solution is composed of an instruction fetch mod- ule to load instructions from program memory, instruc- tion decoder to enable the several functional units and a register file to store processed data. Video functional units are connected to the internal data memory and the ALU. Data load from external memory to internal mem- ory is handled by a direct memory access controller. 5. Implementation Solution and Results The proposed ASIP was implemented and synthesized on Virtex II Pro FPGA. 5.1. Memory Management In our motion estimation algorithm, the search region area is fixed to 31 × 23 pixels. We note that we need to extend this search region by 16 pixels in both sides (right and bottom) since the last right-bottom position must be displaced of a (15, 12) vector from the centre. Further- more, to interpolate boundary pixels, an extension of three pixels is needed for each side. Figure 7 describes the search area with the several extensions. Hence, the total search area has to be 53 × 45; so 2385 pixels have to be loaded from external to internal mem- ory. Internal memory is designed to be 2 × 18 Kb block RAM integrated in Virtex II FPGA. We note also that a further 1 × 18 Kb block RAM is also needed to store the current MB. Internal memory is 8 bits width for imple- mentation constraints: since we adopt exhaustive search, the whole reference area is parsed in order to search for the best matching MB; so, if we load more than one pixel from reference area, we will be faced to an alignment problem. To avoid such problems, we have chosen to load one pixel in each cycle assuming that this procedure is more consuming in time. Data load to internal memory is ensured by Direct Memory Access controllers which handles the transfer process while the CPU is running. When transfer is finished, an interrupt signal is men- tioned. Synthesis results of the DMA controller shown in Ta- ble 2 presents that this module using roughly 10% of the available FPGA resources and can be run at 205 Mhz clock frequency. 5.2. SAD Engine This engine is used to compute the SAD of 4 pixels. This module loads reference and current pixels from the in- ternal memory and performs the SAD of 4 pixels in one call. The SAD module can be used in the SAD computa- tion of the full pel or half pel search. As described in Figure 8, the SAD engine is providing the output after 9 cycles from the start signal. The output is finally returned to the register file. We note that TMS320C64 DSP is providing SAD of 4 4 × 4 blocks Figure 6. Architecture of the proposed ASIP. F. TLILI ET AL. Copyright © 2010 SciRes. IJCNS 459 Figure 7. Search area organization. Table 2. Synthesis results of DMA controller. Device utilization summary Number of Slices 190 out of 1408 13% Number of Slices Flip Flops 178 out of 2816 6% Number of 4 input LUTs: 300 out of 2816 10% Number of GCLKs 1 out of 16 6% Timing Summary: Minimum period/Maximum Frequency 4.877 ns/205.025 MHz Minimum input arrival time before clock 5.294 ns Maximum output required time after clock 4.968 ns Maximum combinational path delay No path found Figure 8. Timing diagram of SAD engine. (split_sad8 × 8) in 200 cycles in the best case: when all data paths are fully used [10] while our system can pro- vide the same result after 144 cycles without using pipe- line. 5.3. Half Pel Interpolator In our implementation, the proposed algorithm is derived by minimizing the number of memory access. The for- mulas to compute half-pixel interpolations are proposed by using the symmetry of the 6-tap FIR filter coefficients, resulting in significant reduction of the multiplications [11]. This engine is providing 4 interpolated pixels in each call. Input pixels are stored in 6 registers; the size of each one is 32 bits as described in Figure 9: We note that pixels P3 to P6 form a line of a selected 4 × 4 block to be interpolated. The output pixels are H0 to H3. A Single Instruction Multiple Data scheme is ado- pted in our implementation. In this mode, adders and multipliers are applied simultaneously to the pixels of registers in order to get all interpolated pixels at the same time. All control signals are provided by an FSM. We note that the interpolation takes 15 cycles includ- ing the load process from internal memory. Synthesis results are shown in Table 3. Figure 9. Input registers for halfpel interpolation. Table 3. Synthesis results of half pel interpolator. Device utilization summary Number of Slices 354 out of 1408 25% Number of Slices Flip Flops 460 out of 2816 16% Number of 4 input LUTs: 343 out of 2816 12% Number of MULT18X18s 4 out of 12 33% Number of GCLKs 1 out of 16 6% Timing Summary: Minimum period/Maximum Frequency 5.504 ns/181.689 MHz Minimum input arrival time before clock 4.679 ns Maximum output required time after clock 3.638 ns Maximum combinational path delay 3.802 ns F. TLILI ET AL. Copyright © 2010 SciRes. IJCNS 460 Figure 10. Timing diagram of Quarte pel interpolator. 5.4. Quarter Pel Interpolator When receiving Interp4QpixSAD(Ref_pix,Curr_pix,Pitch) instruction, quarter pel interpolation and SAD computa- tion are started. First, pixels loaded from half pel mem- ory are fed into the interpolator module, then, the result- ing quarter pixels are transmitted to the SAD module to be compared to the current pixels. We note that QP in- terpolator interpolates and generates the SAD of 4 pixels in each call. Quarter pel SADs are returned after 14 cycles as shown in the timing diagram shown in Figure 10. 6. Conclusions This paper has presented efficient instructions for im- plementing motion estimation process using most of the key features standardized in H.264. First, we analyzed the complexity of typical H.264 encoder. From this step, we concluded that ME is a bottle neck for the implemen- tation. Then, we presented and analyzed an algorithm for ME. Based on the analysis, we proposed efficient accel- erators for some modules which need most of the proc- essing time. Based on the suggested hardware architec- ture, we fixed the instruction set architecture providing to users large coding flexibility ensuring scalability and multi-standard support. Proposed ASIP was implemented on Virtex II pro FPGA with a total area use about 61% of the FPGA Slices and 43% of the total LUTs. The imple- mented modules can be run on 172 MHz clock. 7. References [1] Q. Y. Shi and H. F. Sun, “Image and Video Compression for Multimedia Engineering: Fundamentals, Algorithms, and Standards,”2nd Édition, CRC Press, Boca Raton, 2008. [2] Draft 3rd Edition of ISO/IEC 14496-10 (E), Redmond, WA, USA, July 2004. [3] F. Kossentini and A. Jerbi, “Exploring the Full Potential of H.264,” NAB, 2007. [4] S. D. Kim, J. H. Lee, C. J. Hyun and M. H. Sunwoo, “ASIP Approach for Implementation of H.264/AVC,” Journal of Signal Processing Systems, Vol. 50, No. 1, 2008, pp. 53-67. [5] P. Harm, et al., “Application Specific Instruction-Set Processor Template for Motion Estimation in Video Ap- plications,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, No. 4, April 2005, pp. 508-527. [6] M. Kumar, M. Balakrishnan and A. Kumar, “ASIP De- sign Methodologies: Survey and Issues,” 14th Interna- tional Conference on VLSI Design, Bangalore, 2001. [7] I. Werda and F. Kossentini, “Analysis and Optimization of UB Video’s H.264 Baseline Encoder if Texas Instru- F. TLILI ET AL. Copyright © 2010 SciRes. IJCNS 461 ment’s TMS320DM642 DSP,” IEEE International Con- ference on Image Processing, Atlanta, October 2006. [8] S. Yang, et al., “A VLSI Architecture for Motion Com- pensation Interpolation in H.264/AVC,” 6th International Conference on ASIC, shanghai, 2005. [9] W. Geurts, et al., “Design of Application-Specific In- struction-Set Processors for Multi-Media, Using a Retar- getable Compilation Flow,” Proceedings of Global Signal Processing (GSPx ) Conference, Target Compiler Tech- nologies, Santa Clara, 2005. [10] M. A. Benayed, A. Samet and N. Masmoudi, “SAD Im- plementation and Optimization for H.264/AVC Encoder on TMS320C64 DSP,” 4th International Conference on Sciences of Electronic, Technologies of Information and Telecommunications (SETIT 2007), Tunisia, 25-29 March 2007. [11] C.-B. Sohn and H.-J. Cho, “An Efficient SIMD-based Quarter-Pixel Interpolation Method for H.264/AVC,” In- ternational Journal of Computer Science and Security, Vol. 6, No. 11, November 2006, pp.85-89. |