Paper Menu >>
Journal Menu >>
![]() Communications and Network, 2010, 2, 207-215 doi:10.4236/cn.2010.24030 Published Online November 2010 (http://www.SciRP.org/journal/cn) Copyright © 2010 SciRes. CN SimNP: A Flexible Platform for the Simulation of Network Processing Systems David Bermingham, Zhen Liu, Xiaojun Wang School of Electronic Engineering, Dublin City University, Collins Avenu e, Glasnevin, Dublin, Ireland E-mail: {david.bermingham , liuzhen, wangx}@eeng.dcu.ie Received September 16, 2010; revised September 30, 2010; accepted October 20, 2010 Abstract Network processing plays an important role in the development of Internet as more and more complicated applications are deployed throughout the network. With the advent of new platforms such as network proc- essors (NPs) that incorporate novel architectures to speedup packet processing, there is an increasing need for an efficient method to facilitate the study of their performance. In this paper, we present a tool called SimNP, which provides a flexible platform for the simulation of a network processing system in order to provide information for workload characterization, architecture development, and application implementa- tion. The simulator models several architectural features that are commonly employed by NPs, including multiple processing engines (PEs), integrated network interface and memory controller, and hardware accel- erators. ARM instruction set is emulated and a simple memory model is provided so that applications im- plemented in high level programming language such as C can be easily compiled into an executable binary using a common compiler like gcc. Moreover, new features or new modules can also be easily added into this simulator. Experiments have shown that our simulator provides abundant information for the study of network processing systems. Keywords: Network Processors; Sim NP 1. Introduction Driven by the ever increasing linking speed of Internet and the complexity of network applications, networking device providers have never ceased the effort in devel- oping a packet processing platform for the next-genera- tion network infrastructures. One of the most promising solutions is network processor (NP) which leverages the programming flexibility of microprocessor with the high performance of custom hardware [1]. Ever since the ad- vent of this concept, the design goals have been setup as: 1) enabling rapid development and deployment of new network applications; and 2) providing sufficient per- formance scalability to prolong the life cycle of the products. During nearly ten years of development, the architec- ture of NP continually evolves to meet the stringent re- quirements of these design goals. For example, enforced by the demands of making the programming easier, spe- cialized instruction sets employed by early generations of NP products [2] have been replaced by standard ones such as MIPS [3] that possess a wealth of existing soft- ware including development tools, libraries, and applica- tion codes. Besides, most of the NPs fall in the Sys- tem-on-Chip (SoC) paradigm. Compared with general purpose processor (GPP), new generations of NPs often possess some or all of the following architectural fea- tures: 1) Multi core. Due to the large amount of packet and flow level parallelism that naturally exist in network ap- plications, multi core schemes are commonly used in various ways [4]. Although the number of processing engines (PEs) in most NPs has been around the order of ten, Cisco’s Silicon Packet Processor (SPP) has pushed the boundary to 188 32-bit Reduced Instruction Set Computer (RISC) cores per chip [5]. 2) Integrated memory controller. For general pur- pose processing that is not sensitive to access latency, the memory subsystem is optimized for bandwidth rather than latency. Due to the semi-real time features of packet processing, NPs must perform fast memory operations to keep up with the packet arriving speed over the network interface [6]. Therefore, most of the NPs have integrated memory controllers in order to achieve lower latency. ![]() 208 D. BERMINGHAM ET AL. 3) Integrated network interface. To reduce latency of packet loading, network interfaces are integrated in- stead of being one of the external I/O devices that are linked to the processor through a common bus and some- times a bus bridge. A typical implementation is to in- clude on-chip MACs so that the processor can connect directly to external PHY devices. SPI-4.2 (System Packet Interface Level 4 Phase 2) for 10 Gigabit optical net- working or Ethernet, and GMII (Gigabit Media Inde- pendent Interface) for Gigabit Ethernet are commonly supported interfaces [7]. In some cases, CSIX (Common Switch Interface) that is used for switch fabric is also supported to ease the deployment of NP on a line card. 4) Hardware accelerator. Offloading special applica- tions that are relatively stable and suitable for hardware implementation has been adopted as an important method to achieve high performance [8]. Hardware ac- celerators usually function as coprocessors and have the potential of being executed concurrently with other parts of the program. They can be implemented either private to or shared by the PEs, or as external devices interacting with NP. Commonly accelerated applications include table lookup, checksum verification and generation, en- cryption/decryption, hashing, and even regular expres- sion matching. The hardware accelerator can be accessed through memory mapping, or specialized instructions, which are extensions to standard instruction set with corresponding modifications to compliers and libraries. Just as GPP, the development of such sophisticated systems needs an efficient methodology that can facili- tate the study of NP architecture and the deployment of network applications on this platform. At the beginning the study of NP, many academic researches have either focused on a specific type of NP product, or heavily re- lied on GPP simulators such as the SimpleScalar toolset [9]. While the conclusions obtained from the former method can hardly be extended to other types of NP [10], convincing results are hard to be obtained fro m the latter either [11,12]. GPP simulators often devote a lot of simulation effort in some features that do not play an important role in network processing systems. For exam- ple, instruction-level parallelism (ILP) is aggressively exploited to increase the utilization rate of processing power in GPP. Therefore, techniques that are hardly em- ployed by NP, such as multiple-instruction issue, out-of-order instruction scheduling, branch predication, and speculative execution, are widely simulated in GPP simulators [13]. However, the effects of NP’s unique architecture that are optimized for packet processing cannot be effectively reflected in GPP simulators. This motivates the development of a simulator called SimNP by the authors, which provides a flexible plat- form for fast simulation and evaluation of network proc- essing systems. A simplified prototype version of this simulator was briefly introduced in [14]. As more mod- ules were added and experiments were performed, part of our work was summarised in a short paper [15]. Since the architecture of network processing systems keeps evolving at a fast pace, the design effort of our simulator are guided by the following criteria: the simulator sho uld provide enough details that represent the features of to- day’s network processing systems and at the same time also enough flexibility so that component can be easily modifi ed, extended or delet e d. Our simulation platform has incorporated all of the four architectural features mentioned above. The simu- lator only models the most basic characteristics of the hardware units such as queuing, and resource contention, without involving details of a specific design. It adopts the software architecture of SimpleScalar toolset in order to provide a clean and expressive interface and guarantee that the individual components can be easily replaced by other modules. Just like SimpleScalar, our simulator does not try to memorize each internal state of the execution and uses an event queue to reduce the need to examine the status of hardware modules during each cycle. A large number of parameters can be tuned through modi- fying the configuration file of the simulator, including the number of PEs, operating frequency of all devices, bus bandwidth, and latency of hardware modules. This simulator provides a unified tool for a wide range of studies. Interactions among different modules provide insight into th e execution of packet processing workloads for architects to generate ideas to improve the perform- ance. Newly designed modules can be tested and evalu- ated before actually being implemented. The simulator also offers a platform for programmers to collect infor- mation such as instruction characteristics, memory utili- zation, and inter-processor communication, which helps them performing tasks like tuning the software imple- mentation of algorithms, and evenly allocating tasks to different PEs. The rest of the paper is organized as follows. Section 2 gives a brief description of related works. Section 3 in- troduces several aspects of our simulator, including software organization, the simulated hardware architec- ture, and programming environment. Section 4 gives some experiences with our simulator. In Section 5, we list some future directions for extending our work. Sec- tion 6 gives a conclusion. 2. Related Works A set of tools for the simulation of network interfaces are introduced in [16]. PacketBench, a tool that provides a framework for network applications implementation is Copyright © 2010 SciRes. CN ![]() D. BERMINGHAM ET AL. 209 presented in [17]. This tool only adds several APIs for loading packets into SimpleScalar and omits most of the impact of NP architectural features. Simulators targeted to real life network processors are also developed. Yan Luo et al have developed a NP simulator called NePSim that complies with most of the functionalities described in Intel IXP1200 specification [18]. Compared with IXP1200’s own cycle-accurate architectural simulator, NePSim estimates the packet throughpu t with an average of only 1% of error and the processing time with 6% of error on the tested benchmark applications. However, application development on IXP1200 requires a thorough understanding of the underlying hardware details. The difficulty in programming has greatly constrained the usage of NePSim in any architectural studies. In [19], Deepak Suryanarayanan et al. present a component net- work simulator called ComNetSim which models a Cisco Toaster network processor. However, it only provides functional simulation and is implemented according to the execution of applications that are modeled. 3. The SimNP Platform The SimNP derives its software architecture from the widely used SimpleScalar/ARM toolset, which allows us to perform accurate simulation of modules such as de- vice command queues and bus arbitration using the exe- cution-driven method. Our simulation platform organizes the simulated hardware components to p ermit their reuse over a wide range of modeling tasks. It consists of sev- eral interchangeable modules to model a range of archi- tectural features. It can be used to model the entire life cycle of packet processing, from packet receiving to its transmission onto the external link. Although it models the architecture of a typical network processor, its usage can be easily extended to other network processing sys- tems. Most of the packet processing is based on software implementation , in addition to the support for simulatin g hardware accelerators. 3.1. Software Architecture The simulator follows the traditional way of having a front-end functional simulator that interprets instructions and handles I/O operations, and a back-end performance model that calculates the expected behaviors according to the executed instructions. To maintain compatibility, the execution of system calls still depends on calling the host operating system. For the simulated packet interface, we provide a specific programming paradigm to avoid using system calls. The simulator accepts instruction binaries and packet traces as input. For applications that need table accesses, a memory image file of these tables should be loaded before the program executes. Both real-life packet traces and synthesized traffic can be used within simulations. In the case of packet headers collected from a website such as the National Laboratory for Applied Network Re- search (NLANR), any encoding format (TSH, TCPDUMP, ERF, FR+) can be used [20]. A dedicated trace loader is able to turn each of them into the SimNP native packet format, with anonymous IP addresses be- ing replaced with random IP addresses generated from real-life route table or rule-set, and payload being padded with random contents to the length indicated in packet header. 3.2. Simulated Hardware Architecture As has mentioned before, the design choices of the simulator concentrate on selecting only those necessary features of packet processing so that the simulated archi- tecture can represent a wide range of network processing systems without getting involved into too many specific details. As shown in Figure 1, all of the four architec- tural features described in Section 1 have been covered in our simulator. As the advance of NP architecture and software, new features are expected to be easily added. 3.2.1. Processing En gi nes SimNP supports up to 32 PEs, with each PE functionally emulating the ARM Instruction Set. The ARM instruc- tion set is chosen for two reasons. Firstly, the Instruction Set Architecture (ISA) provided by the ARM processor closely resembles the small RISC type PEs used with in a NP, and secondly, the maturity of the ARM architecture provides a number of efficient compiler solutions. With free compiler suites such as gcc [21] allowing fast gen- eration of ARM code from languages such as C, and C++. The processing cores support the ARM7 integer instruction set and FPA floating-point extensions, with- out the 16-bit thumb extension. Each PE also has a Control Store, Local SRAM and Local Hardware Accelerators, as will be explained later. The communications between PEs and other devices are performed through a System Bus. Since a shared bus system can result in long access latencies to both I/O and memory, PEs are simulated with an automatic suspen- sion mechanism once a command has been issued. When the command has been completed, the PE resumes from its previous state. At current stage we do not implement synchronization mechanism. 3.2.2. Memory Subsystem Qshared by all PEs. The operation parameters, such as ccess latency and number of banks in DRAM, of these a Copyright © 2010 SciRes. CN ![]() D. BERMINGHAM ET AL. Copyright © 2010 SciRes. CN 210 ... System Bus Rx Buffer Bus Interface Shared Hardware Accelerators ARM Register File Local SRAM Processing Engine Control Store Local Hardware Accelerators ARM Register File Local SRAM Processing Engine Control Store Local Hardware Accelerators Bus Interface SRAM SRAM Bus Interf ace Bus Inter face DMA Tx Buffer I/F CTRL DRAM Controller Bus Interface DRAM TCAM Controller Local Masks TCAM SRAM Controller Bus Interface SRAM Cluster ACluster B Figure 1. Hardware architecture. memory devices are configurable. Prefix Matching (LPM) applications such as route table lookup. In addition to normal content, both global mask (GM) and local masks are needed to be loaded from- memory image before the processing begins. Both Control Store and Local SRAM provide single cycle access. As shown in Figure 2, they can be used to store the program execution environment for each PE, including instructions, initialized and uninitialized data, stack, heap, and arguments. Packets are normally loaded from network interface into DRAM, along with the queuing information stored in global SRAM. Other shared data structures such as routing table, packet clas- sification rule sets, can be stored in either global SRAM or DRAM. TCAM is expected to accelerate the Longest Figure 2 also shows a possible PE memory map. The address space [0000 0000h, 0FFF FFFFh] is private to each PE, while the address space from 1000 0000h is shared among all PEs. Unlike Intel IXP instructions, ARM uses the same instructions to access different type of memory devices. The device-to-address mapping is implemented by modifying the parameters for compiler and specifying starting address in memory image. With- out changing the ARM instructions, it is easier and more flexible for the programmer to create new applications. 3.2.3. Netw ork Interface The SimNP network interface is designed to model the behavior of SPI-like interface. To simplify the program- ming of network applications, packets are maintained in a link-list, instead of fixed-length blocks as in some real-life NPs, when stored in the memory pool of net- work interface. Once the Rx Buffer has been filled, newly arrived packets are dropped. Similarly, a full Tx Buffer blocks the processing of some PEs until enough space is released. PEs demand packet from Rx Buffer or write packet to Tx Buffer by issuing commands to net- work interface. The actual transfers between network interface and memory are handled via a DMA controller which uses either main system bus or a dedicated bus (not shown in Figure 1. A dedicated bus is provided to reduce the contention on system bus, since the traffic volume generated by parallel architecture can be poten- tially very high. Figure 2. An example of memory usage and PE address space mapping. ![]() D. BERMINGHAM ET AL. 211 The interaction between network interface and PE is modeled in a relatively simple way so that the overhead in a real network processing system can be reflected without having the programmer getting overwhelmed with unnecessary communication details. Figure 3 shows the programming framework used for SimNP where a run-to-complete polling model is used for packet transfer. The registers in network interfaces are mapped to memory and the addresses are defined in <simnp_defs.h> for macros starting with NET_INTF_. Macro PE_PKT_BASE_ADDR indicates the starting address in the memory that a packet shou ld be written to or read out. It can be either a value predefined for each PE in <simnp_defs.h> or a value passed from shell scripts at compile time. The packet request is issued by primitive WRITE_WROD which sends necessary infor- mation to the register specified by its parameter. Similar, primitive READ_WORD returns the content of the specified register. 3.2.4. Hardware Accelerator In SimNP, both local and shared hardware accelerators are simulated. As shown in Figure 2, their own memo- ries and registers are accessed by mapped addresses so that new components can be easily added, subtracted and accessed without major changes. Local hardware accel- erators are suitable for simple calculations such as checksum in IP header. As for shared hardware accelerators, two clusters are provided with each of them supporting up to four sepa- rate hardware accelerators, shown in Figure 1. Cluster A targets data-intensive payload applications, such as packet encryption/decryption and Deep Packet Inspec- tion (DPI) [22,23]. To speedup the calculation, additional SRAM is equipped to temporarily hold the packet data block transferred from packet buffer. Like the network interface, using a sour ce address, destination address and buffer length, a DMA controller will automatically fetch and store data without any PE interference. The pro- gramming framework for Cluster A is somewhat similar with that of Figure 3. Cluster B is aimed at accelerators for header-based applications such as packet classifica- tion [24] and IP lookup [25], where the SRAM is used to hold rule-set or route table and a small number of packet data transfers is needed. Figure 4 shows the program- ming framework for calling the packet classification hardware accelerator 3.2.5. System Bus All devices connected to the system bus use one or two Command FIFO(s) (labeled as CMD with PEs’ CMDs omitted in Figure 1, to buffer data requests. The commands are arbitrated by system bus in a weighted round robin manner. The bandwidth of system bus is configured by user. Some devices, such as SRAM and DRAM, also have a data buffer to hold the content de- manded by PE or DMA controllers. During each cycle, at least one command can be processed by the system bus unless all of the buffers are empty. 4. Experiments 4.1. Experimental Setup We choose three algorithms commonly used within net- work processors to do the experiment on SimNP. The first one is a header processing application, Level Compressed 0 #include "simnp_defs.h” 1 int application(void *pkt_addr, int pkt_len); 2 void main () 3 { 4 unsigned long pkt_len; 5 int action; 6 while (1) { 7 /* Request Packet from Interface */ 8 WRITE_WORD(NET_INTF_REQUEST, PE_PKT_BASE_ADDR); 9 pkt_len = READ_WORD(NET_INTF_STATUS); 10 /* Process Packet */ 11 action = application(PE_PKT_BASE_ADDR, pkt_len); 12 If (action == FORWARD) { 13 /* Queue Packet At Egress */ 14 WRITE_WORD(NET_INTF_TRANSMIT, PE_PKT_BASE_ADDR); 15 } 16 } } Figure 3. Sample programming framework for workloads. Copyright © 2010 SciRes. CN ![]() 212 D. BERMINGHAM ET AL. 0 #include "simnp_defs” 1 void main () 2 { 3 struct ip *ipdhr; 4 struct tcp *tcphdr; 5 int port, classify_result; 6 7 iphdr = (struct ip *)PKT_ADDR; 8 tcphdr = (str uct tcp *)PKT_ADDR + (IP_SIZE>>2); 9 /* Create Request */ 10 WRITE_WORD(CLASSIFY_UNIT, iphdr->src_addr); 11 WRITE_WORD(CLASSIFY_UNIT + 4, iphdr->dst_addr); 12 port = (tcphdr->sport)<<16 | ( tcphdr->dport); 13 WRITE_WORD(CLASSIFY_UNIT + 8, port); 14 WRITE_WORD(CL ASSIFY_UNIT + 12, iphdr->prot) ; 15 /* Get the Result fr om Hardwar e Acceler ator */ 16 classify_result = READ_WORD(CLASSIFY_STATUS); } 17 #include "simnp_defs” [1] 18 void main () Figure 4. An example calling procedure for packet classification hardware accelerator. Trie (LC-Trie) based IP Forwarding [25]. The other two are payload processing applications, packet fragmenta- tion [26], and a packet encryption/decryption algorithm called Advanced Encryption Standard (AES). AES is designed to be implemented in Cipher Block Chaining (CBC) mode with 128-bit encryption, which is typical for today’s routers [22]. Under this configuration, AES requires 10 rounds per 16-byte data block. The three programs are compiled with gcc-3.4.3 and the object code is copied to the Control Store of each PE. An OC-3 packet header trace from NLANR is used, which con- tains a large percentage of small packets. A 127,000- entry AT&T East route table is used for the LC-Trie ap- plication. The Simulation is performed on a Linux Computer with a 2.0 GHz Intel® Core-Duo processor and 2GB memory. 4.2. Performance of Multiple PEs Figure 5 presents the performance of packet fragmenta- tion and LC-Trie as we increase the number of PEs from 1 to 32, without changing the device latencies or system bus bandwidth. The solid lines represent the number of non-stall cycles to finish processing 10,000 packets while the dashed lines indicate the amount of stall cycles. Here, a “stall” state happens when all of the PEs in the system are in a suspended state, i.e. no instructions are executed in this cycle. Similar to other payload processing applications, frag- mentation has a high DRAM access requirement to fetch the packet data. As for LC-Trie, the major memory ac- cesses occurred for each packet include retrieving a number of entries in route table wh ich is stored in SRAM. Obviously, the bandwidth requirement of LC-Trie is much lower than fragmentation, which results in a lower stall percentage than fragmentation. As can be seen in Figure 5, when the number of PE is larger than 4, no stall state happens for LC-Trie applications while for fragmentation, the number of stall cycles increases rap- idly as more PEs are added. Under this configuration, it can be seen that 2 PEs are the most efficient for fragmentation, with the stall cycles increasing from over 69.94x106 cycles for a single PE to over 1044x106 cycles for a 32-PE system. In this case, more system bus bandwidth and DRAM bandwidth are demanded to maintain the efficiency of multiple PEs. As for LC-Trie, although the stall cycles quickly falls from 425x106 to 1 .67 x106 wh en implemented on 4 PEs, 8 PEs is the optimum configuration. The reason is that if more than 8 PEs are used, even though at any time at least one PE executes an instruction, the percentage of suspension state of each individual PE also increases. Therefore, the total amount of cycles used to process the same amount of packets only has a moderate decrease. 4.3. Impact of Memory Latencies Figure 6 shows the number of execution cycles and stall cycles needed for processing 10,000 packets with the LC-Trie algorithm as the CPU relative latency changes. For simplicity, we assume the PEs and System Bus working at the same clock speed and so do the DRAM and Global SRAM. Then the ratio between the working frequency of PE/System Bus and DRAM/Global SRAM is defined as CPU relative latency. It can be seen that, when the number of PEs is less than 8, the long latency of external memory does not have a significant impact on the number of processing cycles required. The reason is that, for each packet, only a small number of Global SRAM and DRAM accesses Copyright © 2010 SciRes. CN ![]() D. BERMINGHAM ET AL. 213 are needed. As long as the bandwidth of System Bus is enough for these communications, the increased memory latency is amortized among different PEs and the chance of all PEs being suspended is low. So it can be observed that when there are fewer than 4 PEs, the total number of stall cycles only slightly increases when the relative memory latency is higher than 10. As for 8 PEs, the number of stall cycles becomes more sensitive to the changes in memory latency. However, since the band- width of System Bus is still well enough, the number of cycles needed for processing the packets remains stable across different values of memory latency. Figure 5. NP performance of fragmentation/LC-Trie for various number of PEs. Figure 6. NP performance of LC-Trie under various memory latencies. Copyright © 2010 SciRes. CN ![]() 214 D. BERMINGHAM ET AL. Figure 7. NP performance of AES under various hardware accelerator latencies. However, as more PEs are added, the bandwidth of System Bus is unable to accommodate the data traffic on them in a timely way. In this case, the increase in mem- ory access latency does result in not only a higher num- ber of stall cycles, but also a rapid increase in the number of processing cycles. When the relative latency is higher than 5, processing the same number of packets for 16 or 32 PEs takes more cycles than only 4 or 8 PEs. 4.4. Effectiveness of Hardware Accelerators Using either Cluster A or B, SimNP provides an efficient method of simulating hardware accelerators as a means of evaluating their effectiveness by acquiring figures such as device utilization and speedup. To demonstrate this, we choose to implement the AES algorithm de- scribed in Subsection 4.1 as a hardware accelerator. For hardware accelerators, the trade-off is typically between increasing the performance and reducing the area cost. By configuring the latency of hardware accelerators, we can evaluate the benefit generated through offloading the calculation intensiv e tasks. Figure 7 shows the number of NP processing cycles required to en crypt 30,000 packets as the laten cy of AES hardware accelerator increases. With only one PE being evaluated, the bandwidth of System Bus is enough for the packet data transfer between DRAM and Cluster A. Therefore, a linear increase in the number of necessary processing cycles is observed as the latency for process- ing one data block by AES hardware accelerators be- comes higher. Note that the number of instructions to be executed by AES is at least hundreds of times higher than that of LC-Trie, depending on the length of packet. However, compared with Figure 5, the number of cycles needed for AES is lower than that of LC-Trie, normal- ized to the same number of packets being processed. Besides, the use of hardware accelerator also makes the processing time more deterministic. Such behavior is helpful for the implementation of load balancing. 5. Future Work Components that we plan to implement for SimNP in the future include a more accurate PE execution core, and a cache hierarchy for latency hiding techniques. Introduc- ing cache hierarchy in a multi-core environment brings the problem of cache coherence, but it will reduce the necessity of multiple type s of memory dev ices and make programming much easier. Finally, flexibility will be improved by providing a debugging environment within the simulator, removing the need for any intermediate stages during a ppl i cation verificat i o n. 6. Conclusions As more and more network applications have been moved to the NP platform, the availability of an infra- structure for the simulation and evaluation of such a complicated system becomes increasingly crucial. After nearly ten years of evolution, the modern NP has devel- oped its own collection of architectural features, which are tailored for packet processing. In this article, we have proposed and described a new NP simulator called SimNP. It models the components commonly seen within a NP, such as multiple PEs, integrated network interface and memory controllers, and hardware accel- erators. Supporting ARM instruction set, SimNP can be easily programmed in high level languages such as C with no modifications to compilers. The use of a memory mapped I/O allows rapid addition or removal of compo- nents, as well as complex NP design space exploration, balancing a flexible and appropriate abstraction level while providing meaningful statistics and analysis. 7. Acknowledgements This work is funded by the Irish Research Council for Science, Engineering and Technology (IRCSET) and the Enterprise Ireland (EI). The authors would also like to Copyright © 2010 SciRes. CN ![]() D. BERMINGHAM ET AL. 215 thank Ms. Yachao Zhou and Mr. Feng Guo for their work in preparing this manuscript. 8. References [1] D. Comer and L. Peterson, “Network Systems Design Using Network Processors,” 1st Edition, Prentice-Hall, Inc., USA, 2003. [2] Intel Inc., IXP2800 Hardware Reference Manual. [3] M. R. Hussain, “Octeon Multi-Core Processor,” Keynote Speech of ANCS 2006, San Jose, California, USA, De- cember 2006. [4] M. Venkatachalam, P. Chandra and R. Yavatkar, “A Highly Flexible, Distributed Multiprocessor Architecture for Network Processing,” Computer Networks, Vol. 41, No. 5, 2003, pp. 563-586. [5] W. Eatherton, “The Push of Network Processing to the Top of the Pyramid,” Keynote Address at the Symposium on Architectures for Networking and Communication Systems (ANCS2005), Princeton, New Jersey, USA, Oc- tober 2005. [6] J. Mudigonda, H. M. Vin and R. Yavatkar, “Managing Memory Access Latency in Packet Processing,” Pro- ceedings of the International Conference on Measure- ment and Modeling of Computer Systems (SIGMETRICS 2005), Banff, Alberta, Canada, June 2005, pp. 396-397 [7] M. Peyravian and J. Calvignac, “Fundamental Architec- ture Considerations for Network Processors,” Computer Networks, Vol. 41, No. 5, 2003, pp. 587-600. [8] W. Bux, W. E. Denzel, T. Engbersen, A. Herkersdorf and R.P. Luijten, “Technologies and Building Blocks for Fast Packet Forwarding,” IEEE Communications Magazine, , Vol. 39, No. 1, 2001, pp. 70-77 [9] D. Burger and T. Austin, “The SimpleScalar tool set ver- stion 2.0,” Computer Architecture News, Vol. 25, No. 3, 1997, pp.13-25 [10] N. Shah, W. Plishker and K. Keutzer, “NP-click: A pro- ductive software development approach for network processors,” IEEE Micro, Vol. 24, No. 5, 2004, pp. 45-54. [11] T. Wolf and M. Franklin, “Commbench: A Telecommu- nications Benchmark for Network Processors,” IEEE In- ternational Symposium on Performance Analysis of Sys- tems and Software, Austin, USA, April 2000. [12] G. Memik, W. H. Mangione-Smith and W. Hu, “NetBench: A Benchmarking Suite for Network Proces- sors,” Proceedings of IEEE/ACM International Confer- ence on Computer-Aided Design, Digest of Technical Papers, San Jose, USA, November 2001, pp. 39-42. [13] J. L. Hennessy and D. A. Patterson, “Computer Architec- ture: A Quantitative Approach,” 4th Edition, Morgan Kaufmann, USA, 2006. [14] Z. Liu, D. Bermingham and X. Wang, “Towards Fast and Flexible Simulation of Network Processors,” The IET 2008 China-Ireland International Conference on Infor- mation and Communications Technologies (CIICT2008), Beijing, China, September 2008, pp. 611-614. [15] D. Bermingham, Z. Liu and X. Wang, “SimNP: A Flexi- ble Platform for the Simulation of Network Processing System,” ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS2008), California, USA, November 2008, pp. 123-124. [16] P. Willman, M. Broglioli and V. Pai, “Spinach: A Lib- erty-Based Simulator for Programmable Network Inter- face Architectures,” LCTES, 2004, pp. 20-29. [17] R. Ramaswamy and T. Wolf, “PacketBench: A Tool for Workload Characterization of Network Processing,” Pro- ceeding of IEEE 6th Annual Workshop on Workload Characterization (WWC-6), Austin, TX, 2003, pp. 42-50. [18] Y. Luo, J. Yang, L. N. Bhuyan and L. Zhao, “NePSim: A Network Processor Simulator with a Power Evaluation Framework,” IEEE Micro, Vol. 24, No. 5, 2004, pp. 34- 44. [19] D. Suryanarayanan, J. Marshall and G. T. Byrd, “A Meth- odology and Simulator for the Study of Network Proces- sors,” In: P. Crowley, M. A. Franklin, H. Hadimioglu, and P. Z. Onufryk, Eds., Network Processor Design: Is- sues and Practices, Morgan Kaufmann Publishers, USA, 2003, pp. 27-54. [20] “NLANR Passive Measurement Analysis,” 2007. http:// pma. nlanr.net/ [21] “The GNU Compiler Collection,” 2009. http://gcc.gnu.org [22] B. Schneier, “Applied Cryptography,” 2nd Edition, John Wiley & Sons, New York, 1996. [23] S. Kumar, J. Turner and J. Williams, “Advanced Algo- rithms for Fast and Scalable Deep Packet Inspection,” Proceedings of the ACM/IEEE symposium on Architec- ture for networking and communications systems, San Jose, USA, December 2006, pp. 81-92. [24] P. Gupta and N. McKeown, “Packet Classification on Multiple Fields,” Proceedings of the 1999 Conference on Applications, Technologies, Architectures and Protocols for Computer Communications (ACM SIGCOMM’99), Massachusetts, US, September 1999, pp. 147-160. [25] S. Nilsson and G. Karlsson, “IP-Address Lookup Using LC-Tries,” IEEE Journal on Selected Areas in Commu- nications, Vol. 17, No. 6, June 1999, pp. 1083-1092. [26] RFC-791, “Internet Protocol DARPA Internet Pro- gram Protocol Specification,” 1981. Copyright © 2010 SciRes. CN |