10Gbps Network Packet Throughput on Intel Atom with DPDK
At Netadvia, we are firm believers in getting the most out of less when it comes to network devices. This philosophy is fuelled by the need to manage device performance against hardware costs, particularly as intelligence and processing power is being pushed towards the network edge.
Today, uCPEs* come in many form factors and price ranges. In general, the differences between uCPE offerings are subtle. The key difference, particularly when it comes to cost, is the CPU architecture at the core of the device. Intel processors are commonly used in this space with two key CPU architectures, Intel Xeon, and Intel Atom. To put it simply, the difference between these processors is processing power, but with this power comes a significant price difference.
We were interested in testing the performance of lower end, Atom based uCPEs in order to identify if our clients could maximise product performance without incurring the significant cost increase of moving to Xeon. In particular, we wanted to push the limits of what we could achieve on a single Atom core with network traffic driven through a 10G interface.
We developed a simple Linux application that would receive network packets from a 10G network interface and filter them based on the destination IP address. The application was developed using DPDK (Data Plane Development Kit), a collection of libraries focused on the acceleration of packet processing workloads. DPDK is mainly leveraged by network software engineers for its use of poll mode drivers designed to by-pass the Linux Kernel therefore ensuring that packets arrive at the application level as fast as possible.
Within the application, we categorized packets using a highly optimized vectorized pattern matching algorithm maximizing the use of SSE(Streaming SIMD Extensions) instructions available on the Atom Processor. This instruction set enables the processing of multiple data elements per instruction providing a highly efficient mechanism for processing packets in parallel. It’s true that compilers do a reasonably good job when generating these instructions. However, when it comes to maximizing packet throughput whilst ensuring deterministic behaviour, there are significant efficiencies to be gained by developing customised algorithms using SSE instructions.
We also optimised the use of CPU caches to ensure that data was always available to the CPU when required, therefore reducing the impact of expensive memory reads. The application used 1G hugepages in order to avoid costly TLB misses* that would significantly impact packet throughput.
Our application was deployed on an Atom-based C3000 series uCPE with a processor frequency of 2.2 GHz with the Turbo Boost feature disabled for consistency. The application was allocated a single core, a 10G network interface and a single 1G hugepage. A packet generator was used to transmit packets of varying packet sizes at line rate including an IMIX* distribution to observe a more realistic traffic flow.
The results are observed below.
With this configuration, we were able to receive and analyse packets at line rate (10Gbps) with zero packet loss on a single Atom core for all tested packet sizes. We observed zero packet loss with 64-byte packets which is a useful indicator of worst-case performance. It is clear that the single Atom core is capable of receiving and processing network traffic at a rate of 10Gbps for any packet distribution.
We were impressed by the packet processing power of the Intel Atom device considering the price difference when compared with Intel Xeon-based uCPEs. It is important to note that a reduction in the received rate may be observed by introducing further processing on the receive packets. However, it is also important to note that the more realistic IMIX distribution leaves more room for processing on this Atom Processor. Keeping that in mind, as well as the fact that these devices typically include at least 4 cores, we believe that Atom-based uCPEs are an excellent option when trying to keep costs down while maximising the packet throughput performance of network appliance solutions.
Are you interested in availing of our software engineering & consultancy services? Contact us today and request a meeting.
* TLB (or Translation Lookaside Buffer) is a memory cache that is used to reduce the time taken to access a user memory location.
* IMIX (or Internet Mix) refers to a typical Internet traffic distribution of varying packet sizes that resembles what can be observed in “real-world” networks.