Data plane offloading on a high-speed parallel processing architecture

Get Complete Project Material File(s) Now! »

Solutions and Protocols used for fabric networks

Recently, several proprietary data center networking solutions based on the fabric network concept have emerged and they aim to provide higher throughput across the network and to reduce latency. Such solutions replace traditional three-layer network topology with the full-mesh network of links which is often referred to as the « network fabric ». Thus, any pair of nodes can have a predictable network performance with direct connectivity. Moreover, fabric networks have a flat architecture that is resilient, scalable and that decreases the number of hops needed for communication inside a data center. When compared with the three-tier architecture, fabric networks reduce the amount of north-south traffic while the east-west traffic (which is typical for virtualized environments) encounters less network congestion. One of the main advantages is that VMs can use multiple paths to communicate with each other by using protocols such as TRILL or SPB (Shortest Path Bridging) instead of using a single fixed path calculated by the STP. Additionally, the links that are cut-off by the STP protocol will have a much higher utilization when multipathing is available.

QFabric

Juniper Networks Quantum Fabric (QFabric) architecture delivers a data center design by creat-ing a single tier network that operates like a single Ethernet Switch [57]. Such an architecture brings significant changes in scale, performance and simplicity while providing support for virtualized envi-ronments. Most of data center traffic today is server-to-server or east-west because of the adoption of service-oriented architecture (SOA)-based applications and virtualization [57]. In traditional three-tier architectures, packets need to transit multiple switches up and down the hierarchy in order to get to the destination. This can have a significant impact on transaction latency, workload mobility and real-time cloud computing interactions [57]. Additionally, operational expenses are higher in the traditional ar-chitectures because of the complex network management that needs to track the VM migrations and maintain connectivity between VMs in a tree structure. Juniper Networks QFabric creates a single-tier network that is operated and managed like a single, logical, distributed switch [57]. It consists of edge, interconnect and control devices that run Junos operating system. In traditional data center networks, when the number of servers is increased, the new switches are added to the network in order to interconnect them. Each switch needs to be managed individually which is complicated in large data centers. Moreover, the amount of control traffic in the network also increases. In [57], the authors claim that in tree-based networks, 50% of network’s ports (which are the most expensive ports) are used in order to interconnect switches and not to link servers, storage and other devices. Also, around 50% of network bandwidth is unavailable if STP is being run, as it disables almost half of the available links in order to avoid loops. With this in mind, capital and operational expenses are very high in traditional data center networks. The physical location of the server in a three-tier architecture also affects the application performance. A larger number of hops can increase latency which can further contribute to unpredictable application performance [57].
The inspiration for QFabric was drawn from switch design as inside every switch is a fabric which is a mesh that is completely flat [57]. It provides any-to-any connectivity between ports. A fabric network concept allows retaining simplicity by allowing multiple physical switches to be managed like and to behave as a single logical switch [57]. Any-to-any connectivity permits each device to be only one hop away from any other device. The QFabric architecture includes a distributed data plane that directly interconnects all ports one to another and an SOA-based control plane that provides high scalability, resiliency and a single point of management. Minimal administrative overhead is needed whenever a new compute node is added in a plug-and-play fashion. The three basic components of a self-contained switch fabric – line cards, backplane and Routing Engines – are broken out into independent standalone devices – the QF/Node, the QF/Interconnect, and the QF/Director respectively
[57]. QF/Node is a high density, fixed configuration 1 RU edge device which provides access into and out of the fabric [57]. QF/Interconnect connects all QF/Node edge devices in a full mesh topology. QF/Director integrates control and management services for the fabric and it builds a global view of the entire QFabric which could consist of thousands of server facing ports [57].

FabricPath

FabricPath [58] is an innovation in Cisco NX-OS Software that brings stability and scalability of routing to layer 2 [59]. Unlike in traditional data centers, the network does not need to be segmented anymore. FabricPath also provides a network-wide VM mobility with massive scalability. An entirely new layer 2 data plane is introduced and it is similar to a TRILL protocol. When a frame enters the fabric, it gets encapsulated with a new header that has a routable source and destination addresses. From there, the frame is routed and when it reaches the destination switch, it gets decapsulated into the original Ethernet format. Each switch in the FabricPath has a unique identifier called a Switch-ID which is assigned dynamically. Layer 2 forwarding tables on the FabricPath are built by associating the end hosts MAC addresses against the Switch-ID [58]. One of the advantages of FabricPath is that it preserves the plug-and-play benefits of the classical Ethernet as it requires minimal configuration [58]. FabricPath uses a single control protocol (IS-IS) for unicast forwarding, multicast forwarding and VLAN pruning [58]. It also allows the use of N-way multipathing, while providing high availability. Forwarding in FabricPath is always performed across the shortest path to the destination, unlike in layer 2 forwarding where STP does not necessarily calculate the shortest paths between two endpoints.

Brocade VCS Technology

Brocade defines the Ethernet fabrics in which the control and management planes are extracted from the physical switch into the fabric while the data plane remains on a switch level [60]. In this way, control and management become scalable distributed services that are integrated into the network instead of being integrated into the switch chassis. The advantage of such an approach is that the fabric scales automatically when another fabric-enabled switch is added [60]. In fact, a new switch automati-cally joins a logical chassis which is similar to adding a port card to a chassis switch [60]. The security and policy configuration parameters are automatically inherited by the new switch which simplifies monitoring, management and operations when compared with traditional layer 2 networks. Instead of performing configuration and management multiple times for each switch and each port, these func-tions are performed only once for the whole fabric. All switches in the fabric have the information about device connections to servers and storage which enables Automated Migration of Port Profiles (AMPP) within the fabric [60]. Furthermore, AMPP ensures that all network policies and security settings are maintained whenever a VM or server moves to another port of the fabric, without having to reconfigure the entire network [60]. Brocade VCS Technology implements all the properties and requirements of the Ethernet fabric and it removes the limitations of the classic Ethernet. For instance, it does not use STP but it rather uses the link-state routing. Moreover, equal-cost multipath forwarding is implemented at layer 2 which brings a more efficient usage of the network links. VCS technology is divided in three parts: the VCS Ethernet that is used for the data plane, VCS Distributed Intelligence that is used for the control plane and the VCS Logical Chassis that is used for the management plane.

Communication protocols for fabric networks

TRILL (TRansparent Interconnection of Lots of Links) [54] is the IETF’s protocol that is used for communications inside a data center. The main goal of TRILL protocol is to overcome the limitations of conventional Ethernet networks by introducing certain advantages of network layer protocols. This protocol is considered to be a layer 2 and 1/2 protocol since it terminates traditional Ethernet clouds which is similar to what IP routers do. It has a simple configuration and deployment just like Ethernet and it is based on IS-IS protocol. TRILL provides least-cost pair-wise data forwarding without config-uration, it supports multipathing for both unicast and multicast traffic and it provides safe forwarding during the periods of temporary loops. Additionally, TRILL provides a possibility to create a cloud of links that act as a single IP subnet from the IP nodes point of view. Also, TRILL is transparent to layer 3 protocols. A conversion from existing Ethernet deployments into a TRILL network can be done by replacing classical bridges with RBridges (Routing Bridges) which implement TRILL protocol.
RBridges perform the SPF (Shortest Path First) algorithm, thus providing optimal pair-wise for-warding. TRILL supports multipathing which provides better link utilization in the network by using multiple paths to transfer data between the two end hosts. This is enabled by the encapsulation of the traffic with an additional TRILL header. This header introduces the hop count field, which is a characteristic of layer 3 protocols, and allows for safe forwarding during periods of temporary loops. RBridges are also backward-compatible with IEEE 802.1 customer bridges [54]. More details about TRILL can be found in Appendix A.

SPB

SPB (Shortest Path Bridging) is the IEEE’s protocol that is classified as 802.1aq [61]. It has been conceived as a replacement for older spanning tree protocols such as IEEE 802.1D that disabled all the links that did not belong to the spanning tree. Instead, it uses the shortest path trees (SPTs) which guarantee that traffic is always forwarded over the shortest path between two bridges. Due to a bidi-rectional property of SPTs, the forward and reverse traffic always take the same path. This also holds for unicast and multicast traffic between two bridges. SPB allows using multiple equal cost paths which brings better link utilization in the network. The control plane of the SPB protocol is based on the IS-IS (Intermediate System to Intermediate System) link-state routing protocol (ISIS-SPB) which brings new TLV extensions. This protocol is used to exchange the information between the bridges which is then used for SPT calculation. There are two variants of SPB. One uses the 802.1ad Q-in-Q datapath (shortest path bridging VID (SPBV)) and the other one uses the hierarchical 802.1ah MAC-in-MAC datapath (shortest path bridging MAC (SPBM)) [62]. Both of these variants have the same control plane, algorithms and a common routing mechanism. SPB’s control plane has a global view of a network topology which allows it to have fast restoration after a failure.
Both TRILL and SPB based Ethernet clouds scale much better than Ethernet networks that are based on spanning tree protocols [63]. One of the motivations for this thesis lays in the concept of fabric networks. However, the goal is to build a mesh network that is going to use highly performant smart NIC cards instead of switches that are interconnected just like the switches in the network fab-ric. Additionally, in our experimental evaluation we use the TRILL protocol for the communication between smart NICs which allows having higher utilization of links in the data center.

READ The need for Self-Fulfillment

Hardware implementations

In this section, we introduce two types of hardware implementations : the GPU-based solutions and the FPGA-based solutions. We emphasize here that we classified as hardware implementations all the solutions that rely on specific hardware such as GPUs and FPGAs, while the solutions that rely solely on the processing power of CPUs have been described in the previous section. It is for this reason that even multiple solutions that are based on Click Modular Router are represented in this section as they offload part of the processing on a specialized hardware.

GPU-based solutions

Snap [32] is a packet processing system based on Click which offloads some of the computation load on GPUs. Its goal is to improve the speed of Click packet processing by offloading heavy com-putation processes on GPUs while preserving Click’s flexibility. Therefore, the authors designed Snap to offload only specific elements to GPUs which, if not offloaded, could have created bottlenecks in the packet processing. The processing performed on the GPU is considered to be of the fast path type, while the slow path processing is performed on the CPU. To achieve that, Snap adds a new type of « batch » element and a set of adapter elements which allow the construction of pipelines toward GPUs. However, the standard Click pipeline between elements only lets through one packet at a time which is not enough to benefit from GPU offloading. As a matter of fact, offloading only one packet on a GPU results in having a lot of overhead and additionally, a GPU architecture is not adapted to such tasks. Therefore, Snap modifies Click pipeline in order to transmit a batch of packets thus, increasing the performance gain with offloading.
The authors define and implement new methods to exchange data between elements using a new structure called PacketBatch. Thanks to these modifications, Snap provides a GPU-based parallel element which is composed of a GPU part and a CPU part. The GPU part is the GPU kernel code and the CPU part is the part which receives PacketBatches and sends them to the GPU kernel. To improve Snap interaction with the GPU, the authors developed a GPURuntime object programmed and managed with NVIDIA’s CUDA toolkit [78]. However, to send and retrieve batches of packets to the GPU, Snap needs two new elements called Batcher and Debatcher. The Batcher collects packets and sends them in batches to the GPU whereas the Debatcher does the inverse, it receives batches of packets and sends them one at a time back to the CPU. Both the Batcher and Debatcher elements manage the necessary copies of data between the host memory and the GPU memory in order to facilitate the use of Snap and also to improve packet copy times.
To benefit from GPU offloading, the GPU has to manage several or all the packets of a batch in parallel. Parallel processing often reorders the packets which is not desirable for TCP or streaming performances. To prevent this issue, Snap uses a GPUCompletionQueue element which waits for all packets of a batch to be processed before sending the batch to the Debatcher. Snap faces another issue. In Click not all packets follow the same path element-wise, there-fore, packets from a batch might have different paths. This separation of packets can happen before reaching the GPU or inside the GPU. The authors found that there are two main classes of packet divergence which are either routing/classification divergence or exception-path divergence. The rout-ing/classification divergence results in having a fragmented memory in the pipeline and happens mostly before reaching the GPU. This is a problem which implies that the unnecessary packets will be copied in the GPU memory which is time-consuming because of the scarce PCIe bandwidth. To solve this issue, Snap only copies the necessary packet in a contiguous memory space at the host level to create a batch of packets which is then sent to the GPU memory using a single transfer through the PCIe. Regarding path divergence happening inside the GPU, Snap solves this issue by attaching predicate bits to each packet. This way the path divergence is only treated once the batch is back in the CPU part thanks to the predicate bits which indicate which element should process the packet next.
In order to further improve Snap performances, the authors used packet slicing to reduce the amount of data that needs to be copied between the host memory and the GPU memory. This slicing mecha-nism, defined in PacketBatch, allows GPU processing elements to operate on specific regions of data of a packet. For example, to realize an IP route lookup only the destination address is needed, therefore, only this IP address has to be copied to the GPU which reduces by at least 94% the size of data being copied when the packet is 64-bytes long.
Snap, thanks to its improvement over Click, enables fast packet processing with the help of GPU offloading. As a matter of fact, the authors realized a packet forwarder with Snap and its perfor-mance was 4 times better than standard Click results. Nevertheless, some Snap elements such as the HostToDeviceMemcpy and the GPUCompletionQueue are GPU-specific which means that Snap is not compatible with all GPUs.

PacketShader

PacketShader [8] is a software router framework which uses Graphic Processing Units (GPUs), in order to alleviate the costly computing need from CPUs, for fast packet processing. Actually, this solution allows to take advantage of fast path processing of packets by exploiting the GPU’s processing power. In comparison to CPU, GPU cores can attain an order of magnitude higher raw computation power and they are very well suited for the data-parallel execution model which is typical for the majority of router applications. PacketShader idea consists of two parts:
• optimization of I/O through the elimination of per-packet memory management overhead and batch packet processing.
• offloading of core packet processing tasks to GPUs and the use of massive parallelism for packet processing.

Table of contents :

1 Introduction
1.1 Motivation
1.2 Problematics
1.3 Contributions
1.4 Plan of the thesis
2 State of the art
2.1 Introductory remark
2.2 Solutions and Protocols used for fabric networks
2.2.1 Fabric network solutions
2.2.1.1 QFabric
2.2.1.2 FabricPath
2.2.1.3 Brocade VCS Technology
2.2.2 Communication protocols for fabric networks
2.2.2.1 TRILL
2.2.2.2 SPB
2.3 Fast packet processing
2.3.1 Terminology
2.3.1.1 Fast Path
2.3.1.2 Slow Path
2.3.2 Background on packet processing
2.4 Software implementations
2.4.1 Click-based solutions
2.4.1.1 Click
2.4.1.2 RouteBricks
2.4.1.3 FastClick
2.4.2 Netmap
2.4.3 NetSlices
2.4.4 PF_RING (DNA)
2.4.5 DPDK
2.5 Hardware implementations
2.5.1 GPU-based solutions
2.5.1.1 Snap
2.5.1.2 PacketShader
2.5.1.3 APUNet
2.5.1.4 GASPP
2.5.2 FPGA-based solutions
2.5.2.1 ClickNP
2.5.2.2 GRIP
2.5.2.3 SwitchBlade
2.5.2.4 Chimpp
2.5.3 Performance comparison of different IO frameworks
2.5.4 Other optimization techniques
2.6 Integration possibilities in virtualized environments
2.6.1 Packet processing in virtualized environments
2.6.2 Integration constraints and usage requirements
2.7 Latest approaches and future directions in packet processing
2.8 Conclusion
3 Fabric network architecture by using hardware acceleration cards
3.1 Introduction
3.2 Problems and limitations of traditional layer 2 architectures
3.3 Fabric networks
3.4 TRILL protocol for communication inside a data center
3.5 Comparison of software and hardware packet processing implementations . . 80
3.5.1 Comparison of software solutions
3.5.1.1 Operations in the user-space and kernel-space
3.5.1.2 Zero-copy technique
3.5.1.3 Batch processing
3.5.1.4 Parallelism
3.5.2 Comparison of hardware solutions
3.5.2.1 Hardware used
3.5.2.2 Usage of CPU
3.5.2.3 Connection type
3.5.2.4 Operations in the user-space and kernel-space
3.5.2.5 Zero-copy technique
3.5.2.6 Batch processing
3.5.2.7 Parallelism
3.5.3 Discussion on GPU-based solutions
3.5.4 Discussion on FPGA-based solutions
3.5.5 Other hardware solutions
3.6 Kalray MPPA processor
3.6.1 MPPA architecture
3.6.2 MPPA AccessCore SDK
3.6.3 Reasons for choosing MPPA for packet processing
3.7 ODP (OpenDataPlane) API
3.7.1 ODP API concepts
3.7.1.1 Packet
3.7.1.2 Thread
3.7.1.3 Queue
3.7.1.4 PktIO
3.7.1.5 Pool
3.8 Architecture of the fabric network by using the MPPA smart NICs
3.9 Conclusion
4 Data plane offloading on a high-speed parallel processing architecture
4.1 Introduction
4.2 System model and solution proposal
4.2.1 System model
4.2.2 Frame journey
4.2.2.1 Control frame
4.2.2.2 Data frame
4.2.3 Implementation of TRILL data plane on the MPPA machine
4.3 Performance evaluation
4.3.1 Experimental setup and methodology
4.3.2 Throughput, latency and packet processing rate
4.4 Conclusion
5 Analysis of the fabric network’s control plane for the PoP data centers use case
5.1 Data center network architectures
5.2 Control plane
5.2.1 Calculation of the control plane metrics
5.2.1.1 Full-mesh topology
5.2.1.2 Fat-tree topology
5.2.1.3 Hypercube topology
5.2.2 Discussion on parameters used
5.2.3 Overhead of the control plane traffic
5.2.4 Convergence time
5.2.5 Resiliency and scalability
5.2.6 Topology choice
5.3 Conclusion
6 Conclusion
6.1 Contributions
6.1.1 Fabric network architecture by using hardware acceleration cards
6.1.2 Data plane offloading on a high-speed parallel processing architecture
6.1.3 Analysis of the fabric network’s control plane for the PoP data centers use case
6.2 Future works
Publications
Bibliography