Add post

nbellari · nbellari · commit 83e5449d3bad · 2023-06-18T19:29:37.000+05:30
diff --git a/_posts/2023-06-18-bluebird.markdown b/_posts/2023-06-18-bluebird.markdown
@@ -0,0 +1,30 @@
+---
+layout: post
+title: Bluebird - SDN for baremetal cloud services
+categories: networking programming
+date:  2023-06-09 17:32:05 +0530
+---
+
+I completed reading a paper [Bluebird: High-Performance SDN for Bare-metal Cloud Services](https://www.usenix.org/system/files/nsdi22-paper-arumugam.pdf). Here are some notes.
+
+A cloud can provide an environment for customers to setup their own infrastructure and deliver their services. Mostly the cloud provider would allow the customer to create virtual machines, add them to virtual networks and create policies that define how the machines (and services) can interact with one another. There are some cases where a customer needs bare-metal servers (instead of VMs) for delivering certain services. The paper cites examples such as storate systems like NetApp, Cray etc. that needs to run on bare-metals. In such environments, the cloud provider has no control over the bare-metal - i.e. the customer is free to choose their own OS/software stack that runs on the bare-metal. In case of hosting VMs, the hypervisor (a.k.a VMM) hosts its own virtual switch that connects the VMs to the customer's virtual network. There is no such luxury in case of bare-metal nodes.
+
+There are a couple of options for connecting bare-metal nodes. 1) use a smart-NIC on the bare-metal and let it be programmed by a controller such that the bare-metal is connected to the customer's virtual network. 2) Let the ToR switch in the rack take care of connecting the bare-metal node to the customer's virtual network. This paper is about the second option.
+
+This is not a new concept, there are existing solutions (for example Nuage's VSG) that makes the ToR a VxLAN gateway and creates `service interfaces` on it to which bare-metal nodes can connect to be part of the customer's virtual network. What is interesting here is that Microsoft chose to use a ToR made of Intel's Tofino chipsets. And Arista is already part of that game that ships `Arista 7170` boxes which are based on Tofino chipsets. Tofino is p4-programmable chip and we have seen what it is capable of in my erstwhile startup. One of our engineers wrote a complex pipeline that delivered broadband services with PPPoE over L2TP tunnels. The same thing required quite some work in Broadcom chipsets - rewrite microcode, ship a new version of SDK etc. Given reasonable expertise, one can build powerful pipelines with p4 programmable Tofino chipsets.
+
+That said, we will now delve into some points that captured my attention in the paper:
+
+* Prior to the advent of smart-NICs and smart-ToRs, vendors used to provide software based gateways to connect bare-metal nodes to customer's virtual network. Nuage's VRS-G is an example. This would require a dedicated server connected to an L2 switch to which one or more bare-metal nodes are also connected. The bare-metal nodes will have the server as the gateway and would forward all traffic to it. The paper makes a claim that having this functionality in the ToR is much less power consuming and much more performant and scalable than having on a server.
+* The authors talk about how they re-designed the table sizes and pipelines to give more space for VxLAN VTEP mappings as opposed to ipv4/v6 routes which are not used that often or that much. That is the power of P4 programmable chipsets.
+* In their implementation, forwarding is only based on L3 destination and not L2 destination. That is, a bare-metal node would send all the packets to the gateway. Depending on the destination, it would either do an ARP resolution to get MAC address and forward (intra-subnet traffic) or send it to the gateway (inter-subnet traffic). In either case, the (inner) destination MAC is just a dummy MAC. The ToR would encapsulate this packet and send it to the remote gateway which would decapsulate, look at the IP and do an ARP resolution to find where the destination resides and forwards accordingly. That would still need a L3 table, ARP table and L2 table, in principle, though. The only difference is that the remote nodes need not hold the MAC information for all the end points, only the IP information would suffice. Only the local nodes need to have MAC information.
+* One special thing with Arista-7170 box is that it has 2 10G interfaces connected directly to CPU. That is, the OS can do a PCI scan and find two 10G interfaces. One can run DPDK on the CPU with a couple of cores to make a fast slow-path interface from ASIC to CPU! This is interesting. Generally, the slow path from ASIC to CPU would only take the standard PCI path, raise an interrupt and the driver comes and drains the queues. This DPDK based approach is pretty interesting and innovative! Even mirroring traffic to the CPU for examination also scales well with DPDK.
+* One thing that I did not understand in the paper is the use of special formatted UDP source port (for the VxLAN header) so that VFP can identify the packets coming from bare-metal node. It is not clear to me why such a distinction needs to be made.
+* The other thing the paper talks about in some detail is the route cache. It may not be possible for all the routes to find space in the hardware tables, so a software table (which is bigger in size) is also maintained that can hold orders of magnitude more entries. So, theoritically, the table size grows. This cache thing is not a new concept in computer science, there is memory cache, file system cache etc. Cache follows the well known statistical multiplexing principle - keep only active entries and give a notion of a big bucket.
+* There are hit-bits associated with entries in the table - both in software and hardware. A hit-bit in software means that it is being visited and worth getting promoted to hardware. An entry in the hardware is aged out and evicted based on an LRU algorithm.
+* Aging hardware entries is adaptive - that is, depending on the level of table fill, the aging timeout is decided accordingly - a sparsely filled table will have a larger timeout while an almost full table will have more aggressive timeouts. I wonder, if OVS can also adapt something like this.
+* The paper shows a glimpse of the layers that sit above the ToR and manages the ToR. That is Bluebird. Bluebird is based on microservices architecture. Different components can send policy down (in the form of a JSON object). Basically, policy describes the end state. Orchestrator receives the policy objects and translates to policies that are to be applied to different ToRs. These will be taken up by an abstraction layer and will be translated to the corresponding Tor specific configuration.
+* The paper also gives a hierarchical overview of Azure cloud. Azure cloud contains a set of regions, each region is divided into avaliability zones and each zone comprises one or more data centers. The bluebird service is deployed per availability zone.
+* The paper also talks about redundancy where two SDN ToRs can form a mLAG. It also mentions that bluebird service uses an anycast IP to program the ToRs - though it is not very clear how they achieve that with a single IP.
+
+All said and done, it is not clear how security policies are programmed on the ToR and realized. The paper does not talke about that. Also, if some advanced forwarding policies are needed (like mirroring customer traffic, redirecting for a service etc.) it is not clear how a ToR can achieve that. It was a nice read, though.