Cumulus – Pure Layer-3 Data Centre 28th September, 2015 · Matt Conran

  • 1
  • Article
  • Updated 2 years ago
  • (Edited)
Cumulus – Pure Layer-3 Data Centre 28th September, 2015 · Matt Conran

The challenges designing a true layer-3 only data centre surface at the
access layer. Dual connected servers terminating on separate Top-of-Rack
(ToR) switches cannot have more than one IP address. A limitation
resulting in VLAN sprawl, unnecessary ToR inter-switch links, and uplink
broadcast domain sharing. Dinesh Dutt and Cumulus Networks devised
a clever solution entailing the redistribution of Address Resolution
Protocol (ARP), avoiding Multi-Chassis Link Aggregation (MLAG) designs and allowing pure Layer-3 data centre networks. Layer 2 was not built with security in mind. Introducing a Layer-3 only
data centre completely eliminates any layer 2 security problems. For a
brief introduction to Cumulus architecture, kindly visit my previous
post on Cumulus networks.


Are we using the “right” layer 2 protocol?

Layer 1 is the easy layer. It defines an encoding scheme needed to pass ones and zeros between devices. Things get a bit more interesting at Layer 2, where adjacent devices exchange frames (layer 2 packets) for reachability. Layer-2 addresses known as MAC addresses
are commonly used at Layer 2 but not always needed. Their need arises
when you have more than two devices attached to the same physical
network. Imagine a device receiving a stream of bits, does it matter if
Ethernet, native IP or CLNS/CLNP comes in the “second” layer? The question we should ask ourselves is are we using the “right” layer 2 protocol? Many
networks implement VLANs to support random IP address assignment and IP
mobility. The switches perform layer-2 forwarding even though they
might be capable of layer-3 forwarding. They forward packets based on MAC addresses within a subnet yet a layer-3 switch does not need Layer 2 information to route IPv4 or IPv6 packets. Cumulus has gone one step further and made it possible to configure every server-to-ToR interface a layer 3 interface. Their design permits multipath default route forwarding, removing the need for ToR interconnects and common broadcast domain sharing of uplinks. 


Bonding Vs ECMP

A typical server environment consists of a single server with two uplinks. For
device and link redundancy, uplinks are bonded into a port channel and
terminated on different ToR switches; forming an MLAG. As this is an
MLAG design, the ToR switches need an interswitch link. You cannot bond
server NICs to two separate ToR switches without creating an MLAG.




If you don’t want to use an MLAG there are other Linux modes available on hosts, such as “active | passive”, “active | passive on receive”. A 3rd mode
is available but consists of a trick using different ARP replies for
the different neighbors. This forces both MAC address into the ARP cache
of your neighbors; allowing both interfaces to receive. The “active | passive mode is popular as it offers predictable packet forwarding and easier troubleshooting. The “active | passive on receive” mode
receives on one link but transmits on both. Usually, you can only
receive on one interface as that is what is in the ARP cache of your
neighbors. To prevent MAC address flapping at the ToR switch, separate 
MAC addresses are transmitted. If a switch receives the same MAC address
over two separate interfaces, it will generate an MAC Address Flapping errorIn
each of the bonding examples, we have a common problem in that we can’t
associate one IP address with two MAC addresses. These solutions also
require ToR interswitch links. The only way to get around this is to implement a pure layer-3 Equal-cost
multi-path routing (ECMP) solution between host and ToR. 


Pure Layer-3 Solution Complexities

Firstly, we cannot have one IP address with two MAC address. To overcome this,
we implement additional Linux features. Linux has the capability for an unnumbered interface, permitting the assignment for the same IP address to both interface; one IP address for two physical NICs. Next, we assign a /32 Anycast IP address to the host via a loopback address. 




Secondly, the end hosts need to send to a next hop that is not on a shared
subnet. Linux allows you to specify an attribute to the received default
route, called “on link”. This attribute tells end hosts that “I might not be on a directly connected subnet to the next hop but trust me
that the next hop is on the other side of this link”. It basically
forces the hosts to send ARP requests, regardless of common subnet
assignment. These techniques enable the assignment of the same IP
address to both interfaces and permit forwarding to a default route out
both interfaces. Each interface is on its own broadcast domain. Subnets can span two ToR without requiring bonding or an inter-switch link.


Normal ARP processing still works. Even though the Layer 3 ToR switch doesn’t
need Layer 2 information to route IP packets, the Linux end host
believes it has to deal with traditional L2/L3 forwarding environment.
As a result, the Layer 3 switch continues to reply to incoming ARP
request. The host will ARP for the ToR Anycast gateway (even though it’s
not on the same subnet) and the ToR will respond with its MAC address.
The host ARP table will only have one ARP entry because the default
route points to a next hop, not an interface.


Return traffic is slightly different and it depends what the ToR advertises
back to the network. There are two modes, firstly, if the ToR advertises
a /24 to the rest of the network everything works fine until the server-to-ToR link fails.
Now, it becomes a layer-2 problem as you already said you could reach
the subnet. Resulting in return traffic traversing an inter-switch ToR
link in order to get back to the server. But this goes against our
previous design requirement of removing any ToR inter-switch links.
Essentially, you need to opt for the second mode and advertise a /32 for
each host back into the network. Take the information learnt in ARP,
consider it as a host routing protocol and redistribute it into the data
centre protocol i.e redistribute ARP.
The ARP table gets you the list of neighbors and the redistribution
pushes those entries into the routed fabric as /32 host routes. This
gives you the ability to redistribute only what /32 are active and
present in ARP tables. It should be noted that this is not a default
mode and currently an experimental feature.

Additional information on layer-3 data centre for Cumulus at
Photo of Scott Suehle

Scott Suehle, Alum

  • 3,772 Points 3k badge 2x thumb

Posted 2 years ago

  • 1

Be the first to post a reply!