MLAG inside Rack, IP Fabric between racks?

  • 2
  • Question
  • Updated 2 years ago
Curious what others are doing in this space.

Working on a possible topology for new data center buildout -- we are small enough that a single switch servicing a rack is not ideal (so desire MLAG for tolerating switch failure / maintenance events), but want to be able to scale out appropriately and limit fault domains, so would like to do L3 CLOS Leaf and Spine between Racks.

The challenge i'm coming to is how best to design that with appropriate failover and routing, etc.  I am using the VX VMs to lab this up as best I can and curious if anyone else has done this, or has any insight or input.

What i've thought about so far:

  • IBGP between switches that are MLAG pair, EBGP from MLAG Leaf to Spines (not sure whether its best for Spines to be same AS, or all different AS; seems like conflicting designs out in the wild that have been published; each Leaf MLAG pair would be its own AS).  Running this in the VM lab environment now, and seems to work alright (although in current form each spine gets two ECMP routes for the MLAG loopbacks rather than only one)
  • EBGP everywhere, every switch is its own AS.  MLAG peers EBGP with each other, and every spine.  not sure implications of this yet, haven't labbed it up yet.
  • Similar to option 2, but instead of EBGP between MLAG pairs, a default static route at higher Admin Distance so if somehow an MLAG switch loses all its EBGP sessions to spines, but peer router still has them, it would ride that static default to peer.

Only time any of this comes into play I think is if for some reason one MLAG peers EBGP sessions to every spine is gone, but the peer still has its EBGP sessions, it would need to know to forward to the peer switch rather than discard the packets.

Anyone have any ideas / comments?



Thanks,

Will
Photo of Will McLendon

Will McLendon

  • 440 Points 250 badge 2x thumb

Posted 2 years ago

  • 2
Photo of Sean Cavanaugh

Sean Cavanaugh, Alum

  • 3,380 Points 3k badge 2x thumb
  • IBGP between switches that are MLAG pair, EBGP from MLAG Leaf to Spines (not sure whether its best for Spines to be same AS, or all different AS; seems like conflicting designs out in the wild that have been published; each Leaf MLAG pair would be its own AS).  

No reason to mix iBGP and eBGP.  It is possible to utilize only eBGP in your fabric. as per your 2nd option.  Highly recommend reading this.   This means there is no IGP as well.  (were you doing eBGP and iBGP without a IGP?)

  • Running this in the VM lab environment now, and seems to work alright (although in current form each spine gets two ECMP routes for the MLAG loopbacks rather than only one)
You mean each ToR is represented differently?  This is fine.  L3 is happening from ToR->Spine so you would expect this.  Another option is to use anycast.... although I think that is making it more complicated then it needs to be....

  • EBGP everywhere, every switch is its own AS.  MLAG peers EBGP with each other, and every spine.  not sure implications of this yet, haven't labbed it up yet.
With or without MLAG?  What is the requirement for MLAG?  A lot of people look at MLAG as what they have to do rather than is it the right network architecture for the problem.  Can you possibly route on the host?

Don't peer L3 between the MLAG peers.  Use ifplugd to shut down clag and VRR when all uplinks go down.  Routing between the MLAG peers is a suboptimal config for a corner case (all spine links down on one switch). Furthermore, it results in degraded network behavior that needs more troubleshooting than just completely black holing. How much do we over provision the peer link for this corner case failure? The traffic going over the peerlink will be unpredictable.

  • Similar to option 2, but instead of EBGP between MLAG pairs, a default static route at higher Admin Distance so if somehow an MLAG switch loses all its EBGP sessions to spines, but peer router still has them, it would ride that static default to peer.
Never use static routes if it can be helped.  It is such a corner case you could just add more spines and never hit this problem.  It only really becomes an issue with single spine scenarios (one way out).  I like the EBGP option more than the static route :)  But ifplugd seems to be the consensus internally after I bounced this thread around.
Photo of Will McLendon

Will McLendon

  • 440 Points 250 badge 2x thumb
the reason I would need MLAG is because our footprint is not so large that we could tolerate a switch failure (or taking down for maintenance) taking out an entire rack....today our entire footprint fits inside a single rack.  As such, I need MLAG for availability.  We are looking at scaling beyond a single rack pretty quickly, but not to hundreds of racks or anything, so a fairly small leaf/spine for a while.  VMWare shop so no routing to hosts.  NSX may be implemented in future, exact plan there unclear.  If deployed, Edge gateways would likely peer to Edge Routers, not the Leaf switches.

I think I like the all EBGP option best (still haven't had a chance to lab it up yet, however).  The ifplugd thing can still be leveraged if indeed all spine interfaces go down to prevent unnecessary peerlink oversaturation (with current traffic needs not likely to happen anyway really), however if the failure is somehow control plane or operator error taking down the Leaf-Spine EBGP session, the EBGP session to the peer MLAG switch helps keep traffic flowing

Thanks,

will
Photo of Joshua Bowers

Joshua Bowers

  • 154 Points 100 badge 2x thumb
I'm in the same boat as you. Our initial deployment will be EBGP w/MLAG facing compute/storage, two Spines, and MLAG for the WAN edge. The MLAG facing compute/storage is due to scalabilty issues with our vSAN which we will overcome when we add more nodes. The MLAG for the WAN Edge is to span vlans for legacy physical servers in the DMZ. I'll split the Leaf MLAG when we add enough nodes for the compute/storage, and I'll split the WAN Edge MLAG once we have the physical servers decommed and NSX overlayed.
(Edited)
Photo of Sean Cavanaugh

Sean Cavanaugh, Alum

  • 3,380 Points 3k badge 2x thumb
I think if you reduce the MLAG down to ToR->  Servers there is no problem.  I always try my hardest to get rid of any layer2 when I can, but it does not always make sense.

What is nice is it is possible to have two Cumulus Leaf switches act as a single VTEP so if you need to bridge L2 domains across your fabric that becomes possible as well.  An NSX solution or our LNV solution will allow HW VTEPs to work (encapsulation and decapsulation on the Broadcom Trident 2).

Honestly I would also check if your NICs can do encap/decap for 10Gbps, if they don't that makes it really ideal to use the ToR for encap/decap.

I love these design questions.
Photo of Joshua Bowers

Joshua Bowers

  • 154 Points 100 badge 2x thumb
Thanks for the reminder that Cumulus supports a lightweight version of VXLAN. I totally forgot. :$
Photo of Dele Adeolu

Dele Adeolu

  • 80 Points 75 badge 2x thumb
I am working on a similar project to scale out our Data center fabric. Fortunately we are a Linux shop with a full blown Open stack deployment. After listening to Dinesh Dutt's BGP talk on ipspace . I am considering migrating to Ebgp across the fabric rather than IGP (ospf/ibgp)  if my testing pans  out


I currently have MLAG on my downlink and Ospf on the uplink to  spine and ebgp at the edge.

Based on what i  have gathered  thus far, running ebgp with a single ASN at the spine and multiple asn's on the leafs seems the most efficient design for large scale deployments. 
 
I think balancing efficient design and cost is critical irrespective of the size of the deployment..  I have to figure out what will work best for me.



all the best



Thanks