my cheat sheet modern network fabrics

1 Background

For many years layer 2 networks solved loop problems with the spanning tree protocol, stp, ieee 8021D, which calculated a least spanning tree with a root tree based on the lowest bridge id, and a minimal spanning tree built from that root.

Any change to the topology with a new switch being inserted, or a switch leaving the tree (shut down) or link state change between bridges would cause the spanning tree to block all traffic while the new minimal spanning tree with root bridge selection process would complete. States for l2 links were:

  • blocking
  • listening
  • learning
  • forwarding
  • disabled

1.1 BPDU

Switches communicated their states by sending bpdu packets, bridge protocol data units.

1.2 STP shortcommings

The two biggest issues with spanning tree protocol is 1) switches typically take 40 seconds to fully transition from a state change to forwarding packets again. 2) any stp errors by any switch in the network can cause a L2 loop that very quickly, exponentially in fact bring a network down with packets getting duplicated til the links are saturated.

1.3 Flooding BUM Traffic

L2 networks support end devices discovering other L2 end devices by sending broadcasts. A basic example is a arp broadcast for an ip address. Switches that do NOT yet have the destination MAC address in its TCAM tables will flood the broadcast out all ports. Unknown packets get the same treatment. When a switch sees a packet with a source MAC address it does not yet know, it will take that opportunity to record which port that new MAC address packet arrived from. Thus learning and populating a TCAM table where each MAC address should be sent to. This is called source based learning based on the source mac addr

1.4 taming stp

Throughout the years many ancillary techniques were used to tame stp to be more resilient. Techniques used

  • rapid spanning tree (rstp) (took 40 seconds down to a few seconds of outage by pre-calculating alternative links to bring up immediately when main link would go down. These included:
    • uplink fast
    • port fast
    • backbone fast
  • fixing which switch would be the root switch. Root switches should be the actual core switches where traffic should flow through, and not some edge access switch that does not have the capacity handle all traffic
  • restricting which ports will even accept bpdus thus preventing stp recalculation
  • parallel link or dual switch protocols that avoids spanning tree on certain links. Examples include:
    • vpc (virtual port channel)
    • mcec (multi-chassis ether-channel)
    • lag (link aggregation protocol)
    • mlag (multi-chassis lag)
    • pvst+ (per vlan spanning tree)
    • ether-channel
    • vss (virtual switch system)

1.5 avoid stp altogether.

A newer approach is to move away from stp altogether. The replacement protocol is still tasked with these fundamental tasks.

  1. Must allow for redundant paths with automatic failover for resiliency
  2. Must allow for parallel links when more bandwidth is needed
  3. Must allow automatic discovery of end nodes and broadcast traffic
  4. Must learn where each new MAC address seen on the network should be sent. Ideally without resorting to bum traffic flooding.

2 stp replacement

There are several ways to approach the problem, most often starting with prorietary methods, but typically moving to standards based approaches when they become available.

The following table shows some key techniques and their approach

    Company / control New MAC addr    
Method Encapsulation Standard plane Learning Method default gw multipath
Fabric MAC in MAC (TRILL +) L3 2 FTag trees are to ODMAC Up to 16
Paths L2 in L2 cisco IS-IS built by IS-IS   ECMP
  (sw. to sw.)     BUM is flooded to    
        1 FTag and multicast    
        flooded to other    
        FTag.    
LISP ip in ip   IGP EIDs to RLOC mapping    
  by ETR and ITR     saved in MS (mapping    
        server)    
VXLAN mac in ip(udp) ieee   source VTEP sends    
  VTEPs encap     unknown MAC to multicast anycast  
  L2 frames with VNID     group. All VTEPs join    
  and source VTEP ip     this group. Correct dest    
        VTEP replies (if on same VNID    
        and knows dmac)    
VXLAN mac in ip newer ieee MP-BGP Uses Head End Replication, in    
        the control plane. The VTEP    
        duplicates/copies the BUM    
        packet and sends to all other    
        VTEPs. (multicast scales    
        better) b    
VXLAN            
GPO            
SDA     LISP      

3 Conversational learning

A switch will learn and store a MAC address destination ONLY for packets that actually pass through it. Packets striclty traversing other swirches are never learned. This allows for greater scale, faster learning, and better performance.

For locally connected MAC addresses, the switch learns the MAC address as usual. For remote MAC addresses (form devices connected to other fabric path swtiches), the switch does NOT bother to learn a remote MAC source address unless the dmac is known (typically local). Only when a packet is seen with a destination MAC address known will the switch record the source MAC address as well. Thereby only remembering address that are actually used by the switch. This matches the 3-way handshake nicely.

This will automatically solve the MAC addr flooding denial of service attack where a compromised host floods broadcasts with random source mac addresses. Regular STP switches would fill their TCAM tables with these bogus mac addresses.

4 VXLAN

Core is a L3 core that passes UDP frames from source to destintion VTEP, VXLAN Tunnel End Point. The ingress VTEP encapsulates the L2 frames in IP-UDP frames with the destination IP address of the egress VTEP. The egress VTEP strips the IP header and forwards the L2 frame to the destination.

But how does the ingress VTEP know that the destination MAC addr of the L2 packet should be sent to a particular egress VTEP? The next section on address learning shows you.

4.1 Address Learning

VXLAN can use data plane learning, as in traditional ethernet, where BUM traffic is flooded out all interfaces. Or VXLAN can use control plane learningg, which is more complicated but much more efficient. Control plane learning uses BGP to distribute reachability information to all BGP peers, but instead of IP prefixes being distributed, it is L2 MAC addresses and ARP info that is distributed.

Remember that there are you can split up L2 packets into two types, I, known unicast traffic, where the packet has a destination MAC that is known, and II, BUM traffic which is broadcast, unknown unicast, and multicast traffic.

Known unicast traffic by definition is known, so a simple lookup tells the switch where to send traffic.

VXLAN handles BUM traffic in one of two ways, through multicast, or through headend replication

4.1.1 Multicast for BUM traffic

VXLAN creates a mapping of a multicast group to each VXLAN, i.e. VNI or VXLAN Network ID. For example:

VNI Mcast Group
678 224.0.2.5
17 224.0.2.17
1501 224.0.2.1
41214 224.0.2.14
18 224.0.2.18

Each VNI maps to a single Mcast group, however multiple VNIs can map to the same multicast group. i.e. the Mcast groups can be shared.

For each VNI present on a VTEP, that VTEP will join the corresponding Mcast group. And only those multicast groups.

Finally an ingress VTEP that receives a BUM packet on a VNI, will forward that BUM packet to the appropriate Mcast group.

4.1.2 Headend Replication for BUM traffic

The ingress VTEP is also called the "Headend" VTEP. So when the ingress VTEP receives a BUM packet on a particular VNI, it makes a copy of that packet, or "replicates" the packet as a unicast packet for each remote VTEP that has that same VNI.

Because VTEPs are making copies of packets, for each remote VTEP on that same VNI, this really works for smaller networks, up to 20 VTEPs. To scale on bigger networks, use Mcast. Headend replication is easier to configure though, as you do not need any multicast setup.

4.2 Data Plane Learning

BUM packets arrive on the data plane on a given VNI, and are flooded out to a multicast group for that particular VNI. Thus each remote VTEP that has joined that multicast group will receive that BUM packet.

A couple of drawbacks to data plane learning are:

  1. ONLY works for bridged, L2 traffic
  2. To route to a different subnet (different VNI) you have to use an external router, which means you have to trombone to a VTEP that supports L3 interfaces on multiple VNIs
  3. Less secure as a rouge VTEP could announce itself as a default gw for internet traffic, thus able to insert itself as a MIM router for traffic. i.e. in data plane learning there is no authentication of VTEPs.

4.2.1 Configuration with Data Plane learning

this is easier, but not preferred. see Configuring Control Plane Learning for a more scalable solution

vxlan-dataplane-learning.PNG

Figure 1: VXLAN Dataplane Learning
  1. enable features
feature ospf    # needed by the underlay
feature pim     # needed for multicast
feature nv overlay   # enables VXLAN (network virtualization)
# ignore the warning about routing template

feature vn-segment-vlan-based   # lets you tag frames with VXLAN header


router ospf 50

  1. set MTU
system jumbomtu 9216

system routing template-vxlan-scale  # not needed with newer NXOS versions
# will need save config and reboot router for this change.
  1. map VLAN to VNI
vlan 555
  # vlan 555 for regular hosts
  vn-segment 5000      # maps vlan 555 to vxlan 5000
  1. routed link (underlay)
interface eth 1/24
no switchport                    # makes it a routed link
ip address 10.250.0.1/30
ip router opsf 50 area 0         # routing for the underlay
ip pim sparse-mode               # needed for multicast over the underlay
no shut

At this point the ospf backbone should be up.

  1. host port Nothing special here. Just a plain access point on a vlan
int eth 1/1
description end device, maybe a laptop or a server.  No VXLAN configs here
switchport
switchport access vlan 555
no shut
  1. loopback interfaces
# needed for both rendezvous point for multicast and VTEP ip address

interface loopback 0
description  VTEP gets its ip address from the loopback interface
ip address 1.1.1.1/32
ip router ospf 50 area 0
ip pim sparse-mode


interface loopback 1
description for multicast rendezvous point  #SAME on all hosts i.e. anycast ip addr
ip address 172.17.17.17/32
ip router ospf 50 area 0
ip pim sparse-mode
  1. setup multicast for the underlay:

7.1) setup rendezvous point, this time based on the loopback 1 interface 172.17.17.17 the loopback 1 interface has the SAME 172.17.17.17 address on both swtiches as it is an anycast gateway.

ip pim rp-address 172.17.17.17 group-list 224.0.0.0/4   # same on all VTEPs

# the anycast ip address is bound to the real ip address.  So assuming the
# other two VTEPs have addresses 2.2.2.2 and 3.3.3.3 (loopback 0) then:
ip pim anycast-rp 172.17.17.17 1.1.1.1
ip pim anycast-rp 172.17.17.17 2.2.2.2
ip pim anycast-rp 172.17.17.17 3.3.3.3
  1. create the VTEP interface that encapsulates the L2 traffic This interface config would be identical on all VTEPs. i.e. they each have an nve1 interface mapped to the local loopback 0 interface, and for each VTEP participating in VNI 5000 they need to join the same mcast group, in this case 230.1.55.55
    interface nve1           # on the Nexus platform
      description ip addr is from loopback 0  Also where VNIs mapped to multicast groups
      no shut
      source-interface loopback 0
      member vni 5000 mcast-group 230.1.55.55      # where BUM traffic is sent 
    
  2. Confirming:
    show nve interface
    
    show nve peers
    # initially won't show any peers, until a host initiates a broadcast.  Then
    # that broadcast is flooded onto the multicast group and the VTEPs will see
    each other.
    
    show nve vni
    # will show the multicast group associated with the VNI and its state
    # mode is DP for data plane learning
    

    Other troubleshooting commands:

    show nve interface
    show nve vni
    show nve peers    # shows VTEP peers, but these are cached and will time-out
                      # with no traffic.
    

4.3 control plane learning

switches will learn mac addresses ahead of time, using bgp. each vtep will take any new mac address that it learns about locally, and forward that information to all other bgp peers. thus every vtep will know every mac address on every other vtep, for each vni that it is routing for.

4.3.1 bgp address families

some background on bgp. bgp can share reachability information for different protocols. these are called different address families. the first one was ipv4, followed soon after by ipv6 address families, and vpn address families for mpls. but bgp also supports sharing evpn address families.

4.3.2 bgp control plane learning

each vtep runs bgp and peers with each other vtep using ibgp, thus as a full mesh peering to all other vteps, which dictates having actually (n*(n-1)/2 peers. beyond 8 or 9 vteps this full mesh becomes too cumbersome, just bgp also supports bgp route reflectors where each vtep peers only with two or several bgp peers, known as bgp route reflectors. the router reflectors will handle replicating the learned reachability info to all of its peers, so all vteps will get all the information, without having to create the n-squared peering.

when vteps are learned via bgp, they get authenticated, and are thus trusted. all other vteps coming online are not trusted. that is another benefit to using control plane learning via bgp.

4.3.3 arps

when a host sends out its first arp packet the local switch, i.e. ingress vtep will suppress that arp and just reply immediately to the host asking. it can do that because it already knows 1) where the destination mac resides, as well as its ip address. thus the reachability info that each vtep learns is both the mac addr as well as the ip addr related to that mac addr.

so each ingress vtep seeing a host for the first time, typically an arp request, that ingress vtep sends both the mac address and ip address to all of its bgp peers, along with its own vtep id, so all other vteps know both mac address and arp information.

4.3.4 bum traffic when control planning being used

although every vtep will have learned of every mac and ip address on each vni, there are still times where a host will send a broadcast, unknown unicast or multicast packet. vxlan will use either 1) Headend Replication for BUM traffic or2) Multicast for BUM traffic to send bum traffic to all remote vteps.

4.4 Configuring Control Plane Learning

This configuration is slightly more complex than Configuration with Data Plane learning Particularly we will use headend replication (not multicast)

vxlan-controlplane-learning.PNG

Figure 2: VXLAN Controlplane Learning

4.4.1 Configure the underlay

  1. enable features
feature ospf    # needed by the underlay

router ospf 50

  1. set MTU
system jumbomtu 9216
  1. routed link (underlay)
interface eth 1/24
description routed backbone for underlay in OSFP area 0
no switchport                    # makes it a routed link
ip address 10.250.0.1/30
ip router opsf 50 area 0         # routing for the underlay
no shut
  1. loopback interfaces
# needed for both VTEP ip address

interface loopback 0
description  VTEP gets its ip address from the loopback interface
ip address 1.1.1.1/32
ip router ospf 50 area 0

At this point the ospf backbone should be up.

4.4.2 configure the overlay

This time we need bgp as that will share EVPN reachablity info between VTEPs. i.e. each MAC addr and ARP entry of local VTEP gets diseminated to all other VTEPs

  1. Turn on features we need
feature bgp
 feature interface-vlan     # same as when we want to create SVIs (switched
                           # virtual interfaces)  Here we  use it to
                           # create a virtual  anycast gateway interface
feature vn-segment-vlan-based   # lets you tag frames with VXLAN header
feature nv overlay         # enables VXLAN (network virtualization)
nv overlay evpn            # adds the evpn address family 

  1. Configure the anycast gateway on all VTEPs It is the SAME vmac on all switches, so I like to pick one easy to remember.
fabric forwarding anycast-gateway-mac dead.beaf.cafe
  1. create the VTEP interface that encapsulates the L2 traffic

This interface config would be identical on all VTEPs. i.e. they each have an nve1 interface mapped to the local loopback 0 interface, and for each VTEP participating in VNI 5000 they need to join the same mcast group, in this case 230.1.55.55

interface nve1           # on the Nexus platform
  description nve1 ip addr is from loopback 0    Also use bgp to learn hosts
  source-interface loopback 0
  host-reachability protocol bgp    # enables control plane learning on VTEP
  no shut
  1. each VTEP needs to be running BGP
router bgp 65535    # private AS number
router-id 1.1.1.1   # i.e. the loopback0 interface
neighbour 2.2.2.2   # statically specify the other VTEPs that you will peer with
  remote-as 65535   # same AS number so will be iBGP
  update-source loopback0
  address-family l2vpn evpn       # tells bgp to share L2 reachability to peer
    send-community                # uses communities to send L2 reachablity
    send-community extended       # and route target info the bgp neighbours
neighbour 3.3.3.3
  remote-as 65535
  update-source loopback0
  address-family l2vpn evpn       # tells bgp to share L2 reachability to peer  
    send-community
    send-community extended
  

4.4.3 Add a tenant

  1. create a L3VNI by creating a vlan first. This allows routing on the local switch and lets you attach it to a VRF. This case the L3VNI is 660000 and its associated vlan is 66
    # on both vtep a and vtep b for Pepsi L3VNI
    vlan 66
      description vlan for Pepsi L3VNI 660000   Will have a matching int vlan 66
      vn-segment 660000
      # vrf member is configed on the matching l3 int vlan 66
    
    
    # on both vtep a and vtep b for Coke L3VNI
    vlan 99
      description vlan for Coke L3VNI 990000  Will have a matching int vlan 99
      vn-segment 990000
      # vrf member is configured on the matching l3 int vlan 99
    
    
    # Identical on all VTEPS that need that vrf
    
  1. create a VRF per tenant, and route distinguishers and route targets RD and RT
vrf context Pepsi
  vni 660000           # adds the Pepsi L3VNI 660000 to the Pepsi VRF
  rd auto
  # adding route distinguishers to keep the vrf unique in the bgp database.
  # If you are peering with non-cisco, best to manually set the rd value, 
  # Here we can leave it auto, which lets the switch assign the rd automatically

  # route targets are used to both import and export routes and in this case
  # evpn addresses for Pepsi
  address-family ipv4 unicast
     route-target both auto     # auto lets the switch pick a unique value
     route-target both auto evpn

  # identical on vtep 1 and vtep 2, just not shown here.
 
vrf context Coke
  vni 990000
  rd auto
  address-family ipv4 unicast
     route-target both auto
     route-target both auto evpn
     

Recapping BGP: BGP rd and rt lets you control sharing routing info and MAC address between peers, while still keeping them isoated to a particular VRF/ tenant. It works like this:

a) exporting marks the routes from the local vrf bgp database with the route target value you want for a tenant

b) bgp peers receive these routes and look for the RT values that they would like to import for the tenants on this VTEP.

c) the imported routes are thus added to the appropriate vrf (based on the matching RT markings)

  1. create the svi interfaces for any l3vni vlans, in this case 66 (and later 99)
int vlan 66
   no shut
   vrf member Pepsi
   ip forward        # essentially turns on routing 


int vlan 99
   no shut
   vrf member Coke
   ip forward        # even if this is a l2 ethernet interface, you will need this
                     # to route to other subnets, like the internet.
   # if truly you only need bridging here then  no need for "ip forward"  confirm this
  1. configure the VTEP, which is done differently from the data plane approach, including ARP suppression and head-end replication
    int nve1
      description VTEP or network virtualized edge, nve, switch
      member vni 6006 
         suppress-arp   # this lets local switch to just answer any and all arp
                        # broadcasts directly, since all VTEPs have their any-cast
                        # gateway locally
         ingress-replication protocol bgp   # this is the head-end replication
    
      member vni 6007
         suppress-arp
         ingress-replication protocol bgp
    
      member vni 660000 associate-vrf    # tells the VTEP that the vni 660000
                                            # is used for routing 
    
      member vni 9009
         suppress-arp
         ingress-replication protocol bgp
    
    
      member vni 990000 associate-vrf
    
    
    
    

    Continuing with BGP additions:

    router bgp 65535
       vrf Pepsi
           address-family ipv4 unicast
              advertise l2vpn evpn
              # this allows external routes to be advertised as L3 routes within
              # the vrf, in  this case Pepsi.
    # identical on all vteps with Pepsi vrf
    
  1. create client vlans and associated vnis i.e. to vxlan network identifiers
# on vtep-a
vlan 666 
  description part of Pepsi clients
  # this is the vlan 666 for L3VNI, for "Pepsi".  I add int vlan 666 later
  vn-segment 6006      # maps vlan 666 to vxlan 6006  i.e. vni 6006

# on vtep-b
vlan 667
  description part of Pepsi clients
  # this is a L3VNI, for "Pepsi". actaully just the VLAN, I add int vlan 666 later
  vn-segment 6007      # maps vlan 666 to vxlan 6007  i.e.  vni 6007


# note: vtep-b could also have vlan 666, for for Pepsi they wanted two separate
# vlans, thus routing between vla n  666 and vlan 667

  1. create the L3 interface vlan (i.e. SVI) that is associated with the L3VNI as well as a customer's VRF
    int vlan 666
      description the SVI for Pepsi, i.e. vlan 666 that was created earlier
      no shut
      vrf member Pepsi
      ip address 192.168.20.1/24
      fabric forwarding mode anycast-gateway    # same anycast gw on all vteps
    
    # identical on vtep 1 and vtep 2
    
    int vlan 667
      description the other SVI for Pepsi, i.e. vlan 667 that was created earlier
      vrf member Pepsi
      ip address 192.168.22.1/24
      fabric forwarding mode anycast-gateway    # same anycast gw on all vteps
    
    
    int vlan 900
       description Coke vlan
       no shut
       vrf member Coke
       #  ip addr for anycast gateway is added even if Coke does not need to
       #  route between vlans.  That is because the anycast gateway is always
       #  the default gw to get to any outside L3 network
       ip addr 172.16.99.1/24
       fabric forwarding mode anycast-gateway 
    
    
  1. configure anycast gateway
  2. finally configure the EVPN i.e. vteps for each tenant
    evpn
    # used to advertise MAC address reachablity in BGP
      vni 6006 l2    # or l2 information
          rd auto
          route-target import auto 
          route-target export auto
    # identical on all vteps
    
      vni 6007 l2
          rd auto
          route-target import auto
          route-target export auto
    
      vni 9009 l2
          rd auto
          route-target import auto
          route-target export auto 
    
  1. configure edge ports connecting to end devices
int eth 1/1
description Pepsi end device, maybe a laptop or a server.  No VXLAN configs here.
switchport
switchport access vlan 666
no shut

int eth 1/2
description Coke end device, maybe a laptop or a server.  No VXLAN configs here.
switchport
switchport access vlan 900
no shut

int eth1/3
description  other vlan on the Pepsi client facing interface
switchport
switchport access vlan 667
no shut
  1. Confirming:
    show nve interface
    
    show nve peers
    # initially won't show any peers, until a host initiates a broadcast.  Then
    # that broadcast is flooded onto the multicast group and the VTEPs will see
    each other.
    
    show nve vni
    # will show the multicast group associated with the VNI and its state
    # mode is DP for data plane learning
    

4.4.4

5 routing and multitenancy vxlan evpn

vxlan evpn supports irb, (integrated routing and bridging) as well as multitenancy

5.1 irb

with data plane learning, as discussed above, you need an external router when traffic passes from one vlan/subnet to another. with vxlan-bpg-evpn, i.e. control plane learning, both layer 2 l2vni and layer 3 l3vni vnis can be created. you specify the type of vni when you create it. l2vpn pass traffic within the same l2 vni, i.e. bridging traffic. l3vni let you route traffic from one vni to another vni. l3vnis are optional, as you can still use external routers to route traffic, but if you want the local vtep to route traffic, then you need a l3vni

each vtep needs to know only the l2vnis that are local to it. however all vteps need to know all l3vni networks. this is a requirement for all vteps to support anycast gateway, where all vteps present themselves as the l3 gateway to local hosts, and then route traffic to the l3 destination. could not,

5.2 anycast gateway

with l3vnis each vtep presents itself as the l3 gateway to end hosts. all vteps present the same ip address and mac address for each anycast gateway, so hosts can easily move between vteps and retain their arp table. vms can move between vteps very easily. they do not even need to re-arp for their default gateway.

5.3 multitenancy

to create an isolated network, say for pepsi, simply map the pepsi l3vni to the pepsi vrf. and the coke l3vni to the coke vrf.

  vrf-1 vrf-2 vrf-3
customer/tenant pepsi coke maple leaf foods
    vni 9000  
    L3VNI 99000  
route 65000:5 65000:10 65000:20
distinguisher      
route targets export 65000:10 export 65000:10 export 65000:10
  import 65000:10 import 65000:10 import 65000:10
l3vni 660000 990000 1111111
associated vlan 66 99 111
client vlans 666, 667 900 not yet configed
client ip addr 192.168.20.0/24 192.168.20.0/24 10.55.66.0/24
subnets 192.168.22.0/24    

Many VNIs can be associated with a tenant, or customer.

** unfinished. check back later.