VMware has multiple SDN solutions in their portfolio, which are all under the NSX-umbrella. Today I’m going to talk about NSX-T, the go-to SDN solution for multi-cloud- and container-based environments.
NSX-T is still under development and lacks some of the features that NSX-V already has. BUT, from a networking perspective NSX-T already delivers features that NSX-V doesn’t have (and will never going to have). For example, with the introduction of NSX-T you can have much greater control of the BGP configuration (for instance AS-path prepending) in comparison to NSX-V, which doesn’t allow you to do all the fancy BGP stuff.
One of the most requested feature is a multi-site active/active architectures, which cannot be easily fulfilled by SDN solutions (in general). I will explain this in more detail in the paragraphs below.
If you’re familiar with stretched- and overlay network you can skip the following paragraph (you can use your time wisely for important stuff).
Traditional Multi-site challenges
Here comes a brief history of traditional multi-site challenges “a.k.a. where are we coming from”.
In traditional multi-site network topologies, sites were connected with layer 3 (routed) links. Which means that each site was identified by its IP space. And this imposes mayor challenges for applications:
- When migrating workloads between sites, they will need to change their IP-address.
- All in-place firewall rules must be changed to accommodate the modified IP address.
- Load-balancer configuration must be changed to accommodate the modified IP address.
- Security policies must be (manually) recreated on the secondary site, as the sites are operated in a stand-alone manner.
The above mentioned challenges can be overcome by stretching (or bridging) the layer 2 networks over both sites, which allows you to keep the same IP address when a workload is moved to the other site.
But stretching networks brings new challenges: With a stretched (redundant) network, “looping” must be prohibited, otherwise the datacenter-interconnects becomes occupied with network packets that are getting looped over the redundant links. This problem can be solved by using Spanning Tree, which basically disables one of the redundant links and allow you yo have only one active path between the sites at any time. Yup, there goes your available bandwidth.
In the meanwhile new technologies have been developed to overcome the “looping”-problem, which basically bundles the available links and create one logical connection between sites.
Also with stretching/bridging the failover domain is stretched over both sites: a broadcast storm can take out both sites and can cause a total blackout/outage.
So, as you may already understand (or already know): from a network perspective stretching network topologies must be kept to a minimum for the sake of the overall availability.
Multi-site SDN opportunities and challenges
“SDN” and “overlay networks” are usually going hand-in-hand or are mentioned in the same sentence. SDN and overlay networks can create excellent opportunities for multi site deployments (but there are some caveats):
A overlay network enables you to create a layer 2 network on top of (and decoupled from) any layer 3 routable network. Which allows you to stretch a broadcast domain over multiple sites independent from the underlay network (hardware) or layer 3 topology. A workload is connected to the overlay network and can be seamlessly transported or migrated between sites. The creation of the overlay network is usually realized by a SDN, but this is not mandatory (it can be delivered from the underlay hardware itself also).
For an overlay network to be functional a (uplink-)connection between the underlay and the overlay network must be realized. This can be realized through the use of bridging (by connecting 2 broadcast domains) or through the use of routing between the overlay and the underlay networks. Usually the “routing” option realizes the most reliable and scalable connectivity between the overlay and the underlay.
For the configuration and operation of the overlay network a (separate) control plane must be operational, which keeps track of the overlay IP/MAC-addresses and the corresponding underlay devices (hosts). An additional advantage of an overlay network is that the control plane keeps track of all MAC-address and there corresponding location, so broadcast message doesn’t have to be flooded throughout the network: network packets are redirected directly to the required nodes (servers) which are connected to the overlay network. This minimizes the bandwidth utilization and protects against broadcast storms (a.k.a. outages).
So, each SDN solutions comes with its own control plane, which itself imposes some new challenges for multisite deployments. The control plane is (usually) provided by a cluster of uneven (3 or 5) nodes which (for the sake of availability) together form a node-majority controller-cluster: this applies for both NSX for vSphere (NSX-v), NSX-T and also for Cisco ACI.
The challenge of this node majority (controller-) cluster is that, in a multi-site environment, a cluster can only be active in one site at any given time: the majority of cluster nodes can only be located in one site. If that particular site is going down, the control plane doesn’t function anymore. The same applies for the management of the SDN, which can only be active in one site only.
A non-functioning control- and/or a management plane doesn’t cause direct problem for the data plane. The communication between servers will be functioning without an operational control and management- plane, as the data plane operates decoupled from the control plane. Only when something is changed/migrated, the control plane is not able to take according actions to the data plane. If nothing is changed the communication will be working as always.
When comparing NSX-T to NSX for vSphere (NSX-v) and/or Cisco ACI. NSX-v and Cisco ACI deliver solutions to overcome this multi-site management and control plane issue by either allowing a disconnected control plane (changes can be resolved by the underlaying hosts itself) or use multiple control plane clusters and/or multiple management planes
To conclude: NSX-T, currently (version 2.3 and lower) doesn’t have a solution for this problem (yet .. i’m sure this will come).
To continue I start with some NSX basics
NSX-T basics (summed up)
NSX-T data plane
The management and control plane are summarized in the above paragraphs. The NSX-T data plane is being build upon the Generic Network Virtualization Encapsulation (GENEVE) protocol, where NSX-V is build upon the VXLAN protocol.
Each hypervisor hosts and Edge nodes that are being a part of the GENEVE overlay topology are called Transport Nodes. As the name implies: The hosts and nodes transport the GENEVE encapsulated packets and are being a part of the control plane topology (which is somewhat out of scope for now).
NSX-T uses the NSX VMware Distributed Switch (called the N-VDS) for the connectivity of virtual machines and require dedicated NICs on the Transport Node.
The N-VDS spans the Transport Nodes and host the Logical Switch, which are being used to connect virtual machines to. Where NSX-V was only capable of hosting VXLAN-backed Logical Switches, NSX-T is capable of hosting GENEVE- and VLAN-backed Logical Switches.
With NSX-T there is a clear differentiation between Uplinks and physical NICs (pNIC). Uplinks refer to the uplink of the N-VDS, where a pNIC refers to a physical NIC of the Transport Node. Link Aggregation is supported in NSX-T on the pNIC level.
NSX-T uses the concept of Uplink profiles, which act as a template for the definition of the N-VDS uplinks. It defines the format of the uplinks, the overlay transport VLAN, MTU size and teaming policies.
NSX-T routing model
NSX-T utilizes a “tiered logical router model” for providing easy multi-tenant integration. There are 2 (main-)types of logical routers available:
- Tier-0 logical routers providing ramp-on/ramp-off connectivity from the (virtualized) overlay to the (physical) underlay. It provides a connection platform for Tier-1 logical routers. In a multi-tenant environment this router is managed by the provider administrators
- Tier-1 logical routers providing gateway functionality for overlay logical switches and are ALWAYS connected to a single Tier-0 logical router (except for standalone Tier-1 router, but this is out of scope for now). In a multi-tenant environment each individual logical router can be managed by the tenant administrator.
The connectivity between the Tier-0 Logical Router and the Tier-1 Logical Router is provided by a “Routerlink”, a /31 subnet within the 100.64.0.0/10 reserved address space (RFC6598), which is automatically created when deployed .
NSX-T Logical Router components
Each logical router consist of 2 components: a Distributed Router- and a Service Router component which are connected to eachother through a Intra-Tier transit link. The Intra-Tier transit link is provided with a APIPA (Automatic Private IP Address) subnet, to disallow interference with the normal (RFC1918) private IP address ranges. The used APIPA range can be changed if it’s interfering somewhere in the network: but I think you really should re-design your network if this is true.
- The Distributed Router (DR) component provides dynamic (BGP) and static routing capabilities. The router is distributed in the hypervisor of all Transport Nodes (TN): This concept is similar to the NSX-V Distributer Logical Router (DLR). Note: the DR is Equal Cost Multi Path (ECMP) capable.
- The Service Router (SR) provides (stateful and/or stateless) network services (firewalling, load-balancing, NAT, DHCP, etc).
NSX-T Edge nodes
NSX-T provides Edge nodes which are appliances that provide capacity and physical network connectivity to run network services (logical service routers) that cannot be distributed to the hypervisors. To deploy a Logical Service Router an Edge node is mandatory.
When deploying a NSX-T Edge Node its a empty shell, there are no logical service routers running in it. Edge Nodes are available in two form-factors: as bare metal servers and as virtual machines.
Compared to NSX-V Edge Service Gateways the NSX-T Edge Nodes do need to connect to the “management” network. The NSX-T manager needs this connection to be able to configure the NSX-T Edge Node.
The Bare Metal NSX-T Edge Node consist of four physical NICs (called fp-ethX) for uplink connectivity or overlay transport traffic. The management NIC can redirected to a 1 Gb physical NIC
The Virtual Machine NSX-T Edge Node consists of four NICs (called eth0, fp-eth0, fp-eth1, and fp-eth2). Eth0 is used for management, the others NICs are available for uplink connectivity or overlay transport traffic.
NSX-T Edge clusters
For high-availability, redundancy and scalability multiple NSX-T Edge nodes can be placed into a NSX-T Edge Cluster.
A NSX-T Edge Cluster must not be compare with NSX-V Edge cluster: A NSX-T Edge cluster consist of NSX-T Edge Nodes (bare metal servers or VMs) were NSX-V Edge clusters consist of vSphere (ESXi) hypervisor hosts which is purposely build for hosting NSX-V Edge Service Gateways).
A NSX-T Edge cluster consist of a set of heterogenous Edge Nodes (Bare-metal OR VMs), they cannot be mixed in one NSX-T Edge cluster.
A NSX-T Edge cluster can host one Tier-0 Logical Router and/or multiple Tier-1 Logical Routers. It’s not mandatory to host the Tier-0 Logical Router and the Tier-1 Logical router in the same NSX-T Edge cluster, they can be hosted in separated NSX-T Edge clusters.
I’m aware this is a summarization, but it’s enough to continue:
Active/Passive vs Active/Active
When talking about multisite network topologies, there are always two options:
- Active/Passive – all workloads are operational in one (primary) site, the other (secondary) site is standby for when a failure in the primary site occurs). When a failure occurs, all workloads are migrated to the secondary (passive) site until the primary site is functioning again. Only one site can be active at any given time. A switchover between sites is always accompanied with a small outage.
- Active/Active – workloads are able to migrate freely between sites, both sites actively serve workloads and/or applications.
A active/passive topology is by all means the easiest to create and manage: no topology complexity and you are able to power a complete site down and power it up on the other site, everything will be working again without any complexity. The downtime involved with the switchover is a calculated risk.
Note: This topology may not be sufficient for business critical (BCA) or line-of-business (LoB) applications who must be operational 24/7 (also when a site is down). For these LoBs or BCAs an active/active network topology must be deployed (or the application is made high-available through the use of application clustering .. which is out of scope for now).
An active/active environment imposes some design-complexity as the management, control- and the data-plane, layer 2, layer 3 and uplink connectivity must be all addressed within one architecture.
As said: NSX-T currently doesn’t contain any built-in features to provide support for active/active topologies for the management and the control plane, it only provides active/active topology support from the data plane perspective.
With that being said: let’s continue with a NSX-T Active/Active topology (from a data plane standpoint).
NSX-T in a Active/Active topology
A prerequisites for having an active/active topology is having layer 3 routing instances in the (physical) underlay network at both sites, which are connected to each other. After all, the IP subnet must be routable on both sites to be able of creating an active/active topology. So when a site outage occurs the other site can operate autonomous.
Each site will need to have uplink connectivity from the overlay to the underlay. And as said: NSX-T provides a multi-tier routing solution, which sadly cannot be used for multi-site active/active topologies. A Tier-1 Logical Router can only be connected to one Tier-0 Logical Router. The Tier-0 router does support an active/active configuration, which can be spanned over both sites. But when talking about scalability, you are bound to this single Tier-0 Logical Router instance.
To create a scalable solution multiple Tier-0 Logical Routers must be deployed, each site should host NSX-Edge cluster and a minimum of 1 Tier-0 Logical Router. So when using multiple Tier-0 Logical Routers, we cannot use Tier-1 Logical Routers. I will call the Tier-0 routers, “transit routers (TRs)” as they provide a transit between the logical switches to the physical (underlay) network. In this case we use ECMP in combination with BGP for distributing the network traffic over both TRs.
I recommend to use NSX-T Edge Nodes in the “large” Virtual Machine form factor, as they provide sufficient capacity. DRS host-affinity rules are being created to pin a NSX-T Edge cluster to a particular site.
A quick note: when using a Tier-0 Logical Router in an active/active mode, you cannot use stateful services as firewalling and load-balancing.
For ECMP to work properly a centralized routering instance must be available, from were the routes can be distributed over both TRs. Both TRs must be connected to a transit logical switch, which provides backbone connectivity between both TRs and the centralized routing instance.
The centralized routing instance is provided by another Tier-0 Logical Router, which can be deployed in an active/standby configuration allowing stateful services and which is capable of connecting Tier-1 Logical Routers to. The centralized Tier-0 Logical router spans both sites. Between the Tier-0 Logical Routers aggressive BGP timers of 1/3 seconds are being used: when a site outage occurs an interruption of max 3 seconds may occur. From the TRs towards the upstream Layer 3 devices, more relaxt timers are being configured (4/12 seconds) to lower the bandwidth utilization.
A diagram is shown below:
In this setup we are able to manipulate ingress and egress traffic by using (for example) BGP AS-Path prepending. This allows us to steer the network traffic over one site, allowing you to have more control over the preferred network path. When the preferred site is going down, the other site will be sending and receiving the network packets as it only has a lower priority.
Good to know is that the centralized active/standby Tier-0 router already uses the BGP AS-Path Prepending feature for its active/standby configuration.
Providing a active/active NSX-T topology from the management and control plane point of view is not possible (yet). From a data-plane perspective you can achieve a active/active multisite topology, but there are some pro’s and con’s.
I can already hear some criticism about having an active/standby router as a centralized routing instance. I’ve only configured it here so I’m being able to have stateful service at this layer, which is not uncommon. You are able to create an active/active Tier-0 Logical router, but then without stateful services.