vSphere with Tanzu Validation and Testing of Network MTU

Blog Date: June 18, 2021
Tested on vSphere 7.0.1 Build 17327586
vSphere with Tanzu Standard


On a recent customer engagement, we ran into an issue where vSphere with Tanzu wasn’t successfully deploying. We had intermittent connectivity to the internal Tanzu landing page IP. What we were fighting ended up being inconsistent MTU values configured both on the VMware infrastructure side, and also in the customers network. One of the many prerequisites to a successful installation of vSphere with Tanzu, is having a consistent MTU value of 1600.


The Issue:


Tanzu was just deployed to an NSX-T backed cluster, however you are unable to connect to the vSphere with Tanzu landing page address to download Kubernetes CLI Package via wget.  Troubleshooting in NSX-T interface shows that the load balancer is up that has the control plane VMs connected to it.


Symptoms:

  • You can ping the site address IP of the vSphere with Tanzu landing page
  • You can also telnet to it over 443
  • Intermittent connectivity to the vSphere with Tanzu landing page
  • Intermittent TLS handshake errors
  • vmkping tests between host vteps is successful.
  • vmkping tests from hosts with large 1600+ packet to nsx edge node TEPs is unsuccessful.


The Cause:


Improper/inconsistent MTU settings in the network data path.  vSphere with Tanzu requires minimum MTU of 1600.   The MTU size must be 1600 or greater on any network that carries overlay traffic.  See VMware documentation here:   System Requirements for Setting Up vSphere with Tanzu with vSphere Networking and NSX Advanced Load Balancer (vmware.com)


vSphere with Tanzu Network MTU Validations:


These validations should have been completed prior to the deployment. However, in this case we were finding inconsistent MTU settings. So to simplify, these are what you need to look for.

  • In NSX-T, validate that the MTU on the tier-0 gateway is set to a minimum of 1600.
  • In NSX-T, validate that the MTU on the edge transport node profile is set to a minimum of 1600.
  • In NSX-T, validate that the MTU on the host uplink profile is set to a minimum of 1600.
  • In vSphere, validate that the MTU on the vSphere Distributed Switch (vDS) is set to a minimum of 1600.
  • In vSphere, validate that the MTU on the ESXi management interface (vmk0) is set to a minimum of 1600.
  • In vSphere, validate that the MTU on the vxlan interfaces on the hosts is set to a minimum of 1600.


Troubleshooting:

In the Tanzu enabled vSphere compute cluster, SSH into an ESXi host, and ping from the host’s vxlan interface to the edge TEP interface.  This can be found in NSX-T via: System, Fabric and select Nodes, edge transport nodes, and find the edges for Tanzu. The TEP interface IPs will be to the right.  In this lab, I only have the one edge. Production environments will have more.

  • In this example, vxlan was configured on vmk10 and vmk11 on the hosts. Your mileage may vary.
  • We are disabling fragmentation with (-d) so the packet will stay whole. We are using a packet size of 1600
# vmkping -I vmk10 -S vxlan -s 1600 -d <edge_TEP_IP>
# vmkping -I vmk11 -S vxlan -s 1600 -d <edge_TEP_IP>
  • If the ping is unsuccessful, we need to identify the size of the packet that can get through.  Try a packet size of 1572. If unsuccessful try 1500.  If unsuccessful try 1476. If unsuccessful try 1472, etc.

To test farther up the network stack, we can perform a ping something that has a different VLAN, subnet, and is on a routable network.  In this example, the vMotion network is on a different network that is routable. It has a different VLAN, subnet, and gateway.  We can use two ESXi hosts from the Tanzu enabled cluster.

  1. Open SSH sessions to ESXi-01 and ESXi-02.
  2. On ESXi-02, get the PortNum for the vMotion vmk. On the far left you will see the PortNum for the vMotion enabled vmk. Run the following command:
# net-stats -l

3. Run a packet capture on ESXi-02 like so:

# pktcap-uw --switchport <vMotion_vmk_PortNum> --proto 0x01 --dir 2 -o - | tcpdump-uw -enr -

4. On the ESXi-01 session, use the vmkping command to ping the vMotion interface of ESXi-02.  In this example we use a packet size of 1472 because that was the packet size the could get through, and option -d to prevent fragmentation.

# vmkping -I vmk0 -s 1472 -d <ESXi-02_vMotion_IP>

5. On the ESXi-02 session, we should now see six or more entries. Do a CTRL+C to cancel the packet capture.

6. Looking at the packet capture output on ESXi-02, We can see on the request line that ESXi-01 MAC address made a request to ESXi-02 MAC address.

  • On the next line for reply, we might see a new MAC address that is not ESXi-01 or ESXi-02.  If that’s the case, then give this MAC address to the Network team to troubleshoot further.



Testing:

Using the ESXi hosts in the Tanzu enabled vSphere compute cluster, we can ping from the host’s vxlan interface to the edge TEP interface.

  • The edge TEP interface can be found in NSX-T via: System, Fabric and select Nodes, edge transport nodes, and find the edges for Tanzu. The TEP interface IPs will be to the far right.
  • You will need to know what host vmks the vxlan is enabled. In this example we are using vmk10 and vmk11 again.

In this example we are using vmk10 and vmk11 again. We are disabling fragmentation with (-d) so the packet will stay whole. We are using a packet size of 1600. These should now be successful.

The commands will look something like:

# vmkping -I vmk10 -S vxlan -s 1600 -d <edge_TEP_IP>
# vmkping -I vmk11 -S vxlan -s 1600 -d <edge_TEP_IP>

On the ESXi-01 session, use the vmkping command to ping something on a different routable network, so that we can force traffic out of the vSphere environment and be routed. In this example just like before, we will be using the vMotion interface of ESXi-02. Packet size of 1600 should now work. We still use option -d to prevent fragmentation.

# vmkping -I vmk0 -s 1600 -d <ESXi-02_vMotion_IP>

On the developer VM, you should now be able to download the vsphere-plugin.zip from the vSphere with Tanzu landing page with the wget command.

# wget https://<cluster_ip>/wcp/plugin/linux-amd64/vsphere-plugin.zip

With those validations out of the way, you should now be able to carry on with the vSphere with Tanzu deployment.