vSphere with Tanzu: Project Contour TLS Delegation Invalid – Secret Not Found

Blog Date: June 25, 2021
Tested on vSphere 7.0.1 Build 17327586
vSphere with Tanzu Standard

On a recent customer engagement, we ran into an issue where after we deployed project Contour, and created a TLS delegation “contour-tls”, but we ran into an issue where Contour did not like the public wildcard certificate we provided. We were getting an error message “TLS Secret “projectcontour/contour-tls” is invalid: Secret not found.”

After an intensive investigation to make sure everything in the environment was sound, we came to the conclusion that the “is invalid” part of the error message suggested that there was something wrong with the certificate. After working with the customer we discovered that they included the Root, the certificate, and the intermediate authorities in the PEM file. The root doesn’t need to be in the pem.  Just the certificate, and the intermediate authorities in descending order. Apparently that root being in the pem file made Contour barf. Who knew?

You could possibly see that the certificate is the issue by checking the pem data for both the <PrivateKeyName>.key and the <CertificateName>.crt by running the following commands, and comparing the pem output. IF it doesn’t match this could be your issue as well. The “<>” should be updated with your values, and don’t include these “<” “>”.

openssl pkey -in <PrivateKeyName>.key -pubout -outform pem | sha256sum
openssl x509 -in <CertificateName>.crt -pubkey -noout -outform pem | sha256sum

Below are the troubleshooting steps we took, and what we did to resolve the issue. We were using Linux, and had been logged into vSphere with Tanzu already. Did I mention that I hate certificates? But I digress….

The Issue:

You had just deployed a TKG cluster, and then deployed Project Contour as the ingress controller that uses a load balancer to be the single point of entry for all external users. This connection terminates SSL connections, and you have applied a public wildcard certificate to it. You created the TLS secret, and have created the TLS delegation, so that new apps deployed to this TKG cluster can delegate TLS connection terminations to contour. However, after you deployed your test app to verify the TLS delegation is working, you see a status of “Invalid. At least one error present, see errors for details.”, after running the following command:


kubectl get httpproxies

Troubleshooting:

  1. You run the following command to gather more information, and see in the error message: “Secret not found” Reason: “SecretNotValid
kubectl describe httpproxies.projectcontour.io

2. You check to make sure the TLS Secret was created in the right namespace with the following command, and you see that it is apart of the desired namespace. In this example, our namespace was called projectcontour, and the TLS secret was called contour-tls.


kubectl get secrets -A

3. You check the TLS delegation to make sure it was created with the following command. In this example ours was called contour-delegation, and our namespace is projectcontour.

kubectl get tlscertificatedelegations -A

4. You look at the contents of the tlscertificatedelegations with the following command, and nothing looks out of the ordinary.

kubectl describe tlscertificatedelegations -A

5. You check to see the secrets of the namespace with the following command. In this example our namespace is called projectcontour and we can see our TLS delegation contour-tls.


kubectl get secrets --namespace projectcontour

6. You validate contour-tls has data in it with the following command. In this example, our namespace is projectcontour and our TLS is contour-tls.

kubectl get secrets --namespace projectcontour contour-tls -o yaml

In the yaml output, up at the top you should see tls.crt: with data after

Down towards the bottom of the yaml output, you should see tls.key with data after

Conclusion: Everything looks proper on the Tanzu side. Based on the error message we saw “TLS Secret “projectcontour/contour-tls” is invalid: Secret not found.” The “is invalid” part could suggest that there is something wrong with the contents of the TLS secret.  At this point, the only other possibility would be that the public certificate has something wrong and needs to be re-generated.   The root doesn’t need to be in the pem.  Just the certificate for the site, and intermediate authorities in descending order.

The Resolution:

  1. Create and upload the new public certificate for contour to vSphere with Tanzu.
  2. We’ll need to delete the secret and re-create it.  Our secret was called “contour-tls”, and the namespace was called “projectcontour”.
kubectl delete secrets contour-tls -n projectcontour

3. We needed to clean our room, and delete the httpproxies we created in our test called test-tls.yml, and an app that was using the TLS delegation. In this example it was called tls-delegation.yml

kubectl delete -f test-tls.yml
kubectl delete -f tls-delegation.yml

4. Now we created a new secret called contour-tls with the new cert. The <> indicates you need to replace that value with your specific information. The “<>” does not belong in the command.

kubectl create secret tls contour-tls -n projectcontour --cert=<wildcard.tanzu>.pem --key=<wildcard.tanzu>.key

5. Validate the pem values for .key and .crt match. The <> indicates you need to replace that value with your specific information. The “<>” does not belong in the command.


openssl pkey -in <PrivateKeyName>.key -pubout -outform pem | sha256sum

openssl x509 -in <CertificateName>.crt -pubkey -noout -outform pem | sha256sum

6. If the pem numbers match, the certificate is valid. Lets go ahead an re-create the tls-delegation.  Example command:

kubectl apply -f tls-delegation.yml

7. Now you should be good to go. After you deploy your app, you should be able to check the httpproxies again for Project Contour, and see that it has a valid status

kubectl get httpproxies.projectcontour.io

If all else fails, you can open a ticket with VMware GSS to troubleshoot further.

vSphere with Tanzu Validation and Testing of Network MTU

Blog Date: June 18, 2021
Tested on vSphere 7.0.1 Build 17327586
vSphere with Tanzu Standard


On a recent customer engagement, we ran into an issue where vSphere with Tanzu wasn’t successfully deploying. We had intermittent connectivity to the internal Tanzu landing page IP. What we were fighting ended up being inconsistent MTU values configured both on the VMware infrastructure side, and also in the customers network. One of the many prerequisites to a successful installation of vSphere with Tanzu, is having a consistent MTU value of 1600.


The Issue:


Tanzu was just deployed to an NSX-T backed cluster, however you are unable to connect to the vSphere with Tanzu landing page address to download Kubernetes CLI Package via wget.  Troubleshooting in NSX-T interface shows that the load balancer is up that has the control plane VMs connected to it.


Symptoms:

  • You can ping the site address IP of the vSphere with Tanzu landing page
  • You can also telnet to it over 443
  • Intermittent connectivity to the vSphere with Tanzu landing page
  • Intermittent TLS handshake errors
  • vmkping tests between host vteps is successful.
  • vmkping tests from hosts with large 1600+ packet to nsx edge node TEPs is unsuccessful.


The Cause:


Improper/inconsistent MTU settings in the network data path.  vSphere with Tanzu requires minimum MTU of 1600.   The MTU size must be 1600 or greater on any network that carries overlay traffic.  See VMware documentation here:   System Requirements for Setting Up vSphere with Tanzu with vSphere Networking and NSX Advanced Load Balancer (vmware.com)


vSphere with Tanzu Network MTU Validations:


These validations should have been completed prior to the deployment. However, in this case we were finding inconsistent MTU settings. So to simplify, these are what you need to look for.

  • In NSX-T, validate that the MTU on the tier-0 gateway is set to a minimum of 1600.
  • In NSX-T, validate that the MTU on the edge transport node profile is set to a minimum of 1600.
  • In NSX-T, validate that the MTU on the host uplink profile is set to a minimum of 1600.
  • In vSphere, validate that the MTU on the vSphere Distributed Switch (vDS) is set to a minimum of 1600.
  • In vSphere, validate that the MTU on the ESXi management interface (vmk0) is set to a minimum of 1600.
  • In vSphere, validate that the MTU on the vxlan interfaces on the hosts is set to a minimum of 1600.


Troubleshooting:

In the Tanzu enabled vSphere compute cluster, SSH into an ESXi host, and ping from the host’s vxlan interface to the edge TEP interface.  This can be found in NSX-T via: System, Fabric and select Nodes, edge transport nodes, and find the edges for Tanzu. The TEP interface IPs will be to the right.  In this lab, I only have the one edge. Production environments will have more.

  • In this example, vxlan was configured on vmk10 and vmk11 on the hosts. Your mileage may vary.
  • We are disabling fragmentation with (-d) so the packet will stay whole. We are using a packet size of 1600
# vmkping -I vmk10 -S vxlan -s 1600 -d <edge_TEP_IP>
# vmkping -I vmk11 -S vxlan -s 1600 -d <edge_TEP_IP>
  • If the ping is unsuccessful, we need to identify the size of the packet that can get through.  Try a packet size of 1572. If unsuccessful try 1500.  If unsuccessful try 1476. If unsuccessful try 1472, etc.

To test farther up the network stack, we can perform a ping something that has a different VLAN, subnet, and is on a routable network.  In this example, the vMotion network is on a different network that is routable. It has a different VLAN, subnet, and gateway.  We can use two ESXi hosts from the Tanzu enabled cluster.

  1. Open SSH sessions to ESXi-01 and ESXi-02.
  2. On ESXi-02, get the PortNum for the vMotion vmk. On the far left you will see the PortNum for the vMotion enabled vmk. Run the following command:
# net-stats -l

3. Run a packet capture on ESXi-02 like so:

# pktcap-uw --switchport <vMotion_vmk_PortNum> --proto 0x01 --dir 2 -o - | tcpdump-uw -enr -

4. On the ESXi-01 session, use the vmkping command to ping the vMotion interface of ESXi-02.  In this example we use a packet size of 1472 because that was the packet size the could get through, and option -d to prevent fragmentation.

# vmkping -I vmk0 -s 1472 -d <ESXi-02_vMotion_IP>

5. On the ESXi-02 session, we should now see six or more entries. Do a CTRL+C to cancel the packet capture.

6. Looking at the packet capture output on ESXi-02, We can see on the request line that ESXi-01 MAC address made a request to ESXi-02 MAC address.

  • On the next line for reply, we might see a new MAC address that is not ESXi-01 or ESXi-02.  If that’s the case, then give this MAC address to the Network team to troubleshoot further.



Testing:

Using the ESXi hosts in the Tanzu enabled vSphere compute cluster, we can ping from the host’s vxlan interface to the edge TEP interface.

  • The edge TEP interface can be found in NSX-T via: System, Fabric and select Nodes, edge transport nodes, and find the edges for Tanzu. The TEP interface IPs will be to the far right.
  • You will need to know what host vmks the vxlan is enabled. In this example we are using vmk10 and vmk11 again.

In this example we are using vmk10 and vmk11 again. We are disabling fragmentation with (-d) so the packet will stay whole. We are using a packet size of 1600. These should now be successful.

The commands will look something like:

# vmkping -I vmk10 -S vxlan -s 1600 -d <edge_TEP_IP>
# vmkping -I vmk11 -S vxlan -s 1600 -d <edge_TEP_IP>

On the ESXi-01 session, use the vmkping command to ping something on a different routable network, so that we can force traffic out of the vSphere environment and be routed. In this example just like before, we will be using the vMotion interface of ESXi-02. Packet size of 1600 should now work. We still use option -d to prevent fragmentation.

# vmkping -I vmk0 -s 1600 -d <ESXi-02_vMotion_IP>

On the developer VM, you should now be able to download the vsphere-plugin.zip from the vSphere with Tanzu landing page with the wget command.

# wget https://<cluster_ip>/wcp/plugin/linux-amd64/vsphere-plugin.zip

With those validations out of the way, you should now be able to carry on with the vSphere with Tanzu deployment.

Kubernetes API Server Endpoint FQDN Missing from Certificate SAN in vSphere with Tanzu Deployment.

Blog Date: June 17, 2021
Tested on vSphere 7.0.1 Build 17327586
vSphere with Tanzu Standard

The Issue:

On a recent customer engagement we ran into an issue when we applied the CA signed certificate to the vSphere with Tanzu enabled cluster. The customer could reach the Tanzu Landing Page (internal Kubernetes site address) with the assigned IP, and they received the secure lock on the site. However, they received an invalid certificate warning when trying to connect to the internal Kubernetes site with the FQDN.  Upon closer inspection we realized that the FQDN is not apart of the certificate Subject Alternative Name (SAN). We had also found that this customer had MTU inconsistencies in their environment, and we ended up redeploying vSphere with Tanzu a couple of times. There will be another blog post for that, but in regards to this blog, on the last deployment we missed adding the FQDN during the setup.

The Cause:

When enabling vSphere with Tanzu on a compute cluster, during the deployment wizard, you are asked for the API Server endpoint FQDN (fully-qualified domain name).  You will notice this says “Optional”. 

However, because this value was not filled out during the deployment, it will not be in the SAN when you create a CSR to apply a certificate to the Tanzu supervisor cluster. 

Resolution:

Currently the only “easy” fix for this would be to redeploy vSphere with Tanzu on the cluster assuming you are early on in the deployment. 

However, if you have already deployed workloads this will be destructive.  Your only other option is to open a ticket with VMware GSS, and they will need to add the missing entry to the database on the vCenter.

I wouldn’t expect there to be a public KB article on this as we do not want customers editing the vCenter database without GSS guidance.  There is currently no way to add the missing API Server endpoint FQDN in the UI.  As of 6/16/2021, I heard an unconfirmed rumor that a feature request has been added for this, and there will be an option to edit this in the UI. However, there’s currently no ETA on when it will be added. 

A big shout-out to Nicholas M. in GSS for helping me to resolve. Even though I cannot share the full resolution here, I can at least help others troubleshoot.

Advanced Troubleshooting:

  1. If you really need to confirm that this is your issue, we can open a putty session to the vCenter.

2. Next we can check the database to to find what MasterDNSName was entered during the time of deployment.  In my test, I only have a single compute cluster that has vSphere with Tanzu enabled.  Your mileage may vary if you have more than one cluster enabled.  Enter the following command to view the table.  We are not making changes here. 

# PGPASSFILE=/etc/vmware/wcp/.pgpass psql -U wcpuser -d VCDB -h localhost -x -c "select desired_config from cluster_db_configs where cluster like 'domain-c%';" | less

3. Initially you will see a bunch of lines on your console. If you hit the “page down” key once or twice to get past these lines (if needed, lowercase g to go back to the top).  Look for MasterDNSNames. This would be the API Server endpoint FQDN.  If the value = null, the API Server endpoint FQDN was left blank during the setup. 

You cannot edit this. This however will confirm that the api server endpoint FQDN was not entered during the initial deployment.

5. hit q to quit

As stated previously, if this is a fresh deployment, the easiest path forward would be to re-deploy vSphere with Tanzu on the compute cluster in vSphere.

If you already have running workloads, your only other option at this point would be to open a support request with VMware GSS.