NSX SSL Certificate Failure on ESXi: SSL handshake failed

Some time ago I was having an issue putting a host back into service in an NSX environment.  In Log Insight, and in the /var/log/netcpa.log I was seeing errors similar to the following:

2018-05-26T11:07:50.486Z [FFD53B70 error 'Default'] SSL handshake failed on 172.15.4.100:0 : error = SSL Exception: error:140000DB:SSL routines:SSL routines:short read
2018-05-26T11:07:55.545Z [FFD12B70 error 'Default'] SSL handshake failed on 172.15.4.100:0 : error = SSL Exception: error:140000DB:SSL routines:SSL routines:short read
2018-05-26T11:08:00.600Z [FFD12B70 error 'Default'] SSL handshake failed on 172.15.4.100:0 : error = SSL Exception: error:140000DB:SSL routines:SSL routines:short read

Browsing through VMware’s archive, I came across KB2151089, very similar to the issue I was having, however upgrading to NSX 6.3.5 was not an option at the time.  I remembered a similar issue at my previous workplace, and dug through my evernote archive to find my notes.

Before we continue, this should go without saying, but your milage may very, and I’d recommend opening a ticket with VMware’s GSS.  At the very least you should test this process out in a lab.

These steps outlined here will resolve the issue.  Keep in mind at this point, the host is not in production, and currently is in maintenance mode:

  • Determine if the NSX controllers are connected by logging into the ESXi host, and running the following commands:
# esxcli network ip connection list |grep 1234

— and —

# esxcli network ip connection list |grep 5671

 

  • Next, log into the NSX appliance and backup the config.  While the config backup is taking place, get the ESXi host mob id from the vCenter mob page https://<vcsa-fqdn>/mob
  1. select the link for the ‘root folder‘, eg. group-d1
  2. select the link for the ‘child entity‘ eg. datacenter-2
  3. select the link for the ‘host folder‘ eg. group-h4
  4. select the link for the ‘child entity‘ eg. domain-c7
  5. Now locate the ‘host‘ and find the host-xxxx value. eg: host-1234 
  • After the NSX backup is complete, ssh into the NSX manager.  Root access to the appliance will be needed, so at the command prompt:
  1. Enter ‘en‘ and the enter the ‘admin’ password
  2. Enter ‘st en‘ and enter the following password: IAmOnThePhoneWithTechSupport 
  • Log into the sql prompt
# psql -U secureall
secureall=#
  • Issue the following command to verify that there is a record associated with the host mob ID.  Below is an example using host-1234
# select host_uuid,node_uuid,thumbprint from vnvp_hot_key where host_uuid='host-1234';

Example output:

host_uuid  |              node_uuid               | thumbprint                      
-----------+--------------------------------------+------------------
host-1234  | a2a68660-515e-4f87-811d-306c54b0b2e8 |AD:58:C0:84:FF:DF: 5E:95:50:B7:63:2E:3F:B2:67:22:56:F7:DC:9B

(1 row)

  • Next, in vCenter move the host to an isolation cluster.  We will need to validate the NSX vibs installed by running the following command on the host:
# esxcli software vib list |grep -E 'esx-dvfilter-switch-security|vsip|vxlan'

 

Example output:

esx-dvfilter-switch-security   6.3.1-0.0.5124716  VMware  VMwareCertified 2017-02-28
esx-vsip                       6.3.1-0.0.5124716  VMware                VMwareCertified 2017-02-28

esx-vxlan                      6.3.1-0.0.5124716  VMware VMwareCertified 2017-02-28

 

  • Remove the NSX vibs with the following commands:
# esxcli software vib remove -n esx-vxlan
# esxcli software vib remove -n esx-vsip
# esxcli software vib remove -n esx-dvfilter-switch-security

 

  • Returning to the NSX terminal window, now delete the record using the secureall=# prompt. Using ‘host-1234’ as an example.
# delete from vnvp_host_key where host_uuid='host-1234';
DELETE 1

 

  • Reboot the ESXi host.  Once the host has rebooted, put the host back into the proper cluster.  To be safe, I would temporarily turn down DRS (move slider left), and exit maintenance mode.
  • We can validate that the host looks proper in vSphere web UI: ‘Network & Security -> Installation -> Host Preparation Tab‘ .
  • Click the ‘Resolve‘ link next to the cluster name

Validation

  • Once the tasks are all completed you can run the ‘esxcli software vib list….‘ command again to see that the three vibs have been installed.
  • Test that the vxlan network is functioning on the host.
  • Verify that the SSL Exception is no longer showing in the /var/log/netcpa.log.
  • If there are no errors, then the host is all set to be put back into service.

 

 

 

Failure Installing NSX VIB Module On ESXi Host: VIB Module For Agent Is Not Installed On Host

Now admittedly I did this to myself as I was tracking down a root cause on how operations engineers were putting hosts back into production clusters without a properly functioning vxlan.  Apparently the easiest way to get a host into this state is to repeatedly move a host in and out of a production cluster to an isolation cluster where the NSX VIB module is uninstalled.  This is a bug that is resolved in vCenter 6 u3, so at least there’s that little nugget of good news.

Current production setup:

  • NSX: 6.2.8
  • ESXi:  6.0.0 build-4600944 (Update 2)
  • VCSA: 6 Update 2
  • VCD: 8.20

So for this particular error, I was seeing the following in vCenter events: “VIB Module For Agent Is Not Installed On Host“.  After searching the KB articles I came across this one KB2053782 “Agent VIB module not installed” when installing EAM/VXLAN Agent using VUM”.  Following the KB, I made sure my update manager was in working order, and even tried following steps in the KB, but I still had the same issue.

  • Investigating the EAM.log, and found the following:
 1-12T17:48:27.785Z | ERROR | host-7361-0 | VibJob.java | 761 | Unhandled response code: 99 
 2018-01-12T17:48:27.785Z | ERROR | host-7361-0 | VibJob.java | 767 | PatchManager operation failed with error code: 99 
 With VibUrl: https://172.20.4.1/bin/vdn/vibs-6.2.8/6.0-5747501/vxlan.zip 
 2018-01-12T17:48:27.785Z | INFO | host-7361-0 | IssueHandler.java | 121 | Updating issues: 

 eam.issue.VibNotInstalled { 
 time = 2018-01-12 17:48:27,785, 
 description = 'XXX uninitialized', 
 key = 175, 
 agency = 'Agency:7c3aa096-ded7-4694-9979-053b21297a0f:669df433-b993-4766-8102-b1d993192273', 
 solutionId = 'com.vmware.vShieldManager', 
 agencyName = '_VCNS_159_anqa-1-zone001_VMware Network Fabri', 
 solutionName = 'com.vmware.vShieldManager', 
 agent = 'Agent:f509aa08-22ee-4b60-b3b7-f01c80f555df:669df433-b993-4766-8102-b1d993192273', 
 agentName = 'VMware Network Fabric (89)',
  • Investigating the esxupdate.log file and found the following:
 bba9c75116d1:669df433-b993-4766-8102-b1d993192273')), com.vmware.eam.EamException: VibInstallationFailed 
 2018-01-12T17:48:25.416Z | ERROR | agent-3 | AuditedJob.java | 75 | JOB FAILED: [#212229717] 
 EnableDisableAgentJob(AgentImpl(ID:'Agent:c446cd84-f54c-4103-a9e6-fde86056a876:669df433-b993-4766-8102-b1d993192273')), 
 com.vmware.eam.EamException: VibInstallationFailed 
 2018-01-12T17:48:27.821Z | ERROR | agent-2 | AuditedJob.java | 75 | JOB FAILED: [#1294923784] 
 EnableDisableAgentJob(AgentImpl(ID:'Agent:f509aa08-22ee-4b60-
  • Restarting the VUM services didn’t work, as the VIB installation would still fail.
  • Restarting the host didn’t work.
  • On the ESXi host I ran the following command to determine if any VIBS were installed, but it didn’t show any information:  esxcli software vib list 

Starting to suspect that the ESXi host may have corrupted files.  Digging around a little more, I found the following KB2122392 Troubleshooting vSphere ESX Agent Manager (EAM) with NSX“, and KB2075500 Installing VIB fails with the error: Unknown command or namespace software vib install

Decided to manually install the NSX VIB package on the host following KB2122392 above.  Did the manuel extract the downloaded “vxlan.zip”. Below are contents of the vxlan.zip. It Contains the 3 VIB files:
  • esx-vxlan
  • esx-vsip
  • esx-dvfilter-switch-security

Tried install them manually, but got errors indicating corrupted files on the esxi host.  Had to run the following commands first to restore the corrupted files.  **CAUTION – NEEDED TO REBOOT HOST AFTER THESE TWO COMMANDS**:

  • # mv /bootbank/imgdb.tgz /bootbank/imgdb.gz.bkp
  • # cp /altbootbank/imgdb.tgz /bootbank/imgdb.tgz
  • # reboot

Once the host came back up, I attempted to continue with the manual VIB installation.  All three NSX VIBS successfully installed.  Host now showing a healthy status in NSX preparation.  Guest introspection (GI) successfully installed.

 

Upgrading NSX from 6.2.4 to 6.2.8 In a vCloud Director 8.10.1 Environment

We use NSX to serve up the edges in vCloud Director environment currently running on 8.10.1.  One of the important caveats to note here, that when you do upgrade an NSX 6.2.4 appliance in this configuration, you will no longer be able to redeploy the edges in vCD until you upgrade and redeploy the edge first in NSX.  Then and only then will the subsequent redeploys in vCD work.  The cool thing about that though, is VMware finally has a decent error message that displays in vCD if you do try to redeploy an edge before upgrading it in NSX, you’d see an error message similar to:

—————————————————————————————————————–

“[ 5109dc83-4e64-4c1b-940b-35888affeb23] Cannot redeploy edge gateway (urn:uuid:abd0ae80) com.vmware.vcloud.fabric.nsm.error.VsmException: VSM response error (10220): Appliance has to be upgraded before performing any configuration change.”

—————————————————————————————————————–

Now we get to the fun part – The Upgrade…

A little prep work goes a long way:

  • If you have a support contract with VMware, I HIGHLY RECOMMEND opening a support request with VMware, and detail with GSS your upgrade plans, along with the date of the upgrade.  This allows VMware to have a resource available in case the upgrade goes sideways.
  • Make a clone of the appliance in case you need to revert (keep powered off)
  • Set host clusters DRS where vCloud Director environment/cloud VMs are to manual (keeps VMs/edges stationed in place during upgrade)
  • Disable HA
  • Do a manual backup of NSX manager in the appliance UI

Shutdown the vCloud Director Cell service

  • It is highly advisable to stop the vcd service on each of the cells in order to prevent clients in vCloud Director from making changes during the scheduled outage/maintenance.  SSH to each vcd cell and run the following in each console session:
# service vmware-vcd stop
  • A good rule of thumb is to now check the status of each cell to make sure the service has been disabled.  Run this command in each cell console session:
# service vmware-vcd status
  • For more information on these commands, please visit the following VMware KB article: KB1026310

Upgrading the NSX appliance to 6.2.8

  1. Log into NSX manager and the vCenter client
  2. Navigate to Manage→ Upgrade

nsx4

  1. Click ‘upgrade’ button
  2. Click the ‘Choose File’ button
  3. Browse to upgrade bundle and click open
  4. Click the ‘continue button’, the install bundle will be uploaded and installed.

 

nsx1

nsx2

  1. You will be prompted if you would like to enable SSH and join the customer improvement program
  2. Verify the upgrade version, and click the upgrade button.

nsx5

  1. The upgrade process will automatically reboot the NSX manager vm in the background. Having the console up will show this.  Don’t trust the ‘uptime’ displayed in the vCenter for the VM.
  2. Once the reboot has completed the GUI will come up quick but it will take a while for the NSX management services to change to the running state. Give the appliance 10 minutes or so to come back up, and take the time now to verify the NSX version. If using guest introspection, you should wait until the red flags/alerts clear on the hosts before proceeding.
  3. In the vSphere web client, make sure you see ‘Networking & Security’ on the left side.  If it does not show up, you may need to ssh into the vCenter appliance and restart the web service.  Otherwise continue to step 12.
# service vsphere-client restart

12. In the vsphere web client, go to Networking and Security -> Installation and select the Management Tab.  You have the option to select your controllers and download a controller snapshot.  Otherwise click the “Upgrade Available” link.

nsx8

13. Click ‘Yes’ to upgrade the controllers.  Sit back and relax.  This part can take up to 30 minutes.  You can click the page refresh in order to monitor progress of the upgrades on each controller.

nsx9

14.  Once the upgrade of the controllers has completed, ssh into each controller and run the following in the console to verify it indeed has connection back to the appliance

# show control-cluster status

15. On the ESXi hosts/blades in each chassis, I would run this command just as a sanity check to spot any NSX controller connection issues.

 esxcli network ip connection list | grep 1234
  • If all controllers are connected you should see something similar in your output

nsx10

  • If controllers are not in a healthy state, you may get something similar to this next image in your output.  If this is the case, you can first try to reboot the controller.  If that doesn’t work try a reboot.  If that doesn’t work…..weep in silence. Then call VMware using the SR I strongly suggested creating before the upgrade, and GSS or your TAM can get you squared away.

nsx11

16.  Now in the vSphere web client, if you go back to Network & Security -> Installation -> Host Preparation,  you will see that there in an upgrade available for the clusters.  Depending on the size of your environment, you may choose to do the upgrade now or at a later time outside of the planned outage.  Either way you would click on the target cluster ‘Upgrade Available’ link and select yes.  Reboot one host at a time that way the vibs are installed in a controlled fashion. If you simply click resolve, the host will attempt to go into maintenance mode and reboot.

17. After the new vibs have been installed on each host, run the following command to be sure they have the new vib version:

# esxcli software vib list | grep -E 'esx-dvfiler|vsip|vxlan'

Start the vCloud Director Cell service

  • On each cell run the following commands

To start:

# service vmware-vcd start

Check the status after :

# service vmware-vcd status
  • Log into VCD and by now the inventory service should be syncing with the underlining vCenter.  I would advise waiting for it to complete, then run some sanity checks (provision orgs, edges, upgrade edges, etc)