Failure Installing NSX VIB Module On ESXi Host: VIB Module For Agent Is Not Installed On Host

Now admittedly I did this to myself as I was tracking down a root cause on how operations engineers were putting hosts back into production clusters without a properly functioning vxlan.  Apparently the easiest way to get a host into this state is to repeatedly move a host in and out of a production cluster to an isolation cluster where the NSX VIB module is uninstalled.  This is a bug that is resolved in vCenter 6 u3, so at least there’s that little nugget of good news.

Current production setup:

  • NSX: 6.2.8
  • ESXi:  6.0.0 build-4600944 (Update 2)
  • VCSA: 6 Update 2
  • VCD: 8.20

So for this particular error, I was seeing the following in vCenter events: “VIB Module For Agent Is Not Installed On Host“.  After searching the KB articles I came across this one KB2053782 “Agent VIB module not installed” when installing EAM/VXLAN Agent using VUM”.  Following the KB, I made sure my update manager was in working order, and even tried following steps in the KB, but I still had the same issue.

  • Investigating the EAM.log, and found the following:
 1-12T17:48:27.785Z | ERROR | host-7361-0 | VibJob.java | 761 | Unhandled response code: 99 
 2018-01-12T17:48:27.785Z | ERROR | host-7361-0 | VibJob.java | 767 | PatchManager operation failed with error code: 99 
 With VibUrl: https://172.20.4.1/bin/vdn/vibs-6.2.8/6.0-5747501/vxlan.zip 
 2018-01-12T17:48:27.785Z | INFO | host-7361-0 | IssueHandler.java | 121 | Updating issues: 

 eam.issue.VibNotInstalled { 
 time = 2018-01-12 17:48:27,785, 
 description = 'XXX uninitialized', 
 key = 175, 
 agency = 'Agency:7c3aa096-ded7-4694-9979-053b21297a0f:669df433-b993-4766-8102-b1d993192273', 
 solutionId = 'com.vmware.vShieldManager', 
 agencyName = '_VCNS_159_anqa-1-zone001_VMware Network Fabri', 
 solutionName = 'com.vmware.vShieldManager', 
 agent = 'Agent:f509aa08-22ee-4b60-b3b7-f01c80f555df:669df433-b993-4766-8102-b1d993192273', 
 agentName = 'VMware Network Fabric (89)',
  • Investigating the esxupdate.log file and found the following:
 bba9c75116d1:669df433-b993-4766-8102-b1d993192273')), com.vmware.eam.EamException: VibInstallationFailed 
 2018-01-12T17:48:25.416Z | ERROR | agent-3 | AuditedJob.java | 75 | JOB FAILED: [#212229717] 
 EnableDisableAgentJob(AgentImpl(ID:'Agent:c446cd84-f54c-4103-a9e6-fde86056a876:669df433-b993-4766-8102-b1d993192273')), 
 com.vmware.eam.EamException: VibInstallationFailed 
 2018-01-12T17:48:27.821Z | ERROR | agent-2 | AuditedJob.java | 75 | JOB FAILED: [#1294923784] 
 EnableDisableAgentJob(AgentImpl(ID:'Agent:f509aa08-22ee-4b60-
  • Restarting the VUM services didn’t work, as the VIB installation would still fail.
  • Restarting the host didn’t work.
  • On the ESXi host I ran the following command to determine if any VIBS were installed, but it didn’t show any information:  esxcli software vib list 

Starting to suspect that the ESXi host may have corrupted files.  Digging around a little more, I found the following KB2122392 Troubleshooting vSphere ESX Agent Manager (EAM) with NSX“, and KB2075500 Installing VIB fails with the error: Unknown command or namespace software vib install

Decided to manually install the NSX VIB package on the host following KB2122392 above.  Did the manuel extract the downloaded “vxlan.zip”. Below are contents of the vxlan.zip. It Contains the 3 VIB files:
  • esx-vxlan
  • esx-vsip
  • esx-dvfilter-switch-security

Tried install them manually, but got errors indicating corrupted files on the esxi host.  Had to run the following commands first to restore the corrupted files.  **CAUTION – NEEDED TO REBOOT HOST AFTER THESE TWO COMMANDS**:

  • # mv /bootbank/imgdb.tgz /bootbank/imgdb.gz.bkp
  • # cp /altbootbank/imgdb.tgz /bootbank/imgdb.tgz
  • # reboot

Once the host came back up, I attempted to continue with the manual VIB installation.  All three NSX VIBS successfully installed.  Host now showing a healthy status in NSX preparation.  Guest introspection (GI) successfully installed.

 

NSX Host VIB Upgrade From 6.1.X to 6.2.4

There is a known issue when upgrading the NSX host VIB from 6.1.X to 6.2.4, where once the host is upgraded to VIB 6.2.4, and the virtual machines are moved to it, if they should somehow find their way back to a 6.1.X host, the VM’s NIC will become disconnected causing an outage. This has been outlined in KB2146171

Resolution

We found the following steps to be the best solution in getting to the 6.2.4 NSX VIB version on ESXi 6u2, without causing any interruptions in regards to the network connectivity of the virtual machines.

  1. Log into the vSphere web client, go to Networking & Security, select Installation on the navigation menu, and then select the Host preparation tab.
  2. Select the desired cluster, and click the “Upgrade Available” message next to it.  This will start the upgrade process of all the hosts, and once completed, all hosts will display “Reboot Required”.
  3. Mark the first host for maintenance mode as you normally would, and once all virtual machines have evacuated off, and the host marked as in maintenance mode, restart it as you normally would.
  4. While we wait for the host to reboot, right click on the host cluster being upgraded and select Edit Settings.  Select vSphere DRS, and set the automation level to Manual.  This will give you control over host evacuations and where the virtual machines go.
  5. Once the host has restarted, monitor the Recent Tasks window and wait for the NSX vib installation to complete.
  6. Bring the host out of maintenance mode.  Now migrate a test VM over to the new host and test network connectivity.  Ping to another VM on a different host, and then make sure you can ping out to something like 8.8.8.8.
  7.  Verify the VIB has been upgraded to 6.2.4 from the vSphere web Networking & Security host preparation section.
  8. Open PowerCLI and connect to the vCenter where this maintenance activity is being performed.  In order to safely control the migration of virtual machines from hosts containing the NSX VIV 6.1.X to the host that has been upgraded to 6.2.4, we will use the following command to evacuate the next host’s virtual machines onto the one that was just upgraded.
Get-VM -Location "<sourcehost>" | Move-VM -Destination (Get-Vmhost "<destinationhost>")
  • “sourcehost” being the next host you wish to upgrade, and the “destinationhost” being the one that was just upgraded.

9.  Once the host is fully evacuated, place the host in maintenance mode, and reboot it.

10. VMware provided us with a script that should ONLY be executed against NSX vib 6.2.4 hosts, and does the following:

  • Verifies the VIB version running on the host.
    For example: If the VIB version is between VIB_VERSION_LOW=3960641, VIB_VERSION_HIGH=4259819 then it is considered to be a host with VIB 6.2.3 and above. Any other VIB version the script will fail with a warningCustomer needs to make sure that the script is executed against ALL virtual machines that have been upgraded since 6.1.x.
  • Once the script sets the export_version to 4, the version is persistent across reboots.
  • There is no harm if customer executes the script multiple times on the same host as only VMs that need modification will be modified.
  • Script should only be executed NSX-v 6.2.4 hosts

I have attached a ZIP file containg the script here:  fix_exportversion.zip

Script Usage

  • Copy the script to a common datastore accessible to all hosts and run the script on each host.
  • Log in to the 6.2.4 ESXi host via ssh or CONSOLE, where you intend to execute the script.
  • chmod u+x the files
  • Execute the script:
./vmfs/volumes/<Shared_Datastore>/fix_exportversion.sh /vmfs/volumes/<Shared_Datastore>/vsipioctl

 

Example output:

~ # /vmfs/volumes/NFS-101/fix_exportversion.sh /vmfs/volumes/NFS-101/vsipioctl
Fixed filter nic-39377-eth0-vmware-sfw.2 export version to 4.
Fixed filter nic-48385-eth0-vmware-sfw.2 export version to 4.
Filter nic-50077-eth0-vmware-sfw.2 already has export version 4.
Filter nic-52913-eth0-vmware-sfw.2 already has export version 4.
Filter nic-53498-eth0-vmware-sfw.2 has export version 3, no changes required.

Note: If the export version for any VM vNIC shows up as ‘2’, the script will modify the version to ‘4’ and does not modify other VMs where export version is not ‘2’.

11.  Repeat steps 5 – 10 on all hosts in the cluster until completion.  This script appears to be necessary as we have seen cases where a VM may still lose its NIC even if it is vmotioned from one NSX vib 6.2.4 host to another 6.2.4 host.

12. Once 6.2.4 host VIB installation is complete, and the script has been run against the hosts and virtual machines running on them, DRS can be set back to your desired setting like Fully automated for instance.

13.  Virtual machines should now be able to vmotion between hosts without losing their NICs.

  • This process was thoroughly tested in a vCloud Director cloud environment containing over 20,000 virtual machines, and on roughly 240 ESXi hosts without issue. vCenter environment was vCSA version 6u2, and ESXi version 6u2.