Failure Installing NSX VIB Module On ESXi Host: VIB Module For Agent Is Not Installed On Host

March 6, 2018March 6, 2018 CaptainvOPs

Now admittedly I did this to myself as I was tracking down a root cause on how operations engineers were putting hosts back into production clusters without a properly functioning vxlan. Apparently the easiest way to get a host into this state is to repeatedly move a host in and out of a production cluster to an isolation cluster where the NSX VIB module is uninstalled. This is a bug that is resolved in vCenter 6 u3, so at least there’s that little nugget of good news.

Current production setup:

NSX: 6.2.8
ESXi: 6.0.0 build-4600944 (Update 2)
VCSA: 6 Update 2
VCD: 8.20

So for this particular error, I was seeing the following in vCenter events: “VIB Module For Agent Is Not Installed On Host“. After searching the KB articles I came across this one KB2053782 “Agent VIB module not installed” when installing EAM/VXLAN Agent using VUM”. Following the KB, I made sure my update manager was in working order, and even tried following steps in the KB, but I still had the same issue.

Investigating the EAM.log, and found the following:

 1-12T17:48:27.785Z | ERROR | host-7361-0 | VibJob.java | 761 | Unhandled response code: 99 
 2018-01-12T17:48:27.785Z | ERROR | host-7361-0 | VibJob.java | 767 | PatchManager operation failed with error code: 99 
 With VibUrl: https://172.20.4.1/bin/vdn/vibs-6.2.8/6.0-5747501/vxlan.zip 
 2018-01-12T17:48:27.785Z | INFO | host-7361-0 | IssueHandler.java | 121 | Updating issues: 

 eam.issue.VibNotInstalled { 
 time = 2018-01-12 17:48:27,785, 
 description = 'XXX uninitialized', 
 key = 175, 
 agency = 'Agency:7c3aa096-ded7-4694-9979-053b21297a0f:669df433-b993-4766-8102-b1d993192273', 
 solutionId = 'com.vmware.vShieldManager', 
 agencyName = '_VCNS_159_anqa-1-zone001_VMware Network Fabri', 
 solutionName = 'com.vmware.vShieldManager', 
 agent = 'Agent:f509aa08-22ee-4b60-b3b7-f01c80f555df:669df433-b993-4766-8102-b1d993192273', 
 agentName = 'VMware Network Fabric (89)',

Investigating the esxupdate.log file and found the following:

 bba9c75116d1:669df433-b993-4766-8102-b1d993192273')), com.vmware.eam.EamException: VibInstallationFailed 
 2018-01-12T17:48:25.416Z | ERROR | agent-3 | AuditedJob.java | 75 | JOB FAILED: [#212229717] 
 EnableDisableAgentJob(AgentImpl(ID:'Agent:c446cd84-f54c-4103-a9e6-fde86056a876:669df433-b993-4766-8102-b1d993192273')), 
 com.vmware.eam.EamException: VibInstallationFailed 
 2018-01-12T17:48:27.821Z | ERROR | agent-2 | AuditedJob.java | 75 | JOB FAILED: [#1294923784] 
 EnableDisableAgentJob(AgentImpl(ID:'Agent:f509aa08-22ee-4b60-

Restarting the VUM services didn’t work, as the VIB installation would still fail.
Restarting the host didn’t work.
On the ESXi host I ran the following command to determine if any VIBS were installed, but it didn’t show any information: esxcli software vib list

Starting to suspect that the ESXi host may have corrupted files. Digging around a little more, I found the following KB2122392 “Troubleshooting vSphere ESX Agent Manager (EAM) with NSX“, and KB2075500 “Installing VIB fails with the error: Unknown command or namespace software vib install”

Decided to manually install the NSX VIB package on the host following KB2122392 above. Did the manuel extract the downloaded “vxlan.zip”. Below are contents of the vxlan.zip. It Contains the 3 VIB files:

esx-vxlan
esx-vsip
esx-dvfilter-switch-security

Tried install them manually, but got errors indicating corrupted files on the esxi host. Had to run the following commands first to restore the corrupted files. **CAUTION – NEEDED TO REBOOT HOST AFTER THESE TWO COMMANDS**:

# mv /bootbank/imgdb.tgz /bootbank/imgdb.gz.bkp
# cp /altbootbank/imgdb.tgz /bootbank/imgdb.tgz
# reboot

Once the host came back up, I attempted to continue with the manual VIB installation. All three NSX VIBS successfully installed. Host now showing a healthy status in NSX preparation. Guest introspection (GI) successfully installed.

Enable TLS v1 In vCloud Director 8.20 and vCloud Availability 1.0

December 18, 2017December 18, 2017 CaptainvOPs

VMware’s vCloud Director (vCD) and vCloud Availability (vCAV) only come with TLS v1.1 and 1.2 enabled out of the box. This process will show you how to enable TLS v1. If more information is needed, please visit VMware’s Documentation on vCloud Director 8.20, or the following KB2145796. This work should be completed after hours as you would inevitably be moving VCD proxy service from one cell to another, and this could cause a brief outage for customers. This process will require taking the cell offline, so do each cell one at a time starting with a cell not running the inventory service

Open an SSH session to a VCD cell, or vCAv cloud proxy cell, and su to root
Change to the ‘ /opt/vmware/vcloud-director/bin/ ‘ directory
Use the Cell Management Tool to quiesce the cell. This will move active jobs over to another cell, and cleanly shutdown the cell. You should make note which VCD cell has the proxy service enabled, and avoid that cell until last.

# ./cell-management-tool -u administrator cell --quiesce true

Get the status of any running jobs on each cell. ** Verify Job count = 0 | Is Active = false | In Maintenance Mode = false

# ./cell-management-tool -u administrator cell --status

Example Output:

Job count = 0
Is Active = false In Maintenance Mode = false

Shut the cell down to prevent any other jobs from becoming active on the cell.

# ./cell-management-tool -u administrator cell --shutdown

Example Output:

Cell successfully deactivated and all tasks cleared in preparation for shutdown Stopping vmware-vcd-watchdog:                              [  OK  ] Stopping vmware-vcd-cell:                                  [  OK  ]

Run the following command on the vCD cell in /opt/vmware/vcloud/bin/ to enable TLS1

# ./cell-management-tool ssl-protocols -d SSLv3,SSLv2Hello

Start the cell service, and validate that a vCD cell has the listener service running from the UI, and that vCenter is connected to one of the cells.

# service vmware-vcd start

To validate that TLS v1 has been enabled on the vCD cell, or vCAV cloud proxy cell, run the following command

# ./cell-management-tool ssl-protocols -l

Example output

Allowed SSL protocols:
* TLSv1.2
* TLSv1.1
* TLSv1

If you have additional VCD cells, or vCAV cloud proxy cells, repeat this process one at a time.

Upgrade Existing vRealize Operations Manager Add-on/Solution Paks

October 23, 2017October 23, 2017 CaptainvOPs

The following was recorded using a vRealize Operations Manager (VROps) 6.6 cluster, however older versions of VROps can be upgraded the same way.

Log into the vROps environment, go to the Administration tab, and select solutions in the left column.
Here you can see all of the add-on/solutions paks that I have installed in this environment. To upgrade an existing solution, simply click the green plus button.

Browse for the new pak. In this example I have selected “Reset Default Content” option. As the statement suggests, this can override policies, customized alerts, symptoms etc. that may have been customized by your organization, forcing that work to be re-created. However, I like using this option because I get those new changes, and can adjust my monitoring accordingly. Use at your own discretion

Click ‘upload’

Click ‘Next’
Read and accept the EULA if you so desire
Click ‘Next’

Now the installation process will begin. This shouldn’t take longer than 5 minutes.

vrops54

Click Finish

vrops55

Now the latest version of the Add-on/solutions pak is installed and ready for use. In most cases it will just pick up the config from older versions.

Collecting Java Heap dump from vCloud Director Cells

September 27, 2017December 19, 2017 CaptainvOPs

You just need to generate the java heap dump from one of the cells. What you’ll need to succeed:

JCONSOLE
IP tables disabled on the cell you are connecting to.
Disk space available on the cell to accommodate the dump – I believe these can be between 8 and 10 GB in size
Unless an emergency, do this operation outside of normal business hours as it will be CPU intensive for up to 3 minutes, can impact API call performance, and can potentially cause the VCD cell inventory service to hang.

Step #1: Disable iptables on the cell

ssh to the desired cell and run the following command:

# service iptables stop

Step #2: Connect with jconsole (java console)

domain credentials should work here depending on your environment
connect to port: 8999
connect to desired cell

vcd9

If you get this message “Secure connection failed. Retry Insecurely?” just click the ‘insecure’ button to continue

Step #3: Generate the heap dump

On the MBeans tab, in the com.sun.management/HotSpotDiagnostics object, select the Operation section.
In dumpHeap parameters, enter the following information:
p0: [heap-output-path]

p1: true – do a garbage collection before dump heap

For example:

p0: /opt/vmware/vcloud-director/vcd_cell_name_heap-dump-file.hprof

p1: true
Click the dumpHeap button.

vcd11

There will be no indication that the heapdump completes. I just watch the size of the file until the growth stops on the cell. This process typically takes less than two minutes.

Step #4: Cleanup and send-off

Locate the heap dump in /opt/vmware/vcloud-director/ and move off to a location where you can compress and upload to VMware FTP site as you would for logs.
Start the iptables on the cell: # service iptables start

Upgrading VMware vCloud Director to 8.20

September 25, 2017September 28, 2017 CaptainvOPs

This document was creating while upgrading an existing vCloud Director 8.10.1 environment with an Oracle database, and multiple cloud cells.

After downloading the latest version of vCloud Director 8.20 for service providers, SCP the upgrade to all VCD cells. You can review the release notes here.

What you’ll need to do before getting started:

SSH into each cell and ‘sudo su -‘ to root
move the bin to the root directory
chmod +x vmware-vcloud-director-distribution-8.20.0-5515092.bin
I strongly advise opening an support request with VMWare before proceeding with the upgrade. You may not need it, but it comes in handy having one logged beforehand.

Maintenance – Shutdown the cells

1. Open an SSH session into each VCD cell

2. Sudo to root using the following command:

# sudo su -

3. Change to the vcloud-director/bin/ directory

# cd /opt/vmware/vcloud-director/bin/

4. Use the Cell Management Tool to quiesce the cell. This will move active jobs over to another cell.

# ./cell-management-tool -u administrator cell --quiesce true

5. Get the status of any running jobs on each cell. ** Verify Job count = 0 | Is Active = false | In Maintenance Mode = false

# ./cell-management-tool -u administrator cell --status

Example Output:

vcd6

6. Shut the cell down to prevent any other jobs from becoming active on the cell. This command will also allow active jobs to cleanly finish

# ./cell-management-tool -u administrator cell --shutdown

Example Output:

vcd7

7. Get a status on the cells to be sure everything is down

# service vmware-vcd status

8. Now complete steps 4 – 7 on the remaining cells to cleanly shutdown the vCD service on all cells.

9. Here is where I would shutdown the VCD cell virtual machines, and database to get a clean snapshot while the environment is powered off

10. Once the database virtual machine is fully up, power-on the VCD cell virtual machines.

11. Log back into the vCloud Director environment to verify functionality before the upgrade.

12. SSH to all VCD cell virtual machines and use the following command to stop the service again on each cell. Here there is an assumption made that we are now well within a maintenance window.

# service vmware-vcd stop

Starting The vCloud Director Upgrade

1. Start with the first cell, and run the first half of the upgrade. DO NOT upgrade the database yet.

# ./vmware-vcloud-director-distribution-8.20.0-5515092.bin

Example Output:

vCD1

2. Respond with: y

Example Output:

vcd2

3. Stop. Now you need to run steps one and two on the rest of the vCloud Director Cells, and install the upgrade. Do them one at a time. DO NOT upgrade the database yet.

4. Now that all cells have been upgraded, go back to the first cell and run the database upgrade.

# ./opt/vmware/vcloud-director/bin/upgrade

Example vCD Database upgrade output:

vcd3

5. Respond with: y

vcd4

6. Start the the first cell by responding with ‘y’

vcd8

7. Manually start the VCD service on the remaining cells

# service vmware-vcd start

8. Get the VCD status of all cells by running the following command on each

# service vmware-vcd status

9. Log into the cell, and watch/wait for vCenter to sync with vCD under the Manage & Monitor section → vCenters. This normally takes 30 minutes or so. Once done the status will change from a spinning circle to a green check mark.

10. Run some environment validation tests to be sure everything is working and is proper, and then delete those snapshots taken earlier.

Upgrading NSX from 6.2.4 to 6.2.8 In a vCloud Director 8.10.1 Environment

August 24, 2017September 24, 2017 CaptainvOPs

We use NSX to serve up the edges in vCloud Director environment currently running on 8.10.1. One of the important caveats to note here, that when you do upgrade an NSX 6.2.4 appliance in this configuration, you will no longer be able to redeploy the edges in vCD until you upgrade and redeploy the edge first in NSX. Then and only then will the subsequent redeploys in vCD work. The cool thing about that though, is VMware finally has a decent error message that displays in vCD if you do try to redeploy an edge before upgrading it in NSX, you’d see an error message similar to:

—————————————————————————————————————–

“[ 5109dc83-4e64-4c1b-940b-35888affeb23] Cannot redeploy edge gateway (urn:uuid:abd0ae80) com.vmware.vcloud.fabric.nsm.error.VsmException: VSM response error (10220): Appliance has to be upgraded before performing any configuration change.”

—————————————————————————————————————–

Now we get to the fun part – The Upgrade…

A little prep work goes a long way:

If you have a support contract with VMware, I HIGHLY RECOMMEND opening a support request with VMware, and detail with GSS your upgrade plans, along with the date of the upgrade. This allows VMware to have a resource available in case the upgrade goes sideways.
Make a clone of the appliance in case you need to revert (keep powered off)
Set host clusters DRS where vCloud Director environment/cloud VMs are to manual (keeps VMs/edges stationed in place during upgrade)
Disable HA
Do a manual backup of NSX manager in the appliance UI

Shutdown the vCloud Director Cell service

It is highly advisable to stop the vcd service on each of the cells in order to prevent clients in vCloud Director from making changes during the scheduled outage/maintenance. SSH to each vcd cell and run the following in each console session:

# service vmware-vcd stop

A good rule of thumb is to now check the status of each cell to make sure the service has been disabled. Run this command in each cell console session:

# service vmware-vcd status

For more information on these commands, please visit the following VMware KB article: KB1026310

Upgrading the NSX appliance to 6.2.8

Log into NSX manager and the vCenter client
Navigate to Manage→ Upgrade

Click ‘upgrade’ button
Click the ‘Choose File’ button
Browse to upgrade bundle and click open
Click the ‘continue button’, the install bundle will be uploaded and installed.

You will be prompted if you would like to enable SSH and join the customer improvement program
Verify the upgrade version, and click the upgrade button.

The upgrade process will automatically reboot the NSX manager vm in the background. Having the console up will show this. Don’t trust the ‘uptime’ displayed in the vCenter for the VM.
Once the reboot has completed the GUI will come up quick but it will take a while for the NSX management services to change to the running state. Give the appliance 10 minutes or so to come back up, and take the time now to verify the NSX version. If using guest introspection, you should wait until the red flags/alerts clear on the hosts before proceeding.
In the vSphere web client, make sure you see ‘Networking & Security’ on the left side. If it does not show up, you may need to ssh into the vCenter appliance and restart the web service. Otherwise continue to step 12.

# service vsphere-client restart

12. In the vsphere web client, go to Networking and Security -> Installation and select the Management Tab. You have the option to select your controllers and download a controller snapshot. Otherwise click the “Upgrade Available” link.

13. Click ‘Yes’ to upgrade the controllers. Sit back and relax. This part can take up to 30 minutes. You can click the page refresh in order to monitor progress of the upgrades on each controller.

14. Once the upgrade of the controllers has completed, ssh into each controller and run the following in the console to verify it indeed has connection back to the appliance

# show control-cluster status

15. On the ESXi hosts/blades in each chassis, I would run this command just as a sanity check to spot any NSX controller connection issues.

 esxcli network ip connection list | grep 1234

If all controllers are connected you should see something similar in your output

If controllers are not in a healthy state, you may get something similar to this next image in your output. If this is the case, you can first try to reboot the controller. If that doesn’t work try a reboot. If that doesn’t work…..weep in silence. Then call VMware using the SR I strongly suggested creating before the upgrade, and GSS or your TAM can get you squared away.

16. Now in the vSphere web client, if you go back to Network & Security -> Installation -> Host Preparation, you will see that there in an upgrade available for the clusters. Depending on the size of your environment, you may choose to do the upgrade now or at a later time outside of the planned outage. Either way you would click on the target cluster ‘Upgrade Available’ link and select yes. Reboot one host at a time that way the vibs are installed in a controlled fashion. If you simply click resolve, the host will attempt to go into maintenance mode and reboot.

17. After the new vibs have been installed on each host, run the following command to be sure they have the new vib version:

# esxcli software vib list | grep -E 'esx-dvfiler|vsip|vxlan'

Start the vCloud Director Cell service

On each cell run the following commands

To start:

# service vmware-vcd start

Check the status after :

# service vmware-vcd status

Log into VCD and by now the inventory service should be syncing with the underlining vCenter. I would advise waiting for it to complete, then run some sanity checks (provision orgs, edges, upgrade edges, etc)

Get VM Tools Version with VMware’s PowerCLI

May 16, 2017May 16, 2017 CaptainvOPs

I had an engineer visit me the other day asking if there was an automated way to get the current version of VMtools running for a set of virtual machines, and in this case, it was for a particular customer running in our vCenter. I said there most certainly was using PowerCLI.

Depending on the size of the environment, the first option here may be sufficient, although it can be an “expensive” query as I’ve noticed it takes longer to return results. Using PowerCLI, you can connect to the desired vCenter and run the following one-liner to get return output on the console. Here I was looking for a specific customer in vCloud Director, so in the vCenter I located the customers folder containing the VMs. Replace the ‘foldername’ inside the asterisks with the desired folder of VMs. This command would also work in a normal vCenter as well.

Get-Folder -name *foldername* | get-vm | get-vmguest | select VMName, ToolsVersion | FT -autosize

Example output:

vmtools01

You can see that this example that folder has a mix of virtual machines running, some not (no ToolsVersion value returned), and has a mix of VMtools versions running.

What if you just wanted a list of all virtual machines in the vCenter, the whole jungle?

 Get-Datacenter -Name "datacentername" | get-vm | get-vmguest | select VMName, ToolsVersion | FT -autosize

In either case, if you want to redirect output to a CSV add the following to the end of the line

 | export-csv -path "\path\to\file\filename.csv" -NoTypeInformation -UseCulture

Example:

Get-Folder -name *foldername* | get-vm | get-vmguest | select VMName, ToolsVersion | export-csv -path "\path\to\file\filename.csv" -NoTypeInformation -UseCulture

Another method/example of getting the tools version, and probably the fastest is using ‘Get-view’. A much longer string of command-lets, but this would be the ideal method for large environments if a quick return of data was needed, lets say for a nightly script that was least impactful to the vCenter.

 Get-Folder -name *foldername* | Get-VM | % { get-view $_.id } | select name, @{Name=“ToolsVersion”; Expression={$_.config.tools.toolsversion}}, @{ Name=“ToolStatus”; Expression={$_.Guest.ToolsVersionStatus}}

Example Output:

vmtools02

If you are after a list of all virtual machines running in the vCenter, a command similar to this can be used:

 Get-VM | % { get-view $_.id } | select name, @{Name=“ToolsVersion”; Expression={$_.config.tools.toolsversion}}, @{ Name=“ToolStatus”; Expression={$_.Guest.ToolsVersionStatus}}

VMware has put together a nice introductory blog on using get-view HERE

Just like last time, if you want to redirect output to a CSV file just take the following on to the end of the line for either method ie specific folder or entire vCenter:

 | export-csv -path "\path\to\file\filename.csv" -NoTypeInformation -UseCulture

ESXi host fails to upgrade from 5.5 Update 3 to 6 Update 2

November 15, 2016December 12, 2016 CaptainvOPs

This happened to me today and thought it was worth sharing. Most of the hosts in this particular cluster upgraded fine to ESXi 6u2 from ESXi 5.5u3 with the exception of this one host. Update manager kept giving me this error “Cannot run upgrade script on host” in the vCenter Recent Tasks pane.

esxi01

A quick google search brought me to this KB article 2007163, but after following the KB I wasn’t able to find the referenced error “Remediation failed due to non mode failure “on the update manager server (Win2008) under C:\AppData\VMware\Update Manager\Logs\vmware-vum-server-log4cpp.log file.

I started an SSH session to the ESXi host, but wasn’t able to find and entry similar to the error “OSError: [Errno 39] Directory not empty:” in the /var/log/vua.log file

I instead found this error:

—————————————————————-

[FFD0D8C0 error ‘Default’] Alert:WARNING: This application is not using QuickExit().

The exit code will be set to 0.@ bora/vim/lib/vmacore/main/service.cpp:147

–> Backtrace:

–> backtrace[00] rip 1bc228c3 Vmacore::System::Stacktrace::CaptureFullWork(unsigned int)

—————————————————————–

By chance, I happened to check space on the ESXi host #df -h and found I had a partition that was 100% full.

esxi02

So I changed directory to it # cd /storage/core/……./ where I found two more directories /var/core/ . Using the command #ls to list the directory, I found two zdumps.

esxi03

I deleted the two zdumps, and then checked the space again #df -h

esxi04

Seeing now that my directory is now 68% utilized instead of 100%, I attempted the ESXi 6u2 upgrade again this time with success.

How to evacuate virtual machines from one host to another with PowerCLI

October 28, 2016October 30, 2016 CaptainvOPs

If you ever find yourself in need of evacuating an esxi host there is a handy PowerCLI command that can do just that, and it maintains the resource pools for the virtual machines too. This was used in vCenter 6u2, PowerCLI 6.3 R1, esxi 5.5

– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

Connect-VIserver <vcenter-name/ip>

Get-VM -Location “<sourcehost>” | Move-VM -Destination (Get-Vmhost “<destinationhost>”).

– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

I recently went through an outage that affected our ability to put hosts into maintenance mode, as the vMotion operation would get stuck at the vCenter level at 13%, with no indication at the host level that something was happening. This PowerCLI command allowed me to evacuate one host’s virtual machines onto another getting me through all 18 hosts in the cluster.

CaptainvOPS

Tag: vCloud Director

Failure Installing NSX VIB Module On ESXi Host: VIB Module For Agent Is Not Installed On Host

Upgrading NSX from 6.2.4 to 6.2.8 In a vCloud Director 8.10.1 Environment

Now we get to the fun part – The Upgrade…

Shutdown the vCloud Director Cell service

Upgrading the NSX appliance to 6.2.8

Start the vCloud Director Cell service