Hard cut-over to a new vCenter Appliance

October 28, 2018October 30, 2018 CaptainvOPs

I went through this a couple of years ago, found it in my notes, and thought I would share. We experienced a SAN outage that corrupted the vCSA 5.5 appliance internal database.

The symptoms that we had something bad happening in the vCenter where the following: The thick client wouldn’t always connect, and if it did you could only stay connected for a maximum of 5 minutes before getting kicked back to the login screen. The web client was acting very similar. We opened a Support request, and after looking at the logs we could see that there was corruption in one of the tables. Given that we were already going to upgrade this appliance anyway, VMware had suggested a hard cut-over, where we would backup the DVSwitch config, disconnect the hosts from the old 5.5 vCSA with the virtual machines still running, power down the old vCSA appliance, power on the new 6.0u1 vCSA, and re-attach the hosts to it. Sounds easy enough right?

The following is a high level view of the steps required to cut over to a new vCenter. This process assumes that traditional methods of upgrading to a new vCenter version cannot be trusted, and that standing up a new vCenter, and reconnecting the hosts to it, is the only viable option.

If the vCenter Appliance is in a bad state, it is always recommended to contact VMware GSS first and open an SR, to properly determine what is wrong, and what the best recovery options are. These steps were recorded on a 5.5 vCSA and 6.0u1 vCSA. Your mileage may very.

Step-by-step guide

-=Process on the old vCenter Appliance=-

– Log in as the local Administrator

– Export DVSwitch config

– Create a standard switch mimicking distributed switch on first host

– Migrate one physical host nic (pnic) to the standard switch

– Update networking on all virtual machines on host over to the standard switch

– Migrate other host pnics to standard switch

– Disabled HA and DRS for the cluster

– Disconnected host from the vCenter

**Rinse wash repeat on remaining hosts until all are disconnected**

– Shutdown old vCenter Appliance.

-=Process on the new vCenter Appliance=-

– Startup the new vCenter Appliance and configure it.

– Log in as local Administrator

– Setup the data center and host clusters

– Add all hosts to the new vCenter

– Import DVSwitch config

– Add DVSwitch to hosts,

– Migrate one pnic on the host to DVSwitch

– Updated all VMs networking to DVSwitch

– Migrate other pnic to DVSwitch

**Rinse wash repeat on remaining hosts and VMs until they are on the DVSwitch**

– Disconnect standard switch from hosts

Failure Installing NSX VIB Module On ESXi Host: VIB Module For Agent Is Not Installed On Host

March 6, 2018March 6, 2018 CaptainvOPs

Now admittedly I did this to myself as I was tracking down a root cause on how operations engineers were putting hosts back into production clusters without a properly functioning vxlan. Apparently the easiest way to get a host into this state is to repeatedly move a host in and out of a production cluster to an isolation cluster where the NSX VIB module is uninstalled. This is a bug that is resolved in vCenter 6 u3, so at least there’s that little nugget of good news.

Current production setup:

NSX: 6.2.8
ESXi: 6.0.0 build-4600944 (Update 2)
VCSA: 6 Update 2
VCD: 8.20

So for this particular error, I was seeing the following in vCenter events: “VIB Module For Agent Is Not Installed On Host“. After searching the KB articles I came across this one KB2053782 “Agent VIB module not installed” when installing EAM/VXLAN Agent using VUM”. Following the KB, I made sure my update manager was in working order, and even tried following steps in the KB, but I still had the same issue.

Investigating the EAM.log, and found the following:

 1-12T17:48:27.785Z | ERROR | host-7361-0 | VibJob.java | 761 | Unhandled response code: 99 
 2018-01-12T17:48:27.785Z | ERROR | host-7361-0 | VibJob.java | 767 | PatchManager operation failed with error code: 99 
 With VibUrl: https://172.20.4.1/bin/vdn/vibs-6.2.8/6.0-5747501/vxlan.zip 
 2018-01-12T17:48:27.785Z | INFO | host-7361-0 | IssueHandler.java | 121 | Updating issues: 

 eam.issue.VibNotInstalled { 
 time = 2018-01-12 17:48:27,785, 
 description = 'XXX uninitialized', 
 key = 175, 
 agency = 'Agency:7c3aa096-ded7-4694-9979-053b21297a0f:669df433-b993-4766-8102-b1d993192273', 
 solutionId = 'com.vmware.vShieldManager', 
 agencyName = '_VCNS_159_anqa-1-zone001_VMware Network Fabri', 
 solutionName = 'com.vmware.vShieldManager', 
 agent = 'Agent:f509aa08-22ee-4b60-b3b7-f01c80f555df:669df433-b993-4766-8102-b1d993192273', 
 agentName = 'VMware Network Fabric (89)',

Investigating the esxupdate.log file and found the following:

 bba9c75116d1:669df433-b993-4766-8102-b1d993192273')), com.vmware.eam.EamException: VibInstallationFailed 
 2018-01-12T17:48:25.416Z | ERROR | agent-3 | AuditedJob.java | 75 | JOB FAILED: [#212229717] 
 EnableDisableAgentJob(AgentImpl(ID:'Agent:c446cd84-f54c-4103-a9e6-fde86056a876:669df433-b993-4766-8102-b1d993192273')), 
 com.vmware.eam.EamException: VibInstallationFailed 
 2018-01-12T17:48:27.821Z | ERROR | agent-2 | AuditedJob.java | 75 | JOB FAILED: [#1294923784] 
 EnableDisableAgentJob(AgentImpl(ID:'Agent:f509aa08-22ee-4b60-

Restarting the VUM services didn’t work, as the VIB installation would still fail.
Restarting the host didn’t work.
On the ESXi host I ran the following command to determine if any VIBS were installed, but it didn’t show any information: esxcli software vib list

Starting to suspect that the ESXi host may have corrupted files. Digging around a little more, I found the following KB2122392 “Troubleshooting vSphere ESX Agent Manager (EAM) with NSX“, and KB2075500 “Installing VIB fails with the error: Unknown command or namespace software vib install”

Decided to manually install the NSX VIB package on the host following KB2122392 above. Did the manuel extract the downloaded “vxlan.zip”. Below are contents of the vxlan.zip. It Contains the 3 VIB files:

esx-vxlan
esx-vsip
esx-dvfilter-switch-security

Tried install them manually, but got errors indicating corrupted files on the esxi host. Had to run the following commands first to restore the corrupted files. **CAUTION – NEEDED TO REBOOT HOST AFTER THESE TWO COMMANDS**:

# mv /bootbank/imgdb.tgz /bootbank/imgdb.gz.bkp
# cp /altbootbank/imgdb.tgz /bootbank/imgdb.tgz
# reboot

Once the host came back up, I attempted to continue with the manual VIB installation. All three NSX VIBS successfully installed. Host now showing a healthy status in NSX preparation. Guest introspection (GI) successfully installed.

Upgrading NSX from 6.2.4 to 6.2.8 In a vCloud Director 8.10.1 Environment

August 24, 2017September 24, 2017 CaptainvOPs

We use NSX to serve up the edges in vCloud Director environment currently running on 8.10.1. One of the important caveats to note here, that when you do upgrade an NSX 6.2.4 appliance in this configuration, you will no longer be able to redeploy the edges in vCD until you upgrade and redeploy the edge first in NSX. Then and only then will the subsequent redeploys in vCD work. The cool thing about that though, is VMware finally has a decent error message that displays in vCD if you do try to redeploy an edge before upgrading it in NSX, you’d see an error message similar to:

—————————————————————————————————————–

“[ 5109dc83-4e64-4c1b-940b-35888affeb23] Cannot redeploy edge gateway (urn:uuid:abd0ae80) com.vmware.vcloud.fabric.nsm.error.VsmException: VSM response error (10220): Appliance has to be upgraded before performing any configuration change.”

—————————————————————————————————————–

Now we get to the fun part – The Upgrade…

A little prep work goes a long way:

If you have a support contract with VMware, I HIGHLY RECOMMEND opening a support request with VMware, and detail with GSS your upgrade plans, along with the date of the upgrade. This allows VMware to have a resource available in case the upgrade goes sideways.
Make a clone of the appliance in case you need to revert (keep powered off)
Set host clusters DRS where vCloud Director environment/cloud VMs are to manual (keeps VMs/edges stationed in place during upgrade)
Disable HA
Do a manual backup of NSX manager in the appliance UI

Shutdown the vCloud Director Cell service

It is highly advisable to stop the vcd service on each of the cells in order to prevent clients in vCloud Director from making changes during the scheduled outage/maintenance. SSH to each vcd cell and run the following in each console session:

# service vmware-vcd stop

A good rule of thumb is to now check the status of each cell to make sure the service has been disabled. Run this command in each cell console session:

# service vmware-vcd status

For more information on these commands, please visit the following VMware KB article: KB1026310

Upgrading the NSX appliance to 6.2.8

Log into NSX manager and the vCenter client
Navigate to Manage→ Upgrade

Click ‘upgrade’ button
Click the ‘Choose File’ button
Browse to upgrade bundle and click open
Click the ‘continue button’, the install bundle will be uploaded and installed.

You will be prompted if you would like to enable SSH and join the customer improvement program
Verify the upgrade version, and click the upgrade button.

The upgrade process will automatically reboot the NSX manager vm in the background. Having the console up will show this. Don’t trust the ‘uptime’ displayed in the vCenter for the VM.
Once the reboot has completed the GUI will come up quick but it will take a while for the NSX management services to change to the running state. Give the appliance 10 minutes or so to come back up, and take the time now to verify the NSX version. If using guest introspection, you should wait until the red flags/alerts clear on the hosts before proceeding.
In the vSphere web client, make sure you see ‘Networking & Security’ on the left side. If it does not show up, you may need to ssh into the vCenter appliance and restart the web service. Otherwise continue to step 12.

# service vsphere-client restart

12. In the vsphere web client, go to Networking and Security -> Installation and select the Management Tab. You have the option to select your controllers and download a controller snapshot. Otherwise click the “Upgrade Available” link.

13. Click ‘Yes’ to upgrade the controllers. Sit back and relax. This part can take up to 30 minutes. You can click the page refresh in order to monitor progress of the upgrades on each controller.

14. Once the upgrade of the controllers has completed, ssh into each controller and run the following in the console to verify it indeed has connection back to the appliance

# show control-cluster status

15. On the ESXi hosts/blades in each chassis, I would run this command just as a sanity check to spot any NSX controller connection issues.

 esxcli network ip connection list | grep 1234

If all controllers are connected you should see something similar in your output

If controllers are not in a healthy state, you may get something similar to this next image in your output. If this is the case, you can first try to reboot the controller. If that doesn’t work try a reboot. If that doesn’t work…..weep in silence. Then call VMware using the SR I strongly suggested creating before the upgrade, and GSS or your TAM can get you squared away.

16. Now in the vSphere web client, if you go back to Network & Security -> Installation -> Host Preparation, you will see that there in an upgrade available for the clusters. Depending on the size of your environment, you may choose to do the upgrade now or at a later time outside of the planned outage. Either way you would click on the target cluster ‘Upgrade Available’ link and select yes. Reboot one host at a time that way the vibs are installed in a controlled fashion. If you simply click resolve, the host will attempt to go into maintenance mode and reboot.

17. After the new vibs have been installed on each host, run the following command to be sure they have the new vib version:

# esxcli software vib list | grep -E 'esx-dvfiler|vsip|vxlan'

Start the vCloud Director Cell service

On each cell run the following commands

To start:

# service vmware-vcd start

Check the status after :

# service vmware-vcd status

Log into VCD and by now the inventory service should be syncing with the underlining vCenter. I would advise waiting for it to complete, then run some sanity checks (provision orgs, edges, upgrade edges, etc)

Creating, Listing and Removing VM Snapshots with PowerCLi and PowerShell

January 30, 2017January 30, 2017 CaptainvOPs

PowerCLi + PowerShell Method

-=Creating snapshots=-

Let’s say you are doing a maintenance, and need a quick way to snapshot certain VMs in the vCenter. The create_snapshot.ps1 PowerShell does just that, and it can be called from PowerCli.

createsnapshot

Open PowerCLi and connect to the desired vCenter

powercli_connect

From the directory that you have placed the create_snapshot.ps1 script, run the command and watch for output.

> .\create_snapshot.ps1 -vm <vm-name>,<vm-name> -name snapshot_name

Like so:

snapshot2

In vCenter recent tasks window, you’ll see something similar to:

snapshot1

-=Removing snapshots=-

Once you are ready to remove the snapshots, the remove_snapshot.ps1 PowerShell script does just that.

snapshot5

Once you are logged into the vCenter through PowerCli like before, from the directory that you have placed the remove_snapshot.ps1 script, run the command and watch for output.

> .\remove_snapshot.ps1 -vm xx01-vmname,xx01-vmname -name snapshot_name

Like so:

snapshot3

In vCenter recent tasks window, you’ll see something similar to:

snapshot4

Those two PowerShell scripts can be found here:

create_snapshot.ps1 and remove_snapshot.ps1

_________________________________________________________________

PowerCLi Method

-=Creating snapshots=-

The PowerCLi New-Snapshot cmdlet allows the creation of snapshots in similar fashion, and there’s no need to call on a PowerShell script. However can be slower

> get-vm an01-jump-win1,an01-1-automate | new-snapshot -Name "cbtest" -Description "testing" -Quiesce -Memory

snapshot6

If the VM is running and it has virtual tools installed, you can opt for a quiescent snapshot with –Quiesce parameter. This has the effect of saving the virtual disk in a consistent state.
If the virtual machine is running, you can also elect to save the memory state as well with the –Memory parameter
You can also

Keep in mind using these options increases the time required to take the snapshot, but it should put the virtual machine back in the exact state if you need to restore back to it.

-=Listing Snapshots=-

If you need to check the vCenter for any VM that contains snapshots, the get-snapshot cmdlet allows you to do that. You can also use cmdlets like format-list to make it easier to read.

> Get-vm | get-snapshot | format-list vm,name,created

snapshot8

Other options:

Description
Created
Quiesced
PowerState
VM
VMId
Parent
ParentSnapshotId
ParentSnapshot
Children
SizeMB
IsCurrent
IsReplaySupported
ExtensionData
Id
Name
Uid

-=Removing snapshots=-

The PowerCLi remove-snapshot cmdlet does just that, and used in combination with the get-snapshot cmdlet looks something like this.

> get-snapshot -name cbtest -VM an01-jump-win1,an01-1-automate | remove-snapshot -RunAsync -confirm:$false

snapshot7

If you don’t want to be prompted, include –confirm:$False.
Removing a snapshot can be a long process so you might want to take advantage of the –RunAsync parameter again.
Some snapshots may have child snapshots if you are taking many during a maintenance, so you can also use –RemoveChildren to clean those up as well.

NSX Host VIB Upgrade From 6.1.X to 6.2.4

January 16, 2017January 22, 2017 CaptainvOPs

There is a known issue when upgrading the NSX host VIB from 6.1.X to 6.2.4, where once the host is upgraded to VIB 6.2.4, and the virtual machines are moved to it, if they should somehow find their way back to a 6.1.X host, the VM’s NIC will become disconnected causing an outage. This has been outlined in KB2146171

Resolution

We found the following steps to be the best solution in getting to the 6.2.4 NSX VIB version on ESXi 6u2, without causing any interruptions in regards to the network connectivity of the virtual machines.

Log into the vSphere web client, go to Networking & Security, select Installation on the navigation menu, and then select the Host preparation tab.
Select the desired cluster, and click the “Upgrade Available” message next to it. This will start the upgrade process of all the hosts, and once completed, all hosts will display “Reboot Required”.
Mark the first host for maintenance mode as you normally would, and once all virtual machines have evacuated off, and the host marked as in maintenance mode, restart it as you normally would.
While we wait for the host to reboot, right click on the host cluster being upgraded and select Edit Settings. Select vSphere DRS, and set the automation level to Manual. This will give you control over host evacuations and where the virtual machines go.
Once the host has restarted, monitor the Recent Tasks window and wait for the NSX vib installation to complete.
Bring the host out of maintenance mode. Now migrate a test VM over to the new host and test network connectivity. Ping to another VM on a different host, and then make sure you can ping out to something like 8.8.8.8.
Verify the VIB has been upgraded to 6.2.4 from the vSphere web Networking & Security host preparation section.
Open PowerCLI and connect to the vCenter where this maintenance activity is being performed. In order to safely control the migration of virtual machines from hosts containing the NSX VIV 6.1.X to the host that has been upgraded to 6.2.4, we will use the following command to evacuate the next host’s virtual machines onto the one that was just upgraded.

Get-VM -Location "<sourcehost>" | Move-VM -Destination (Get-Vmhost "<destinationhost>")

“sourcehost” being the next host you wish to upgrade, and the “destinationhost” being the one that was just upgraded.

9. Once the host is fully evacuated, place the host in maintenance mode, and reboot it.

10. VMware provided us with a script that should ONLY be executed against NSX vib 6.2.4 hosts, and does the following:

Verifies the VIB version running on the host.
For example: If the VIB version is between VIB_VERSION_LOW=3960641, VIB_VERSION_HIGH=4259819 then it is considered to be a host with VIB 6.2.3 and above. Any other VIB version the script will fail with a warningCustomer needs to make sure that the script is executed against ALL virtual machines that have been upgraded since 6.1.x.
Once the script sets the export_version to 4, the version is persistent across reboots.
There is no harm if customer executes the script multiple times on the same host as only VMs that need modification will be modified.
Script should only be executed NSX-v 6.2.4 hosts

I have attached a ZIP file containg the script here: fix_exportversion.zip

Script Usage

Copy the script to a common datastore accessible to all hosts and run the script on each host.
Log in to the 6.2.4 ESXi host via ssh or CONSOLE, where you intend to execute the script.
chmod u+x the files
Execute the script:

./vmfs/volumes/<Shared_Datastore>/fix_exportversion.sh /vmfs/volumes/<Shared_Datastore>/vsipioctl

Example output:

~ # /vmfs/volumes/NFS-101/fix_exportversion.sh /vmfs/volumes/NFS-101/vsipioctl
Fixed filter nic-39377-eth0-vmware-sfw.2 export version to 4.
Fixed filter nic-48385-eth0-vmware-sfw.2 export version to 4.
Filter nic-50077-eth0-vmware-sfw.2 already has export version 4.
Filter nic-52913-eth0-vmware-sfw.2 already has export version 4.
Filter nic-53498-eth0-vmware-sfw.2 has export version 3, no changes required.

Note: If the export version for any VM vNIC shows up as ‘2’, the script will modify the version to ‘4’ and does not modify other VMs where export version is not ‘2’.

11. Repeat steps 5 – 10 on all hosts in the cluster until completion. This script appears to be necessary as we have seen cases where a VM may still lose its NIC even if it is vmotioned from one NSX vib 6.2.4 host to another 6.2.4 host.

12. Once 6.2.4 host VIB installation is complete, and the script has been run against the hosts and virtual machines running on them, DRS can be set back to your desired setting like Fully automated for instance.

13. Virtual machines should now be able to vmotion between hosts without losing their NICs.

This process was thoroughly tested in a vCloud Director cloud environment containing over 20,000 virtual machines, and on roughly 240 ESXi hosts without issue. vCenter environment was vCSA version 6u2, and ESXi version 6u2.

vMotion fails at 67% on esxi 6 in vCenter 6.

December 9, 2016December 24, 2016 CaptainvOPs

Came across an interesting error the other night while on call, as I had a host in a cluster that VM’s could not vMotion off of either manually or through DRS. I was seeing the following error message in vSphere:

The source detected that the destination failed to resume.

vMotion migration [-1062731518:1481069780557682] failed: remote host <192.168.1.2> failed with status Failure.
vMotion migration [-1062731518:1481069780557682] failed to asynchronously receive and apply state from the remote host: Failure.
Failed waiting for data. Error 195887105. Failure.

While tailing the host.d log on the source host I was seeing the following error:

2016-12-09T19:44:40.373Z warning hostd[2B3E0B70] [Originator@6876 sub=Libs] ResourceGroup: failed to instantiate group with id: -591724635. Sysinfo error on operation return ed status : Not found. Please see the VMkernel log for detailed error information

While tailing the host.d log on the destination host, I was seeing the following error:

2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] ReportVMs: processing vm 223
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] ReportVMs: serialized 36 out of 36 vms
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] GenerateFullReport: report file /tmp/.vm-report.xml generated, size 915 bytes.
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] PublishReport: file /tmp/.vm-report.xml published as /tmp/vm-report.xml
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] NotifyAgent: write(33, /var/run/snmp.ctl, V) 1 bytes to snmpd
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] GenerateFullReport: notified snmpd to update vm cache
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] DoReport: VM Poll State cache - report completed ok
2016-12-09T19:44:40.317Z warning hostd[33081B70] [Originator@6876 sub=Libs] ResourceGroup: failed to instantiate group with id: 727017570. Sysinfo error on operation returne d status : Not found. Please see the VMkernel log for detailed error information

While tailing the destination vmkernel.log host, I was seeing the following error:

2016-12-09T19:44:22.000Z cpu21:35086 opID=b5686da8)World: 15516: VC opID AA8C46D5-0001C9C0-81-91-cb-a544 maps to vmkernel opID b5686da8
2016-12-09T19:44:22.000Z cpu21:35086 opID=b5686da8)Config: 681: "SIOControlFlag2" = 1, Old Value: 0, (Status: 0x0)
2016-12-09T19:44:22.261Z cpu21:579860616)World: vm 579827968: 1647: Starting world vmm0:oats-agent-2_(e00c5327-1d72-4aac-bc5e-81a10120a68b) of type 8
2016-12-09T19:44:22.261Z cpu21:579860616)Sched: vm 579827968: 6500: Adding world 'vmm0:oats-agent-2_(e00c5327-1d72-4aac-bc5e-81a10120a68b)', group 'host/user/pool34', cpu: shares=-3 min=0 minLimit=-1 max=4000, mem: shares=-3 min=0 minLimit=-1 max=1048576
2016-12-09T19:44:22.261Z cpu21:579860616)Sched: vm 579827968: 6515: renamed group 5022126293 to vm.579860616
2016-12-09T19:44:22.261Z cpu21:579860616)Sched: vm 579827968: 6532: group 5022126293 is located under group 5022124087
2016-12-09T19:44:22.264Z cpu21:579860616)MemSched: vm 579860616: 8112: extended swap to 46883 pgs
2016-12-09T19:44:22.290Z cpu20:579860616)Migrate: vm 579827968: 3385: Setting VMOTION info: Dest ts = 1481312661276391, src ip = <192.168.1.2> dest ip = <192.168.1.17> Dest wid = 0 using SHARED swap
2016-12-09T19:44:22.293Z cpu20:579860616)Hbr: 3394: Migration start received (worldID=579827968) (migrateType=1) (event=0) (isSource=0) (sharedConfig=1)
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 2997: Accepted connection from <::ffff:192.168.1.2>
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3049: data socket size 0 is less than config option 562140
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3085: dataSocket 0x430610ecaba0 receive buffer size is 562140
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 2997: Accepted connection from <::ffff:192.168.1.2>
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3049: data socket size 0 is less than config option 562140
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3085: dataSocket 0x4306110fab70 receive buffer size is 562140
2016-12-09T19:44:22.332Z cpu0:33670)VMotionUtil: 3995: 1481312661276391 D: Stream connection 1 added.
2016-12-09T19:44:24.416Z cpu1:32854)elxnet: elxnet_allocQueueWithAttr:4255: [vmnic0] RxQ, QueueIDVal:2
2016-12-09T19:44:24.416Z cpu1:32854)elxnet: elxnet_startQueue:4383: [vmnic0] RxQ, QueueIDVal:2
2016-12-09T19:44:24.985Z cpu12:579860756)VMotionRecv: 658: 1481312661276391 D: Estimated network bandwidth 471.341 MB/s during pre-copy
2016-12-09T19:44:24.994Z cpu4:579860755)VMotionSend: 4953: 1481312661276391 D: Failed to receive state for message type 1: Failure
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: VMotionSend: 3979: 1481312661276391 D: failed to asynchronously receive and apply state from the remote host: Failure.
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: Migrate: 270: 1481312661276391 D: Failed: Failure (0xbad0001) @0x4180324c6786
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: VMotionUtil: 6267: 1481312661276391 D: timed out waiting 0 ms to transmit data.
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: VMotionSend: 688: 1481312661276391 D: (9-0x43ba40001a98) failed to receive 72/72 bytes from the remote host <192.168.1.2>: Timeout
2016-12-09T19:44:24.998Z cpu4:579860616)WARNING: Migrate: 5454: 1481312661276391 D: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.

We are using the vSphere distributed switch in our environment, and each host has a vmk dedicated to vMotion traffic only, so this was my first test, verified the IP and subnet for the vmk on the source/destination hosts, and I was successfully able to ping using vmkping to the destination host, and tested the connection both ways.

The second test completed was to power off a VM, and test its ability to vMotion off of the host – this worked. When I powered the VM back on it immediately went back to the source host. I then tried to vMotion the VM again while it was powered on from the affected source host, and move it to the destination host like I had before, and to my surprise it worked now. Tested this process with a few other VMs for consistency. I tried to restart a VM on the affected host, and then move it off to another host, but this did not work.

My final test was to vMotion a VM from a different host to the affected host. This worked as well, and I was even able to vMotion off from that affected host again.

In our environment we have a Trend-micro agent VM and a GI VM running on each host. I logged into the vSphere web-client to look at the status of the Trend-micro VM and there was no indication of an error, I found the same status checking the GI vm.

Knowing we have had issues with Trend-micro in the past, I powered down the Trend-micro VM running on the host, and attempted a manual vMotion of a test VM I knew couldn’t move before – IT WORKED. Tried another with the same result. Put the host into maintenance mode to try and evacuate the rest of the customer VMs off from it with success!

To wrap all of this up, the Trend-micro agent VM running on the esxi6 host was preventing other VMs from vMotioning off either manually or through DRS. Once the trend-micro agent VM was powered off, I was able to evacuate the host.

CaptainvOPS

Tag: vCSA