Failure Adding an Additional Node to vRealize Operations Manager Due to Expired Certificate

The Issue:

Unable to add additional nodes to cluster.  This error happened while adding an additional data and remote collector.  The cause ended up being a expired customer certificate, and surprisingly there was no noticeable mechanism such as a yellow warning banner in vROps UI to warn that a certificate had expired, or is about to expire.

Troubleshooting:

Log into the the new node being added, and tail the vcopsConfigureRoles.log

# tail -f /storage/vcops/log/vcopsConfigureRoles.log

You would see entries similar to:

2016-08-10 00:11:56,254 [22575] - root - WARNING - vc_ops_utilities - runHttpRequest - Open URL: 'https://localhost/casa/deployment/cluster/join?admin=172.22.3.14' returned reason: 
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581), exception: 
2016-08-10 00:11:56,254 [22575] - root - DEBUG - vcopsConfigureRoles - joinSliceToCasaCluster - Add slice to CaSA cluster response code: 9000
2016-08-10 00:11:56,254 [22575] - root - DEBUG - vcopsConfigureRoles - joinSliceToCasaCluster - Expected response code not found. Sleep and retry. 0 
2016-08-10 00:12:01,259 [22575] - root - INFO - vcopsConfigureRoles - joinSliceToCasaCluster - Add Cluster to slice response code: 9000 
2016-08-10 00:12:01,259 [22575] - root - INFO - vc_ops_logging - logInfo - Remove lock file: /usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/conf/vcops-configureRoles.lck
2016-08-10 00:12:01,259 [22575] - root - DEBUG - vcopsPlatformCommon - programExit - Role State File to Update: '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties'
2016-08-10 00:12:01,260 [22575] - root - DEBUG - vcopsPlatformCommon - UpdateDictionaryValue - Update section: "generalSettings" key: "failureDetected" with value: "true" file: "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties"
2016-08-10 00:12:01,260 [22575] - root - DEBUG - vcopsPlatformCommon - loadConfigFile - Loading config file "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties"
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - copyPermissionsAndOwner - Updating file permissions of '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties.new' from 100644 to 100660
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - copyPermissionsAndOwner - Updating file ownership of '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties.new' from 1000/1003 to 1000/1003
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - UpdateDictionaryValue - The key: failureDetected was updated 
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - programExit - Updated failure detected to true 
2016-08-10 00:12:01,261 [22575] - root - INFO - vcopsPlatformCommon - programExit - Exiting with exit code: 1, Add slice to CaSA Cluster failed. Response code: 9000.  Expected: 200

Resolution:

Step #1

Take snapshot of all vROps nodes

Step #2

Revert back to VMware’s default certificate on all nodes using the following kb article. KB2144949

Step #3

The custom cert files that need to be renamed on the nodes are located at /storage/vcops/user/conf/ssl.  This should be completed on all nodes.   Alternatively, you can remove them, but renaming them is sufficient.

# mv customCert.pem customCert.pem.BAK
# mv customChain.pem customChain.pem.BAK
# mv customKey.pem customKey.pem.BAK
# mv uploaded_cert.pem uploaded_cert.pem.BAK   

Step #4

Now attempt to add the new node again.  From the master node, you can watch the installation of the new node by tailing the casa.log

# tail -f /storage/vcops/log/casa/casa.log

Delete the snapshots as soon as possible.

  • To add a new custom certificate to the vRealize Operations Manager, follow this KB article: KB2046591

_________________________________________________________________

Alternative Solutions

  • There could be an old management pak installed that was meant for an older version of vROps.  This has been know to cause failures.  Follow this KB for more information: KB2119769

 

  • If you are attempting to add a node to the cluster using an IP address previously used, the operation may fail.  Follow this KB for more information: KB2147076

 

 

NSX Host VIB Upgrade From 6.1.X to 6.2.4

There is a known issue when upgrading the NSX host VIB from 6.1.X to 6.2.4, where once the host is upgraded to VIB 6.2.4, and the virtual machines are moved to it, if they should somehow find their way back to a 6.1.X host, the VM’s NIC will become disconnected causing an outage. This has been outlined in KB2146171

Resolution

We found the following steps to be the best solution in getting to the 6.2.4 NSX VIB version on ESXi 6u2, without causing any interruptions in regards to the network connectivity of the virtual machines.

  1. Log into the vSphere web client, go to Networking & Security, select Installation on the navigation menu, and then select the Host preparation tab.
  2. Select the desired cluster, and click the “Upgrade Available” message next to it.  This will start the upgrade process of all the hosts, and once completed, all hosts will display “Reboot Required”.
  3. Mark the first host for maintenance mode as you normally would, and once all virtual machines have evacuated off, and the host marked as in maintenance mode, restart it as you normally would.
  4. While we wait for the host to reboot, right click on the host cluster being upgraded and select Edit Settings.  Select vSphere DRS, and set the automation level to Manual.  This will give you control over host evacuations and where the virtual machines go.
  5. Once the host has restarted, monitor the Recent Tasks window and wait for the NSX vib installation to complete.
  6. Bring the host out of maintenance mode.  Now migrate a test VM over to the new host and test network connectivity.  Ping to another VM on a different host, and then make sure you can ping out to something like 8.8.8.8.
  7.  Verify the VIB has been upgraded to 6.2.4 from the vSphere web Networking & Security host preparation section.
  8. Open PowerCLI and connect to the vCenter where this maintenance activity is being performed.  In order to safely control the migration of virtual machines from hosts containing the NSX VIV 6.1.X to the host that has been upgraded to 6.2.4, we will use the following command to evacuate the next host’s virtual machines onto the one that was just upgraded.
Get-VM -Location "<sourcehost>" | Move-VM -Destination (Get-Vmhost "<destinationhost>")
  • “sourcehost” being the next host you wish to upgrade, and the “destinationhost” being the one that was just upgraded.

9.  Once the host is fully evacuated, place the host in maintenance mode, and reboot it.

10. VMware provided us with a script that should ONLY be executed against NSX vib 6.2.4 hosts, and does the following:

  • Verifies the VIB version running on the host.
    For example: If the VIB version is between VIB_VERSION_LOW=3960641, VIB_VERSION_HIGH=4259819 then it is considered to be a host with VIB 6.2.3 and above. Any other VIB version the script will fail with a warningCustomer needs to make sure that the script is executed against ALL virtual machines that have been upgraded since 6.1.x.
  • Once the script sets the export_version to 4, the version is persistent across reboots.
  • There is no harm if customer executes the script multiple times on the same host as only VMs that need modification will be modified.
  • Script should only be executed NSX-v 6.2.4 hosts

I have attached a ZIP file containg the script here:  fix_exportversion.zip

Script Usage

  • Copy the script to a common datastore accessible to all hosts and run the script on each host.
  • Log in to the 6.2.4 ESXi host via ssh or CONSOLE, where you intend to execute the script.
  • chmod u+x the files
  • Execute the script:
./vmfs/volumes/<Shared_Datastore>/fix_exportversion.sh /vmfs/volumes/<Shared_Datastore>/vsipioctl

 

Example output:

~ # /vmfs/volumes/NFS-101/fix_exportversion.sh /vmfs/volumes/NFS-101/vsipioctl
Fixed filter nic-39377-eth0-vmware-sfw.2 export version to 4.
Fixed filter nic-48385-eth0-vmware-sfw.2 export version to 4.
Filter nic-50077-eth0-vmware-sfw.2 already has export version 4.
Filter nic-52913-eth0-vmware-sfw.2 already has export version 4.
Filter nic-53498-eth0-vmware-sfw.2 has export version 3, no changes required.

Note: If the export version for any VM vNIC shows up as ‘2’, the script will modify the version to ‘4’ and does not modify other VMs where export version is not ‘2’.

11.  Repeat steps 5 – 10 on all hosts in the cluster until completion.  This script appears to be necessary as we have seen cases where a VM may still lose its NIC even if it is vmotioned from one NSX vib 6.2.4 host to another 6.2.4 host.

12. Once 6.2.4 host VIB installation is complete, and the script has been run against the hosts and virtual machines running on them, DRS can be set back to your desired setting like Fully automated for instance.

13.  Virtual machines should now be able to vmotion between hosts without losing their NICs.

  • This process was thoroughly tested in a vCloud Director cloud environment containing over 20,000 virtual machines, and on roughly 240 ESXi hosts without issue. vCenter environment was vCSA version 6u2, and ESXi version 6u2.

Upgrading A Large vRealize Operations Manager (vROps) Appliance Cluster

Upgrading a multi-node vROps cluster can bring significant downtime to the monitoring/data collection abilities of the cluster.  The largest production cluster I am responsible for consists of nine data nodes, including the master and master replica, and four remote collectors for our remote data centers.  If you recall my previous post Sizing and Installing The vROps Appliance, I discussed the various sizing options of a vROps cluster based on the data collected, and in my case this cluster is configured as LARGE due to the size of our vROps cluster.  One of the biggest challenges of maintaining a large cluster, that has remote collectors collecting from data centers in different geographical locations, is the ability to upgrade the cluster with minimal downtime. As it stands now, if I were to upgrade this cluster with the traditional methods VMware provided, I would be looking at a minimal downtime of eight hours.  VMware does offer a useful work around: How to reduce update time by pre-copying software update PAK files KB2127895, and we will be using that here.

But first, I wanted to introduce you to a script developed by a Jeremy McCoy, and his repository over at github called nakedhitman.  In there you will find this awesome script called vROps Cluster Repair that I have personally used many times, and was recommended to me by VMware’s GSS. This script is intended to bring the vROps cluster back to a known healthy state, and I like to run it before upgrading my Production vROps clusters.  You will want to familiarize yourself with that script, download and get it setup with your environment details.

Preparing for the Upgrade

  • First – Run the nakedhitman’s – vROps Cluster Repair script.  This will cause a brief outage (max 30 minutes) as services are stopped on each vROps node for cleanup.  *I recommend taking a snapshot of all vROps nodes beforehand just in case.  Once the cluster comes online and starts collecting data, delete those snapshots.
  • Second – Insure you have enough free space on the appliances to support the upgrade.
  • Third – Complete a basic heath-check of the appliance outlined in my post: vRealize Operations Manager (vROps) Health-Check.  While not necessary, I have personally had upgrades fail due to the issues found in this health-check.
  • Fourth – Complete Part 1 of VMware KB2127895 article to get the upgrade paks pre-staged on all nodes except the master.  No downtime required as this can be done live.  The benefit of using this KB us that you are essentially removing the time it takes for the cluster to copy the two pak files around during the upgrade process, which could take hours depending on the size of the environment.

 Upgrading The Appliance Cluster

  1. Snapshot the cluster and remote collectors.  Take the cluster offline from the master’s admin page https://<vrops>/admin.  The the cluster is offline, shutdown the vrops appliance nodes in order of remote collector, data nodes, master replica and lastly the master.  Snapshot the VMs, and then boot the master first, wait for it to fully come up to the appliance login screen, and then boot the master replica, datanodes and remote collectors last.
  2. Log back into the master appliance Admin page, but do not bring the the cluster online.
  3. On the left pane select the Software Update tab, and then click the Install a Software Update… button.
  4. Browse for the PAK file and select it.
  5. Installation options……
    1. DO NOT select the option “Install the PAK file even if it is already installed.” – Think of this as a force install. This is used if the original software update failed and you are attempting to try again. This option will ignore the pre-staged PAK files you placed earlier, and severely delay the upgrade as the cluster will now have to copy the PAK files around to each of the nodes.
    2. You have the option to “Reset out-of-the-box content, overwriting to a newer version provided by this update. Note, user modifications to out-of-the-box alerts, Symptoms, Recommendations and Policies will be overwritten.”
  6. Click Upload.
  7. Accept the license agreement.
  8. Click Next.
  9. The upgrade will now start. Sit back, and Relax! The upgrade can take hours to complete. There are 9 steps to this.

vrops45

  • Eventually you will need to log back into the admin page to monitor the progress of the upgrade.  Since  6.2, you can check the status of the upgrade by clicking the little notebook next to each node.  If there’s an issue detected like in the screen capture below, it may not stop the upgrade from progressing, but you should take notice.  VMware has even started included KB article links to help troubleshoot.

vrops46

  • There are two places to watch the upgrade on the master at the log level if you’d like:
     # tail -f /storage/vcops/log/pakManager/vcopsPakManager.root.apply_system_update.log

    –and–

     # tail -f /storage/vcops/log/pakManager/vcopsPakManager.root.query.log
  • Once the installation is complete and at steps 9 of 9, go back to the system status tab and verify the system state is online with the little green check. VMware engineers have said that at this point the upgrade has completed successfully, and it is safe to remove the snapshots.

vrops47

  • Should the upgrade fail, open a severity 1 SR with VMware asap.
  • If the sun is shining and the upgrade finishes, delete those snapshots, and enjoy all the upgrades/bug fixes the new release brings.

 

As a side note…

I have submitted a couple feature requests to VMware in order to ease the upgrade process of large vROps installations.

  1. For multi-data center environments: The ability to have smaller appliances in each data center, with a single search head appliance connected to the multiple data center vROps deployments.  The idea here would be a “single pane of glass” to see all data centers like you get if there is a single large muli-node vROps cluster, with multiple remote collectors. Having smaller deployments accessible by a single search head would allow for the ability to take one deployment down per data center at a time to upgrade it, dramatically reducing  the data outage, and upgrade time.
  2. The ability to deploy the latest vROps appliance, and import the data from the old like VMware does with the vCSA.  The idea here is that this would be another way to reduce the upgrade time, and reduce the outage occurred by upgrading the appliance.
  3. Tying #1 and #2 together, the ability to stand up a new appliance in said remote data center, and then export that data centers specific data from the main large cluster to the smaller deployment, or the ability to just stand up a new appliance and import the data from the old one.

vMotion fails at 67% on esxi 6 in vCenter 6.

Came across an interesting error the other night while on call, as I had a host in a cluster that VM’s could not vMotion off of either manually or through DRS.  I was seeing the following error message in vSphere:


The source detected that the destination failed to resume.

vMotion migration [-1062731518:1481069780557682] failed: remote host <192.168.1.2> failed with status Failure.
vMotion migration [-1062731518:1481069780557682] failed to asynchronously receive and apply state from the remote host: Failure.
Failed waiting for data. Error 195887105. Failure.


 

  • While tailing the host.d log on the source host I was seeing the following error:
2016-12-09T19:44:40.373Z warning hostd[2B3E0B70] [Originator@6876 sub=Libs] ResourceGroup: failed to instantiate group with id: -591724635. Sysinfo error on operation return ed status : Not found. Please see the VMkernel log for detailed error information

 

  • While tailing the host.d log on the destination host, I was seeing the following error:
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] ReportVMs: processing vm 223
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] ReportVMs: serialized 36 out of 36 vms
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] GenerateFullReport: report file /tmp/.vm-report.xml generated, size 915 bytes.
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] PublishReport: file /tmp/.vm-report.xml published as /tmp/vm-report.xml
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] NotifyAgent: write(33, /var/run/snmp.ctl, V) 1 bytes to snmpd
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] GenerateFullReport: notified snmpd to update vm cache
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] DoReport: VM Poll State cache - report completed ok
2016-12-09T19:44:40.317Z warning hostd[33081B70] [Originator@6876 sub=Libs] ResourceGroup: failed to instantiate group with id: 727017570. Sysinfo error on operation returne d status : Not found. Please see the VMkernel log for detailed error information

 

  • While tailing the destination vmkernel.log host, I was seeing the following error:
2016-12-09T19:44:22.000Z cpu21:35086 opID=b5686da8)World: 15516: VC opID AA8C46D5-0001C9C0-81-91-cb-a544 maps to vmkernel opID b5686da8
2016-12-09T19:44:22.000Z cpu21:35086 opID=b5686da8)Config: 681: "SIOControlFlag2" = 1, Old Value: 0, (Status: 0x0)
2016-12-09T19:44:22.261Z cpu21:579860616)World: vm 579827968: 1647: Starting world vmm0:oats-agent-2_(e00c5327-1d72-4aac-bc5e-81a10120a68b) of type 8
2016-12-09T19:44:22.261Z cpu21:579860616)Sched: vm 579827968: 6500: Adding world 'vmm0:oats-agent-2_(e00c5327-1d72-4aac-bc5e-81a10120a68b)', group 'host/user/pool34', cpu: shares=-3 min=0 minLimit=-1 max=4000, mem: shares=-3 min=0 minLimit=-1 max=1048576
2016-12-09T19:44:22.261Z cpu21:579860616)Sched: vm 579827968: 6515: renamed group 5022126293 to vm.579860616
2016-12-09T19:44:22.261Z cpu21:579860616)Sched: vm 579827968: 6532: group 5022126293 is located under group 5022124087
2016-12-09T19:44:22.264Z cpu21:579860616)MemSched: vm 579860616: 8112: extended swap to 46883 pgs
2016-12-09T19:44:22.290Z cpu20:579860616)Migrate: vm 579827968: 3385: Setting VMOTION info: Dest ts = 1481312661276391, src ip = <192.168.1.2> dest ip = <192.168.1.17> Dest wid = 0 using SHARED swap
2016-12-09T19:44:22.293Z cpu20:579860616)Hbr: 3394: Migration start received (worldID=579827968) (migrateType=1) (event=0) (isSource=0) (sharedConfig=1)
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 2997: Accepted connection from <::ffff:192.168.1.2>
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3049: data socket size 0 is less than config option 562140
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3085: dataSocket 0x430610ecaba0 receive buffer size is 562140
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 2997: Accepted connection from <::ffff:192.168.1.2>
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3049: data socket size 0 is less than config option 562140
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3085: dataSocket 0x4306110fab70 receive buffer size is 562140
2016-12-09T19:44:22.332Z cpu0:33670)VMotionUtil: 3995: 1481312661276391 D: Stream connection 1 added.
2016-12-09T19:44:24.416Z cpu1:32854)elxnet: elxnet_allocQueueWithAttr:4255: [vmnic0] RxQ, QueueIDVal:2
2016-12-09T19:44:24.416Z cpu1:32854)elxnet: elxnet_startQueue:4383: [vmnic0] RxQ, QueueIDVal:2
2016-12-09T19:44:24.985Z cpu12:579860756)VMotionRecv: 658: 1481312661276391 D: Estimated network bandwidth 471.341 MB/s during pre-copy
2016-12-09T19:44:24.994Z cpu4:579860755)VMotionSend: 4953: 1481312661276391 D: Failed to receive state for message type 1: Failure
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: VMotionSend: 3979: 1481312661276391 D: failed to asynchronously receive and apply state from the remote host: Failure.
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: Migrate: 270: 1481312661276391 D: Failed: Failure (0xbad0001) @0x4180324c6786
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: VMotionUtil: 6267: 1481312661276391 D: timed out waiting 0 ms to transmit data.
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: VMotionSend: 688: 1481312661276391 D: (9-0x43ba40001a98) failed to receive 72/72 bytes from the remote host <192.168.1.2>: Timeout
2016-12-09T19:44:24.998Z cpu4:579860616)WARNING: Migrate: 5454: 1481312661276391 D: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.

 

We are using the vSphere distributed switch in our environment, and each host has a vmk dedicated to vMotion traffic only, so this was my first test, verified the IP and subnet for the vmk on the source/destination hosts, and I was successfully able to ping using vmkping to the destination host, and tested the connection both ways.

The second test completed was to power off a VM, and test its ability to vMotion off of the host – this worked.  When I powered the VM back on it immediately went back to the source host.  I then tried to vMotion the VM again while it was powered on from the affected source host, and move it to the destination host like I had before, and to my surprise it worked now.  Tested this process with a few other VMs for consistency.  I tried to restart a VM on the affected host, and then move it off to another host, but this did not work.

My final test was to vMotion a VM from a different host to the affected host.  This worked as well, and I was even able to vMotion off from that affected host again.

 

In our environment we have a Trend-micro agent VM and a GI VM running on each host.  I logged into the vSphere web-client to look at the status of the Trend-micro VM and there was no indication of an error, I found the same status checking the GI vm.

Knowing we have had issues with Trend-micro in the past, I powered down the Trend-micro VM running on the host, and attempted a manual vMotion of a test VM I knew couldn’t move before – IT WORKED.  Tried another with the same result.  Put the host into maintenance mode to try and evacuate the rest of the customer VMs off from it with success!

To wrap all of this up, the Trend-micro agent VM running on the esxi6 host was preventing other VMs from vMotioning off either manually or through DRS.  Once the trend-micro agent VM was powered off, I was able to evacuate the host.

 

vRealize Operations Manager (vROps) Health-Check.

This post is not intended to be the traditional front-end health check on the appliance, and instead will focus on the back-end, specifically the Cassandra database on the data nodes.  I decided to write this post due to the various issues I have encountered managing two large production deployments, with the largest containing 9 data nodes, and 3 remote collectors collecting and processing metrics north of 3,829,804.

The first check we can do is on the database sync between the data nodes including the master and master replica.  This can also be useful in determining unusual disk growth on one or more of the data nodes. Open a SSH session to the master appliance and issue the following command:

# $VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/vrops-platform-cli/vrops-platform-cli.py getShardStateMappingInfo

 

The sample output to be concerned with looks similar to the following example:


{
   "stateMappings": {
    "vrops1": {
      "vRealize Ops Shard-0724c812-9def-4391-9efa-2395d701d43e": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-77839361-986c-4817-bbb3-e7f4f1827921": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-8469fdff-55f0-49f7-a0e7-18cd6cc288c0": {
        "state": "RUNNING"
      },
      "vRealize Ops Shard-8c8d1ce4-36a5-4f23-b77d-29b839156383": {
        "state": "RUNNING"
      },
      "vRealize Ops Shard-ab79572e-6372-48d2-990d-d21c884b46fb": {
        "state": "RUNNING"
      },
      "vRealize Ops Shard-bfa03b9e-bac9-4040-b1a8-1fd8c2797a6a": {
        "state""OUT_OF_SYNC"
      }
    }
  },

The “vRealize Ops Shard” refers to the data nodes including the master and master replica nodes. The available status’ are RUNNING, SYNCHING, BALANCING, OUT_OF_BALANCE, and OUT_OF_SYNC.

  • States of RUNNING, SYNCHING and BALANCING are normal healthy states.
  • OUT_OF_BALANCE and OUT_OF_SYNC status is cause for concern, and is enough to open an SR with VMware to have them take a look.  But lets look a little deeper to see if there’s more going on here.  It may be beneficial information to give to VMware’s GSS.

The vRealize Operations Manager appliance uses Apache Cassandra database, so with this next command, we will be looking at the database load using a Cassandra utility called node tool. This command is only gathering operational statistics from the database, so it is safe to run as we are not making any system changes here.

  • A good time to use this utility is when you start getting alerts from various datanodes stating high load, or that cassandra DB service has crashed and has now recovered, or that the data node has disconnected and reconnected.
  • I’ve also noticed this to be a cause for failed upgrades, or failed vrops cluster expansions.
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool --port 9008 status

This will return output similar to:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--   Address    Load    Tokens  Owns  Host ID                               Rack
UN   192.2.3.6  80.08 GB  256     ?   232f646b-3fbc-4388-8962-34000c1bb48b  rack1
UN   192.2.3.7  80.53 GB  256     ?   1bfec59b-3ba8-4ca0-942f-5bb2f97b7319  rack1
UN   192.2.3.4  80.11 GB  256     ?   da6c672c-cc69-4284-a8f5-2775441bb113  rack1
UN   192.2.3.5  79.33 GB  256     ?   ee4a3c3f-3f0f-46ac-b331-a816e8eb37c5  rack1
DN   192.2.3.3  75.13 GB  256     ?   19e80237-6f2c-4ff6-881e-ce94870e0ca5  rack1

Note: Non-system keyspaces don't have the same replication settings, 
effective ownership information is meaningless
———————————————————————–
  • The output to be concerned with here, is the load column.  Under ideal operational conditions, I have been told this should be under 5GB of load per node.  This command does not return data on the remote collectors because they do not contain a database.
  • If database load is over 5GB, you will need to open an SR VMware GSS with this information, along with sending them the usual appliance log bundle.  In this example, my data nodes had over 70 GB of load.
  • If nodetool returns with an error: nodetool: Failed to connect to ‘127.0.0.1:9008’ – ConnectException: ‘Connection refused’, checkout this KB2144358 article.  You may be able to get that node back online before calling GSS.

Concerning the database load, In most cases from my experience GSS would need to truncate the activity, results, and queueid tables, and then run a parallel nodetool repair command on all data nodes starting with the master in order to get the appliance’s feet back under it.  I will detail those steps here as these are the steps usually performed:

  1. Take a snapshot of the nodes: master, master replica, data nodes (Remote Collectors can be skipped) to ensure no issues arise.
  2. Leave the cluster ONLINE
  3. Take analytics offline on the master and all data nodes:
  • Perform this step in parallel. That is, execute this command on each node one-after-another without waiting for it to complete in a single terminal firstA simple for-loop calling ssh to the nodes to do this isn’t sufficient. It is best to ensure the master and data nodes are all going offline together.
# service vmware-vcops stop analytics

4.  Repair the RESORUCE_STATE_DELETE flags for non-existing resources that are to be deleted:

  • On the master node only execute:
# su - postgres -c "/opt/vmware/vpostgres/current/bin/psql -d vcopsdb -p 5433 -c \"update resource set resource_flag='RESOURCE_STATE_NORMAL' where resource_flag='RESOURCE_STATE_DELETE';\""

5.  Perform Cassandra maintenance on the master node only.  Afterword you will be running cassandra repair on the rest of the nodes that will sync up their databases with the master.  There are a total of four commands here, so run them in order:

# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.activity_tbl"
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.activityresults_tbl"
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.queueid_tbl"
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 clearsnapshot 

6.  Perform Cassandra maintenance on all nodes:

  • This is critical. The -pr option to the repair tool causes a subset of the token range to be coordinated from the node calling the nodetool. This improves performance however it is critical that ALL nodes in the Cassandra cluster perform this operation to get a complete and consistent repair.  See <www.datastax.com_dev_blog_repair-cassandra >
  • Execute this on the master and all data nodes simultaneously:
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 repair -par -pr

7. To monitor the repair progress, you can start another SSH session to the master node and tail the following log:

# tail -f /storage/log/vcops/log/cassandra/system.log
  • The repair operation has two distinct phases. First it calculates the differences between the nodes (repair work to be done), and then it acts on those differences by streaming data to the appropriate nodes.

Generally speaking, you can also monitor the nodetool repair operation with these two nodetool commands, but this is not necessary:

  • netstats: This monitors the repair streams to the nodes:
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool --port 9008 netstats
  • compactionstats: This checks on the active Merkle Tree calculations:
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool --port 9008 compactionstats

8. Perform the instance metric key id clean-up on all nodes.  Perform this step in parallel on the master and data nodes.  This cleans up the disk on the nodes, as this cleans up the snapshots of Cassandra on each node:

# $VMWARE_PYTHON_BIN $VCOPS_BASE/tools/persistenceTool.py RemoveMatchedMetricKeys --outputFile /tmp/deleted_report.txt --regex "\"^diskspace-waste:.+?snapshot:.*\"" --remove true # $VMWARE_PYTHON_BIN $VCOPS_BASE/tools/persistenceTool.py RemoveMatchedMetricKeys --outputFile /tmp/deleted_report2.txt --regex "\"^diskspace:.+?snapshot:.*(accessTime|used)$\"" --remove true

9.  Clean up the alarms & alerts on all nodes.   Perform this step in parallel on the master and data nodes:

# su - postgres -c "/opt/vmware/vpostgres/9.3/bin/psql -p 5432 -U vcops -d vcopsdb -c 'truncate table alert cascade;'"
# su - postgres -c "/opt/vmware/vpostgres/9.3/bin/psql -p 5432 -U vcops -d vcopsdb -c 'truncate table alarm cascade;'"

10.  Bring the analytics processes back online.   Execute this step on the master, master replica and data nodes. You may use a ssh for-loop and execute these commands sequentially:

# service vmware-vcops start analytics
  • I have seen the cluster to take 20 to 30 minutes to come back online (from my experience with a 9+ node large cluster).
  • If you log into the https://<vrops>/admin page, you will see that the HA status is degraded, or needs to be re-enabled.  Give the appliance time as it will reset itself back to a healthy green state, once fully online.

11. Once the cluster is fully online and you can confirm the data is being collected, delete the snapshots you took earlier.

12. On the master node, if you again run the command:

# $VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/vrops-platform-cli/vrops-platform-cli.py getShardStateMappingInfo

You should see something similar to:

{
   "stateMappings": {
    "vrops1": {
      "vRealize Ops Shard-0724c812-9def-4391-9efa-2395d701d43e": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-77839361-986c-4817-bbb3-e7f4f1827921": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-8469fdff-55f0-49f7-a0e7-18cd6cc288c0": {
        "state": "RUNNING"
      },
      "vRealize Ops Shard-8c8d1ce4-36a5-4f23-b77d-29b839156383": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-ab79572e-6372-48d2-990d-d21c884b46fb": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-bfa03b9e-bac9-4040-b1a8-1fd8c2797a6a": {
        "state": "SYNCHING"
      }
    }
  },

13. On the master node, if you again run the nodetool status command

# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool --port 9008 status

You should see something similar to:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--   Address    Load    Tokens  Owns  Host ID                               Rack
UN   192.2.3.6  120.20 MB  256     ?   232f646b-3fbc-4388-8962-34000c1bb48b  rack1
UN   192.2.3.7  128.20 MB  256     ?   1bfec59b-3ba8-4ca0-942f-5bb2f97b7319  rack1
UN   192.2.3.4  120.11 MB  256     ?   da6c672c-cc69-4284-a8f5-2775441bb113  rack1
UN   192.2.3.5  115.33 MB  256     ?   ee4a3c3f-3f0f-46ac-b331-a816e8eb37c5  rack1
DN   192.2.3.3  128.13 MB  256     ?   19e80237-6f2c-4ff6-881e-ce94870e0ca5  rack1

Note: Non-system keyspaces don't have the same replication settings, 
effective ownership information is meaningless
  • The ideal situation here is that now the Cassandra DB load should be around 1GB or less

14.   Log into the regular web interface and edit the policy to stop collections on snapshot metrics. This will help in overall performance going forward.

 

Creating vROps Policies and How To Apply Them To Object Groups.

Creating policies in VMware’s vRealize Operations Appliance can be strait forward, if there is a decent understanding of every platform it’s monitoring.  In my last post of this series, I covered the creation of object groups, and that post is important here because policies can be created and assigned to those object groups, allowing the tuning of alerts received for those groups.

Once logged in to the vROps appliance, go into the administration section, and there you will find the policies.

vrops37.png

  • VMware has included many base policies in the policy library, which in most cases will be fine for the initial configuration for the appliance, but you may want to create additional policies to suite your specific environment needs.
  • Also take note of the blue film strip in the upper right corner.  This will take you to VMware’s video repository of policies explanation and a brief how-to video.  These video links can be found throughout the configuration of the appliance, and more are added with each release.

To create a new policy click on the green plus sign to get started.  Give the policy a unique name, and it would be good practice to give a description of what the policy is intended to do.  When creating a policy, you have the ability to “start with” a VMware pre-defined policy, and I recommend taking advantage of that until there is a firm understanding of what these policies do.

vrops38

On the Select Base Policy tab, you can use the drop down menu on the left to get a policy overview of what is being monitored.  In this example, Host system was selected.

vrops39

Policy Overrides can also be incorporated into this policy.  In other words, if there are certain alerts that you do not want, one of the pre-defined policies may already have those alerts turned off, so those policies can be added to the new policy being created here.  Work smarter, not harder right?

vrops40

Moving along to the Analysis Settings tab, here you can see how vROps analyses the alerts, determines thresholds, and assigned system badges.  These can be left at their current settings per the policy you are building off of, or you can click on the padlock to the right and make individual changes.  Keep in mind under the “Show changes for” drop down menu, you will have many objects to select to change the analysis settings on.

vrops41

The Alert/Systems Definitions tab is probably where the majority of time will be spent.  The “Alert Definitions” box at the top is where alerts can be turned on or off based on the base policy used to create this one, or the override policies used.

  • Each management pack installed will have it’s own category for object type.  In other words, “host system” is listed under the vCenter category, but if vCloud Director management pack was installed, it would also have a “host system” under its category.  Each management pack has the ability to add additional alerts for objects referenced in other management packs.  Take time going through each category to see what alerts may need configuring.
  • The State of each alert will either be local with a green check-mark: meaning you enabled it, inherited with a grey check-mark: meaning it is enabled via another policy that was used to create this one, Local with the red crossed out circle: meaning you disabled the alert for the policy, or inherited with a grayed out crossed out circle: meaning it is disabled via another policy that was used to create this one.  Disabling alerts here will still allow the metrics to be collected for the object, you just wont get the alarm for it.
  • The System Definitions section has the same “object type” drop down menu, and you can select the object type here to configure system thresholds for how the symptoms are triggered for the alert selected in the top Alert Definition box above.  I typically do these in tandem.

vrops43

Finally, you can apply the policy to the custom groups you created before in the Apply Policy to Groups tab.

vrops42

Once you click save, and go back to the Active Policies tab, you will be able to see the new policy created, and within five minutes, you should see the Affected Objects count rise.  You can see here that I have a policy marked with “D” meaning it is the default appliance policy.  You can set your own policy as default by clicking the blue circle icon with the arrow on the upper left side.  It may take up to 24 hours before the active alert page reflects the settings of the new policy.  Otherwise you can manually clear those alerts.

vrops44

Previous post to this series: Configuring VMware vRealize Operations Manager Object Groups

ESXi host fails to upgrade from 5.5 Update 3 to 6 Update 2

This happened to me today and thought it was worth sharing.  Most of the hosts in this particular cluster upgraded fine to ESXi 6u2 from ESXi 5.5u3 with the exception of this one host.  Update manager kept giving me this error “Cannot run upgrade script on host” in the vCenter Recent Tasks pane.

esxi01

A quick google search brought me to this KB article 2007163, but after following the KB I wasn’t able to find the referenced error “Remediation failed due to non mode failure “on the update manager server (Win2008) under C:\AppData\VMware\Update Manager\Logs\vmware-vum-server-log4cpp.log file.

I started an SSH session to the ESXi host, but wasn’t able to find and entry similar to the error “OSError: [Errno 39] Directory not empty:” in the /var/log/vua.log file

I instead found this error:

—————————————————————-

[FFD0D8C0 error ‘Default’] Alert:WARNING: This application is not using QuickExit().

The exit code will be set to 0.@ bora/vim/lib/vmacore/main/service.cpp:147

–> Backtrace:

–> backtrace[00] rip 1bc228c3 Vmacore::System::Stacktrace::CaptureFullWork(unsigned int)

—————————————————————–

By chance, I happened to check space on the ESXi host #df -h and found I had a partition that was 100% full.

esxi02

So I changed directory to it # cd /storage/core/……./ where I found two more directories  /var/core/ .  Using the command #ls to list the directory, I found two zdumps.

esxi03

I deleted the two zdumps, and then checked the space again #df -h

esxi04

Seeing now that my directory is now 68% utilized instead of 100%, I attempted the ESXi 6u2 upgrade again this time with success.

 

 

How to evacuate virtual machines from one host to another with PowerCLI

If you ever find yourself in need of evacuating an esxi host there is a handy PowerCLI command that can do just that, and it maintains the resource pools for the virtual machines too.  This was used in vCenter 6u2, PowerCLI 6.3 R1, esxi 5.5

–  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –

Connect-VIserver <vcenter-name/ip>

Get-VM -Location “<sourcehost>” | Move-VM -Destination (Get-Vmhost “<destinationhost>”).

–  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –  –

I recently went through an outage that affected our ability to put hosts into maintenance mode, as the vMotion operation would get stuck at the vCenter level at 13%, with no indication at the host level that something was happening.  This PowerCLI command allowed me to evacuate one host’s virtual machines onto another getting me through all 18 hosts in the cluster.

 

 

Configuring VMware vRealize Operations Manager Object Groups

There are two sections I will cover in this post: ‘Group Types’ and ‘Object groups’.  An example of when you might want to consider creating a group type …lets say you have multiple data centers, a group type could be used as a way to group all objects types of that data center into one folder.  In other worlds: The group type for the data center in Texas would be used as a sorting container for the group objects such as data stores, vCenters, hosts, virtual machines, etc.. and keep them separated from the data center in New York.

The way you can do this is by clicking on the Content icon, selecting group types and then clicking the green plus sign to add a new group type.

vrops28

Next you can click on the environment icon (blue globe), Environment Overview, click the green plus icon to create a new object group, and then in the group type drop down, you can select the group type you just created.  As far as policy selection goes, the built in VMware policies are a great place to start.  You can easily update this selection later when you create a custom policy.  I would recommend checking ‘Keep group membership up to date’.

vrops29

Define membership criteria section.  This is where the water can get muddy as you will have more than one way to target your desired environment objects.  In the drop down menu ‘Select the Object Type that matches all of the following criteria’, the object types selection can grow in number depending how many additional adapter Management PAKs are installed on the vROps appliance.  This selection will also be important because of the way vROps alerts off from the management packs.

– An example would be alerts on Host Systems.  One would assume you would select the vCenter Adapter and then select Host Systems, however  if you have the vCloud Director Adapter Management PAK installed for example, that PAK also has metrics for Host Systems that vROps will alert from,  you would also need to select Host systems under that solution to target those alerts and systems as well.

vrops30

For this example, we will use the Host System under the vCenter Adapter.  There are different ways to target the systems, for this example I will showcase using Object name, contains option.  This option allows us to target several systems IF they have a common name like so:

vrops31

You also have the option to target systems based on their Relationship.  In this example we have clusters of hosts group under the name of MGMT, so I chose Relationship, Child of, contains, MGMT – to target all systems in that cluster like so:

vrops32

There is a Preview button in the lower left corner which you can use to see if your desired systems are picked up by the membership criteria you selected.

vrops33

You can also target multiple system names by using the Add another criteria set option like so:

vrops34

Depending on the size of the environment, I’ve noticed that it helps to make a selection in the ‘in navigation tree’ drop down as well

When you have the desired systems targeted in the group, click OK.  Groups are subject to the metrics interval collection, so the groups page will show grey question marks next to the groups until the next interval has passed.  Any policies that were applied to these custom groups will not change the alerts page until that metrics collection has occurred.

The added benefit to having custom groups is that vROps will also show you the group health, and if you click on the group name you will be brought to a custom interface with alerts only for that group of targeted systems.

vrops35

In my next post I will go over the creating of policies and how to apply them to object groups.

 

Next Post: Creating vROps Policies and How To Apply Them To Object Groups.

Previous Post: Add The vROps License, Configuring LDAP, and The SMTP For vRealize Operations Manager (vROps)

 

Nightly Automated Script To Gather VM and Host Information, and Output To A CSV

Admittedly this was my first attempt at creating a Powershell script, but thought I would share my journey.  We wanted a way to track the location of customer VMs running on our ESXi hosts, and their power state in the event of a host failure.  Ideally this would be automated to run multiple times during the day, and the output would be saved to a csv on a network share.

We had discovered a bug in vCenter 5.5 where if the ESXi 5.5 host was in a disconnected state, and an attempt was made to reconnect it using the vSphere client without knowing that the host was actually network isolated, HA would not restart the VMs on another host as expected.  We would later test this in a lab to find that if we had not used the reconnect option, HA would restart the VMs as expected on other hosts.  We again tested this scenario in vCenter 6 update 2, and the bug was not present.

So the first powercli one-liner I came up with was the following:

> get-vm | select VMhost, Name, @{N="IP Address";E={@($_.guest.IPaddress[0])}}, PowerState | where {$_.PowerState -ne "PoweredOff"} | sort VMhost

get-vm1

I wanted a list of powered on VMs, their IPs, what host they were running on, and I wanted to sort the output by the host name.  Knowing I was on the right track, now I wanted to be able to connect to multiple data centers, and have each data center’s output saved to a CSV on a network share.  I didn’t really need to hang on to the CSVs in that network share for more than seven days, so I also wanted to build in logic so that it would essentially cleanup after itself.

That script looks something like this:

#Initial variables
$vCenter = @()
$sites = @("vcenter01","vcenter02","vcenter03")

#get array of sites and establishes connections to vCenters
   foreach ($site in $sites) {
   $vCenter = $site + "domain.net"

   Connect-VIServer $vCenter

#get list of not equal to powered off VMs, there IP and which hosts they're running on
 get-vm | select VMhost, Name, @{N="IP Address";E={@($_.guest.IPaddress[0])}}, PowerState | where {$_.PowerState -ne "PoweredOff"} | sort VMhost | Export-Csv -Path "c:\path\to\output\$site $((Get-Date).ToString('MM-dd-yyyy_hhmm')).csv" -NotypeInformation -useculture

#Disconnect from vCenters 
 Disconnect-VIServer -Force -Confirm:$false -Server $vCenter

}

#Cleanup old csv after 7 days.
$limit = (Get-Date).AddDays(-7)
$path = "c:\path\to\output\"

Get-ChildItem -Path $path -Recurse -Force | Where-Object { !$_.PSIsContainer -and $_.CreationTime -lt $limit } | Remove-Item -Force

Something I did not know until after running this in a large production environment, is that the get-vm call is heavy and not very efficient.  When I ran this script in my lab it took less than 15 seconds to run.  However in a production environment, connecting to data centers all over the globe, it took over 40 minutes to run.

A colleague of mine who had automation experience, exposed me to another cmdlet called get-view, and said it would be much faster to run as it was more efficient gathering the data needed.  So I rewrote my script and now it looks like:

_________________________________________________________________

get-vm4-5

 

_________________________________________________________________

The new code took less than a couple of minutes to run in my production environment.  I have a Windows VM deployed that’s running the VMware poweractions fling, and it also runs some scheduled scripts.  This script would be running from that server, so I added an additional function to the script that creates a WIN event entry so it could be tracked from a syslog server.

So the final script can be downloaded here.  *Disclaimer – test this in a lab first as the code will need to be updated to suit your needs.