Performing A Database Health Check On vRealize Operation Manager (vROPS) 6.5

In a previous post I showed how you could perform a healthcheck, and possibly resolve database load issues in vROPs versions from 6.3 and older. When VMware released the vROPS 6.5, they changed the way you would access the nodetool utility that is available with cassandra database.

 $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool --port 9008 status

For the 6.5 release and newer, they added the requirement of using a ‘maintenanceAdmin’ user along with a password file.  The new command to check the load status of the activity tables in a vROPS 6.5+ is as follows:

  $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password status

Example output would be something similar to this if your cluster is in a healthy state:


If any of the nodes have over 600 MB of load, you should consult with VMware GSS or a TAM on the next steps to take, and how to elevate the load issues.

Next we can check the syncing status of the cluster to determine overall health.  The command is as follows:

$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/vrops-platform-cli/ getShardStateMappingInfo

Example output:


The “vRealize Ops Shard” refers to the data nodes, and the Master and Master Replica nodes in the main cluster. The available status’ are RUNNING, SYNCING, BALANCING, OUT_OF_BALANCE, and OUT_OF_SYNC.

  • Out of Balance and Out of Sync should be enough to open an SR and have VMware take a look.

Lastly, we can take a look at the size of the activity table.  You can do this by running the following command:

 du -sh /storage/db/vcops/cassandra/data/globalpersistence/activity_tbl-*

Example Output:


If there are two listed here, you should consult with VMware GSS as to which one can safely be removed, as one would be left over from a previous upgrade.


Removing Old vROps Adapter Certificates

I’ve come across this issue in previous versions of vRealize Operations Manager prior to the 6.5 release, where you delete an adapter for data collection like vSphere, NSX or VCD, and immediately try to re-create it.  Whether it was a timing issue, or vROps just didn’t successfully complete the deletion process, I’d typically get an error that the new adapter instance could not be created because a previous one exists with the same name.  Now there are two ways around this.  You can connect the adapter to whatever instance (VCD, NSX, vSphere) you are trying to collect data from using the IP address, instead of the FQDN (or vice-versa), or you can cleanup the certificate that was left behind manually as I will outline the steps below.

To resolve the issue, delete the existing certificate from Cassandra DB and accept the new certificate re-creating adapter instance.

1. Take snapshots of the cluster

2.  SSH to the master node.  Access the Cassandra DB by running the following command:

 $VMWARE_PYTHON_BIN $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc

3. Access the database by running the following command:

use globalpersistence;

4.  We will need to look at the entries in the global persistence certificate store.  To do this, first list all the entries in globalpersistence.certificate store by running the following command:

SELECT * from globalpersistence.certificate;

5. From the list, find the desired certificate.  Now select that specific certificate with the following command:

 SELECT * from globalpersistence.certificate where key = 'Certificate.<ThumbprintOfVCCert>' and classtype = 'certificate' ALLOW FILTERING;

For example:

 SELECT * from globalpersistence.certificate where key = 'Certificate.e88b13c9e346633f94e46d2a74182d219a3c98b2' and classtype = 'certificate' ALLOW FILTERING;

6.   The tables which contains the information:

namespace | classtype | key | blobvalue | strvalue | valuetype | version

7.  Select the Key which matches the Thumbprint of the Certificate you wish to remove and run the following command:

 DELETE FROM globalpersistence.certificate where key = 'Certificate.<ThumbprintOfVCCert>' and classtype = 'certificate' and namespace = 'certificate';

For example:

 DELETE FROM globalpersistence.certificate where key = 'Certificate.e88b13c9e346633f94e46d2a74182d219a3c98b2' and classtype = 'certificate' and namespace = 'certificate';

8.  Verify that the Certificate has been removed from the VMware vRealize Operations Manager UI by navigating to:

Administration > Certificates

9.  Click the Gear icon on the vSphere Solution to Configure.

10.  Click the icon to create an new instance. Do not remove the existing instance unless the data can be lost.  If the old instance has already been deleted prior to this operation, then this warning can be ignored.

11.  Click Test Connection and the new certificate will be imported.

12.   Upon clicking Save there will be an error stating the Resource Key already exists. Ignore this and click Close and the UI will show Discard Changes?. Click Yes.

13.   Upon clicking Certificates Tab the Certificate is shown for an existing VC Instance.  Now you should have a new adapter configured and collecting.  If you kept the old adapter for the data, it can safely be removed after the data retention period has expired.

Get VM Tools Version with VMware’s PowerCLI

I had an engineer visit me the other day asking if there was an automated way to get the current version of VMtools running for a set of virtual machines, and in this case, it was for a particular customer running in our vCenter.   I said there most certainly was using PowerCLI.

Depending on the size of the environment, the first option here may be sufficient, although it can be an “expensive” query as I’ve noticed it takes longer to return results.  Using PowerCLI, you can connect to the desired vCenter and run the following one-liner to get return output on the console.  Here I was looking for a specific customer in vCloud Director, so in the vCenter I located the customers folder containing the VMs.   Replace the ‘foldername’ inside the asterisks with the desired folder of VMs.  This command would also work in a normal vCenter as well.

Get-Folder -name *foldername* | get-vm | get-vmguest | select VMName, ToolsVersion | FT -autosize

Example output:


You can see that this example that folder has a mix of virtual machines running, some not (no ToolsVersion value returned), and has a mix of VMtools versions running.

What if you just wanted a list of all virtual machines in the vCenter, the whole jungle?

 Get-Datacenter -Name "datacentername" | get-vm | get-vmguest | select VMName, ToolsVersion | FT -autosize

In either case, if you want to redirect output to a CSV add the following to the end of the line

 | export-csv -path "\path\to\file\filename.csv" -NoTypeInformation -UseCulture


Get-Folder -name *foldername* | get-vm | get-vmguest | select VMName, ToolsVersion | export-csv -path "\path\to\file\filename.csv" -NoTypeInformation -UseCulture


Another method/example of getting the tools version, and probably the fastest is using ‘Get-view’. A much longer string of command-lets, but this would be the ideal method for large environments if a quick return of data was needed, lets say for a nightly script that was least impactful to the vCenter.

 Get-Folder -name *foldername* | Get-VM | % { get-view $ } | select name, @{Name=“ToolsVersion”; Expression={$}}, @{ Name=“ToolStatus”; Expression={$_.Guest.ToolsVersionStatus}}

Example Output:


If you are after a list of all virtual machines running in the vCenter, a command similar to this can be used:

 Get-VM | % { get-view $ } | select name, @{Name=“ToolsVersion”; Expression={$}}, @{ Name=“ToolStatus”; Expression={$_.Guest.ToolsVersionStatus}}

VMware has put together a nice introductory blog on using get-view HERE

Just like last time, if you want to redirect output to a CSV file just take the following on to the end of the line for either method ie specific folder or entire vCenter:

 | export-csv -path "\path\to\file\filename.csv" -NoTypeInformation -UseCulture




VMware Certified Professional 6 – Data Center Virtualization



I do apologies for being MIA these past couple of weeks.  Anyone who has taken the VCP exam knows, it can be a brutal test to study for.  I thought it best to keep my head down, and study hard so I can pass the VCP6-DCV exam on the first go around.

As I wait for VMware Education to finalize my records, I will be readying new material to share with my fellow virtualization geeks in the coming weeks ahead.

All the Best,

Cory B.


Shutdown and Startup Sequence for a vRealize Operations Manager Cluster

You ever hear the phrase “first one in, last one out”?  That is the methodology you should use when the need arises to shutdown or startup a vRealize Operations Manager (vROps) cluster.  The vROps master should always be the last node to be brought offline in vCenter, and the first node VM to be started in vCenter.

The proper shutdown sequence is as follows:

  • FIRST: The data nodes
  • SECOND: The master replica
  • LAST: The master

The remote collectors can be brought down at any time.  When shutting down the cluster, it is important to “bring the cluster offline”.  Thing of this as a graceful shutdown of all the services in a controlled manor.  You do this from the appliance admin page

1. Log into the admin ui…. https://<vrops-master>/admin/


2. Once logged into the admin UI, click the “Take Offline” button at the top.  This will start the graceful shutdown of services running in the cluster.  Depending on the cluster size, this can take some time.


3. Once the cluster reads offline, log into the vCenter where the cluster resides and begin shutting down the nodes, starting with the datanodes, master replica, and lastly the master.  The remote collectors can be shutdown at any time.

4. When ready, open a VM console to the master VM and power it on.  Watch the master power up until it reaches the following splash page example.  It may take some time, and SUSE may be running a disk check on the VM.  Don’t touch it if it is, just go get a coffee as this may take an hour to complete.

The proper startup sequence is as follows:

  • FIRST: The master
  • SECOND: The master replica
  • LAST: The data nodes, remote collectors


5. Power on the master replica, and again wait for it to fully boot-up to the splash page example above.  Then you can power on all remaining data nodes altogether.

6. Log into the admin ui…. https://<vrops-master>/admin/

7. Once logged in, all the nodes should have a status of offline and in a state of Not running before proceeding.  If there are nodes with a status of not available, the node has not fully booted up.


8. Once all nodes are in the preferred state, bring the cluster online through the admin UI.


If there was a need to shutdown the cluster from the back-end using the same sequence, but you should always use the Admin UI when possible:

Proper shutdown:

  • FIRST: The data nodes
  • SECOND: The master replica
  • LAST: The master

You would need to perform the following command to bring the slice offline.  Each node is considered to be a slice.  You would do this on each node.

# service vmware-vcops-web stop; service vmware-vcops-watchdog stop; service vmware-vcops stop; service vmware-casa stop
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/ --action=bringSliceOffline --offlineReason=troubleshooting

If there was a need to startup the cluster from the back-end using the same sequence, but you should always use the Admin UI when possible:

Proper startup:

  • FIRST: The master
  • SECOND: The master replica
  • LAST: The data nodes, remote collectors

You would need to perform the following command to bring the slice online.  Each node is considered to be a slice.  You would do this on each node.

# $VMWARE_PYTHON_BIN $VCOPS_BASE/../vmware-vcopssuite/utilities/sliceConfiguration/bin/ --action bringSliceOnline
# service vmware-vcops-web start; service vmware-vcops-watchdog start; service vmware-vcops start; service vmware-casa start

If there is a need to check the status of the running services on vROps nodes, the following command can be used.

# service vmware-vcops-web status; service vmware-vcops-watchdog status; service vmware-vcops status; service vmware-casa status

Restarting Syslog Service on ESXi

Syslogs, we all use them in some form or another, and most places have their syslogs going to a collection server like Splunk or VMware’s own vRealize Log insight.  In the event you have an alert configured that notifies you when an ESXi host has stopped sending syslogs to the logging server, or you get a “General System Error” when attempting to change the configuration option on the ESXi host itself, you should open a secure shell to the ESXi server and investigate further.

1. Once a secure shell has been established with the ESXi host, check the config of the vmsyslogd service, and that the process is running by using the following command:

# esxcli system syslog config get
  • If the process is running and configured, output received would be something similar to:
Default Network Retry Timeout: 180
Local Log Output: /vmfs/volumes/559dae9e-675318ea-b724-901b0e223e18/logs
Local Log Output Is Configured: true
Local Log Output Is Persistent: true
Local Logging Default Rotation Size: 1024
Local Logging Default Rotations: 8
Log To Unique Subdirectory: true
Remote Host: udp://

2. If the process is up, look for the current syslog process with the following command:

# ps -Cc | grep vmsyslogd

3. If the service is running, the output received would be similar to the example below.  If there is no output, then the  vmsyslogd service is dead and needs to be started.  Skip ahead to step 5 if this is the case.

132798531 132798531 vmsyslogd            /bin/python -OO /usr/lib/vmware/vmsyslog/bin/vmsyslogd.pyo
132798530 132798530 wdog-132798531       /bin/python -OO /usr/lib/vmware/vmsyslog/bin/vmsyslogd.pyo

4. In this example, we would need to kill the vmsyslogd and wdog processes before we can restart the syslog daemon on the host.

# kill -9 132798530
# kill -9 132798531

5. To start the process issue the following command:

# /usr/lib/vmware/vmsyslog/bin/vmsyslogd

6. Verify that the process is correctly configured and running again.

# esxcli system syslog config get

Default Network Retry Timeout: 180
Local Log Output: /vmfs/volumes/559dae9e-675318ea-b724-901b0e223e18/logs
Local Log Output Is Configured: true
Local Log Output Is Persistent: true
Local Logging Default Rotation Size: 1024
Local Logging Default Rotations: 8
Log To Unique Subdirectory: true
Remote Host: udp://

7. Log into the syslog collection server and verify the ESXi host is now properly sending logs.

Creating, Listing and Removing VM Snapshots with PowerCLi and PowerShell

PowerCLi + PowerShell Method

-=Creating snapshots=-

Let’s say you are doing a maintenance, and need a quick way to snapshot certain VMs in the vCenter.  The create_snapshot.ps1 PowerShell does just that, and it can be called from PowerCli.


  •  Open PowerCLi and connect to the desired vCenter


  • From the directory that you have placed the create_snapshot.ps1 script, run the command and watch for output.
> .\create_snapshot.ps1 -vm <vm-name>,<vm-name> -name snapshot_name

Like so:


In vCenter recent tasks window, you’ll see something similar to:



-=Removing snapshots=-

Once you are ready to remove the snapshots, the remove_snapshot.ps1 PowerShell script does just that.


  • Once you are logged into the vCenter through PowerCli like before, from the directory that you have placed the remove_snapshot.ps1 script, run the command and watch for output.
> .\remove_snapshot.ps1 -vm xx01-vmname,xx01-vmname -name snapshot_name 

Like so:


In vCenter recent tasks window, you’ll see something similar to:


Those two PowerShell scripts can be found here:

create_snapshot.ps1 and remove_snapshot.ps1


PowerCLi Method

-=Creating snapshots=-

The PowerCLi New-Snapshot cmdlet allows the creation of snapshots in similar fashion, and there’s no need to call on a PowerShell script.  However can be slower

> get-vm an01-jump-win1,an01-1-automate | new-snapshot -Name "cbtest" -Description "testing" -Quiesce -Memory


  • If the VM is running and it has virtual tools installed, you can opt for a quiescent snapshot withQuiesce parameter.  This has the effect of saving the virtual disk in a consistent state.
  • If the virtual machine is running, you can also elect to save the memory state as well with the –Memory parameter
  • You can also

Keep in mind using these options increases the time required to take the snapshot, but it should put the virtual machine back in the exact state if you need to restore back to it.

-=Listing Snapshots=-

If you need to check the vCenter for any VM that contains snapshots,  the get-snapshot cmdlet allows you to do that.  You can also use cmdlets like format-list to make it easier to read.

> Get-vm | get-snapshot | format-list vm,name,created


Other options:


-=Removing snapshots=-

The PowerCLi remove-snapshot cmdlet does just that, and used in combination with the get-snapshot cmdlet looks something like this.

> get-snapshot -name cbtest -VM an01-jump-win1,an01-1-automate | remove-snapshot -RunAsync -confirm:$false


  • If you don’t want to be prompted, include –confirm:$False.
  • Removing a snapshot can be a long process so you might want to take advantage of the –RunAsync parameter again.
  • Some snapshots may have child snapshots if you are taking many during a maintenance, so you can also use –RemoveChildren to clean those up as well.










Failure Adding an Additional Node to vRealize Operations Manager Due to Expired Certificate

The Issue:

Unable to add additional nodes to cluster.  This error happened while adding an additional data and remote collector.  The cause ended up being a expired customer certificate, and surprisingly there was no noticeable mechanism such as a yellow warning banner in vROps UI to warn that a certificate had expired, or is about to expire.


Log into the the new node being added, and tail the vcopsConfigureRoles.log

# tail -f /storage/vcops/log/vcopsConfigureRoles.log

You would see entries similar to:

2016-08-10 00:11:56,254 [22575] - root - WARNING - vc_ops_utilities - runHttpRequest - Open URL: 'https://localhost/casa/deployment/cluster/join?admin=' returned reason: 
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581), exception: 
2016-08-10 00:11:56,254 [22575] - root - DEBUG - vcopsConfigureRoles - joinSliceToCasaCluster - Add slice to CaSA cluster response code: 9000
2016-08-10 00:11:56,254 [22575] - root - DEBUG - vcopsConfigureRoles - joinSliceToCasaCluster - Expected response code not found. Sleep and retry. 0 
2016-08-10 00:12:01,259 [22575] - root - INFO - vcopsConfigureRoles - joinSliceToCasaCluster - Add Cluster to slice response code: 9000 
2016-08-10 00:12:01,259 [22575] - root - INFO - vc_ops_logging - logInfo - Remove lock file: /usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/conf/vcops-configureRoles.lck
2016-08-10 00:12:01,259 [22575] - root - DEBUG - vcopsPlatformCommon - programExit - Role State File to Update: '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/'
2016-08-10 00:12:01,260 [22575] - root - DEBUG - vcopsPlatformCommon - UpdateDictionaryValue - Update section: "generalSettings" key: "failureDetected" with value: "true" file: "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/"
2016-08-10 00:12:01,260 [22575] - root - DEBUG - vcopsPlatformCommon - loadConfigFile - Loading config file "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/"
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - copyPermissionsAndOwner - Updating file permissions of '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/' from 100644 to 100660
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - copyPermissionsAndOwner - Updating file ownership of '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/' from 1000/1003 to 1000/1003
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - UpdateDictionaryValue - The key: failureDetected was updated 
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - programExit - Updated failure detected to true 
2016-08-10 00:12:01,261 [22575] - root - INFO - vcopsPlatformCommon - programExit - Exiting with exit code: 1, Add slice to CaSA Cluster failed. Response code: 9000.  Expected: 200


Step #1

Take snapshot of all vROps nodes

Step #2

Revert back to VMware’s default certificate on all nodes using the following kb article. KB2144949

Step #3

The custom cert files that need to be renamed on the nodes are located at /storage/vcops/user/conf/ssl.  This should be completed on all nodes.   Alternatively, you can remove them, but renaming them is sufficient.

# mv customCert.pem customCert.pem.BAK
# mv customChain.pem customChain.pem.BAK
# mv customKey.pem customKey.pem.BAK
# mv uploaded_cert.pem uploaded_cert.pem.BAK   

Step #4

Now attempt to add the new node again.  From the master node, you can watch the installation of the new node by tailing the casa.log

# tail -f /storage/vcops/log/casa/casa.log

Delete the snapshots as soon as possible.

  • To add a new custom certificate to the vRealize Operations Manager, follow this KB article: KB2046591


Alternative Solutions

  • There could be an old management pak installed that was meant for an older version of vROps.  This has been know to cause failures.  Follow this KB for more information: KB2119769


  • If you are attempting to add a node to the cluster using an IP address previously used, the operation may fail.  Follow this KB for more information: KB2147076



NSX Host VIB Upgrade From 6.1.X to 6.2.4

There is a known issue when upgrading the NSX host VIB from 6.1.X to 6.2.4, where once the host is upgraded to VIB 6.2.4, and the virtual machines are moved to it, if they should somehow find their way back to a 6.1.X host, the VM’s NIC will become disconnected causing an outage. This has been outlined in KB2146171


We found the following steps to be the best solution in getting to the 6.2.4 NSX VIB version on ESXi 6u2, without causing any interruptions in regards to the network connectivity of the virtual machines.

  1. Log into the vSphere web client, go to Networking & Security, select Installation on the navigation menu, and then select the Host preparation tab.
  2. Select the desired cluster, and click the “Upgrade Available” message next to it.  This will start the upgrade process of all the hosts, and once completed, all hosts will display “Reboot Required”.
  3. Mark the first host for maintenance mode as you normally would, and once all virtual machines have evacuated off, and the host marked as in maintenance mode, restart it as you normally would.
  4. While we wait for the host to reboot, right click on the host cluster being upgraded and select Edit Settings.  Select vSphere DRS, and set the automation level to Manual.  This will give you control over host evacuations and where the virtual machines go.
  5. Once the host has restarted, monitor the Recent Tasks window and wait for the NSX vib installation to complete.
  6. Bring the host out of maintenance mode.  Now migrate a test VM over to the new host and test network connectivity.  Ping to another VM on a different host, and then make sure you can ping out to something like
  7.  Verify the VIB has been upgraded to 6.2.4 from the vSphere web Networking & Security host preparation section.
  8. Open PowerCLI and connect to the vCenter where this maintenance activity is being performed.  In order to safely control the migration of virtual machines from hosts containing the NSX VIV 6.1.X to the host that has been upgraded to 6.2.4, we will use the following command to evacuate the next host’s virtual machines onto the one that was just upgraded.
Get-VM -Location "<sourcehost>" | Move-VM -Destination (Get-Vmhost "<destinationhost>")
  • “sourcehost” being the next host you wish to upgrade, and the “destinationhost” being the one that was just upgraded.

9.  Once the host is fully evacuated, place the host in maintenance mode, and reboot it.

10. VMware provided us with a script that should ONLY be executed against NSX vib 6.2.4 hosts, and does the following:

  • Verifies the VIB version running on the host.
    For example: If the VIB version is between VIB_VERSION_LOW=3960641, VIB_VERSION_HIGH=4259819 then it is considered to be a host with VIB 6.2.3 and above. Any other VIB version the script will fail with a warningCustomer needs to make sure that the script is executed against ALL virtual machines that have been upgraded since 6.1.x.
  • Once the script sets the export_version to 4, the version is persistent across reboots.
  • There is no harm if customer executes the script multiple times on the same host as only VMs that need modification will be modified.
  • Script should only be executed NSX-v 6.2.4 hosts

I have attached a ZIP file containg the script here:

Script Usage

  • Copy the script to a common datastore accessible to all hosts and run the script on each host.
  • Log in to the 6.2.4 ESXi host via ssh or CONSOLE, where you intend to execute the script.
  • chmod u+x the files
  • Execute the script:
./vmfs/volumes/<Shared_Datastore>/ /vmfs/volumes/<Shared_Datastore>/vsipioctl


Example output:

~ # /vmfs/volumes/NFS-101/ /vmfs/volumes/NFS-101/vsipioctl
Fixed filter nic-39377-eth0-vmware-sfw.2 export version to 4.
Fixed filter nic-48385-eth0-vmware-sfw.2 export version to 4.
Filter nic-50077-eth0-vmware-sfw.2 already has export version 4.
Filter nic-52913-eth0-vmware-sfw.2 already has export version 4.
Filter nic-53498-eth0-vmware-sfw.2 has export version 3, no changes required.

Note: If the export version for any VM vNIC shows up as ‘2’, the script will modify the version to ‘4’ and does not modify other VMs where export version is not ‘2’.

11.  Repeat steps 5 – 10 on all hosts in the cluster until completion.  This script appears to be necessary as we have seen cases where a VM may still lose its NIC even if it is vmotioned from one NSX vib 6.2.4 host to another 6.2.4 host.

12. Once 6.2.4 host VIB installation is complete, and the script has been run against the hosts and virtual machines running on them, DRS can be set back to your desired setting like Fully automated for instance.

13.  Virtual machines should now be able to vmotion between hosts without losing their NICs.

  • This process was thoroughly tested in a vCloud Director cloud environment containing over 20,000 virtual machines, and on roughly 240 ESXi hosts without issue. vCenter environment was vCSA version 6u2, and ESXi version 6u2.

Upgrading A Large vRealize Operations Manager (vROps) Appliance Cluster

Upgrading a multi-node vROps cluster can bring significant downtime to the monitoring/data collection abilities of the cluster.  The largest production cluster I am responsible for consists of nine data nodes, including the master and master replica, and four remote collectors for our remote data centers.  If you recall my previous post Sizing and Installing The vROps Appliance, I discussed the various sizing options of a vROps cluster based on the data collected, and in my case this cluster is configured as LARGE due to the size of our vROps cluster.  One of the biggest challenges of maintaining a large cluster, that has remote collectors collecting from data centers in different geographical locations, is the ability to upgrade the cluster with minimal downtime. As it stands now, if I were to upgrade this cluster with the traditional methods VMware provided, I would be looking at a minimal downtime of eight hours.  VMware does offer a useful work around: How to reduce update time by pre-copying software update PAK files KB2127895, and we will be using that here.

But first, I wanted to introduce you to a script developed by a Jeremy McCoy, and his repository over at github called nakedhitman.  In there you will find this awesome script called vROps Cluster Repair that I have personally used many times, and was recommended to me by VMware’s GSS. This script is intended to bring the vROps cluster back to a known healthy state, and I like to run it before upgrading my Production vROps clusters.  You will want to familiarize yourself with that script, download and get it setup with your environment details.

Preparing for the Upgrade

  • First – Run the nakedhitman’s – vROps Cluster Repair script.  This will cause a brief outage (max 30 minutes) as services are stopped on each vROps node for cleanup.  *I recommend taking a snapshot of all vROps nodes beforehand just in case.  Once the cluster comes online and starts collecting data, delete those snapshots.
  • Second – Insure you have enough free space on the appliances to support the upgrade.
  • Third – Complete a basic heath-check of the appliance outlined in my post: vRealize Operations Manager (vROps) Health-Check.  While not necessary, I have personally had upgrades fail due to the issues found in this health-check.
  • Fourth – Complete Part 1 of VMware KB2127895 article to get the upgrade paks pre-staged on all nodes except the master.  No downtime required as this can be done live.  The benefit of using this KB us that you are essentially removing the time it takes for the cluster to copy the two pak files around during the upgrade process, which could take hours depending on the size of the environment.

 Upgrading The Appliance Cluster

  1. Snapshot the cluster and remote collectors.  Take the cluster offline from the master’s admin page https://<vrops>/admin.  The the cluster is offline, shutdown the vrops appliance nodes in order of remote collector, data nodes, master replica and lastly the master.  Snapshot the VMs, and then boot the master first, wait for it to fully come up to the appliance login screen, and then boot the master replica, datanodes and remote collectors last.
  2. Log back into the master appliance Admin page, but do not bring the the cluster online.
  3. On the left pane select the Software Update tab, and then click the Install a Software Update… button.
  4. Browse for the PAK file and select it.
  5. Installation options……
    1. DO NOT select the option “Install the PAK file even if it is already installed.” – Think of this as a force install. This is used if the original software update failed and you are attempting to try again. This option will ignore the pre-staged PAK files you placed earlier, and severely delay the upgrade as the cluster will now have to copy the PAK files around to each of the nodes.
    2. You have the option to “Reset out-of-the-box content, overwriting to a newer version provided by this update. Note, user modifications to out-of-the-box alerts, Symptoms, Recommendations and Policies will be overwritten.”
  6. Click Upload.
  7. Accept the license agreement.
  8. Click Next.
  9. The upgrade will now start. Sit back, and Relax! The upgrade can take hours to complete. There are 9 steps to this.


  • Eventually you will need to log back into the admin page to monitor the progress of the upgrade.  Since  6.2, you can check the status of the upgrade by clicking the little notebook next to each node.  If there’s an issue detected like in the screen capture below, it may not stop the upgrade from progressing, but you should take notice.  VMware has even started included KB article links to help troubleshoot.


  • There are two places to watch the upgrade on the master at the log level if you’d like:
     # tail -f /storage/vcops/log/pakManager/vcopsPakManager.root.apply_system_update.log


     # tail -f /storage/vcops/log/pakManager/vcopsPakManager.root.query.log
  • Once the installation is complete and at steps 9 of 9, go back to the system status tab and verify the system state is online with the little green check. VMware engineers have said that at this point the upgrade has completed successfully, and it is safe to remove the snapshots.


  • Should the upgrade fail, open a severity 1 SR with VMware asap.
  • If the sun is shining and the upgrade finishes, delete those snapshots, and enjoy all the upgrades/bug fixes the new release brings.


As a side note…

I have submitted a couple feature requests to VMware in order to ease the upgrade process of large vROps installations.

  1. For multi-data center environments: The ability to have smaller appliances in each data center, with a single search head appliance connected to the multiple data center vROps deployments.  The idea here would be a “single pane of glass” to see all data centers like you get if there is a single large muli-node vROps cluster, with multiple remote collectors. Having smaller deployments accessible by a single search head would allow for the ability to take one deployment down per data center at a time to upgrade it, dramatically reducing  the data outage, and upgrade time.
  2. The ability to deploy the latest vROps appliance, and import the data from the old like VMware does with the vCSA.  The idea here is that this would be another way to reduce the upgrade time, and reduce the outage occurred by upgrading the appliance.
  3. Tying #1 and #2 together, the ability to stand up a new appliance in said remote data center, and then export that data centers specific data from the main large cluster to the smaller deployment, or the ability to just stand up a new appliance and import the data from the old one.