vRealize Operations Manager (vROps) Health-Check for versions prior to 6.5.

This post is not intended to be the traditional front-end health check on the appliance, and instead will focus on the back-end, specifically the Cassandra database on the data nodes.  I decided to write this post due to the various issues I have encountered managing two large production deployments, with the largest containing 9 data nodes, and 3 remote collectors collecting and processing metrics north of 3,829,804.

The first check we can do is on the database sync between the data nodes including the master and master replica.  This can also be useful in determining unusual disk growth on one or more of the data nodes. Open a SSH session to the master appliance and issue the following command:

# $VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/vrops-platform-cli/vrops-platform-cli.py getShardStateMappingInfo

 

The sample output to be concerned with looks similar to the following example:


{
   "stateMappings": {
    "vrops1": {
      "vRealize Ops Shard-0724c812-9def-4391-9efa-2395d701d43e": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-77839361-986c-4817-bbb3-e7f4f1827921": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-8469fdff-55f0-49f7-a0e7-18cd6cc288c0": {
        "state": "RUNNING"
      },
      "vRealize Ops Shard-8c8d1ce4-36a5-4f23-b77d-29b839156383": {
        "state": "RUNNING"
      },
      "vRealize Ops Shard-ab79572e-6372-48d2-990d-d21c884b46fb": {
        "state": "RUNNING"
      },
      "vRealize Ops Shard-bfa03b9e-bac9-4040-b1a8-1fd8c2797a6a": {
        "state""OUT_OF_SYNC"
      }
    }
  },

The “vRealize Ops Shard” refers to the data nodes including the master and master replica nodes. The available status’ are RUNNING, SYNCHING, BALANCING, OUT_OF_BALANCE, and OUT_OF_SYNC.

  • States of RUNNING, SYNCHING and BALANCING are normal healthy states.
  • OUT_OF_BALANCE and OUT_OF_SYNC status is cause for concern, and is enough to open an SR with VMware to have them take a look.  But lets look a little deeper to see if there’s more going on here.  It may be beneficial information to give to VMware’s GSS.

The vRealize Operations Manager appliance uses Apache Cassandra database, so with this next command, we will be looking at the database load using a Cassandra utility called node tool. This command is only gathering operational statistics from the database, so it is safe to run as we are not making any system changes here.

  • A good time to use this utility is when you start getting alerts from various datanodes stating high load, or that cassandra DB service has crashed and has now recovered, or that the data node has disconnected and reconnected.
  • I’ve also noticed this to be a cause for failed upgrades, or failed vrops cluster expansions.
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool --port 9008 status

This will return output similar to:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--   Address    Load    Tokens  Owns  Host ID                               Rack
UN   192.2.3.6  80.08 GB  256     ?   232f646b-3fbc-4388-8962-34000c1bb48b  rack1
UN   192.2.3.7  80.53 GB  256     ?   1bfec59b-3ba8-4ca0-942f-5bb2f97b7319  rack1
UN   192.2.3.4  80.11 GB  256     ?   da6c672c-cc69-4284-a8f5-2775441bb113  rack1
UN   192.2.3.5  79.33 GB  256     ?   ee4a3c3f-3f0f-46ac-b331-a816e8eb37c5  rack1
DN   192.2.3.3  75.13 GB  256     ?   19e80237-6f2c-4ff6-881e-ce94870e0ca5  rack1

Note: Non-system keyspaces don't have the same replication settings, 
effective ownership information is meaningless

———————————————————————–

  • The output to be concerned with here, is the load column.  Under ideal operational conditions, I have been told this should be under 5GB of load per node.  This command does not return data on the remote collectors because they do not contain a database.
  • If database load is over 5GB, you will need to open an SR VMware GSS with this information, along with sending them the usual appliance log bundle.  In this example, my data nodes had over 70 GB of load.
  • If nodetool returns with an error: nodetool: Failed to connect to ‘127.0.0.1:9008’ – ConnectException: ‘Connection refused’, checkout this KB2144358 article.  You may be able to get that node back online before calling GSS.

Concerning the database load, In most cases from my experience GSS would need to truncate the activity, results, and queueid tables, and then run a parallel nodetool repair command on all data nodes starting with the master in order to get the appliance’s feet back under it.  I will detail those steps here as these are the steps usually performed:

  1. Take a snapshot of the nodes: master, master replica, data nodes (Remote Collectors can be skipped) to ensure no issues arise.
  2. Leave the cluster ONLINE
  3. Take analytics offline on the master and all data nodes:
  • Perform this step in parallel. That is, execute this command on each node one-after-another without waiting for it to complete in a single terminal firstA simple for-loop calling ssh to the nodes to do this isn’t sufficient. It is best to ensure the master and data nodes are all going offline together.
# service vmware-vcops stop analytics

4.  Repair the RESORUCE_STATE_DELETE flags for non-existing resources that are to be deleted:

  • On the master node only execute:
# su - postgres -c "/opt/vmware/vpostgres/current/bin/psql -d vcopsdb -p 5433 -c \"update resource set resource_flag='RESOURCE_STATE_NORMAL' where resource_flag='RESOURCE_STATE_DELETE';\""

5.  Perform Cassandra maintenance on the master node only.  Afterword you will be running cassandra repair on the rest of the nodes that will sync up their databases with the master.  There are a total of four commands here, so run them in order:

# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.activity_tbl"
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.activityresults_tbl"
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.queueid_tbl"
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 clearsnapshot 

6.  Perform Cassandra maintenance on all nodes:

  • This is critical. The -pr option to the repair tool causes a subset of the token range to be coordinated from the node calling the nodetool. This improves performance however it is critical that ALL nodes in the Cassandra cluster perform this operation to get a complete and consistent repair.  See <www.datastax.com_dev_blog_repair-cassandra >
  • Execute this on the master and all data nodes simultaneously:
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 repair -par -pr

7. To monitor the repair progress, you can start another SSH session to the master node and tail the following log:

# tail -f /storage/log/vcops/log/cassandra/system.log
  • The repair operation has two distinct phases. First it calculates the differences between the nodes (repair work to be done), and then it acts on those differences by streaming data to the appropriate nodes.

Generally speaking, you can also monitor the nodetool repair operation with these two nodetool commands, but this is not necessary:

  • netstats: This monitors the repair streams to the nodes:
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool --port 9008 netstats
  • compactionstats: This checks on the active Merkle Tree calculations:
# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool --port 9008 compactionstats

8. Perform the instance metric key id clean-up on all nodes.  Perform this step in parallel on the master and data nodes.  This cleans up the disk on the nodes, as this cleans up the snapshots of Cassandra on each node:

# $VMWARE_PYTHON_BIN $VCOPS_BASE/tools/persistenceTool.py RemoveMatchedMetricKeys --outputFile /tmp/deleted_report.txt --regex "\"^diskspace-waste:.+?snapshot:.*\"" --remove true # $VMWARE_PYTHON_BIN $VCOPS_BASE/tools/persistenceTool.py RemoveMatchedMetricKeys --outputFile /tmp/deleted_report2.txt --regex "\"^diskspace:.+?snapshot:.*(accessTime|used)$\"" --remove true

9.  Clean up the alarms & alerts on all nodes.   Perform this step in parallel on the master and data nodes:

# su - postgres -c "/opt/vmware/vpostgres/9.3/bin/psql -p 5432 -U vcops -d vcopsdb -c 'truncate table alert cascade;'"
# su - postgres -c "/opt/vmware/vpostgres/9.3/bin/psql -p 5432 -U vcops -d vcopsdb -c 'truncate table alarm cascade;'"

10.  Bring the analytics processes back online.   Execute this step on the master, master replica and data nodes. You may use a ssh for-loop and execute these commands sequentially:

# service vmware-vcops start analytics
  • I have seen the cluster to take 20 to 30 minutes to come back online (from my experience with a 9+ node large cluster).
  • If you log into the https://<vrops>/admin page, you will see that the HA status is degraded, or needs to be re-enabled.  Give the appliance time as it will reset itself back to a healthy green state, once fully online.

11. Once the cluster is fully online and you can confirm the data is being collected, delete the snapshots you took earlier.

12. On the master node, if you again run the command:

# $VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/vrops-platform-cli/vrops-platform-cli.py getShardStateMappingInfo

You should see something similar to:

{
   "stateMappings": {
    "vrops1": {
      "vRealize Ops Shard-0724c812-9def-4391-9efa-2395d701d43e": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-77839361-986c-4817-bbb3-e7f4f1827921": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-8469fdff-55f0-49f7-a0e7-18cd6cc288c0": {
        "state": "RUNNING"
      },
      "vRealize Ops Shard-8c8d1ce4-36a5-4f23-b77d-29b839156383": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-ab79572e-6372-48d2-990d-d21c884b46fb": {
        "state": "SYNCHING"
      },
      "vRealize Ops Shard-bfa03b9e-bac9-4040-b1a8-1fd8c2797a6a": {
        "state": "SYNCHING"
      }
    }
  },

13. On the master node, if you again run the nodetool status command

# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool --port 9008 status

You should see something similar to:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--   Address    Load    Tokens  Owns  Host ID                               Rack
UN   192.2.3.6  120.20 MB  256     ?   232f646b-3fbc-4388-8962-34000c1bb48b  rack1
UN   192.2.3.7  128.20 MB  256     ?   1bfec59b-3ba8-4ca0-942f-5bb2f97b7319  rack1
UN   192.2.3.4  120.11 MB  256     ?   da6c672c-cc69-4284-a8f5-2775441bb113  rack1
UN   192.2.3.5  115.33 MB  256     ?   ee4a3c3f-3f0f-46ac-b331-a816e8eb37c5  rack1
DN   192.2.3.3  128.13 MB  256     ?   19e80237-6f2c-4ff6-881e-ce94870e0ca5  rack1

Note: Non-system keyspaces don't have the same replication settings, 
effective ownership information is meaningless
  • The ideal situation here is that now the Cassandra DB load should be around 1GB or less

14.   Log into the regular web interface and edit the policy to stop collections on snapshot metrics. This will help in overall performance going forward.

 

Creating vROps Policies and How To Apply Them To Object Groups.

Creating policies in VMware’s vRealize Operations Appliance can be strait forward, if there is a decent understanding of every platform it’s monitoring.  In my last post of this series, I covered the creation of object groups, and that post is important here because policies can be created and assigned to those object groups, allowing the tuning of alerts received for those groups.

Once logged in to the vROps appliance, go into the administration section, and there you will find the policies.

vrops37.png

  • VMware has included many base policies in the policy library, which in most cases will be fine for the initial configuration for the appliance, but you may want to create additional policies to suite your specific environment needs.
  • Also take note of the blue film strip in the upper right corner.  This will take you to VMware’s video repository of policies explanation and a brief how-to video.  These video links can be found throughout the configuration of the appliance, and more are added with each release.

To create a new policy click on the green plus sign to get started.  Give the policy a unique name, and it would be good practice to give a description of what the policy is intended to do.  When creating a policy, you have the ability to “start with” a VMware pre-defined policy, and I recommend taking advantage of that until there is a firm understanding of what these policies do.

vrops38

On the Select Base Policy tab, you can use the drop down menu on the left to get a policy overview of what is being monitored.  In this example, Host system was selected.

vrops39

Policy Overrides can also be incorporated into this policy.  In other words, if there are certain alerts that you do not want, one of the pre-defined policies may already have those alerts turned off, so those policies can be added to the new policy being created here.  Work smarter, not harder right?

vrops40

Moving along to the Analysis Settings tab, here you can see how vROps analyses the alerts, determines thresholds, and assigned system badges.  These can be left at their current settings per the policy you are building off of, or you can click on the padlock to the right and make individual changes.  Keep in mind under the “Show changes for” drop down menu, you will have many objects to select to change the analysis settings on.

vrops41

The Alert/Systems Definitions tab is probably where the majority of time will be spent.  The “Alert Definitions” box at the top is where alerts can be turned on or off based on the base policy used to create this one, or the override policies used.

  • Each management pack installed will have it’s own category for object type.  In other words, “host system” is listed under the vCenter category, but if vCloud Director management pack was installed, it would also have a “host system” under its category.  Each management pack has the ability to add additional alerts for objects referenced in other management packs.  Take time going through each category to see what alerts may need configuring.
  • The State of each alert will either be local with a green check-mark: meaning you enabled it, inherited with a grey check-mark: meaning it is enabled via another policy that was used to create this one, Local with the red crossed out circle: meaning you disabled the alert for the policy, or inherited with a grayed out crossed out circle: meaning it is disabled via another policy that was used to create this one.  Disabling alerts here will still allow the metrics to be collected for the object, you just wont get the alarm for it.
  • The System Definitions section has the same “object type” drop down menu, and you can select the object type here to configure system thresholds for how the symptoms are triggered for the alert selected in the top Alert Definition box above.  I typically do these in tandem.

vrops43

Finally, you can apply the policy to the custom groups you created before in the Apply Policy to Groups tab.

vrops42

Once you click save, and go back to the Active Policies tab, you will be able to see the new policy created, and within five minutes, you should see the Affected Objects count rise.  You can see here that I have a policy marked with “D” meaning it is the default appliance policy.  You can set your own policy as default by clicking the blue circle icon with the arrow on the upper left side.  It may take up to 24 hours before the active alert page reflects the settings of the new policy.  Otherwise you can manually clear those alerts.

vrops44

Previous post to this series: Configuring VMware vRealize Operations Manager Object Groups

ESXi host fails to upgrade from 5.5 Update 3 to 6 Update 2

This happened to me today and thought it was worth sharing.  Most of the hosts in this particular cluster upgraded fine to ESXi 6u2 from ESXi 5.5u3 with the exception of this one host.  Update manager kept giving me this error “Cannot run upgrade script on host” in the vCenter Recent Tasks pane.

esxi01

A quick google search brought me to this KB article 2007163, but after following the KB I wasn’t able to find the referenced error “Remediation failed due to non mode failure “on the update manager server (Win2008) under C:\AppData\VMware\Update Manager\Logs\vmware-vum-server-log4cpp.log file.

I started an SSH session to the ESXi host, but wasn’t able to find and entry similar to the error “OSError: [Errno 39] Directory not empty:” in the /var/log/vua.log file

I instead found this error:

—————————————————————-

[FFD0D8C0 error ‘Default’] Alert:WARNING: This application is not using QuickExit().

The exit code will be set to 0.@ bora/vim/lib/vmacore/main/service.cpp:147

–> Backtrace:

–> backtrace[00] rip 1bc228c3 Vmacore::System::Stacktrace::CaptureFullWork(unsigned int)

—————————————————————–

By chance, I happened to check space on the ESXi host #df -h and found I had a partition that was 100% full.

esxi02

So I changed directory to it # cd /storage/core/……./ where I found two more directories  /var/core/ .  Using the command #ls to list the directory, I found two zdumps.

esxi03

I deleted the two zdumps, and then checked the space again #df -h

esxi04

Seeing now that my directory is now 68% utilized instead of 100%, I attempted the ESXi 6u2 upgrade again this time with success.