Failure Adding an Additional Node to vRealize Operations Manager Due to Expired Certificate

The Issue:

Unable to add additional nodes to cluster.  This error happened while adding an additional data and remote collector.  The cause ended up being a expired customer certificate, and surprisingly there was no noticeable mechanism such as a yellow warning banner in vROps UI to warn that a certificate had expired, or is about to expire.

Troubleshooting:

Log into the the new node being added, and tail the vcopsConfigureRoles.log

# tail -f /storage/vcops/log/vcopsConfigureRoles.log

You would see entries similar to:

2016-08-10 00:11:56,254 [22575] - root - WARNING - vc_ops_utilities - runHttpRequest - Open URL: 'https://localhost/casa/deployment/cluster/join?admin=172.22.3.14' returned reason: 
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581), exception: 
2016-08-10 00:11:56,254 [22575] - root - DEBUG - vcopsConfigureRoles - joinSliceToCasaCluster - Add slice to CaSA cluster response code: 9000
2016-08-10 00:11:56,254 [22575] - root - DEBUG - vcopsConfigureRoles - joinSliceToCasaCluster - Expected response code not found. Sleep and retry. 0 
2016-08-10 00:12:01,259 [22575] - root - INFO - vcopsConfigureRoles - joinSliceToCasaCluster - Add Cluster to slice response code: 9000 
2016-08-10 00:12:01,259 [22575] - root - INFO - vc_ops_logging - logInfo - Remove lock file: /usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/conf/vcops-configureRoles.lck
2016-08-10 00:12:01,259 [22575] - root - DEBUG - vcopsPlatformCommon - programExit - Role State File to Update: '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties'
2016-08-10 00:12:01,260 [22575] - root - DEBUG - vcopsPlatformCommon - UpdateDictionaryValue - Update section: "generalSettings" key: "failureDetected" with value: "true" file: "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties"
2016-08-10 00:12:01,260 [22575] - root - DEBUG - vcopsPlatformCommon - loadConfigFile - Loading config file "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties"
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - copyPermissionsAndOwner - Updating file permissions of '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties.new' from 100644 to 100660
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - copyPermissionsAndOwner - Updating file ownership of '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties.new' from 1000/1003 to 1000/1003
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - UpdateDictionaryValue - The key: failureDetected was updated 
2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - programExit - Updated failure detected to true 
2016-08-10 00:12:01,261 [22575] - root - INFO - vcopsPlatformCommon - programExit - Exiting with exit code: 1, Add slice to CaSA Cluster failed. Response code: 9000.  Expected: 200

Resolution:

Step #1

Take snapshot of all vROps nodes

Step #2

Revert back to VMware’s default certificate on all nodes using the following kb article. KB2144949

Step #3

The custom cert files that need to be renamed on the nodes are located at /storage/vcops/user/conf/ssl.  This should be completed on all nodes.   Alternatively, you can remove them, but renaming them is sufficient.

# mv customCert.pem customCert.pem.BAK
# mv customChain.pem customChain.pem.BAK
# mv customKey.pem customKey.pem.BAK
# mv uploaded_cert.pem uploaded_cert.pem.BAK   

Step #4

Now attempt to add the new node again.  From the master node, you can watch the installation of the new node by tailing the casa.log

# tail -f /storage/vcops/log/casa/casa.log

Delete the snapshots as soon as possible.

  • To add a new custom certificate to the vRealize Operations Manager, follow this KB article: KB2046591

_________________________________________________________________

Alternative Solutions

  • There could be an old management pak installed that was meant for an older version of vROps.  This has been know to cause failures.  Follow this KB for more information: KB2119769

 

  • If you are attempting to add a node to the cluster using an IP address previously used, the operation may fail.  Follow this KB for more information: KB2147076