The Issue:
Unable to add additional nodes to cluster. This error happened while adding an additional data and remote collector. The cause ended up being a expired customer certificate, and surprisingly there was no noticeable mechanism such as a yellow warning banner in vROps UI to warn that a certificate had expired, or is about to expire.
Troubleshooting:
Log into the the new node being added, and tail the vcopsConfigureRoles.log
# tail -f /storage/vcops/log/vcopsConfigureRoles.log
You would see entries similar to:
2016-08-10 00:11:56,254 [22575] - root - WARNING - vc_ops_utilities - runHttpRequest - Open URL: 'https://localhost/casa/deployment/cluster/join?admin=172.22.3.14' returned reason: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581), exception: 2016-08-10 00:11:56,254 [22575] - root - DEBUG - vcopsConfigureRoles - joinSliceToCasaCluster - Add slice to CaSA cluster response code: 9000 2016-08-10 00:11:56,254 [22575] - root - DEBUG - vcopsConfigureRoles - joinSliceToCasaCluster - Expected response code not found. Sleep and retry. 0 2016-08-10 00:12:01,259 [22575] - root - INFO - vcopsConfigureRoles - joinSliceToCasaCluster - Add Cluster to slice response code: 9000 2016-08-10 00:12:01,259 [22575] - root - INFO - vc_ops_logging - logInfo - Remove lock file: /usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/conf/vcops-configureRoles.lck 2016-08-10 00:12:01,259 [22575] - root - DEBUG - vcopsPlatformCommon - programExit - Role State File to Update: '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties' 2016-08-10 00:12:01,260 [22575] - root - DEBUG - vcopsPlatformCommon - UpdateDictionaryValue - Update section: "generalSettings" key: "failureDetected" with value: "true" file: "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties" 2016-08-10 00:12:01,260 [22575] - root - DEBUG - vcopsPlatformCommon - loadConfigFile - Loading config file "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties" 2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - copyPermissionsAndOwner - Updating file permissions of '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties.new' from 100644 to 100660 2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - copyPermissionsAndOwner - Updating file ownership of '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties.new' from 1000/1003 to 1000/1003 2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - UpdateDictionaryValue - The key: failureDetected was updated 2016-08-10 00:12:01,261 [22575] - root - DEBUG - vcopsPlatformCommon - programExit - Updated failure detected to true 2016-08-10 00:12:01,261 [22575] - root - INFO - vcopsPlatformCommon - programExit - Exiting with exit code: 1, Add slice to CaSA Cluster failed. Response code: 9000. Expected: 200
Resolution:
Step #1
Take snapshot of all vROps nodes
Step #2
Revert back to VMware’s default certificate on all nodes using the following kb article. KB2144949
Step #3
The custom cert files that need to be renamed on the nodes are located at /storage/vcops/user/conf/ssl. This should be completed on all nodes. Alternatively, you can remove them, but renaming them is sufficient.
# mv customCert.pem customCert.pem.BAK
# mv customChain.pem customChain.pem.BAK
# mv customKey.pem customKey.pem.BAK
# mv uploaded_cert.pem uploaded_cert.pem.BAK
Step #4
Now attempt to add the new node again. From the master node, you can watch the installation of the new node by tailing the casa.log
# tail -f /storage/vcops/log/casa/casa.log
Delete the snapshots as soon as possible.
- To add a new custom certificate to the vRealize Operations Manager, follow this KB article: KB2046591
_________________________________________________________________
Alternative Solutions
- There could be an old management pak installed that was meant for an older version of vROps. This has been know to cause failures. Follow this KB for more information: KB2119769
- If you are attempting to add a node to the cluster using an IP address previously used, the operation may fail. Follow this KB for more information: KB2147076