Unable to add additional nodes to cluster. This error happened while adding an additional data and remote collector. The cause ended up being a expired customer certificate, and surprisingly there was no noticeable mechanism such as a yellow warning banner in vROps UI to warn that a certificate had expired, or is about to expire.
Log into the the new node being added, and tail the vcopsConfigureRoles.log
# tail -f /storage/vcops/log/vcopsConfigureRoles.log
You would see entries similar to:
2016-08-10 00:11:56,254  - root - WARNING - vc_ops_utilities - runHttpRequest - Open URL: 'https://localhost/casa/deployment/cluster/join?admin=172.22.3.14' returned reason: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581), exception: 2016-08-10 00:11:56,254  - root - DEBUG - vcopsConfigureRoles - joinSliceToCasaCluster - Add slice to CaSA cluster response code: 9000 2016-08-10 00:11:56,254  - root - DEBUG - vcopsConfigureRoles - joinSliceToCasaCluster - Expected response code not found. Sleep and retry. 0 2016-08-10 00:12:01,259  - root - INFO - vcopsConfigureRoles - joinSliceToCasaCluster - Add Cluster to slice response code: 9000 2016-08-10 00:12:01,259  - root - INFO - vc_ops_logging - logInfo - Remove lock file: /usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/conf/vcops-configureRoles.lck 2016-08-10 00:12:01,259  - root - DEBUG - vcopsPlatformCommon - programExit - Role State File to Update: '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties' 2016-08-10 00:12:01,260  - root - DEBUG - vcopsPlatformCommon - UpdateDictionaryValue - Update section: "generalSettings" key: "failureDetected" with value: "true" file: "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties" 2016-08-10 00:12:01,260  - root - DEBUG - vcopsPlatformCommon - loadConfigFile - Loading config file "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties" 2016-08-10 00:12:01,261  - root - DEBUG - vcopsPlatformCommon - copyPermissionsAndOwner - Updating file permissions of '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties.new' from 100644 to 100660 2016-08-10 00:12:01,261  - root - DEBUG - vcopsPlatformCommon - copyPermissionsAndOwner - Updating file ownership of '/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/roleState.properties.new' from 1000/1003 to 1000/1003 2016-08-10 00:12:01,261  - root - DEBUG - vcopsPlatformCommon - UpdateDictionaryValue - The key: failureDetected was updated 2016-08-10 00:12:01,261  - root - DEBUG - vcopsPlatformCommon - programExit - Updated failure detected to true 2016-08-10 00:12:01,261  - root - INFO - vcopsPlatformCommon - programExit - Exiting with exit code: 1, Add slice to CaSA Cluster failed. Response code: 9000. Expected: 200
Take snapshot of all vROps nodes
Revert back to VMware’s default certificate on all nodes using the following kb article. KB2144949
The custom cert files that need to be renamed on the nodes are located at /storage/vcops/user/conf/ssl. This should be completed on all nodes. Alternatively, you can remove them, but renaming them is sufficient.
# mv customCert.pem customCert.pem.BAK # mv customChain.pem customChain.pem.BAK # mv customKey.pem customKey.pem.BAK # mv uploaded_cert.pem uploaded_cert.pem.BAK
Now attempt to add the new node again. From the master node, you can watch the installation of the new node by tailing the casa.log
# tail -f /storage/vcops/log/casa/casa.log
Delete the snapshots as soon as possible.
- To add a new custom certificate to the vRealize Operations Manager, follow this KB article: KB2046591
- There could be an old management pak installed that was meant for an older version of vROps. This has been know to cause failures. Follow this KB for more information: KB2119769
- If you are attempting to add a node to the cluster using an IP address previously used, the operation may fail. Follow this KB for more information: KB2147076