Upgrading To vSphere 6.7 Update 1, and Using The vCenter Converge Tool: Part 2

In this second part of the blog series “Upgrading to vSphere 6.7 Update 1, and Using the vCenter Converge Tool”, I will go over my experience using the Convergence tool. Lets get started.

The Basics of the vCenter Converge Tool

David Stamen (@davidstamen) put together an excellent blog on Understanding the vCenter Server Converge Tool, at VMware’s offical blog site, which I found very useful. Shout-out to Nigel Hickey (@vCenterNerd) for answering some questions I had.

The Convergence Tool basically takes the external PSC and embeds it into the vCenter appliance like so:

Photo credit
@davidstamen

For this customer, I had three vCSA’s and three PSC’s that I needed to converge. Most of the blogs that I found didn’t cover PSC’s that were joined to a domain, environments with multiple vCenters, or with multiple PSC’s, so I thought I would write this up in a blog.

Planning the Convergence

The first thing I had to do was take note of any registered services with the SSO domain. I utilized VMware’s KB2043509 to identify these services, which I had none to worry about. VMware specifically calls out NSX and Site Recovery Manager (SRM), but since those were not in use at this customer, the only things I had to worry about were Horizon, vROps, vRLi and Zerto. Each of these services registered directly to the vCenters, so I had nothing to worry about there. If I had any services registered with the SSO domain, I’d simply need to re-register them once the convergence tool was ran. But since this didn’t apply, I can move forward with configuring the scripts for the convergence tool.

I also need to have an understanding of the replication typology of the existing SSO domain. VMware KB2127057 was an excellent resource I used to gather that information. Opening a putting session to a vCenter, and running the ‘vdcrepadmin’ command against each of the external PSCs, I was able to see the following:

# cd /usr/lib/vmware-vmdir/bin

./vdcrepadmin -f showpartners -h external_psc-a.domain.com -u administrator -w kjdshfsdkjfhskjdhf

ldap://external_psc-b.domain.com
ldap://external_psc-c.domain.com

-----------------------------------------------------------------

./vdcrepadmin -f showpartners -h external_psc-b.domain.com -u administrator -w kjdshfsdkjfhskjdhf

ldap://external_psc-a.domain.com
ldap://external_psc-c.domain.com

-----------------------------------------------------------------

./vdcrepadmin -f showpartners -h external_psc-c.domain.com -u administrator -w kjdshfsdkjfhskjdhf

ldap://external_psc-a.domain.com
ldap://external_psc-b.domain.com

I can see they already have a ring topology, which is the desired architecture. If I were to draw the SSO typology out, it would look something like:

Setting Up the JSON Templates for the Convergence Tool

The converge.json template that the convergence tool uses, can be found in the VMware VCSA ISO, that was used for the 6.7 Update 1 upgrade, under the following path: DVD Drive (#):\VMware VCSA\vcsa-converge-cli\templates\

To make my life easier, I copied the contents of the entire ISO to a folder on the root of my C drive. I then made a seperate folder on the root of C called converge, and created a folder for each of the three vCenters I’d be working with: vCenter-A, vCenter-B, vCenter-C. I made a copy of the converge.json, and placed it into each folder.

Taking a look at the converge.json for vCenter-A, the template tells you what data needs to be filled in, so pay close attention. Lines 10 – 15 needs entries for the ESXi host where the vCenter resides, or the managing vCenter appliance. Here I chose the option to used the Managing ESXi host. All I needed to do, was look in vSphere to see where the vCSA appliance VM resided on which host. While there, I also set the Cluster DRS settings to manual, to prevent the VMs from moving during the upgrade. Once I obtained the information needed, I completed that portion of the json. (I’ve redacted environment specific information).

Lines 16 – 21 need data entries for the first vCenter appliance (vCenter-A) to be converged. Here I need the FQDN for vCenter-A, for the Username, I need the administrator@vsphere.local account, its password, and the root password of the appliance.

Lines 22 – 33 would be filled out IF the Platform Services Controller (PSC) appliance is joined to the domain. My customer was joined to the domain, so I needed to fill this section out. Otherwise you can remove this section from the JSON.

Now, because this is the first vCenter of three, in the same SSO domain, for the first convergence, I did not need this section, because the first vCenter does not have a partner yet. It will be needed however, on the second (vCenter-B) and third (vCenter-C) convergences.

Now I need to fill out a second and third converge.json file for the second and third convergence, saving each in its respective folder. For vCenter-B and vCenter-C, for the partner hostname on line 32, I used the FQDN of the first converged vCenter (vCenter-A), as that is the first partner of the SSO domain.

For vCenter-A, the first to be converged, the completed converge.json looks like this (take note of the commas, brackets and lines removed):

For the second convergence (vCenter-B), and third convergence (vCenter-C), the completed converge.json looks like this:

Now that we got the converge.json done for each of the vCenters, we can work on the decommission.json.

Here is the template VMware provides in the same directory:

Lines 11 – 15 require impute for the Managing vCenter or ESXi Host of the External PSC. Again, just like the vCenter, I used the ESXi host that the PSC is running on.

Lines 16 – 21 needs data for the Platform Services Controller that will be Decommissioned.

Lines 30 – 34 requires information for the vCenter the PSC was paired with. Again here I just used the ESXi host that the vCenter is currently running on

Lines 35 – 39 require the information for the vCenter, the PSC is paired with.

Now that we have the decommission.json filled out for the first vCenter (vCenter-A), I have to repeat the process for the second and third vCenters (vCenter-B, vCenter-C). The full decommission.json should look like

Now that both the converge.json and decommission.json have been filled out for each of my environments (3), and stored in the same directory on the root of C, I can move forward with the Convergence process.

Prerequisites and Considerations Before Starting the Convergence Process

  • The converge tool only supports the VCSA and PSC 6.7 Update 1. All nodes must be on 6.7 Update 1 before converting.
  • If you are currently running a Windows vCenter Server or PSC, you must migrate to the appliance first.
  • Before converting, take a backups of your VCSA(s) and PSCs in the vSphere SSO domain(VM snapshots, and DB backups).
  • Know all other solutions using the PSC for authentication in the environment. They will need to be re-registered after the convergence completes and before decommissioning.
  • A machine on a routable network which can communicate with the VCSA and PSC will be used to run the convergence and decommission process.
  • Set the DRS Automation Level to manual, and the Migration Threshold to conservative. There will be be issues if the VCSA being converged is moved during the process.
  • If VCHA is enabled, it must be disabled prior to running the convergence process.
  • The converge process will handle PSC HA load balancers. Make sure you point to the VIP in the JSON template if you have them.
  • All vSphere SSO data is migrated with the exception of local OS users.
  • Best to take snapshots of the vCSA and external PSC VMs before continuing. We’ve already backed up the database, but it doesn’t hurt to have snapshots as well.

Executing the Converge Tool

Now that converge.json template for each vCenter (vCenter-A, vCenter-B, vCenter-C) is filled out properly, we can now execute. We will run the convergence tool against the first vCenter (vCenter-A). Note: We can only run the converge tool against one vCSA at a time.

In powershell, we can first run the following command before proceeding with the upgrade to see what options/parameters are available with the converge tool.

.\vcsa-converge-cli\win32\vcsa-util.exe converge --help 

To execute the convergence tool against the first vCenter (vCenter-A), I ran the following command:

.\vcsa-converge-cli\win32\vcsa-util.exe converge --no-ssl-certificate-verification --backup-taken C:\pathtofile.json

The output in powershell should look something like:

It will then ask you to reboot the first vCenter before continuing.

Once the first vCenter (vCenter-A) came up, I executed the convergence tool for the second vCenter (vCenter-B). Once completed I restarted the appliance.

Finally, the last vCenter (vCenter-C) is on deck. I executed the converge.json against that vCenter, and once completed, I restarted it.

Here is where you would need to re-point those systems using the old SSO domain, but since I didn’t have any, I can move forward with the decommissioning steps.

Decommissioning the Old external Platform Services Controllers (PSC)

Using the Converge Tool with the decommission option to remove the external PSC’s. Just like before, we need to do this one PSC at a time. The command looks something like this:

 .\vcsa-converge-cli\win32\vcsa-util.exe decommission --no-ssl-certificate-verification C:\pathtofile.json 

Once the process successfully completes, move onto the next PSC. Repeat the process until all PSC’s have been decommissioned.

Validate the SSO Replication Topology After the Converge Process

If you’ll remember, when I setup the converge.json, I had the second vCenter (vCenter-B) and third vCenter (vCenter-C) replication partner set to the first converged vCenter (vCenter-A). My Replication topology currently looks like this:

I needed to close the loop between vCenter-B and vCenter-C. Using VMware’s KB2127057 , I used the ‘createagreement’ parameter. I opened a putty session to vCenter-B and ran the following command:

# cd /usr/lib/vmware-vmdir/bin

./vdcrepadmin -f createagreement -2 -h vcenter-b.domain.com -H vcenter-c.domain.com -u Administrator -w VMw@re123

Now that the SSO replication agreement has been made between vCenter-B and vCenter-C, my replication topology looks like this:

I’m not going to lie, the hardest part of using the convergence tool, was just getting started. I’ve been through enough fires in my day to know how bad of a time I would have had if something went wrong, and I lost either the vCenter, or external PSC before the convergence successfully completed. Once I got myself beyond that mental hurdle, the process was actually quite easy and smooth.

I know I’ve left this customer’s environment in a lot better shape than I found it, and having embedded PSCs will make future vCenter upgrades a breeze. For a VMware PSO consultant, this was a huge value add for the customer.

Blog Date: April 16, 2019

Upgrading To vSphere 6.7 Update 1, and Using The vCenter Converge Tool: Part 1

I recently wrapped up a vSphere 6.7 U1 upgrade project, while on a VMware Professional Services (PSO) engagement, with a customer in Denver Colorado. On this project, I had to upgrade their three VMware environments from 6.5, to 6.7 Update 1. This customer also had three external Platform Services Controllers (PSC), a configuration that is now depreciated in VMware architecture.

Check the VMware Interoperability, and Compatibility Matrices

The first thing I needed to do, was to take inventory of the customer’s environment. I needed to know how many vCenters, if they had external Platform Services Controllers, how many hosts, vSphere Distributed Switch (VDS), and what the versions were.

  • From my investigation, this customer had three vCenters, and three external platform services controllers (PSC), all apart of the same SSO domain.
  • I also made note of which vCenter was paired with what external PSO. This information is critical not only for the vSphere 6.7 U1 upgrade, but also the convergence process that I will be doing in part two of this blog series.
  • Looking at the customer’s ESXi hosts, the majority were running the same ESXi 6.5 build, but I did find a few Nutanix clusters, and six ESXi hosts still on version 5.5.
  • The customer had multiple vSphere Distributed Switch (VDS) that needed to be upgraded to 6.5 before the 6.7 upgrade.

The second thing that I needed to do was to look at the model of each ESXi host and determine if it is supported for the vSphere 6.7 U1 upgrade. I also need to validate the firmware and BIOS each host is using, to see if I need to have the customer upgrade the firmware and BIOS of the hosts. We’ll plug this information into the VMware Compatibility Guide .

  • From my investigation, the three ESXi hosts running ESXi 5.5 were not compatible with 6.7U1, however they were compatible with the current build of ESXi 6.5 the customer was running on their other hosts. I would need to upgrade these hosts to ESXi 6.5 before starting the vSphere 6.7 U1 upgrade.
  • This customer had a mix of Dell and Cisco UCS hosts, and almost all needed to have their firmware and BIOS upgraded to be compatible with ESXi 6.7 U1.

The third thing I needed to check was to see what other platforms, owned by VMware, and/or bolt on third parties, that I needed to worry about for this upgrade.

  • The customer is using a later version of VMware’s Horizon solution, and luckily for me, it is compatible with vSphere 6.7 U1, so no worries there.
  • The customer has Zerto 6.0 deployed, and unfortunately it needed to be upgraded to a compatible version.
  • The customer has Actifio backup solution, but that is also running a compatible version, so again no need to update it.

Lets Get those ESXi 5.5 hosts Upgraded to 6.5

I needed to schedule an outage with the customer, as they had three offsite locations, with two ESXi 5.5 hosts each. These hosts were using local storage to house and run the VMs, so even though they were in a host cluster, HA was not an option, and the VMs would need to be powered off.

Once I had the outage secured, I was able to move forward with upgrading these six hosts to ESXi 6.5.

Time to Upgrade the vSphere Distributed Switch (VDS)

For this portion of the upgrade, I only needed to upgrade the customers VDS’s to 6.5. This portion of the upgrade was fast, and I was able to do it mid day without the customer experiencing an outage. We did submit a formal maintenance request for visibility, and CYA. Total upgrade time to do all of their VDS’s was less than 15 minutes. Each switch took roughly a minute.

Upgrade the External Platform Services Controllers Before the vCenter Appliances

Now that I had all hosts to a compatible ESXi 6.5 version, I can move forward with the upgrade. I was able to do this upgrade during the day, as the customer would only lose the ability to manage their VMs using the vCenters. I made backups of the PSC and vCSA databases, and created snapshots of the VMs just in case.

I first needed to upgrade the external PSCs (3) to 6.7 U1, so I simply attached the vCSA.iso to my jump VM, and launched the .exe. I did this process one PSC at a time until they were all upgraded to 6.7 U1.

Upgrade the vCenter Appliances to 6.7 Update 1

Now that the external platform services controllers are on 6.7 U1, it is time to upgrade the vCenters. The process is the same with the exe, so I just did one vCenter at a time. Both the external PSC’s and the vCSA’s upgraded without issue, and within a couple of hours both the external PSC’s and vCSA’s had finished the vSphere 6.7 Update 1.

Upgrade Compatible ESXi Hosts to 6.7 Update 1

I really wanted to use the now embedded VMware Update Manager (VUM), but I either faced users who re-attached ISO’s to their Horizon VMs, or had administrators who were upgrading/installing VMware Tools. In one cluster I even happened to find a host that had improper networking configured compared to its peers in the cluster. Once I got all of that out of the way, I was able to schedule VUM to work its way down through each cluster, and upgrade the ESXi hosts to 6.7 Update 1. There were still some fringe cases where VUM wouldn’t work as intended, and I needed to do one host at a time.

Conclusion for the Upgrade

In the end, upgrading the customer’s three environments, vCSA, PSC and ESXi to 6.7 Update 1 took me about a couple of weeks to do alone. Not too shabby considering I finished ahead of schedule, even with all of the issues I faced. After the upgrade, the customer started having their Cisco UCS blades purple screen at random. After opening a case with GSS, that week Cisco came out with an emergency patch for the fnic driver, on the customer’s UCS blades, for the very issue they were facing. The customer was able to quickly patch the blades. Talk about good timing.

Part 2 Incoming

Part 2 of this series will focus on using the vCenter Converge Tool. Stay tuned.

Blog Date: 4/15/2019

2019 VMware vExpert Announcement

It’s that time of year again. I’m honored and humbled to continue to be apart of the VMware vExpert program. This program challenges me every day to continue to learn, and contribute to the #vCommunity. For me, this isn’t just some title. This is a family of community warriors where we learn from and help each other grow. Everyone in some way gives back to the community. This year, I am also excited to try my hand at public speaking, and give back to the VMUG community as a community session speaker. I don’t think that I would have had the courage to apply to be a speaker, if it wasn’t for my fellow vExperts encouraging me to do so.

Congrats to all the new and returning vExperts! https://blogs.vmware.com/vexpert/2019/03/07/vexpert-2019-award-announcement/

vRealize Operations Manager Dashboard: vSphere DRS Cluster Health. Part 2

This blog series assumes that the reader has some understanding of how to create a vRealize Operations Manager (vROps) dashboard.

vROps dashboards are made up of what is called widgets. These widgets can either be configured as “self providers”, or can be populated with data by a “parent” widget. Self provider widgets, are configured to individually show specific data. In other words, one widget shows hosts, another shows datastores, and another showing virtual machines, however the widgets will not interact, nor are they dependent of each other. Parent widgets, are configured to provide data from a specific source, and then feed it into other child widgets on the page. This is useful when data is desired to be displayed in different formats of consumption. The dashboard I configured called “vSphere DRS Cluster Health”, does just that. I will break the widgets down to different sections as I walk through the configuration.

Widget #1 – This widget is known as an “object list“, and will be the parent widget of this dashboard. In other words, widgets #2 through #6 rely on the data presented by widget #1. In this case I have the object list widget, configured to show/list the different host clusters in my homelab.

I have given it a title, set refresh content to ON, set the mode to PARENT, and have it set to auto select the first row. In the lower left section “Select which tags to filter”, I have created an environment group in vROps called “Cluster Compute Resource” where I have specified my host clusters. In the lower right box, I have a few metrics selected which I would also like this “object list” widget to show.

This is just a single esxi homelab, so this won’t look as grand as it would if it were to be configured for a production environment. But each object in this list is select-able, and the cool thing is that each object in this list, when it is selected, will change the other widgets.

Widgets #2 and #3 are called “health charts”. I have one configured with the metric for cluster CPU workload %, and the other configured with the metric cluster memory workload %. Both configurations are the same, with the exception that one has a custom metric of Cluster CPU Workload %, and the other is configured with the custom metric of Cluster Memory Workload %. I have both configured to show data for the past 24hrs.

Important: For these two widgets, under “widget interactions“, set both to the first widget: DRS Cluster Settings (Select a cluster to populate charts)

Widgets #4 and #5 are called “View widgets”. One is configured to show the current Cluster CPU Demand, and the other is configured to show the current Cluster Memory Demand. These are also configured to forecast out for 30 days, so that we can potentially see if the clusters will run short of capacity in the near future, allowing us the ability to add more compute to the cluster preemptively.

These are two custom “views” I created. I will go over how to create custom views in a future post, but for those who already know how, I have one “widget view” configured with each.

Important: For these two widgets, under “widget interactions“, set both to the first widget: ” DRS Cluster Settings (Select a cluster to populate charts) ” like we did above.

Widget #6 is another “Object List” widget, and I have this configured to show only host systems, of the selected cluster in Widget #1. Widget #6 will be used to provide data to Widgets #7 and #8.

I also have certain Host System metrics selected here so that I can get high level information of the hosts in the cluster.

Important: For these two widgets, under “widget interactions“, set both to the first widget: ” DRS Cluster Settings (Select a cluster to populate charts)” like we did above.

The final two widgets, #7 and #8, are also called “health chart” widgets. One is configured using the metric host system CPU workload %, and the other is configured using the metric host system Memory workload %. I have both configured to show data for the past 24hrs.

Important: For these two widgets, under “widget interactions“, set both to widget #6, in this example: Host Workload (select a host to populate charts to the right).

vRealize Operations Manager Dashboard: vSphere DRS Cluster Health. Part 1

A few weeks ago, I had a customer ask me about creating a custom vROPs dashboard for them, so that they could monitor the health of the clusters. For those of you who were unaware, VMware has packaged vROPs with a widget called “DRS Cluster Settings”, that does something similar, and look like this:

The idea behind this widget, is that it will list all clusters attached to the vCenter, giving you high level information such as the DRS setting, and the memory and CPU workload of the cluster. With a cluster selected, in the lower window you will see all of the ESXi hosts apart of that cluster, with their CPU and memory workloads as well. If you are interested in this widget, it can be added when creating a new custom dashboard, and you will find it at the bottom of the available widget list.

While this widget gave me some high level detail, it wasn’t exactly what I wanted, so I decided to create my own to give a deeper level of detail. I used the widget above as a template, and went from there.

This dashboard gives me the current memory and CPU workloads for each cluster in the upper left box, and once a cluster selected, it populates the right, and two middle boxes with data. The top right boxes gives me the memory and CPU workload for the past 24hrs, and the two middle boxes gives me the CPU demand and memory demand forecasts for the next 30 days.

Much like the widget mentioned above, by selecting a cluster in the upper left side, in the lower left side there is a box that will populate with all hosts attached to that cluster. Once a host is selected, in the lower right box, we also get a memory and CPU workload for the past 24hrs for the selected host. This dashboard is slightly larger than a page will allow, so unfortunately users would need to scroll down to see all of the data, but I believe it gives an outstanding birds-eye view of the clusters DRS capabilities.

In my next blog post, I’ll break down what’s involved in creating this dashboard.

The Home Lab Part 2

The very long over due followup post to my The Home Lab entry made earlier this year.  I did recently purchase another 64GB (2x 32GB) Diamond Black DDR4 memory to bring my server up to 128GB.  I had some old 1TB spinning disks I installed in the box for some extra storage as well, although I will phase them out with more SSDs in the future.  So as a recap, this is my setup now:

IMG_20171117_170133

Motherboard

motherboardSUPERMICRO MBD-X10SDV-TLN4F-O Mini ITX Server Motherboard Xeon processor D-1541 FCBGA 1667 

Newegg

 

Memory

memory

(x2) Black Diamond Memory 64GB (2 x 32GB) 288-Pin DDR4 SDRAM ECC Registered DDR4 2133 (PC4 17000) Server Memory Model BD32GX22133MQR26

                                   Newegg

M.2 SSD

m.2ssd

WD Blue M.2 250GB Internal SSD Solid State Drive – SATA 6Gb/s – WDS250G1B0B

Newegg

SSD

ssd

(x 2) SAMSUNG 850 PRO 2.5″ 512GB SATA III 3D NAND Internal Solid State Drive (SSD) MZ-7KE512BW

Newegg

 

Case

chassis

SUPERMICRO CSE-721TQ-250B Black Mini-Tower Server Case 250W Flex ATX Multi-output Bronze Power Supply

Newegg

 

Additional Storage

x2 1TB Western Digital Black spinning disks

 

Initially when I built the lab, I decided to use VMware workstation, but I recently just rebuilt it, installing ESXi 6.7 as the base.  Largely for better performance and reliability.  For the time being this will be a single host environment, but keeping with the versioning, vCSA and vROps are 6.7 as well.  Can an HTML 5 interface be sexy?  This has come a long way from the flash client days.

vcenter view

I decided against fully configuring this host as a single vSAN node, just so that I can have the extra disk.  However, when I do decide to purchase more hardware and build a second or third box, this setup will allow me to grow my environment, and reconfigure it for vSAN use.  Although I am tempted to ingest the SSDs into my NAS, carve out datastores from it and not use vSAN, at least for the base storage.

storageview

Networking is flat for now, so there’s nothing really to show here.  As I expand and add a second host, I will be looking at some networking hardware, and have my lab in it’s own isolated space.

Now that I am in the professional services space, working with VMware customers, I needed a lab that was more production. I’m still building out the lab so I’ll have more content to come.

Hard cut-over to a new vCenter Appliance

I went through this a couple of years ago, found it in my notes, and thought I would share.  We experienced a SAN outage that corrupted the vCSA 5.5 appliance internal database.
The symptoms that we had something bad happening in the vCenter where the following:  The thick client wouldn’t always connect, and if it did you could only stay connected for a maximum of 5 minutes before getting kicked back to the login screen.  The web client was acting very similar.  We opened a Support request, and after looking at the logs we could see that there was corruption in one of the tables.  Given that we were already going to upgrade this appliance anyway, VMware had suggested a hard cut-over, where we would backup the DVSwitch config, disconnect the hosts from the old 5.5 vCSA with the virtual machines still running, power down the old vCSA appliance, power on the new 6.0u1 vCSA, and re-attach the hosts to it.  Sounds easy enough right?
The following is a high level view of the steps required to cut over to a new vCenter.  This process assumes that traditional methods of upgrading to a new vCenter version cannot be trusted, and that standing up a new vCenter, and reconnecting the hosts to it, is the only viable option. 
If the vCenter Appliance is in a bad state, it is always recommended to contact VMware GSS first and open an SR, to properly determine what is wrong, and what the best recovery options are.  These steps were recorded on a 5.5 vCSA and 6.0u1 vCSA.  Your mileage may very.
 
Step-by-step guide
 
-=Process on the old vCenter Appliance=-
  – Log in as the local Administrator
  – Export DVSwitch config
  – Create a standard switch mimicking distributed switch on first host
  – Migrate one physical host nic (pnic) to the standard switch
  – Update networking on all virtual machines on host over to the standard switch
  – Migrate other host pnics to standard switch
  – Disabled HA and DRS for the cluster
  – Disconnected host from the vCenter
**Rinse wash repeat on remaining hosts until all are disconnected**
  – Shutdown old vCenter Appliance.
-=Process on the new vCenter Appliance=-
  – Startup the new vCenter Appliance and configure it.
  – Log in as local Administrator
  – Setup the data center and host clusters
  – Add all hosts to the new vCenter
  – Import DVSwitch config
  – Add DVSwitch to hosts,
  – Migrate one pnic on the host to DVSwitch
  – Updated all VMs networking to DVSwitch
  – Migrate other pnic to DVSwitch
**Rinse wash repeat on remaining hosts and VMs until they are on the DVSwitch**
  – Disconnect standard switch from hosts

High CPU utilization on NSX Appliance 6.2.4

I realize that writing up this blog post now, may be irrelevant considering most if not all VMware customers are well beyond NSX appliance 6.2.4.  But some folks may still find the information shared here still relevant.  At the very least the instructions for restarting the bluelane-manager service on the NSX appliance is still something handy to keep in your Rolodex of commands.

There’s an interesting bug in versions of the NSX appliance ranging from versions 6.2.4 – 6.2.8, where the utilization slowly climbs, eventually maxing out at 100% CPU utilization after few hours.  For my environment, we had vSphere version 6, and roughly 60 hosts that were also on ESXi 6.  We were also using traditional SAN storage on FCOE.  In this case a combination of IBM XIV, and INFINIDATs.  In most cases, we could just restart the NSX appliance, which would resolve the CPU utilization issue, however sometimes within two hours, the CPU utilization would climb back up to 100% again. When the appliance CPU maxed out, after a few seconds the NSX manager user interface would typically crash.

The Cause: (copied from KB2145934)

“This issue occurs when the PurgeTask process is not purging the proper amount of job tasks in the NSX database causing job entries to accumulate. When the number of job entries increase, the PurgeTask process attempts to purge these job entries resulting in higher CPU utilization which triggers (GC) Garbage Collection. The GC adds more CPU utilization.”

The only problem with the KB, is that our environment was currently on 6.2.4, so clearly the problem was not resolved.

In order to buy ourselves some time, without needing to restart the NSX appliance, we found that simply restarting a service on the NSX appliance called ‘bluelane-manger‘, had the same affect, but this was only a work around.

You can take the following steps to restart the bluelane-manager service:

 

  • SSH to the NSX Manager using the ‘admin’ account
  • Type
en
  • Type:
st en
  • When prompted for the password, type:
IAmOnThePhoneWithTechSupport
  • To get the status of the bluelane manager service type:
/etc/rc.d/init.d/bluelane-manager status
  • To restart the bluelane-manager service, type:
 /etc/rc.d/init.d/bluelane-manager restart

Now after a few seconds, you should notice that the NSX appliance user interface has restored to normal functionality, and you can log in, and validate that the CPU has fallen to normal usage.

What made the issue worse, was the fact that we had hosts going into the purple diagnostic screen.  I’m not talking one or two here.  Imagine having over 20 ESXi hosts drop at the same time, during production hours, and keep in mind that all of these hosts were running customer workloads….. If you’ll excuse the vulgarity, that certainly has a pucker factor exceeding 10.  At the time, I was working for a service provider running vCloud Director.  The customers were basically sharing the ESXi host resources.  We were also utilizing VMware’s Guest Introspection (GI) service, as we also had trend micro deployed, and as a result most customers were sitting in the default security group.

Through extensive troubleshooting with VMware developers, at a high level we determined the following:  Having all customer VMs in the default NSX security group, every time a customer VM powered on or off, was created or destroyed, vMotioned, replicated in or out of the environment, all had to be synced back to the NSX appliance, which then synced with the ESXi hosts.  Looking at the at specific logs on the ESXi hosts that only VMware had access to, we saw a backlog of sync instructions that the hosts would never have time to process, which was contributing to the NSX appliance CPU issue.  This was also causing the hosts to eventually purple screen.  Fun fact was that by restarting the hosts we could buy ourselves close to two weeks before the issue would reoccur, however, performing many simultaneous vMotions would also cause 100% CPU on the NSX appliance, which would put us into a bad state again.

Thankfully, VMware was currently working on a bug fix release at the time NSX 6.2.8, and our issue served to spur the development team along in finalizing the release, along with adding a few more bug fixes they had originally thought was resolved in the 6.2.4 release.

NSX 6.2.8 release notes

Most relevant to our issues that we faced were the following fixes:

  • Fixed Issue 1849037: NSX Manager API threads get exhausted when communication link with NSX Edge is broken
  • Fixed Issue 1704940: You may encounter the purple diagnostic screen on the ESXi host if the pCPU count exceeds 256
  • Fixed Issue 1760940: NSX Manager High CPU triggered by many simultaneous vMotion tasks
  • Fixed Issue 1813363: Multiple IP addresses on same vNIC causes delays in firewall publish operation
  • Fixed Issue 1798537: DFW controller process on ESXi (vsfwd) may run out of memory

Upgrading to NSX 6.2.8 release, and rethinking our security groups, brought stability back to our environment, although not all above issues were completely resolved as we later found out.  In short most “fixes” were really just process improvements under the hood.  Specifically we could still cause 100% CPU utilization on the NSX appliance by putting too many hosts into maintenance mode consecutively, however at the very least the CPU utilization was more likely able to recover on its own, without us needed to restart the service or appliance. Now why is that important you might ask?  Being a service provider, you want to quickly and efficiently roll through your hosts while doing upgrades, and having something like this inefficiency in the NSX code base, can drastically extend maintenance windows.  Unfortunately for us at the time, as VMware came out with the 6.2.8 maintenance patch after 6.3.x, so the fixes were also not apart of the 6.3.x release yet.  KB2150668

As stated above, the instructions for restarting the bluelane-manager service on the NSX appliance is still something that is very handy to have.

 

 

 

VMware Education Services has updated the naming conventions of VMware’s professional certifications

FYI – VMware is making some major changes to their certification naming conventions. Changes take affect August 2018 for newly released certifications listed below, and are not retroactive.  This will not affect the naming of existing certifications however.

  • VMware Certified Professional – Desktop and Mobility 2018 (VCP-DTM 2018)
  • VMware Certified Advanced Professional – Data Center Virtualization Deployment (VCAP-DCV 2018 Deployment)
  • VMware Certified Advanced Professional – Cloud Management and Automation Deployment (VCAP-CMA 2018 Deployment)

Read more on their official blog post here:

https://blogs.vmware.com/education/2018/08/22/we-are-changing-the-way-we-name-vmware-certifications-the-year-makes-it-clear/

NSX SSL Certificate Failure on ESXi: SSL handshake failed

Some time ago I was having an issue putting a host back into service in an NSX environment.  In Log Insight, and in the /var/log/netcpa.log I was seeing errors similar to the following:

2018-05-26T11:07:50.486Z [FFD53B70 error 'Default'] SSL handshake failed on 172.15.4.100:0 : error = SSL Exception: error:140000DB:SSL routines:SSL routines:short read
2018-05-26T11:07:55.545Z [FFD12B70 error 'Default'] SSL handshake failed on 172.15.4.100:0 : error = SSL Exception: error:140000DB:SSL routines:SSL routines:short read
2018-05-26T11:08:00.600Z [FFD12B70 error 'Default'] SSL handshake failed on 172.15.4.100:0 : error = SSL Exception: error:140000DB:SSL routines:SSL routines:short read

Browsing through VMware’s archive, I came across KB2151089, very similar to the issue I was having, however upgrading to NSX 6.3.5 was not an option at the time.  I remembered a similar issue at my previous workplace, and dug through my evernote archive to find my notes.

Before we continue, this should go without saying, but your milage may very, and I’d recommend opening a ticket with VMware’s GSS.  At the very least you should test this process out in a lab.

These steps outlined here will resolve the issue.  Keep in mind at this point, the host is not in production, and currently is in maintenance mode:

  • Determine if the NSX controllers are connected by logging into the ESXi host, and running the following commands:
# esxcli network ip connection list |grep 1234

— and —

# esxcli network ip connection list |grep 5671

 

  • Next, log into the NSX appliance and backup the config.  While the config backup is taking place, get the ESXi host mob id from the vCenter mob page https://<vcsa-fqdn>/mob
  1. select the link for the ‘root folder‘, eg. group-d1
  2. select the link for the ‘child entity‘ eg. datacenter-2
  3. select the link for the ‘host folder‘ eg. group-h4
  4. select the link for the ‘child entity‘ eg. domain-c7
  5. Now locate the ‘host‘ and find the host-xxxx value. eg: host-1234 
  • After the NSX backup is complete, ssh into the NSX manager.  Root access to the appliance will be needed, so at the command prompt:
  1. Enter ‘en‘ and the enter the ‘admin’ password
  2. Enter ‘st en‘ and enter the following password: IAmOnThePhoneWithTechSupport 
  • Log into the sql prompt
# psql -U secureall
secureall=#
  • Issue the following command to verify that there is a record associated with the host mob ID.  Below is an example using host-1234
# select host_uuid,node_uuid,thumbprint from vnvp_hot_key where host_uuid='host-1234';

Example output:

host_uuid  |              node_uuid               | thumbprint                      
-----------+--------------------------------------+------------------
host-1234  | a2a68660-515e-4f87-811d-306c54b0b2e8 |AD:58:C0:84:FF:DF: 5E:95:50:B7:63:2E:3F:B2:67:22:56:F7:DC:9B

(1 row)

  • Next, in vCenter move the host to an isolation cluster.  We will need to validate the NSX vibs installed by running the following command on the host:
# esxcli software vib list |grep -E 'esx-dvfilter-switch-security|vsip|vxlan'

 

Example output:

esx-dvfilter-switch-security   6.3.1-0.0.5124716  VMware  VMwareCertified 2017-02-28
esx-vsip                       6.3.1-0.0.5124716  VMware                VMwareCertified 2017-02-28

esx-vxlan                      6.3.1-0.0.5124716  VMware VMwareCertified 2017-02-28

 

  • Remove the NSX vibs with the following commands:
# esxcli software vib remove -n esx-vxlan
# esxcli software vib remove -n esx-vsip
# esxcli software vib remove -n esx-dvfilter-switch-security

 

  • Returning to the NSX terminal window, now delete the record using the secureall=# prompt. Using ‘host-1234’ as an example.
# delete from vnvp_host_key where host_uuid='host-1234';
DELETE 1

 

  • Reboot the ESXi host.  Once the host has rebooted, put the host back into the proper cluster.  To be safe, I would temporarily turn down DRS (move slider left), and exit maintenance mode.
  • We can validate that the host looks proper in vSphere web UI: ‘Network & Security -> Installation -> Host Preparation Tab‘ .
  • Click the ‘Resolve‘ link next to the cluster name

Validation

  • Once the tasks are all completed you can run the ‘esxcli software vib list….‘ command again to see that the three vibs have been installed.
  • Test that the vxlan network is functioning on the host.
  • Verify that the SSL Exception is no longer showing in the /var/log/netcpa.log.
  • If there are no errors, then the host is all set to be put back into service.