vRealize Operations Manager Dashboard: vSphere DRS Cluster Health. Part 2

This blog series assumes that the reader has some understanding of how to create a vRealize Operations Manager (vROps) dashboard.

vROps dashboards are made up of what is called widgets. These widgets can either be configured as “self providers”, or can be populated with data by a “parent” widget. Self provider widgets, are configured to individually show specific data. In other words, one widget shows hosts, another shows datastores, and another showing virtual machines, however the widgets will not interact, nor are they dependent of each other. Parent widgets, are configured to provide data from a specific source, and then feed it into other child widgets on the page. This is useful when data is desired to be displayed in different formats of consumption. The dashboard I configured called “vSphere DRS Cluster Health”, does just that. I will break the widgets down to different sections as I walk through the configuration.

Widget #1 – This widget is known as an “object list“, and will be the parent widget of this dashboard. In other words, widgets #2 through #6 rely on the data presented by widget #1. In this case I have the object list widget, configured to show/list the different host clusters in my homelab.

I have given it a title, set refresh content to ON, set the mode to PARENT, and have it set to auto select the first row. In the lower left section “Select which tags to filter”, I have created an environment group in vROps called “Cluster Compute Resource” where I have specified my host clusters. In the lower right box, I have a few metrics selected which I would also like this “object list” widget to show.

This is just a single esxi homelab, so this won’t look as grand as it would if it were to be configured for a production environment. But each object in this list is select-able, and the cool thing is that each object in this list, when it is selected, will change the other widgets.

Widgets #2 and #3 are called “health charts”. I have one configured with the metric for cluster CPU workload %, and the other configured with the metric cluster memory workload %. Both configurations are the same, with the exception that one has a custom metric of Cluster CPU Workload %, and the other is configured with the custom metric of Cluster Memory Workload %. I have both configured to show data for the past 24hrs.

Important: For these two widgets, under “widget interactions“, set both to the first widget: DRS Cluster Settings (Select a cluster to populate charts)

Widgets #4 and #5 are called “View widgets”. One is configured to show the current Cluster CPU Demand, and the other is configured to show the current Cluster Memory Demand. These are also configured to forecast out for 30 days, so that we can potentially see if the clusters will run short of capacity in the near future, allowing us the ability to add more compute to the cluster preemptively.

These are two custom “views” I created. I will go over how to create custom views in a future post, but for those who already know how, I have one “widget view” configured with each.

Important: For these two widgets, under “widget interactions“, set both to the first widget: ” DRS Cluster Settings (Select a cluster to populate charts) ” like we did above.

Widget #6 is another “Object List” widget, and I have this configured to show only host systems, of the selected cluster in Widget #1. Widget #6 will be used to provide data to Widgets #7 and #8.

I also have certain Host System metrics selected here so that I can get high level information of the hosts in the cluster.

Important: For these two widgets, under “widget interactions“, set both to the first widget: ” DRS Cluster Settings (Select a cluster to populate charts)” like we did above.

The final two widgets, #7 and #8, are also called “health chart” widgets. One is configured using the metric host system CPU workload %, and the other is configured using the metric host system Memory workload %. I have both configured to show data for the past 24hrs.

Important: For these two widgets, under “widget interactions“, set both to widget #6, in this example: Host Workload (select a host to populate charts to the right).

vRealize Operations Manager Dashboard: vSphere DRS Cluster Health. Part 1

A few weeks ago, I had a customer ask me about creating a custom vROPs dashboard for them, so that they could monitor the health of the clusters. For those of you who were unaware, VMware has packaged vROPs with a widget called “DRS Cluster Settings”, that does something similar, and look like this:

The idea behind this widget, is that it will list all clusters attached to the vCenter, giving you high level information such as the DRS setting, and the memory and CPU workload of the cluster. With a cluster selected, in the lower window you will see all of the ESXi hosts apart of that cluster, with their CPU and memory workloads as well. If you are interested in this widget, it can be added when creating a new custom dashboard, and you will find it at the bottom of the available widget list.

While this widget gave me some high level detail, it wasn’t exactly what I wanted, so I decided to create my own to give a deeper level of detail. I used the widget above as a template, and went from there.

This dashboard gives me the current memory and CPU workloads for each cluster in the upper left box, and once a cluster selected, it populates the right, and two middle boxes with data. The top right boxes gives me the memory and CPU workload for the past 24hrs, and the two middle boxes gives me the CPU demand and memory demand forecasts for the next 30 days.

Much like the widget mentioned above, by selecting a cluster in the upper left side, in the lower left side there is a box that will populate with all hosts attached to that cluster. Once a host is selected, in the lower right box, we also get a memory and CPU workload for the past 24hrs for the selected host. This dashboard is slightly larger than a page will allow, so unfortunately users would need to scroll down to see all of the data, but I believe it gives an outstanding birds-eye view of the clusters DRS capabilities.

In my next blog post, I’ll break down what’s involved in creating this dashboard.

The Home Lab Part 2

The very long over due followup post to my The Home Lab entry made earlier this year.  I did recently purchase another 64GB (2x 32GB) Diamond Black DDR4 memory to bring my server up to 128GB.  I had some old 1TB spinning disks I installed in the box for some extra storage as well, although I will phase them out with more SSDs in the future.  So as a recap, this is my setup now:

IMG_20171117_170133

Motherboard

motherboardSUPERMICRO MBD-X10SDV-TLN4F-O Mini ITX Server Motherboard Xeon processor D-1541 FCBGA 1667 

Newegg

 

Memory

memory

(x2) Black Diamond Memory 64GB (2 x 32GB) 288-Pin DDR4 SDRAM ECC Registered DDR4 2133 (PC4 17000) Server Memory Model BD32GX22133MQR26

                                   Newegg

M.2 SSD

m.2ssd

WD Blue M.2 250GB Internal SSD Solid State Drive – SATA 6Gb/s – WDS250G1B0B

Newegg

SSD

ssd

(x 2) SAMSUNG 850 PRO 2.5″ 512GB SATA III 3D NAND Internal Solid State Drive (SSD) MZ-7KE512BW

Newegg

 

Case

chassis

SUPERMICRO CSE-721TQ-250B Black Mini-Tower Server Case 250W Flex ATX Multi-output Bronze Power Supply

Newegg

 

Additional Storage

x2 1TB Western Digital Black spinning disks

 

Initially when I built the lab, I decided to use VMware workstation, but I recently just rebuilt it, installing ESXi 6.7 as the base.  Largely for better performance and reliability.  For the time being this will be a single host environment, but keeping with the versioning, vCSA and vROps are 6.7 as well.  Can an HTML 5 interface be sexy?  This has come a long way from the flash client days.

vcenter view

I decided against fully configuring this host as a single vSAN node, just so that I can have the extra disk.  However, when I do decide to purchase more hardware and build a second or third box, this setup will allow me to grow my environment, and reconfigure it for vSAN use.  Although I am tempted to ingest the SSDs into my NAS, carve out datastores from it and not use vSAN, at least for the base storage.

storageview

Networking is flat for now, so there’s nothing really to show here.  As I expand and add a second host, I will be looking at some networking hardware, and have my lab in it’s own isolated space.

Now that I am in the professional services space, working with VMware customers, I needed a lab that was more production. I’m still building out the lab so I’ll have more content to come.

The Journey Continues

I’d be lying if I had said this year hasn’t been full of unexpected twists and turns, but it’s in those moments of great difficulty and uncertainty I believe, that we truly find ourselves.  Seven months ago I was referred to a VMware Product engineer role at a cloud provider and hosting company in San Antonio.  I successfully made it through the interviews, and was offered a position with the company.  For this role, the company and I had agreed for me to be onsite for six months, and then be a full time remote employee after.  From May until late October, I spent my time working and exploring San Antonio Texas.

Roles and expectations can change, and having it in writing doesn’t always give you solid ground to stand on.  But I pushed forth on my new journey, excited for the challenges ahead, knowing that I am checking off each requirement for the role, as I work through various projects.  I got to deploy a new SDDC environment, for a customer’s new private cloud, using vCloud Foundation for Service Providers, worked various research tasks, and even studied for and passed my VCP 6.5 – DCV delta.  Not necessarily in that order.

Reaching that six month mark, and feeling proud of the work that I accomplished, I received the regrettable news that I wouldn’t be able to go remote as originally agreed to.  With family and relationship requirements outside of work playing a factor, along with my own personal restrictions and requirements for this role, I had to make the hard decision to walk away.

I couldn’t have asked for a better team in San Antonio, many of whom I was able to get to know outside of work, and who invited me into their homes for after work gatherings, and team lunches around San Antonio.  If you look hard enough in San Antonio, you can find really good barbecue, authentic Mexican, Vietnamese, Taiwanese, Greek and Italian.  The freshman twenty is a real thing, but I’m grateful these guys shared their favorite spots around the city with me.  I didn’t get a chance to really get to know my remote team members out of the UK, but enjoyed the time spent on projects with them.

So what’s next for me?  This is just another fork in the road, leading me down a path of new challenges.  I’ll be taking on new projects working with VMware Professional Services (PSO), through a 3rd party agency.  This role will allow me to live where I want in Colorado, and also allow me to work remotely and travel.  Working for VMware has been a goal of mine for several years, and I’m hopeful that this will eventually turn into a full time role with them.

With all of that out of the way, I thought I would leave you with some pictures I took from the places I visited while in San Antonio.

San Antonio River Walk

36410189_10100448002226828_4359736017004003328_o_10100448002216848.jpg

The Alamo 

IMG_20180609_100001.jpg

 

I certainly wouldn’t consider myself religious, but around San Antonio you can find a lot of historic missions, many of which are still considered to be active places of worship.  I personally find the old architecture and buildings fascinating.

Mission Concepcion

IMG_20180804_103642.jpg

Mission San Jose

IMG_20180804_111714.jpg

Mission San Juan

IMG_20180804_122501.jpg

Mission Espada

IMG_20180804_125516.jpg

Hard cut-over to a new vCenter Appliance

I went through this a couple of years ago, found it in my notes, and thought I would share.  We experienced a SAN outage that corrupted the vCSA 5.5 appliance internal database.
The symptoms that we had something bad happening in the vCenter where the following:  The thick client wouldn’t always connect, and if it did you could only stay connected for a maximum of 5 minutes before getting kicked back to the login screen.  The web client was acting very similar.  We opened a Support request, and after looking at the logs we could see that there was corruption in one of the tables.  Given that we were already going to upgrade this appliance anyway, VMware had suggested a hard cut-over, where we would backup the DVSwitch config, disconnect the hosts from the old 5.5 vCSA with the virtual machines still running, power down the old vCSA appliance, power on the new 6.0u1 vCSA, and re-attach the hosts to it.  Sounds easy enough right?
The following is a high level view of the steps required to cut over to a new vCenter.  This process assumes that traditional methods of upgrading to a new vCenter version cannot be trusted, and that standing up a new vCenter, and reconnecting the hosts to it, is the only viable option. 
If the vCenter Appliance is in a bad state, it is always recommended to contact VMware GSS first and open an SR, to properly determine what is wrong, and what the best recovery options are.  These steps were recorded on a 5.5 vCSA and 6.0u1 vCSA.  Your mileage may very.
 
Step-by-step guide
 
-=Process on the old vCenter Appliance=-
  – Log in as the local Administrator
  – Export DVSwitch config
  – Create a standard switch mimicking distributed switch on first host
  – Migrate one physical host nic (pnic) to the standard switch
  – Update networking on all virtual machines on host over to the standard switch
  – Migrate other host pnics to standard switch
  – Disabled HA and DRS for the cluster
  – Disconnected host from the vCenter
**Rinse wash repeat on remaining hosts until all are disconnected**
  – Shutdown old vCenter Appliance.
-=Process on the new vCenter Appliance=-
  – Startup the new vCenter Appliance and configure it.
  – Log in as local Administrator
  – Setup the data center and host clusters
  – Add all hosts to the new vCenter
  – Import DVSwitch config
  – Add DVSwitch to hosts,
  – Migrate one pnic on the host to DVSwitch
  – Updated all VMs networking to DVSwitch
  – Migrate other pnic to DVSwitch
**Rinse wash repeat on remaining hosts and VMs until they are on the DVSwitch**
  – Disconnect standard switch from hosts

High CPU utilization on NSX Appliance 6.2.4

I realize that writing up this blog post now, may be irrelevant considering most if not all VMware customers are well beyond NSX appliance 6.2.4.  But some folks may still find the information shared here still relevant.  At the very least the instructions for restarting the bluelane-manager service on the NSX appliance is still something handy to keep in your Rolodex of commands.

There’s an interesting bug in versions of the NSX appliance ranging from versions 6.2.4 – 6.2.8, where the utilization slowly climbs, eventually maxing out at 100% CPU utilization after few hours.  For my environment, we had vSphere version 6, and roughly 60 hosts that were also on ESXi 6.  We were also using traditional SAN storage on FCOE.  In this case a combination of IBM XIV, and INFINIDATs.  In most cases, we could just restart the NSX appliance, which would resolve the CPU utilization issue, however sometimes within two hours, the CPU utilization would climb back up to 100% again. When the appliance CPU maxed out, after a few seconds the NSX manager user interface would typically crash.

The Cause: (copied from KB2145934)

“This issue occurs when the PurgeTask process is not purging the proper amount of job tasks in the NSX database causing job entries to accumulate. When the number of job entries increase, the PurgeTask process attempts to purge these job entries resulting in higher CPU utilization which triggers (GC) Garbage Collection. The GC adds more CPU utilization.”

The only problem with the KB, is that our environment was currently on 6.2.4, so clearly the problem was not resolved.

In order to buy ourselves some time, without needing to restart the NSX appliance, we found that simply restarting a service on the NSX appliance called ‘bluelane-manger‘, had the same affect, but this was only a work around.

You can take the following steps to restart the bluelane-manager service:

 

  • SSH to the NSX Manager using the ‘admin’ account
  • Type
en
  • Type:
st en
  • When prompted for the password, type:
IAmOnThePhoneWithTechSupport
  • To get the status of the bluelane manager service type:
/etc/rc.d/init.d/bluelane-manager status
  • To restart the bluelane-manager service, type:
 /etc/rc.d/init.d/bluelane-manager restart

Now after a few seconds, you should notice that the NSX appliance user interface has restored to normal functionality, and you can log in, and validate that the CPU has fallen to normal usage.

What made the issue worse, was the fact that we had hosts going into the purple diagnostic screen.  I’m not talking one or two here.  Imagine having over 20 ESXi hosts drop at the same time, during production hours, and keep in mind that all of these hosts were running customer workloads….. If you’ll excuse the vulgarity, that certainly has a pucker factor exceeding 10.  At the time, I was working for a service provider running vCloud Director.  The customers were basically sharing the ESXi host resources.  We were also utilizing VMware’s Guest Introspection (GI) service, as we also had trend micro deployed, and as a result most customers were sitting in the default security group.

Through extensive troubleshooting with VMware developers, at a high level we determined the following:  Having all customer VMs in the default NSX security group, every time a customer VM powered on or off, was created or destroyed, vMotioned, replicated in or out of the environment, all had to be synced back to the NSX appliance, which then synced with the ESXi hosts.  Looking at the at specific logs on the ESXi hosts that only VMware had access to, we saw a backlog of sync instructions that the hosts would never have time to process, which was contributing to the NSX appliance CPU issue.  This was also causing the hosts to eventually purple screen.  Fun fact was that by restarting the hosts we could buy ourselves close to two weeks before the issue would reoccur, however, performing many simultaneous vMotions would also cause 100% CPU on the NSX appliance, which would put us into a bad state again.

Thankfully, VMware was currently working on a bug fix release at the time NSX 6.2.8, and our issue served to spur the development team along in finalizing the release, along with adding a few more bug fixes they had originally thought was resolved in the 6.2.4 release.

NSX 6.2.8 release notes

Most relevant to our issues that we faced were the following fixes:

  • Fixed Issue 1849037: NSX Manager API threads get exhausted when communication link with NSX Edge is broken
  • Fixed Issue 1704940: You may encounter the purple diagnostic screen on the ESXi host if the pCPU count exceeds 256
  • Fixed Issue 1760940: NSX Manager High CPU triggered by many simultaneous vMotion tasks
  • Fixed Issue 1813363: Multiple IP addresses on same vNIC causes delays in firewall publish operation
  • Fixed Issue 1798537: DFW controller process on ESXi (vsfwd) may run out of memory

Upgrading to NSX 6.2.8 release, and rethinking our security groups, brought stability back to our environment, although not all above issues were completely resolved as we later found out.  In short most “fixes” were really just process improvements under the hood.  Specifically we could still cause 100% CPU utilization on the NSX appliance by putting too many hosts into maintenance mode consecutively, however at the very least the CPU utilization was more likely able to recover on its own, without us needed to restart the service or appliance. Now why is that important you might ask?  Being a service provider, you want to quickly and efficiently roll through your hosts while doing upgrades, and having something like this inefficiency in the NSX code base, can drastically extend maintenance windows.  Unfortunately for us at the time, as VMware came out with the 6.2.8 maintenance patch after 6.3.x, so the fixes were also not apart of the 6.3.x release yet.  KB2150668

As stated above, the instructions for restarting the bluelane-manager service on the NSX appliance is still something that is very handy to have.

 

 

 

VMworld-US 2018 Thoughts, A Week Later

A week ago, VMworld-US for 2018 wrapped up, and I have been slowly collecting my thoughts throughout the week to try and put out a meaningful blog post about my experience this year. While I was there I passed the VCP6.5-DCV delta exam. I’ve heard other people say the the delta exams are tougher, and they certainly were not wrong. Now I will be able to focus on getting the latest 6.5 VCAPs for the data center virtualization. I was also able to meet several other vExperts in the community, along with seeing old friends from years past. I was able to go out and celebrate with my new team at the end of the week which is always nice.

VMworld-US this year was a bit of a mixed bag, and it felt like it had lost its swagger. The half baked point system the events team dreamed up was just that. It certainly wasn’t thought out enough as only a few vendors participated. I guess you don’t know until you try, but hopefully it won’t make a return next year. If you did participate however, the swag that the VMworld team handed out was well worth it. The socks, laptop sleeve, and insolated bottle were all top notch. I was excited to get the VR headset, but disappointed that it doesn’t fit larger phones like the Nexus 6p. The solutions exchange was much more reserved compared to previous years. I am happy to see the continued support for those authors in the vCommunity, by encouraging book signings, and a meet and greet with the authors.

VMware did have some good announcements this year however. VMware ESXi 64-bit support on Arm processors, Amazon Relational Database service on VMware, and vSphere platinum to name a few. The keynotes themselves were great, but I much rather enjoyed last years opening act more, when VMware entertained the crowd with virtual reality. It felt more edgy and futuristic. This year it was rather slow, and they just seemed to jump right into the keynote.

The guest speaker on Wednesday’s keynote was Malala Yousafzai, who was there to speak about her own struggles in her home country Pakistan, and the attempt on her life because she shares the beliefs of the modern world, where women have equal opportunities, both in career and education. But with so many other women actually in the tech industry, and the push for getting younger girls interested in technology, was she really the best choice? I’m not discrediting the hardships she went through in her own country, but what was the point in bringing all of that to a technology conference? To me it felt a little weird having her interviewed by someone who’s home country of India, to this day, still allows the practice of marriage arrangements. Maybe I was the only one who cared to look at the finer details of the exchange. The additional security and screening to have Malala there, caused too much congestion for attendees to get in. Most ended up skipping out on going to main stage for the keynote. The event itself, felt like the popularity of the speaker, outweighed the value to attendees.

I’m still on the fence about the VMworld fest this year, and I’m certainly not alone according to this reddit thread. Royal machines put on a good show for the most part, but with retirees taking the stage, it wasn’t the best show unless you just wanted to re-live the 90’s.

Blink 182 and Fallout Boy of previous years I felt were far better venues and locations. On the other side of screen in the picture below was the main stage, which most people couldn’t get to, so this was their only viewing option. It looks packed, and certainly was at the beginning, but when I left halfway through the set, this side of the screen was almost cleared out. Maybe VMworld should have taken the picture at the end?

Screen Shot 2018-09-09 at 12.34.48 PM

VMworld events team still failed at organizing enough food for attendees again this year, and lines were ridiculously long for the food and beer that was available. I’ve heard unconfirmed reports of attendees leaving hungry again. The location itself was awkward, and didn’t provide adequate room for the main show. The decor was cool, but that’s about it.

VMworld finally provided hot breakfast to the US crowd this year for attendees, which was much appreciated. I personally love bacon and eggs for breakfast, just not every single day. Would it have really cost that much more to provide some hot oatmeal, waffles or pancakes one or two days out of the week to break it up a bit?

There still seems to be a big push for moving to the cloud, and that certainly was the message that was being echoed at VMworld. Being a vExpert, and a member of the vCommunity, I was able to talk to many at the conference, and I was hearing a different message. The cloud is too expensive, and organizations have begun migrating away from the cloud, to have control of their infrastructure, and to keep costs down. Working for two very different cloud providers in my IT career, my current employer, who gives customers their own private clouds, and my former employer, who uses the shared cloud approach, certainly gives me an interesting perspective on cloud provider technologies and architecture. The market has many different cloud offerings for customers to chose from, but customers still have yet to fully understand what they need to be successful in the cloud, given their size and expectations. But who will lead them down the correct path? Perhaps we are on the cusp of another industry shift.

Heading off to VMworld US 2018

As I’m readying myself to board the plane that will take me to Las Vegas, I can’t help but feel excited to be heading back to VMworld US. Las Vegas is not everyone’s cup of tea, and maybe it’s just the geek in me, but Vegas feels like it has a certain electricity to it, that energizes me. VMworld has always been special for me, and I have been fortunate to attend these past three years. I’m always excited to meet new additions to the #vCommunity, along with seeing friends I’ve made in the community already. This is certainly a great bunch of geeks that continue to inspire me. I can’t wait to see what VMware has in store for attendees, with product releases and announcements.

VMware Education Services has updated the naming conventions of VMware’s professional certifications

FYI – VMware is making some major changes to their certification naming conventions. Changes take affect August 2018 for newly released certifications listed below, and are not retroactive.  This will not affect the naming of existing certifications however.

  • VMware Certified Professional – Desktop and Mobility 2018 (VCP-DTM 2018)
  • VMware Certified Advanced Professional – Data Center Virtualization Deployment (VCAP-DCV 2018 Deployment)
  • VMware Certified Advanced Professional – Cloud Management and Automation Deployment (VCAP-CMA 2018 Deployment)

Read more on their official blog post here:

https://blogs.vmware.com/education/2018/08/22/we-are-changing-the-way-we-name-vmware-certifications-the-year-makes-it-clear/