High CPU utilization on NSX Appliance 6.2.4

I realize that writing up this blog post now, may be irrelevant considering most if not all VMware customers are well beyond NSX appliance 6.2.4.  But some folks may still find the information shared here still relevant.  At the very least the instructions for restarting the bluelane-manager service on the NSX appliance is still something handy to keep in your Rolodex of commands.

There’s an interesting bug in versions of the NSX appliance ranging from versions 6.2.4 – 6.2.8, where the utilization slowly climbs, eventually maxing out at 100% CPU utilization after few hours.  For my environment, we had vSphere version 6, and roughly 60 hosts that were also on ESXi 6.  We were also using traditional SAN storage on FCOE.  In this case a combination of IBM XIV, and INFINIDATs.  In most cases, we could just restart the NSX appliance, which would resolve the CPU utilization issue, however sometimes within two hours, the CPU utilization would climb back up to 100% again. When the appliance CPU maxed out, after a few seconds the NSX manager user interface would typically crash.

The Cause: (copied from KB2145934)

“This issue occurs when the PurgeTask process is not purging the proper amount of job tasks in the NSX database causing job entries to accumulate. When the number of job entries increase, the PurgeTask process attempts to purge these job entries resulting in higher CPU utilization which triggers (GC) Garbage Collection. The GC adds more CPU utilization.”

The only problem with the KB, is that our environment was currently on 6.2.4, so clearly the problem was not resolved.

In order to buy ourselves some time, without needing to restart the NSX appliance, we found that simply restarting a service on the NSX appliance called ‘bluelane-manger‘, had the same affect, but this was only a work around.

You can take the following steps to restart the bluelane-manager service:

 

  • SSH to the NSX Manager using the ‘admin’ account
  • Type
en
  • Type:
st en
  • When prompted for the password, type:
IAmOnThePhoneWithTechSupport
  • To get the status of the bluelane manager service type:
/etc/rc.d/init.d/bluelane-manager status
  • To restart the bluelane-manager service, type:
 /etc/rc.d/init.d/bluelane-manager restart

Now after a few seconds, you should notice that the NSX appliance user interface has restored to normal functionality, and you can log in, and validate that the CPU has fallen to normal usage.

What made the issue worse, was the fact that we had hosts going into the purple diagnostic screen.  I’m not talking one or two here.  Imagine having over 20 ESXi hosts drop at the same time, during production hours, and keep in mind that all of these hosts were running customer workloads….. If you’ll excuse the vulgarity, that certainly has a pucker factor exceeding 10.  At the time, I was working for a service provider running vCloud Director.  The customers were basically sharing the ESXi host resources.  We were also utilizing VMware’s Guest Introspection (GI) service, as we also had trend micro deployed, and as a result most customers were sitting in the default security group.

Through extensive troubleshooting with VMware developers, at a high level we determined the following:  Having all customer VMs in the default NSX security group, every time a customer VM powered on or off, was created or destroyed, vMotioned, replicated in or out of the environment, all had to be synced back to the NSX appliance, which then synced with the ESXi hosts.  Looking at the at specific logs on the ESXi hosts that only VMware had access to, we saw a backlog of sync instructions that the hosts would never have time to process, which was contributing to the NSX appliance CPU issue.  This was also causing the hosts to eventually purple screen.  Fun fact was that by restarting the hosts we could buy ourselves close to two weeks before the issue would reoccur, however, performing many simultaneous vMotions would also cause 100% CPU on the NSX appliance, which would put us into a bad state again.

Thankfully, VMware was currently working on a bug fix release at the time NSX 6.2.8, and our issue served to spur the development team along in finalizing the release, along with adding a few more bug fixes they had originally thought was resolved in the 6.2.4 release.

NSX 6.2.8 release notes

Most relevant to our issues that we faced were the following fixes:

  • Fixed Issue 1849037: NSX Manager API threads get exhausted when communication link with NSX Edge is broken
  • Fixed Issue 1704940: You may encounter the purple diagnostic screen on the ESXi host if the pCPU count exceeds 256
  • Fixed Issue 1760940: NSX Manager High CPU triggered by many simultaneous vMotion tasks
  • Fixed Issue 1813363: Multiple IP addresses on same vNIC causes delays in firewall publish operation
  • Fixed Issue 1798537: DFW controller process on ESXi (vsfwd) may run out of memory

Upgrading to NSX 6.2.8 release, and rethinking our security groups, brought stability back to our environment, although not all above issues were completely resolved as we later found out.  In short most “fixes” were really just process improvements under the hood.  Specifically we could still cause 100% CPU utilization on the NSX appliance by putting too many hosts into maintenance mode consecutively, however at the very least the CPU utilization was more likely able to recover on its own, without us needed to restart the service or appliance. Now why is that important you might ask?  Being a service provider, you want to quickly and efficiently roll through your hosts while doing upgrades, and having something like this inefficiency in the NSX code base, can drastically extend maintenance windows.  Unfortunately for us at the time, as VMware came out with the 6.2.8 maintenance patch after 6.3.x, so the fixes were also not apart of the 6.3.x release yet.  KB2150668

As stated above, the instructions for restarting the bluelane-manager service on the NSX appliance is still something that is very handy to have.

 

 

 

VMworld-US 2018 Thoughts, A Week Later

A week ago, VMworld-US for 2018 wrapped up, and I have been slowly collecting my thoughts throughout the week to try and put out a meaningful blog post about my experience this year. While I was there I passed the VCP6.5-DCV delta exam. I’ve heard other people say the the delta exams are tougher, and they certainly were not wrong. Now I will be able to focus on getting the latest 6.5 VCAPs for the data center virtualization. I was also able to meet several other vExperts in the community, along with seeing old friends from years past. I was able to go out and celebrate with my new team at the end of the week which is always nice.

VMworld-US this year was a bit of a mixed bag, and it felt like it had lost its swagger. The half baked point system the events team dreamed up was just that. It certainly wasn’t thought out enough as only a few vendors participated. I guess you don’t know until you try, but hopefully it won’t make a return next year. If you did participate however, the swag that the VMworld team handed out was well worth it. The socks, laptop sleeve, and insolated bottle were all top notch. I was excited to get the VR headset, but disappointed that it doesn’t fit larger phones like the Nexus 6p. The solutions exchange was much more reserved compared to previous years. I am happy to see the continued support for those authors in the vCommunity, by encouraging book signings, and a meet and greet with the authors.

VMware did have some good announcements this year however. VMware ESXi 64-bit support on Arm processors, Amazon Relational Database service on VMware, and vSphere platinum to name a few. The keynotes themselves were great, but I much rather enjoyed last years opening act more, when VMware entertained the crowd with virtual reality. It felt more edgy and futuristic. This year it was rather slow, and they just seemed to jump right into the keynote.

The guest speaker on Wednesday’s keynote was Malala Yousafzai, who was there to speak about her own struggles in her home country Pakistan, and the attempt on her life because she shares the beliefs of the modern world, where women have equal opportunities, both in career and education. But with so many other women actually in the tech industry, and the push for getting younger girls interested in technology, was she really the best choice? I’m not discrediting the hardships she went through in her own country, but what was the point in bringing all of that to a technology conference? To me it felt a little weird having her interviewed by someone who’s home country of India, to this day, still allows the practice of marriage arrangements. Maybe I was the only one who cared to look at the finer details of the exchange. The additional security and screening to have Malala there, caused too much congestion for attendees to get in. Most ended up skipping out on going to main stage for the keynote. The event itself, felt like the popularity of the speaker, outweighed the value to attendees.

I’m still on the fence about the VMworld fest this year, and I’m certainly not alone according to this reddit thread. Royal machines put on a good show for the most part, but with retirees taking the stage, it wasn’t the best show unless you just wanted to re-live the 90’s.

Blink 182 and Fallout Boy of previous years I felt were far better venues and locations. On the other side of screen in the picture below was the main stage, which most people couldn’t get to, so this was their only viewing option. It looks packed, and certainly was at the beginning, but when I left halfway through the set, this side of the screen was almost cleared out. Maybe VMworld should have taken the picture at the end?

Screen Shot 2018-09-09 at 12.34.48 PM

VMworld events team still failed at organizing enough food for attendees again this year, and lines were ridiculously long for the food and beer that was available. I’ve heard unconfirmed reports of attendees leaving hungry again. The location itself was awkward, and didn’t provide adequate room for the main show. The decor was cool, but that’s about it.

VMworld finally provided hot breakfast to the US crowd this year for attendees, which was much appreciated. I personally love bacon and eggs for breakfast, just not every single day. Would it have really cost that much more to provide some hot oatmeal, waffles or pancakes one or two days out of the week to break it up a bit?

There still seems to be a big push for moving to the cloud, and that certainly was the message that was being echoed at VMworld. Being a vExpert, and a member of the vCommunity, I was able to talk to many at the conference, and I was hearing a different message. The cloud is too expensive, and organizations have begun migrating away from the cloud, to have control of their infrastructure, and to keep costs down. Working for two very different cloud providers in my IT career, my current employer, who gives customers their own private clouds, and my former employer, who uses the shared cloud approach, certainly gives me an interesting perspective on cloud provider technologies and architecture. The market has many different cloud offerings for customers to chose from, but customers still have yet to fully understand what they need to be successful in the cloud, given their size and expectations. But who will lead them down the correct path? Perhaps we are on the cusp of another industry shift.

Heading off to VMworld US 2018

As I’m readying myself to board the plane that will take me to Las Vegas, I can’t help but feel excited to be heading back to VMworld US. Las Vegas is not everyone’s cup of tea, and maybe it’s just the geek in me, but Vegas feels like it has a certain electricity to it, that energizes me. VMworld has always been special for me, and I have been fortunate to attend these past three years. I’m always excited to meet new additions to the #vCommunity, along with seeing friends I’ve made in the community already. This is certainly a great bunch of geeks that continue to inspire me. I can’t wait to see what VMware has in store for attendees, with product releases and announcements.

VMware Education Services has updated the naming conventions of VMware’s professional certifications

FYI – VMware is making some major changes to their certification naming conventions. Changes take affect August 2018 for newly released certifications listed below, and are not retroactive.  This will not affect the naming of existing certifications however.

  • VMware Certified Professional – Desktop and Mobility 2018 (VCP-DTM 2018)
  • VMware Certified Advanced Professional – Data Center Virtualization Deployment (VCAP-DCV 2018 Deployment)
  • VMware Certified Advanced Professional – Cloud Management and Automation Deployment (VCAP-CMA 2018 Deployment)

Read more on their official blog post here:

https://blogs.vmware.com/education/2018/08/22/we-are-changing-the-way-we-name-vmware-certifications-the-year-makes-it-clear/

NSX SSL Certificate Failure on ESXi: SSL handshake failed

Some time ago I was having an issue putting a host back into service in an NSX environment.  In Log Insight, and in the /var/log/netcpa.log I was seeing errors similar to the following:

2018-05-26T11:07:50.486Z [FFD53B70 error 'Default'] SSL handshake failed on 172.15.4.100:0 : error = SSL Exception: error:140000DB:SSL routines:SSL routines:short read
2018-05-26T11:07:55.545Z [FFD12B70 error 'Default'] SSL handshake failed on 172.15.4.100:0 : error = SSL Exception: error:140000DB:SSL routines:SSL routines:short read
2018-05-26T11:08:00.600Z [FFD12B70 error 'Default'] SSL handshake failed on 172.15.4.100:0 : error = SSL Exception: error:140000DB:SSL routines:SSL routines:short read

Browsing through VMware’s archive, I came across KB2151089, very similar to the issue I was having, however upgrading to NSX 6.3.5 was not an option at the time.  I remembered a similar issue at my previous workplace, and dug through my evernote archive to find my notes.

Before we continue, this should go without saying, but your milage may very, and I’d recommend opening a ticket with VMware’s GSS.  At the very least you should test this process out in a lab.

These steps outlined here will resolve the issue.  Keep in mind at this point, the host is not in production, and currently is in maintenance mode:

  • Determine if the NSX controllers are connected by logging into the ESXi host, and running the following commands:
# esxcli network ip connection list |grep 1234

— and —

# esxcli network ip connection list |grep 5671

 

  • Next, log into the NSX appliance and backup the config.  While the config backup is taking place, get the ESXi host mob id from the vCenter mob page https://<vcsa-fqdn>/mob
  1. select the link for the ‘root folder‘, eg. group-d1
  2. select the link for the ‘child entity‘ eg. datacenter-2
  3. select the link for the ‘host folder‘ eg. group-h4
  4. select the link for the ‘child entity‘ eg. domain-c7
  5. Now locate the ‘host‘ and find the host-xxxx value. eg: host-1234 
  • After the NSX backup is complete, ssh into the NSX manager.  Root access to the appliance will be needed, so at the command prompt:
  1. Enter ‘en‘ and the enter the ‘admin’ password
  2. Enter ‘st en‘ and enter the following password: IAmOnThePhoneWithTechSupport 
  • Log into the sql prompt
# psql -U secureall
secureall=#
  • Issue the following command to verify that there is a record associated with the host mob ID.  Below is an example using host-1234
# select host_uuid,node_uuid,thumbprint from vnvp_hot_key where host_uuid='host-1234';

Example output:

host_uuid  |              node_uuid               | thumbprint                      
-----------+--------------------------------------+------------------
host-1234  | a2a68660-515e-4f87-811d-306c54b0b2e8 |AD:58:C0:84:FF:DF: 5E:95:50:B7:63:2E:3F:B2:67:22:56:F7:DC:9B

(1 row)

  • Next, in vCenter move the host to an isolation cluster.  We will need to validate the NSX vibs installed by running the following command on the host:
# esxcli software vib list |grep -E 'esx-dvfilter-switch-security|vsip|vxlan'

 

Example output:

esx-dvfilter-switch-security   6.3.1-0.0.5124716  VMware  VMwareCertified 2017-02-28
esx-vsip                       6.3.1-0.0.5124716  VMware                VMwareCertified 2017-02-28

esx-vxlan                      6.3.1-0.0.5124716  VMware VMwareCertified 2017-02-28

 

  • Remove the NSX vibs with the following commands:
# esxcli software vib remove -n esx-vxlan
# esxcli software vib remove -n esx-vsip
# esxcli software vib remove -n esx-dvfilter-switch-security

 

  • Returning to the NSX terminal window, now delete the record using the secureall=# prompt. Using ‘host-1234’ as an example.
# delete from vnvp_host_key where host_uuid='host-1234';
DELETE 1

 

  • Reboot the ESXi host.  Once the host has rebooted, put the host back into the proper cluster.  To be safe, I would temporarily turn down DRS (move slider left), and exit maintenance mode.
  • We can validate that the host looks proper in vSphere web UI: ‘Network & Security -> Installation -> Host Preparation Tab‘ .
  • Click the ‘Resolve‘ link next to the cluster name

Validation

  • Once the tasks are all completed you can run the ‘esxcli software vib list….‘ command again to see that the three vibs have been installed.
  • Test that the vxlan network is functioning on the host.
  • Verify that the SSL Exception is no longer showing in the /var/log/netcpa.log.
  • If there are no errors, then the host is all set to be put back into service.

 

 

 

VMworld 2018 is right around the corner! Where will you be?

It’s almost that time a year again….some might even call it that special time of year where VMware geeks from across the globe converge on VMworld.  One might even consider this summer camp, and like any who have experienced this before, you meet new people in the vCommunity, make friends, and part ways after the week of technical sessions, social gatherings, and just the straight up shop talking, war story sharing, and the sharing of ideas.  Personally, this will be my third year attending, and I am super excited to be going.  This conference means enough to me that, due to other circumstances that happened early this year, I purchased my own pass so to ensure that I wouldn’t miss out.

Now is the perfect time to cash in on those early bird discounts on conference passes, good until June 15.  Why wouldn’t you want to save a couple hundred dollars on one of the best IT conferences of the year?  For an individual, it’s $1,795 vs $2,095.  That’s before other discounts that may be applied like vmug memberships, or the discount for VMware Certified Professionals who hold an active VCP.

So, why go to VMworld?

I think for many first timers, there’s a certain electricity, and excitement about going.  Let me be the first to tell you, that feeling…. never really goes away.  Like the past couple of years, VMworld in the US will be held once again in Las Vegas.

Image result for VMworld 2018

I personally love coming to VMworld and have looked forward to it every year.  There’s always good energy here; the minute you get off the plane, it is happy.  Every experience I’ve had here is fun, and people genuinely are in a good mood.  This conference gives attendees the chance to attend VMware lead, and partner lead sessions on platforms you may have thought about using or are currently using.  These sessions are meant to share best practices with the community, transfer knowledge in ways to use VMware platforms, and also give you a chance to ask the experts, many of whom work for VMware, and in some cases, are very involved with the development of the platforms you use.

VMworld is not just about attending sessions however.  This conference gives you the unique opportunity to network with other IT professionals from across the globe and establish relationships that you would otherwise never be able to do.  Like it did for me, this conference may also inspire you to join the vCommunity, a thriving community of professionals who not only share their knowledge with others, but who also need help themselves.  I think we can all agree that no two environments/businesses are alike, and we have all used VMware’s platforms in ways that were intended, and in ways that even VMware might not have ever considered.  Members of the vCommunity take it upon themselves to share their experiences with others, through blogs, social media, and support forums to help others.  This conference gives us a chance to get together, share war stories from our time in the trenches, and many times, you will find attendees getting together to engineer and develop something cool.

VMware {code} group has even put together a hackathon, where members from the vCommunity can get together while at VMworld, to develop some amazing things, and sometimes there are even prizes to be had for the coolest of the cool ideas.  But don’t let those words “code” or “hackathon” scare you.  These sessions are not just for developers!  Sure it will certainly help, but the power of the community, enables you to participate in these teams anyways.  You may not be able to contribute code, but you can still contribute ideas to the team, and you might even pick up a few coding skills in the fun.  Let’s face it; some pretty cool ideas are cooked up during hackathons.  VMware’s internal hackathon cooked up the idea to bring VR into the datacenter, and allow you to virtually move your workloads from On-Premises Data Centers, into the cloud.  It’s freakin VR man!  How cool is that?

Screenshot2

The VMworld conference also affords you the opportunity to attend instructor lead labs, along with VMware’s hands on labs that you can also experience from home.  While at the conference, there will be many vendors out on the floor where you can experience new products, ask questions about products that you already use, and lets not forget the vendor haul crawl where there will be free adult beverages, snacks, and cool swag vendors are giving out.  All can be found in the solutions exchange area.

Image result for VMworld 2017

I’m not going to lie, the parties at VMworld are pretty wild too.  Not saying that should be the only reason you go, but it is a good way to mingle with other conference attendees, jam out to some good music, and of course escape the Las Vegas heat.  VMworld of course wraps up with it’s own party, before the last day of the conference.

Screen Shot 2018-06-02 at 12.16.46 PM

So what are you waiting for?  I can’t think of any reason not to attend the US 2018 VMworld in Las Vegas, August 26th – 30th, or the UK 2018 VMworld in Barcelona, November 5th – 8th.  Follow this link here, and I will see you at the conference in Las Vegas!  Remember to take advantage of those early bird rates, good until June 15th!  REGISTER HERE FOR VMWORLD 2018

Screen Shot 2018-06-03 at 9.29.50 AM

 

Simple, Efficient, and Modern: VMware NSX…

Simple, Efficient, and Modern: VMware NSX introduces new HTML5 UI

Simple, Efficient, and Modern: VMware NSX…

Along with the advancements in context-aware micro-segmentation and network virtualization, we are also continually raising the bar on making VMware NSX simple to deploy, manage, and operationalize at scale – and that, of course, involves a responsive and easy-to-use HTML-based UI to access VMware NSX functionality. With VMware NSX for vSphere 6.4.1, you can now The post Simple, Efficient, and Modern: VMware NSX introduces new HTML5 UI appeared first on Network Virtualization .


VMware Social Media Advocacy

Add Custom Recommendation to vROps alert definition for versions prior to 6.6

  • This is useful when a new SOP document is created, we will be able to link to it directly on the alert email that is sent.

Step-By-Step

  1. Log into the main vRealize Operations Manager page.
  2. Click Content and then Recommendations

  3. On this page you can create, edit and delete custom recommendations.  Click the green plus sign to create a new custom recommendation.
  4. Here you can enter the test for the custom recommendation.  Paste the link to the SOP, highlight it, and then click the hyperlink icon.  Now paste the link again and click OK.  The “actions” section will allow the use of automated functions if you were looking at the triggered alert in vROps.  For now, just click save.
  5. Now you can add the custom recommendation to an alert definition.  Click Content and then Alert Definitions.
  6. Search for the alert that you would like to add the SOP to, select it and click the edit button.
  7. Click on section 5: Add Recommendations and then click on the plus sign
  8. Now you will need to search for the new SOP recommendation you just created, so search for SOP, find it in the list on the left, click and drag to position under the Recommendations section.
  9. Finally click save.  Now when this alert is triggered, and an email is sent, there will be a clickable link in the email to the SOP document.