NSX Host VIB Upgrade From 6.1.X to 6.2.4

January 16, 2017January 22, 2017 CaptainvOPs

There is a known issue when upgrading the NSX host VIB from 6.1.X to 6.2.4, where once the host is upgraded to VIB 6.2.4, and the virtual machines are moved to it, if they should somehow find their way back to a 6.1.X host, the VM’s NIC will become disconnected causing an outage. This has been outlined in KB2146171

Resolution

We found the following steps to be the best solution in getting to the 6.2.4 NSX VIB version on ESXi 6u2, without causing any interruptions in regards to the network connectivity of the virtual machines.

Log into the vSphere web client, go to Networking & Security, select Installation on the navigation menu, and then select the Host preparation tab.
Select the desired cluster, and click the “Upgrade Available” message next to it. This will start the upgrade process of all the hosts, and once completed, all hosts will display “Reboot Required”.
Mark the first host for maintenance mode as you normally would, and once all virtual machines have evacuated off, and the host marked as in maintenance mode, restart it as you normally would.
While we wait for the host to reboot, right click on the host cluster being upgraded and select Edit Settings. Select vSphere DRS, and set the automation level to Manual. This will give you control over host evacuations and where the virtual machines go.
Once the host has restarted, monitor the Recent Tasks window and wait for the NSX vib installation to complete.
Bring the host out of maintenance mode. Now migrate a test VM over to the new host and test network connectivity. Ping to another VM on a different host, and then make sure you can ping out to something like 8.8.8.8.
Verify the VIB has been upgraded to 6.2.4 from the vSphere web Networking & Security host preparation section.
Open PowerCLI and connect to the vCenter where this maintenance activity is being performed. In order to safely control the migration of virtual machines from hosts containing the NSX VIV 6.1.X to the host that has been upgraded to 6.2.4, we will use the following command to evacuate the next host’s virtual machines onto the one that was just upgraded.

Get-VM -Location "<sourcehost>" | Move-VM -Destination (Get-Vmhost "<destinationhost>")

“sourcehost” being the next host you wish to upgrade, and the “destinationhost” being the one that was just upgraded.

9. Once the host is fully evacuated, place the host in maintenance mode, and reboot it.

10. VMware provided us with a script that should ONLY be executed against NSX vib 6.2.4 hosts, and does the following:

Verifies the VIB version running on the host.
For example: If the VIB version is between VIB_VERSION_LOW=3960641, VIB_VERSION_HIGH=4259819 then it is considered to be a host with VIB 6.2.3 and above. Any other VIB version the script will fail with a warningCustomer needs to make sure that the script is executed against ALL virtual machines that have been upgraded since 6.1.x.
Once the script sets the export_version to 4, the version is persistent across reboots.
There is no harm if customer executes the script multiple times on the same host as only VMs that need modification will be modified.
Script should only be executed NSX-v 6.2.4 hosts

I have attached a ZIP file containg the script here: fix_exportversion.zip

Script Usage

Copy the script to a common datastore accessible to all hosts and run the script on each host.
Log in to the 6.2.4 ESXi host via ssh or CONSOLE, where you intend to execute the script.
chmod u+x the files
Execute the script:

./vmfs/volumes/<Shared_Datastore>/fix_exportversion.sh /vmfs/volumes/<Shared_Datastore>/vsipioctl

Example output:

~ # /vmfs/volumes/NFS-101/fix_exportversion.sh /vmfs/volumes/NFS-101/vsipioctl
Fixed filter nic-39377-eth0-vmware-sfw.2 export version to 4.
Fixed filter nic-48385-eth0-vmware-sfw.2 export version to 4.
Filter nic-50077-eth0-vmware-sfw.2 already has export version 4.
Filter nic-52913-eth0-vmware-sfw.2 already has export version 4.
Filter nic-53498-eth0-vmware-sfw.2 has export version 3, no changes required.

Note: If the export version for any VM vNIC shows up as ‘2’, the script will modify the version to ‘4’ and does not modify other VMs where export version is not ‘2’.

11. Repeat steps 5 – 10 on all hosts in the cluster until completion. This script appears to be necessary as we have seen cases where a VM may still lose its NIC even if it is vmotioned from one NSX vib 6.2.4 host to another 6.2.4 host.

12. Once 6.2.4 host VIB installation is complete, and the script has been run against the hosts and virtual machines running on them, DRS can be set back to your desired setting like Fully automated for instance.

13. Virtual machines should now be able to vmotion between hosts without losing their NICs.

This process was thoroughly tested in a vCloud Director cloud environment containing over 20,000 virtual machines, and on roughly 240 ESXi hosts without issue. vCenter environment was vCSA version 6u2, and ESXi version 6u2.

Upgrading A Large vRealize Operations Manager (vROps) Appliance Cluster

December 23, 2016January 22, 2017 CaptainvOPs

Upgrading a multi-node vROps cluster can bring significant downtime to the monitoring/data collection abilities of the cluster. The largest production cluster I am responsible for consists of nine data nodes, including the master and master replica, and four remote collectors for our remote data centers. If you recall my previous post Sizing and Installing The vROps Appliance, I discussed the various sizing options of a vROps cluster based on the data collected, and in my case this cluster is configured as LARGE due to the size of our vROps cluster. One of the biggest challenges of maintaining a large cluster, that has remote collectors collecting from data centers in different geographical locations, is the ability to upgrade the cluster with minimal downtime. As it stands now, if I were to upgrade this cluster with the traditional methods VMware provided, I would be looking at a minimal downtime of eight hours. VMware does offer a useful work around: How to reduce update time by pre-copying software update PAK files KB2127895, and we will be using that here.

But first, I wanted to introduce you to a script developed by a Jeremy McCoy, and his repository over at github called nakedhitman. In there you will find this awesome script called vROps Cluster Repair that I have personally used many times, and was recommended to me by VMware’s GSS. This script is intended to bring the vROps cluster back to a known healthy state, and I like to run it before upgrading my Production vROps clusters. You will want to familiarize yourself with that script, download and get it setup with your environment details.

Preparing for the Upgrade

First – Run the nakedhitman’s – vROps Cluster Repair script. This will cause a brief outage (max 30 minutes) as services are stopped on each vROps node for cleanup. *I recommend taking a snapshot of all vROps nodes beforehand just in case. Once the cluster comes online and starts collecting data, delete those snapshots.

Second – Insure you have enough free space on the appliances to support the upgrade.

Third – Complete a basic heath-check of the appliance outlined in my post: vRealize Operations Manager (vROps) Health-Check. While not necessary, I have personally had upgrades fail due to the issues found in this health-check.
Fourth – Complete Part 1 of VMware KB2127895 article to get the upgrade paks pre-staged on all nodes except the master. No downtime required as this can be done live. The benefit of using this KB us that you are essentially removing the time it takes for the cluster to copy the two pak files around during the upgrade process, which could take hours depending on the size of the environment.

Upgrading The Appliance Cluster

Snapshot the cluster and remote collectors. Take the cluster offline from the master’s admin page https://<vrops>/admin. The the cluster is offline, shutdown the vrops appliance nodes in order of remote collector, data nodes, master replica and lastly the master. Snapshot the VMs, and then boot the master first, wait for it to fully come up to the appliance login screen, and then boot the master replica, datanodes and remote collectors last.
Log back into the master appliance Admin page, but do not bring the the cluster online.
On the left pane select the Software Update tab, and then click the Install a Software Update… button.
Browse for the PAK file and select it.
Installation options……
1. DO NOT select the option “Install the PAK file even if it is already installed.” – Think of this as a force install. This is used if the original software update failed and you are attempting to try again. This option will ignore the pre-staged PAK files you placed earlier, and severely delay the upgrade as the cluster will now have to copy the PAK files around to each of the nodes.
2. You have the option to “Reset out-of-the-box content, overwriting to a newer version provided by this update. Note, user modifications to out-of-the-box alerts, Symptoms, Recommendations and Policies will be overwritten.”
Click Upload.
Accept the license agreement.
Click Next.
The upgrade will now start. Sit back, and Relax! The upgrade can take hours to complete. There are 9 steps to this.

vrops45

Eventually you will need to log back into the admin page to monitor the progress of the upgrade. Since 6.2, you can check the status of the upgrade by clicking the little notebook next to each node. If there’s an issue detected like in the screen capture below, it may not stop the upgrade from progressing, but you should take notice. VMware has even started included KB article links to help troubleshoot.

vrops46

There are two places to watch the upgrade on the master at the log level if you’d like:

 # tail -f /storage/vcops/log/pakManager/vcopsPakManager.root.apply_system_update.log

–and–

 # tail -f /storage/vcops/log/pakManager/vcopsPakManager.root.query.log

Once the installation is complete and at steps 9 of 9, go back to the system status tab and verify the system state is online with the little green check. VMware engineers have said that at this point the upgrade has completed successfully, and it is safe to remove the snapshots.

vrops47

Should the upgrade fail, open a severity 1 SR with VMware asap.
If the sun is shining and the upgrade finishes, delete those snapshots, and enjoy all the upgrades/bug fixes the new release brings.

As a side note…

I have submitted a couple feature requests to VMware in order to ease the upgrade process of large vROps installations.

For multi-data center environments: The ability to have smaller appliances in each data center, with a single search head appliance connected to the multiple data center vROps deployments. The idea here would be a “single pane of glass” to see all data centers like you get if there is a single large muli-node vROps cluster, with multiple remote collectors. Having smaller deployments accessible by a single search head would allow for the ability to take one deployment down per data center at a time to upgrade it, dramatically reducing the data outage, and upgrade time.
The ability to deploy the latest vROps appliance, and import the data from the old like VMware does with the vCSA. The idea here is that this would be another way to reduce the upgrade time, and reduce the outage occurred by upgrading the appliance.
Tying #1 and #2 together, the ability to stand up a new appliance in said remote data center, and then export that data centers specific data from the main large cluster to the smaller deployment, or the ability to just stand up a new appliance and import the data from the old one.

vMotion fails at 67% on esxi 6 in vCenter 6.

December 9, 2016December 24, 2016 CaptainvOPs

Came across an interesting error the other night while on call, as I had a host in a cluster that VM’s could not vMotion off of either manually or through DRS. I was seeing the following error message in vSphere:

The source detected that the destination failed to resume.

vMotion migration [-1062731518:1481069780557682] failed: remote host <192.168.1.2> failed with status Failure.
vMotion migration [-1062731518:1481069780557682] failed to asynchronously receive and apply state from the remote host: Failure.
Failed waiting for data. Error 195887105. Failure.

While tailing the host.d log on the source host I was seeing the following error:

2016-12-09T19:44:40.373Z warning hostd[2B3E0B70] [Originator@6876 sub=Libs] ResourceGroup: failed to instantiate group with id: -591724635. Sysinfo error on operation return ed status : Not found. Please see the VMkernel log for detailed error information

While tailing the host.d log on the destination host, I was seeing the following error:

2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] ReportVMs: processing vm 223
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] ReportVMs: serialized 36 out of 36 vms
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] GenerateFullReport: report file /tmp/.vm-report.xml generated, size 915 bytes.
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] PublishReport: file /tmp/.vm-report.xml published as /tmp/vm-report.xml
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] NotifyAgent: write(33, /var/run/snmp.ctl, V) 1 bytes to snmpd
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] GenerateFullReport: notified snmpd to update vm cache
2016-12-09T19:44:34.330Z info hostd[33481B70] [Originator@6876 sub=Snmpsvc] DoReport: VM Poll State cache - report completed ok
2016-12-09T19:44:40.317Z warning hostd[33081B70] [Originator@6876 sub=Libs] ResourceGroup: failed to instantiate group with id: 727017570. Sysinfo error on operation returne d status : Not found. Please see the VMkernel log for detailed error information

While tailing the destination vmkernel.log host, I was seeing the following error:

2016-12-09T19:44:22.000Z cpu21:35086 opID=b5686da8)World: 15516: VC opID AA8C46D5-0001C9C0-81-91-cb-a544 maps to vmkernel opID b5686da8
2016-12-09T19:44:22.000Z cpu21:35086 opID=b5686da8)Config: 681: "SIOControlFlag2" = 1, Old Value: 0, (Status: 0x0)
2016-12-09T19:44:22.261Z cpu21:579860616)World: vm 579827968: 1647: Starting world vmm0:oats-agent-2_(e00c5327-1d72-4aac-bc5e-81a10120a68b) of type 8
2016-12-09T19:44:22.261Z cpu21:579860616)Sched: vm 579827968: 6500: Adding world 'vmm0:oats-agent-2_(e00c5327-1d72-4aac-bc5e-81a10120a68b)', group 'host/user/pool34', cpu: shares=-3 min=0 minLimit=-1 max=4000, mem: shares=-3 min=0 minLimit=-1 max=1048576
2016-12-09T19:44:22.261Z cpu21:579860616)Sched: vm 579827968: 6515: renamed group 5022126293 to vm.579860616
2016-12-09T19:44:22.261Z cpu21:579860616)Sched: vm 579827968: 6532: group 5022126293 is located under group 5022124087
2016-12-09T19:44:22.264Z cpu21:579860616)MemSched: vm 579860616: 8112: extended swap to 46883 pgs
2016-12-09T19:44:22.290Z cpu20:579860616)Migrate: vm 579827968: 3385: Setting VMOTION info: Dest ts = 1481312661276391, src ip = <192.168.1.2> dest ip = <192.168.1.17> Dest wid = 0 using SHARED swap
2016-12-09T19:44:22.293Z cpu20:579860616)Hbr: 3394: Migration start received (worldID=579827968) (migrateType=1) (event=0) (isSource=0) (sharedConfig=1)
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 2997: Accepted connection from <::ffff:192.168.1.2>
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3049: data socket size 0 is less than config option 562140
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3085: dataSocket 0x430610ecaba0 receive buffer size is 562140
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 2997: Accepted connection from <::ffff:192.168.1.2>
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3049: data socket size 0 is less than config option 562140
2016-12-09T19:44:22.332Z cpu0:33670)MigrateNet: vm 33670: 3085: dataSocket 0x4306110fab70 receive buffer size is 562140
2016-12-09T19:44:22.332Z cpu0:33670)VMotionUtil: 3995: 1481312661276391 D: Stream connection 1 added.
2016-12-09T19:44:24.416Z cpu1:32854)elxnet: elxnet_allocQueueWithAttr:4255: [vmnic0] RxQ, QueueIDVal:2
2016-12-09T19:44:24.416Z cpu1:32854)elxnet: elxnet_startQueue:4383: [vmnic0] RxQ, QueueIDVal:2
2016-12-09T19:44:24.985Z cpu12:579860756)VMotionRecv: 658: 1481312661276391 D: Estimated network bandwidth 471.341 MB/s during pre-copy
2016-12-09T19:44:24.994Z cpu4:579860755)VMotionSend: 4953: 1481312661276391 D: Failed to receive state for message type 1: Failure
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: VMotionSend: 3979: 1481312661276391 D: failed to asynchronously receive and apply state from the remote host: Failure.
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: Migrate: 270: 1481312661276391 D: Failed: Failure (0xbad0001) @0x4180324c6786
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: VMotionUtil: 6267: 1481312661276391 D: timed out waiting 0 ms to transmit data.
2016-12-09T19:44:24.994Z cpu4:579860755)WARNING: VMotionSend: 688: 1481312661276391 D: (9-0x43ba40001a98) failed to receive 72/72 bytes from the remote host <192.168.1.2>: Timeout
2016-12-09T19:44:24.998Z cpu4:579860616)WARNING: Migrate: 5454: 1481312661276391 D: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.

We are using the vSphere distributed switch in our environment, and each host has a vmk dedicated to vMotion traffic only, so this was my first test, verified the IP and subnet for the vmk on the source/destination hosts, and I was successfully able to ping using vmkping to the destination host, and tested the connection both ways.

The second test completed was to power off a VM, and test its ability to vMotion off of the host – this worked. When I powered the VM back on it immediately went back to the source host. I then tried to vMotion the VM again while it was powered on from the affected source host, and move it to the destination host like I had before, and to my surprise it worked now. Tested this process with a few other VMs for consistency. I tried to restart a VM on the affected host, and then move it off to another host, but this did not work.

My final test was to vMotion a VM from a different host to the affected host. This worked as well, and I was even able to vMotion off from that affected host again.

In our environment we have a Trend-micro agent VM and a GI VM running on each host. I logged into the vSphere web-client to look at the status of the Trend-micro VM and there was no indication of an error, I found the same status checking the GI vm.

Knowing we have had issues with Trend-micro in the past, I powered down the Trend-micro VM running on the host, and attempted a manual vMotion of a test VM I knew couldn’t move before – IT WORKED. Tried another with the same result. Put the host into maintenance mode to try and evacuate the rest of the customer VMs off from it with success!

To wrap all of this up, the Trend-micro agent VM running on the esxi6 host was preventing other VMs from vMotioning off either manually or through DRS. Once the trend-micro agent VM was powered off, I was able to evacuate the host.

Creating vROps Policies and How To Apply Them To Object Groups.

November 22, 2016January 22, 2017 CaptainvOPs

Creating policies in VMware’s vRealize Operations Appliance can be strait forward, if there is a decent understanding of every platform it’s monitoring. In my last post of this series, I covered the creation of object groups, and that post is important here because policies can be created and assigned to those object groups, allowing the tuning of alerts received for those groups.

Once logged in to the vROps appliance, go into the administration section, and there you will find the policies.

VMware has included many base policies in the policy library, which in most cases will be fine for the initial configuration for the appliance, but you may want to create additional policies to suite your specific environment needs.
Also take note of the blue film strip in the upper right corner. This will take you to VMware’s video repository of policies explanation and a brief how-to video. These video links can be found throughout the configuration of the appliance, and more are added with each release.

To create a new policy click on the green plus sign to get started. Give the policy a unique name, and it would be good practice to give a description of what the policy is intended to do. When creating a policy, you have the ability to “start with” a VMware pre-defined policy, and I recommend taking advantage of that until there is a firm understanding of what these policies do.

vrops38

On the Select Base Policy tab, you can use the drop down menu on the left to get a policy overview of what is being monitored. In this example, Host system was selected.

vrops39

Policy Overrides can also be incorporated into this policy. In other words, if there are certain alerts that you do not want, one of the pre-defined policies may already have those alerts turned off, so those policies can be added to the new policy being created here. Work smarter, not harder right?

vrops40

Moving along to the Analysis Settings tab, here you can see how vROps analyses the alerts, determines thresholds, and assigned system badges. These can be left at their current settings per the policy you are building off of, or you can click on the padlock to the right and make individual changes. Keep in mind under the “Show changes for” drop down menu, you will have many objects to select to change the analysis settings on.

vrops41

The Alert/Systems Definitions tab is probably where the majority of time will be spent. The “Alert Definitions” box at the top is where alerts can be turned on or off based on the base policy used to create this one, or the override policies used.

Each management pack installed will have it’s own category for object type. In other words, “host system” is listed under the vCenter category, but if vCloud Director management pack was installed, it would also have a “host system” under its category. Each management pack has the ability to add additional alerts for objects referenced in other management packs. Take time going through each category to see what alerts may need configuring.
The State of each alert will either be local with a green check-mark: meaning you enabled it, inherited with a grey check-mark: meaning it is enabled via another policy that was used to create this one, Local with the red crossed out circle: meaning you disabled the alert for the policy, or inherited with a grayed out crossed out circle: meaning it is disabled via another policy that was used to create this one. Disabling alerts here will still allow the metrics to be collected for the object, you just wont get the alarm for it.
The System Definitions section has the same “object type” drop down menu, and you can select the object type here to configure system thresholds for how the symptoms are triggered for the alert selected in the top Alert Definition box above. I typically do these in tandem.

vrops43

Finally, you can apply the policy to the custom groups you created before in the Apply Policy to Groups tab.

vrops42

Once you click save, and go back to the Active Policies tab, you will be able to see the new policy created, and within five minutes, you should see the Affected Objects count rise. You can see here that I have a policy marked with “D” meaning it is the default appliance policy. You can set your own policy as default by clicking the blue circle icon with the arrow on the upper left side. It may take up to 24 hours before the active alert page reflects the settings of the new policy. Otherwise you can manually clear those alerts.

vrops44

Previous post to this series: Configuring VMware vRealize Operations Manager Object Groups

ESXi host fails to upgrade from 5.5 Update 3 to 6 Update 2

November 15, 2016December 12, 2016 CaptainvOPs

This happened to me today and thought it was worth sharing. Most of the hosts in this particular cluster upgraded fine to ESXi 6u2 from ESXi 5.5u3 with the exception of this one host. Update manager kept giving me this error “Cannot run upgrade script on host” in the vCenter Recent Tasks pane.

esxi01

A quick google search brought me to this KB article 2007163, but after following the KB I wasn’t able to find the referenced error “Remediation failed due to non mode failure “on the update manager server (Win2008) under C:\AppData\VMware\Update Manager\Logs\vmware-vum-server-log4cpp.log file.

I started an SSH session to the ESXi host, but wasn’t able to find and entry similar to the error “OSError: [Errno 39] Directory not empty:” in the /var/log/vua.log file

I instead found this error:

—————————————————————-

[FFD0D8C0 error ‘Default’] Alert:WARNING: This application is not using QuickExit().

The exit code will be set to 0.@ bora/vim/lib/vmacore/main/service.cpp:147

–> Backtrace:

–> backtrace[00] rip 1bc228c3 Vmacore::System::Stacktrace::CaptureFullWork(unsigned int)

—————————————————————–

By chance, I happened to check space on the ESXi host #df -h and found I had a partition that was 100% full.

esxi02

So I changed directory to it # cd /storage/core/……./ where I found two more directories /var/core/ . Using the command #ls to list the directory, I found two zdumps.

esxi03

I deleted the two zdumps, and then checked the space again #df -h

esxi04

Seeing now that my directory is now 68% utilized instead of 100%, I attempted the ESXi 6u2 upgrade again this time with success.

How to evacuate virtual machines from one host to another with PowerCLI

October 28, 2016October 30, 2016 CaptainvOPs

If you ever find yourself in need of evacuating an esxi host there is a handy PowerCLI command that can do just that, and it maintains the resource pools for the virtual machines too. This was used in vCenter 6u2, PowerCLI 6.3 R1, esxi 5.5

– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

Connect-VIserver <vcenter-name/ip>

Get-VM -Location “<sourcehost>” | Move-VM -Destination (Get-Vmhost “<destinationhost>”).

– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

I recently went through an outage that affected our ability to put hosts into maintenance mode, as the vMotion operation would get stuck at the vCenter level at 13%, with no indication at the host level that something was happening. This PowerCLI command allowed me to evacuate one host’s virtual machines onto another getting me through all 18 hosts in the cluster.

Nightly Automated Script To Gather VM and Host Information, and Output To A CSV

September 19, 2016January 26, 2017 CaptainvOPs

Admittedly this was my first attempt at creating a Powershell script, but thought I would share my journey. We wanted a way to track the location of customer VMs running on our ESXi hosts, and their power state in the event of a host failure. Ideally this would be automated to run multiple times during the day, and the output would be saved to a csv on a network share.

We had discovered a bug in vCenter 5.5 where if the ESXi 5.5 host was in a disconnected state, and an attempt was made to reconnect it using the vSphere client without knowing that the host was actually network isolated, HA would not restart the VMs on another host as expected. We would later test this in a lab to find that if we had not used the reconnect option, HA would restart the VMs as expected on other hosts. We again tested this scenario in vCenter 6 update 2, and the bug was not present.

So the first powercli one-liner I came up with was the following:

> get-vm | select VMhost, Name, @{N="IP Address";E={@($_.guest.IPaddress[0])}}, PowerState | where {$_.PowerState -ne "PoweredOff"} | sort VMhost

get-vm1

I wanted a list of powered on VMs, their IPs, what host they were running on, and I wanted to sort the output by the host name. Knowing I was on the right track, now I wanted to be able to connect to multiple data centers, and have each data center’s output saved to a CSV on a network share. I didn’t really need to hang on to the CSVs in that network share for more than seven days, so I also wanted to build in logic so that it would essentially cleanup after itself.

That script looks something like this:

#Initial variables
$vCenter = @()
$sites = @("vcenter01","vcenter02","vcenter03")

#get array of sites and establishes connections to vCenters
   foreach ($site in $sites) {
   $vCenter = $site + "domain.net"

   Connect-VIServer $vCenter

#get list of not equal to powered off VMs, there IP and which hosts they're running on
 get-vm | select VMhost, Name, @{N="IP Address";E={@($_.guest.IPaddress[0])}}, PowerState | where {$_.PowerState -ne "PoweredOff"} | sort VMhost | Export-Csv -Path "c:\path\to\output\$site $((Get-Date).ToString('MM-dd-yyyy_hhmm')).csv" -NotypeInformation -useculture

#Disconnect from vCenters 
 Disconnect-VIServer -Force -Confirm:$false -Server $vCenter

}

#Cleanup old csv after 7 days.
$limit = (Get-Date).AddDays(-7)
$path = "c:\path\to\output\"

Get-ChildItem -Path $path -Recurse -Force | Where-Object { !$_.PSIsContainer -and $_.CreationTime -lt $limit } | Remove-Item -Force

Something I did not know until after running this in a large production environment, is that the get-vm call is heavy and not very efficient. When I ran this script in my lab it took less than 15 seconds to run. However in a production environment, connecting to data centers all over the globe, it took over 40 minutes to run.

A colleague of mine who had automation experience, exposed me to another cmdlet called get-view, and said it would be much faster to run as it was more efficient gathering the data needed. So I rewrote my script and now it looks like:

_________________________________________________________________

get-vm4-5

_________________________________________________________________

The new code took less than a couple of minutes to run in my production environment. I have a Windows VM deployed that’s running the VMware poweractions fling, and it also runs some scheduled scripts. This script would be running from that server, so I added an additional function to the script that creates a WIN event entry so it could be tracked from a syslog server.

So the final script can be downloaded here. *Disclaimer – test this in a lab first as the code will need to be updated to suit your needs.

Configuring VMware vRealize Operations Manager Adapters For Data Collection

September 18, 2016January 22, 2017 CaptainvOPsLeave a comment

If you’ve followed my recent blog post on Installing vRealize Operations Manager (vROps) Appliance, you are now ready to configure the built in vSphere adapter to start data collection.

Depending on how big your environment is, and IF you have remote collectors deployed, you may want to consider configuring collector groups. A Collector group allows you to group multiple remote collectors within the same data center, and the idea is that this would allow for resiliency for those remote collectors, that way when you have the vROps adapters pointed to the collector group instead of the individual remote collector, if one of the remote collectors went down the other would essentially pick up the slack and continue collecting from that data center, so there would be no data loss. You can also create a collector group for a single remote collector for ease of expansion later if you want to add that data collection resiliency.

Go ahead and get logged into the appliance using the regular UI <https//vrops-appliance/ui/login.action>. From here click Administration. If you just need to configure the vSphere adapter for data collection, you can skip ahead to Section 2. Otherwise lets continue in section 1, and configure the collector groups.

Section 1

Click on Collector Groups

vrops11

You can see that I already have collector groups created for my remote data centers, but if you were to create new, just click the green plus sign

vrops12

Give the collector group a name, and then in the lower window select the corresponding remote collector. Then rinse-wash-and-repeat until you have the collector groups configured. Click Save when finished. Now lets move on to Section 2.

Section 2

From the Administration area, click on Solutions

vrops10

Now because this is your new deployment, you would only have Operating Systems / Remote Service Monitoring and VMware vSphere. For the purpose on this post I will only cover configuring the VMware vSphere adapter. Click it to select it, and then click the gears to the right of the green plus sign.

vrops13

Here just fill out the display name, the vCenter Server it will be connecting to, the credentials, and if you click the drop down arrow next to Advanced Settings, you will see the Collectors/Groups drop down menu. Expand that if you have created the custom collectors in Section 1, and select the desired group. Otherwise vROps will use the Default collector group, which is fine if you only have one data center, otherwise I recommend at least selecting a remote collector here if you do not have a collector group configured. This basically puts the load onto the remote collectors for data collection, and allows the cluster to focus on processing all of that lovely data. Click Test Connection to verify connectivity, and then click save. Then rinse-wash-and-repeat until you have all vCenters collecting. Close when finished.

Important to note that vROps by default will collect data every five minutes, and currently that is the lowest setting possible. You can monitor the status of your solutions or adapters here. Once they start collecting their statuses will change to green.

vrops17

If you’d like to add additional solutions otherwise known as “Management PAKs”, head on over to VMware’s Solution Exchange . I currently work for a cloud provider running NSX, so I also have the NSX and vCloud Director Management PAKs installed. From the same solutions page, instead of clicking on the gears, click the green plus sign and add the additional solutions to your environment. This would also be used when you are updating solutions to newer versions. Currently there is no system to update you when a newer version is available.

vrops14

Go to Global Settings on the Administration page, where you can configure the object history, or data retention policy, along with a few other settings.

vrops15

Finally, Go back to home by clicking the house icon. By now the Health Risk and Efficiency badges should all be colored. Ideally green, but your results may vary. This is the final indication that vROps is collecting.

vrops16

Next Post: Add The vROps License, Configuring LDAP, and The SMTP For vRealize Operations Manager (vROps)

Recent Post: Sizing and Installing The VMware vRealize Operations (vROps) Appliance

Sizing and Installing The VMware vRealize Operations (vROps) Appliance

September 18, 2016January 22, 2017 CaptainvOPsLeave a comment

VMware has a sizing guide that will aid you in determining how many appliances you need to deploy. If you have multiple data centers, and somewhere north of 200 hosts, and more than 5,000 VMs, I’d recommend at least starting out with two servers configured as Large deployments. Once you get the built in vSphere adapter collecting for each environment, you can run an audit on the environment using vROps to get the raw numbers, and expand the cluster accordingly. Come prepared. Walk through your environments and get a list of how many hosts, data stores, vCenters, and get a rough count of the virtual machines deployed.

KB2093783 has more details on the sizing, and I strongly urge you to visit the KB, as there are links to the latest releases of vROps, and each KB has a sizing guide attachment at the bottom, where you can input the information you collected from your environment to get a more accurate size.

_________________________________________________________________

Appliance Manual Installation

________________________________________________________________

Architectural Note

Before proceeding be sure you have:
- The appropriate host resources
- The appropriate storage
- IP addresses assigned and entered into DNS
- a “read-only” account configured in AD and vCenter
- The appropriate ports opened between data centers listed in VMWare’s documentation

_________________________________________________________________

Once you have the latest edition of the vROps appliance ovf downloaded, and after consulting the documentation, use either the vSphere client or web, and deploy the OVF template. I’ll skip through browsing for, verifying the details of, accepting the licence agreement for, and naming the appliance.

So now you’ve come to the OVF deployment step where you must select the size of your appliance. No matter the size, the remainder of the deployment is the same, but for this example I will deploy an appliance as Large.

You can deploy the appliance in several sizing configurations depending on the size of your environment and those are: Extra Small, Small, Medium and Large.

Extra Small = 2 vCPUs and 8GBs of memory
Small = 4 vCPUs and 16GBs of memory
Large= 16 vCPUs and 48GBs of memory

You can also choose to deploy a remote collector and they come in two sizes:

Standard = 2 vCPUs and 4GBs of memory
Large = 4 vCPUs and 16GBs of memory

vrops1

You will notice that with each selection, VMware has given a definition of what it entails. Choose the one that best suits your needs. Click next

Storage dialog

Depending on the size of your environment, vrops VMs can get to over a terabyte in size each
Once you’ve made your selection click next
Architectural Note – If adding a master replica node to your vROps cluster, I’d recommend keeping the Master and Master Replicas on separate XIVs, or whatever you use to serve up storage to your environment.

Disk Format dialog

The default is Lazy Zeroed, and that’s how my environments have been deployed. I’d strongly advise not using thin provision for this appliance.
Once you’ve made your selection click next.

Network Mapping dialog

Select the appropriate destination network like a management network, where it can capture traffic from your hosts, VMs, vCenters and datastores.
Once you’ve made your selection click next.

Properties dialog

Here you can set the Timezone for the appliance, and choose whether to use IPv6
Once you’ve filled out the network information, click next

Configuration Verification dialog

Read it carefully to be sure there were no fat fingers at play. Click finish when ready.

_________________________________________________________________

Before you proceed in turning on the appliance, you may want to take the opportunity now and expand its disk. This can be done a couple of ways. You can expand the existing Hard Disk 2, however keep in mind that the current file system can only see disks under 2TB. Any disk space allocated over 2TB the appliance wont be able to see. For my production environment, I increased disk 2 to 1TB in size, and then added 500GB disks as more storage was needed. Also keep in mind the amount of data you are going to be retaining. My appliances are configured for 6 months, but this can be changed as needs change. We’ll go over this later in another post. The cool thing about this appliance is that as you increase the size of disk 2, or add additional storage, the appliance during the power-on process, expands the data partition automatically.

Power up the appliance, open a console to it in vCenter to watch it boot up, and go through some scripted configurations.

vrops4

To get logged in, press ALT + F1 keys. Enter root for the user, leave the password blank and hit enter. Now you will be prompted to input the current password, so leave it blank and hit enter. Now enter a new password, hit enter and enter the new password once more for verification.
Now depending on how locked down your environment is, you may not be able to but I always ping out to 8.8.8.8 along with hitting a few internal servers to verify network settings.
Also unless you really enjoy VMware’s console, I’d recommend running a couple commands to turn on SSH, so any future administrative tasks can be performed with a putty session.
- The first command is: # chkconfig sshd on
  - This enables the sshd service at system boot
- The second command is: # service sshd start
  - This turns on the sshd service so you can connect to the box with a putty session.

_________________________________________________________________

Using Microsoft Edge, Firefox or Chrome, browse to < https ://vrops-appliance-name/ >. This will redirect you to the Getting started page where you can choose Three options:

Express Installation, where you can set the admin password and that’s pretty much it.

vrops8

New Installation gives you a few more options to configure, like which NTP server(s) you want to use, and a TLS/SSL certificate you’ve created specifically for this system (or just use the built-in one).

vrops5

Expand An Existing Installation – this option would be used for additional data nodes or remote collectors as you’ll have the option to pick under “node type”.

vrops6

For this installation we will select New Installation. As a rule of thumb and for better appliance performance, I’d use the NTP servers on your network that vCenter and the ESXi hosts are using to keep time in check. Once you’ve made it though the wizard click finish.

vrops9

It shouldn’t take too long for the master appliance to setup and take you to a log in screen.

You’re not done yet however. You still have to configure your cluster if you have additional data nodes, and remote collectors to add. If you have a master replica, data nodes, or remote collector, get them connected to the master. Each will have their own web UI < https ://vrops-appliance-name/ >, only this time you can use the Expand An Existing Installation Option. You will also need to log into the admin section for some of this <https ://vrops-appliance-name/admin/login.action>

Lets get the master replica added first. When you use the expand an existing cluster option, you’ll need to add it as a data node. Then wait for the cluster to expand to it.

vrops18

Then click the finish adding new nodes button.

vrops19

To enable HA, you’ll notice in the center of the screen there is a High Availability option, but it is disabled. Go ahead an click enable

vrops20

Now select the data node that will be the master replica, make sure enable high availability is checked, and click OK. This part will take a little while, and the cluster services will be restarted. After word the High Availability status will be enabled.

vrops21

Add any remaining data nodes and remote collectors using the Expand An Existing Installation Option.

_________________________________________________________________

Architectural Note

I’d recommend going into vCenter and adding an anti-affinity rule to keep the master and master replica on separate hosts
If you’ve deployed vROps to its own host cluster, I’d recommend turning down vSphere DRS to conservative. The appliances are usually pretty busy in an active environment, and having one vmotion on you can cause cluster performance degradation, and will throw some interesting alarms within vROps. It will recover on its own, but better to avoid when possible.

_________________________________________________________________

Next up – You”ll need to configure the built in vSphere adapter so you can start collecting data. I’ll have more on that in my next post.

Next Post: Configuring VMware vRealize Operations Manager Adapters For Data Collection

Recent Post: What Is VMware’s vRealize Operations Manager?