Introduction

If you have worked or are working with NSX for vSphere (NSX-V) like me you are probably aware of the known bug that is present on some olders versions of NSX-V. Even the VMware Interoperability Matrix has a remark on this on all versions. The bug I am talking about is KB85070. This bug displays itself by giving virtual machines intermittent connectivity issues on either North/South and/or East/West traffic on specific ESXi hosts.

Troubleshooting

And you guessed it, we were having issues on one of our environments a while back that was running NSX-V. It turns out that this bug is only present on NSX-V version 6.4.8 and 6.4.10. Fortunately there is an easy fix and there are two scenario’s that can occur. Usually this bug manifests itself during the upgrade of vSphere ESXi 6.7 to 7.x, but in our scenario we were already running 7.x for a long time before this came up. In this blogpost I will show you how you can find the affected hosts and how you can fix it quickly.

Scenario 1:

The vCenter Server database and/or the ESXi hosts in your environment is showing output from the following PostgreSQL/ESXi Host command:

Enter the database on the vCenter Server:
/opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres

Execute the following query:
select opaque_data from vpx_dvs_opaque_data where opaque_data=decode('ff', 'hex');

ESXi host command:
net-dvs -l | grep 'vxlan.vmknic'

Now the above query on the vCenter Server should give you zero results. If it gives you results please check in with VMware GSS and mention the KB. They need to help you fix this. The ESXi host command should give you the following result:

[root@esx01:~] net-dvs -l | grep 'vxlan.vmknic'
                com.vmware.net.vxlan.vmknic = 0x 1
                com.vmware.net.vxlan.vmknic = 0x 1

If the results are “0x 1” you are OK!, if the result displays “0x ff” you are not and are hitting the mentioned bug. Continue reading for the fix!

Scenario 2:

Only the ESXi host is displaying the wrong result from the ESXi host command mentioned in the previous step. If this scenario is applicable, you can fix this by rebooting the ESXi host. After this the result should be “0x 1” from the command again.

Since this is not doable for hundreds of hosts at once, I made a little script that checks this for you.

function Get-NSX-V-KB85070-Status {
	#Author: Bryan van Eeden
	#Version: v1
	#This function checks the status for the NSX-V VMKNics in regards to KB85070
	#This provides the function with Debug/Verbose/Confirm/WhatIf parameters.
	[CmdletBinding(SupportsShouldProcess=$True)]
	#Define input parameters
	Param(
		[parameter(Mandatory=$true)]
		$Username,
		[parameter(Mandatory=$true)]
		$Password,
		[parameter(Mandatory=$true)]
		$Cluster
	)

	#Setting variables inside function
	$pswdSec = ConvertTo-SecureString -String $Password -AsPlainText -Force
	$cred = New-Object System.Management.Automation.PSCredential($UserName,$pswdSec)
	$NSXVReport = @()

	Get-Cluster -Name $Cluster | Get-VMHost | foreach {
		$SSHSession = New-SSHSession -ComputerName $_ -Credential $cred –AcceptKey
		$result = Invoke-SSHCommand -SSHSession $SSHSession -Command $cmd
		Remove-SSHSession -SSHSession $SSHSession | Out-Null
		
		$NSXVReport += [pscustomobject]@{
			ESXiHost = $result.Host
			NSXVStatus = $result.Output.replace("`t", "")
		}
	}
$NSXVReport | Out-Gridview
}

Now for this script to work you will need the Posh-SSH Module installed in your Powershell library. What this script does is that is effectively creates a SSH session to all of the hosts in the mentioned cluster, executes the command and returns the values we need. After that it shows you a nice list in which you can easily see if there are hosts that are hitting this bug. So instead of logging into each host manually, you can do it in one script and just a couple of seconds. This is how you can use the script:

Get-NSX-V-KB85070-Status -Cluster "clustername" -Username "username" -Password "password"

Once you reboot the ESXi host, the connectivity issues disappear.

Conclusion

This nasty NSX-V bug provides virtual machines with intermittent network issues for either North/South and/or East/West traffic flows. As long as you are using NSX-V modules (Isolated Networks, Routed Networks, NSX Edges etc.). To fix this you can reboot an affected ESXi host. To create an easy overview which hosts are affected you can use my script above. I hope you found this helpful.


Bryan van Eeden

Bryan is an ambitious and seasoned IT professional with almost a decade of experience in designing, building and operating complex (virtual) IT environments. In his current role he tackles customers, complex issues and design questions on a daily basis. Bryan holds several certifications such as VCIX-DCV, VCAP-DCA, VCAP-DCD, V(T)SP and vSAN and vCloud Specialist badges.

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *