Introduction

The other day I was doing an NSX(T) upgrade from 3.2.3.1 to 4.1.1.0 on one of our environments. Everything went alright until the actual first stage in which you start upgrading your NSX infrastructure. As you know the first step in upgrading your NSX Fabric is the step in which you upgrade your NSX Edge Transport Nodes. This stage should not really be an issue, however this time it was on this specific environment. Let’s dig in and fix the issue! The error message that I received was the following:

Pnic status of the edge transport node XXXXX-a028-4378-b1f4-XXXXXXXX is DOWN.,Overall status of the edge transport node XXXXX-a028-4378-b1f4-XXXXXXXX is DOWN.,Edge node XXXXX-a028-4378-b1f4-XXXXXXXX , has errors Errors = [{"moduleName":"upgrade-coordinator","errorCode":30201,"errorMessage":"Pnic status of the edge transport node XXXXX-a028-4378-b1f4-XXXXXXXX is DOWN."}, {"moduleName":"upgrade-coordinator","errorCode":30212,"errorMessage":"Overall status of the edge transport node XXXXX-a028-4378-b1f4-XXXXXXXX is DOWN."}, ] after state sync wait.
NSX upgrade 4.1.1.0 visual error
NSX upgrade 4.1.1.0 visual error

Troubleshooting

Looking at the Edge Transport Node from the UI we can actually see the following message:

Host configuration: Caught MessagingException during host config stage. [TN=TransportNode/XXXXXX-a028-4378-b1f4-XXXXXXX]. Reason: MessagingException
NSX upgrade 4.1.1.0 Edge visual error
NSX upgrade 4.1.1.0 Edge visual error

That is not a lot to go on unfortunately. Before I started to do anything I did check the Edge VM on vCenter Server and it does seem that the 4.1.1.0 OS version is running on the VM. So before doing anything else, I re-entered the NSX Edge into Maintenance Mode under Actions -> Enter NSX Maintenance Mode and rebooted the virtual machine. Since this did nothing I also did a configuration sync with the Actions -> Sync Edge Configuration which essentially resyncs the configuration that is present on the NSX Manager to the NSX Edge Transport Node.

At this point I usually login to the Edge VM itself with SSH. Once done you can get the upgrade status with the following command, which yielded the following result on my Edge:

edge-tn-03> get upgrade progress-status 
****************************************************************************
Node Upgrade has been started. Please do not make any changes, until 
the upgrade operation is complete. Run "get upgrade progress-status"
to show the progress of last upgrade step.
****************************************************************************

Tue Oct 10 2023 UTC 19:09:36.008
Upgrade info:
From-version: 3.2.3.1.0.22104642
To-version: 4.1.1.0.0.22224325

Upgrade steps:
download_os [2023-10-10 15:36:20 - 2023-10-10 15:38:16] SUCCESS
11-preinstall-enter_maintenance_mode [2023-10-10 15:38:21 - 2023-10-10 15:39:31] SUCCESS
install_os [2023-10-10 15:40:20 - 2023-10-10 15:44:54] SUCCESS
switch_os [2023-10-10 15:44:57 - 2023-10-10 15:45:26] SUCCESS
post_power_on [2023-10-10 15:48:15 - 2023-10-10 15:48:27] SUCCESS
migrate_users [2023-10-10 15:48:31 - 2023-10-10 15:48:36] SUCCESS
41-postboot-exit_maintenance_mode [2023-10-10 15:54:55 - 2023-10-10 16:00:53] SUCCESS

This was actually looking good. Since the logging in previous screens is mentioning that all PNICs are down, the next step is to have a look at them with the following command:

edge-tn-03> get physical-port
****************************************************************************
Node Upgrade has been started. Please do not make any changes, until 
the upgrade operation is complete. Run "get upgrade progress-status"
to show the progress of last upgrade step.
****************************************************************************

Tue Oct 10 2023 UTC 19:43:17.464
% An unexpected error occurred: The dataplane service is in error state, has failed or is disabled

Alright now we are getting somewhere. It seems the ‘dataplane service’ is down. Let’s see the state for this service.

edge-tn-03> get service dataplane
****************************************************************************
Node Upgrade has been started. Please do not make any changes, until 
the upgrade operation is complete. Run "get upgrade progress-status"
to show the progress of last upgrade step.
****************************************************************************

Tue Oct 10 2023 UTC 19:44:43.709
Service name:      dataplane
Service state:     stopped

Well hello there, why was this so hard to mention in the GUI? Before we go and try to start it manually, let’s dig through the logs with the following command:

get log-file syslog

I did find a couple of the following lines that were useful to remember:

2023-10-10T16:07:43.723Z edge-tn-03.XXXX.local dispatcher-systemd-helper 47737 - -  service_dispatcher start fail with error_code 132
2023-10-10T16:07:43.224Z edge-tn-03.XXXX.local dispatcher-systemd-helper 47853 - -  service_dispatcher
2023-10-10T16:07:43.434Z edge-tn-03.XXXX.local dispatcher-systemd-helper 47859 - -  Error response from daemon: No such container: service_dispatcher
2023-10-10T16:07:43.048Z edge-tn-03.XXXX.local systemd 1 - -  nsx-edge-dispatcher.service: Control process exited, code=exited, status=1/FAILURE

2023-10-10T19:50:35.655Z edge-tn-03.XXXX.local datapath-systemd-helper 173966 - -  service_datapath start fail with error_code 0
2023-10-10T19:50:35.584Z edge-tn-03.XXXX.local datapath-systemd-helper 176168 - -  service_datapath

2023-10-10T19:50:12.343Z edge-tn-03.XXXX.local NSX 3267 - [nsx@6876 comp="nsx-edge" subcomp="opsagent" s2comp="edge-service" tid="3304" level="ERROR" errorCode="MPA"EDG0300004""] The operation failed because failed to connect to socket: /var/run/vmware/edge/dpd.ctl
2023-10-10T19:50:12.344Z edge-tn-03.XXXX.local NSX 3267 - [nsx@6876 comp="nsx-edge" subcomp="opsagent" s2comp="edge-service" tid="3304" level="INFO"] empty pnic from physical_port/show
2023-10-10T19:50:12.344Z edge-tn-03.XXXX.local NSX 3267 - [nsx@6876 comp="nsx-edge" subcomp="opsagent" s2comp="edge-service" tid="3304" level="INFO"] Could not get EdgeSystemInfo. Retry in 5 seconds.
2023-10-10T19:50:12.391Z edge-tn-03.XXXX.local nsx-opsagent - - -  2023-10-10T19:50:12Z|00915|unixctl|WARN|failed to connect to /var/run/vmware/edge/dpd.ctl

edge-tn-03> start service dataplane
****************************************************************************
Node Upgrade has been started. Please do not make any changes, until 
the upgrade operation is complete. Run "get upgrade progress-status"
to show the progress of last upgrade step.
****************************************************************************

edge-tn-03> get service dataplane
****************************************************************************
Node Upgrade has been started. Please do not make any changes, until 
the upgrade operation is complete. Run "get upgrade progress-status"
to show the progress of last upgrade step.
****************************************************************************

Tue Oct 10 2023 UTC 20:01:22.259
Service name:      dataplane
Service state:     running

edge-tn-03> get phy fp-eth0
% Invalid value for argument <physical-port-name>: fp-eth0
<physical-port-name>: String representation.

edge-tn-03> get physical-port
****************************************************************************
Node Upgrade has been started. Please do not make any changes, until 
the upgrade operation is complete. Run "get upgrade progress-status"
to show the progress of last upgrade step.
****************************************************************************

Tue Oct 10 2023 UTC 20:01:52.060
% An unexpected error occurred: The dataplane service is in error state, has failed or is disabled
edge-tn-03> get service dataplane
****************************************************************************
Node Upgrade has been started. Please do not make any changes, until 
the upgrade operation is complete. Run "get upgrade progress-status"
to show the progress of last upgrade step.
****************************************************************************

Tue Oct 10 2023 UTC 20:02:15.332
Service name:      dataplane
Service state:     stopped

edge-tn-03> get services
****************************************************************************
Node Upgrade has been started. Please do not make any changes, until 
the upgrade operation is complete. Run "get upgrade progress-status"
to show the progress of last upgrade step.
****************************************************************************

Tue Oct 10 2023 UTC 20:07:29.838
Service name:                            dataplane           
Service state:                           stopped             

Service name:                            dhcp                
Service state:                           stopped             

Service name:                            dispatcher          
Service state:                           running             

Service name:                            docker              
Service state:                           running             

Service name:                            ipsecvpn            
Service state:                           stopped             

Service name:                            liagent             
Service state:                           stopped             

Service name:                            local-controller    
Service state:                           running             

Service name:                            metadata-proxy      
Service state:                           stopped             

Service name:                            nestdb              
Service state:                           running             

Service name:                            node-mgmt           
Service state:                           running             

Service name:                            node-stats          
Service state:                           running             

Service name:                            nsx-control-plane-agent
Service state:                           running             

Service name:                            nsx-message-bus     
Service state:                           stopped             

Service name:                            nsx-opsagent        
Service state:                           running             

Service name:                            nsx-platform-client 
Service state:                           running             

Service name:                            nsx-upgrade-agent   
Service state:                           running             

Service name:                            ntp                 
Service state:                           running             
Start on boot:                           True                

Service name:                            router              
Service state:                           stopped             

Service name:                            router-config       
Service state:                           running             

Service name:                            security-hub        
Service state:                           stopped             

Service name:                            snmp                
Service state:                           stopped             
Start on boot:                           False               

Service name:                            ssh                 
Service state:                           running             
Start on boot:                           True                
Root login:                              enabled             

Service name:                            ssh-in-lr           
Service state:                           stopped             

Service name:                            syslog              
Service state:                           running

As you can see, it didn’t really matter what we did even when the dataplane service was running, the Edge would simply not work. At this point I checked around a bit more, and when I did I found a KB that mentioned that there is a bug in this specific NSX(T) version. It took a while for me to find this bug since there was no real indication in the logging that pointed towards this. Since we now know what is happening, we need to change the upgrade versie to 4.1.2.x, I changed it to: 4.1.2.3.0.

Now the issue itself is a fault that slipped into the NSX code on the 4.1.1.0 release. You would never notice this if you are running on newer CPU’s. This specific environment is an older CPU environment.

VMware explained the bug as such:

In NSX 4.1.1 due to the DPDK upgrade on Edges, the machine type for building the DPDK was inadvertently set to ‘native’ rather than the previously used ‘corei7’. This restricted the supported CPU types for the Edge VMs and BMEs to newer CPUs only.

I suggest reading the following page if you would want to know more on the DPDK machine types that can be defined. As I understand it, with that type you can set specific CPU instruction sets to be used. The native machine type configures the CPU instruction set to the machine the code was build on, which was other than Sandy Bridge, Ivy Bridge, and Westmere CPUs. VMware also posted this on the release notes for 4.1.2!

Solution

You can re-do you NSX upgrade by uploading a new NSX(T) .mub file. The Upgrade Coordinator will change the bits and re-do the upgrade from the start. Like can be seen below:

NSX Upgrade re-do with new .mub file
NSX Upgrade re-do with new .mub file

In theory, you should be able to go through all of the upgrade steps once more to complete your upgrade. However, you might guess it, this did not work for us! Since the first Edge Transport node was already upgraded to 4.1.1.0, it would not let me upgrade it once more to 4.1.2.3. Now at this point there are several things that can be done.

  1. Phase out the specific Edge Transport Node by replacing it in the Edge Cluster with another Edge Transport Node (if you have one available). If not, you can deploy a fresh one and use that one.
Replace Edge Transport node in a Edge Cluster
Replace Edge Transport node in a Edge Cluster
  1. Remove the Edge Transport Node from the Edge Cluster altogether. This cannot be done if you have logical routers active on the Edge node as you can see below:
NSX Remove Edge Transport node from Edge Cluster
NSX Remove Edge Transport node from Edge Cluster
  1. Redeploy the Edge Node. This can only be done from the API. My co-blogger wrote a fine blogpost for that if you wish to do this.

In my situation I tried option 1 (and deployed a new Edge Transport node), and I tried option 3. Both of which worked fine. Once this was done, I could re-upload the .mub file and this went through. Once the pre-checks were done the upgrades stages all went by without any issues and it resulted in a healthy NSX 4.1.2.3 environment.

I hope you found this post useful!


Bryan van Eeden

Bryan is an ambitious and seasoned IT professional with almost a decade of experience in designing, building and operating complex (virtual) IT environments. In his current role he tackles customers, complex issues and design questions on a daily basis. Bryan holds several certifications such as VCIX-DCV, VCAP-DCA, VCAP-DCD, V(T)SP and vSAN and vCloud Specialist badges.

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *