Introduction
The other day I was doing an NSX(T) upgrade from 3.2.3.1 to 4.1.1.0 on one of our environments. Everything went alright until the actual first stage in which you start upgrading your NSX infrastructure. As you know the first step in upgrading your NSX Fabric is the step in which you upgrade your NSX Edge Transport Nodes. This stage should not really be an issue, however this time it was on this specific environment. Let’s dig in and fix the issue! The error message that I received was the following:
Pnic status of the edge transport node XXXXX-a028-4378-b1f4-XXXXXXXX is DOWN.,Overall status of the edge transport node XXXXX-a028-4378-b1f4-XXXXXXXX is DOWN.,Edge node XXXXX-a028-4378-b1f4-XXXXXXXX , has errors Errors = [{"moduleName":"upgrade-coordinator","errorCode":30201,"errorMessage":"Pnic status of the edge transport node XXXXX-a028-4378-b1f4-XXXXXXXX is DOWN."}, {"moduleName":"upgrade-coordinator","errorCode":30212,"errorMessage":"Overall status of the edge transport node XXXXX-a028-4378-b1f4-XXXXXXXX is DOWN."}, ] after state sync wait.
Troubleshooting
Looking at the Edge Transport Node from the UI we can actually see the following message:
Host configuration: Caught MessagingException during host config stage. [TN=TransportNode/XXXXXX-a028-4378-b1f4-XXXXXXX]. Reason: MessagingException
That is not a lot to go on unfortunately. Before I started to do anything I did check the Edge VM on vCenter Server and it does seem that the 4.1.1.0 OS version is running on the VM. So before doing anything else, I re-entered the NSX Edge into Maintenance Mode under Actions -> Enter NSX Maintenance Mode and rebooted the virtual machine. Since this did nothing I also did a configuration sync with the Actions -> Sync Edge Configuration which essentially resyncs the configuration that is present on the NSX Manager to the NSX Edge Transport Node.
At this point I usually login to the Edge VM itself with SSH. Once done you can get the upgrade status with the following command, which yielded the following result on my Edge:
edge-tn-03> get upgrade progress-status **************************************************************************** Node Upgrade has been started. Please do not make any changes, until the upgrade operation is complete. Run "get upgrade progress-status" to show the progress of last upgrade step. **************************************************************************** Tue Oct 10 2023 UTC 19:09:36.008 Upgrade info: From-version: 3.2.3.1.0.22104642 To-version: 4.1.1.0.0.22224325 Upgrade steps: download_os [2023-10-10 15:36:20 - 2023-10-10 15:38:16] SUCCESS 11-preinstall-enter_maintenance_mode [2023-10-10 15:38:21 - 2023-10-10 15:39:31] SUCCESS install_os [2023-10-10 15:40:20 - 2023-10-10 15:44:54] SUCCESS switch_os [2023-10-10 15:44:57 - 2023-10-10 15:45:26] SUCCESS post_power_on [2023-10-10 15:48:15 - 2023-10-10 15:48:27] SUCCESS migrate_users [2023-10-10 15:48:31 - 2023-10-10 15:48:36] SUCCESS 41-postboot-exit_maintenance_mode [2023-10-10 15:54:55 - 2023-10-10 16:00:53] SUCCESS
This was actually looking good. Since the logging in previous screens is mentioning that all PNICs are down, the next step is to have a look at them with the following command:
edge-tn-03> get physical-port **************************************************************************** Node Upgrade has been started. Please do not make any changes, until the upgrade operation is complete. Run "get upgrade progress-status" to show the progress of last upgrade step. **************************************************************************** Tue Oct 10 2023 UTC 19:43:17.464 % An unexpected error occurred: The dataplane service is in error state, has failed or is disabled
Alright now we are getting somewhere. It seems the ‘dataplane service’ is down. Let’s see the state for this service.
edge-tn-03> get service dataplane **************************************************************************** Node Upgrade has been started. Please do not make any changes, until the upgrade operation is complete. Run "get upgrade progress-status" to show the progress of last upgrade step. **************************************************************************** Tue Oct 10 2023 UTC 19:44:43.709 Service name: dataplane Service state: stopped
Well hello there, why was this so hard to mention in the GUI? Before we go and try to start it manually, let’s dig through the logs with the following command:
get log-file syslog
I did find a couple of the following lines that were useful to remember:
2023-10-10T16:07:43.723Z edge-tn-03.XXXX.local dispatcher-systemd-helper 47737 - - service_dispatcher start fail with error_code 132 2023-10-10T16:07:43.224Z edge-tn-03.XXXX.local dispatcher-systemd-helper 47853 - - service_dispatcher 2023-10-10T16:07:43.434Z edge-tn-03.XXXX.local dispatcher-systemd-helper 47859 - - Error response from daemon: No such container: service_dispatcher 2023-10-10T16:07:43.048Z edge-tn-03.XXXX.local systemd 1 - - nsx-edge-dispatcher.service: Control process exited, code=exited, status=1/FAILURE 2023-10-10T19:50:35.655Z edge-tn-03.XXXX.local datapath-systemd-helper 173966 - - service_datapath start fail with error_code 0 2023-10-10T19:50:35.584Z edge-tn-03.XXXX.local datapath-systemd-helper 176168 - - service_datapath 2023-10-10T19:50:12.343Z edge-tn-03.XXXX.local NSX 3267 - [nsx@6876 comp="nsx-edge" subcomp="opsagent" s2comp="edge-service" tid="3304" level="ERROR" errorCode="MPA"EDG0300004""] The operation failed because failed to connect to socket: /var/run/vmware/edge/dpd.ctl 2023-10-10T19:50:12.344Z edge-tn-03.XXXX.local NSX 3267 - [nsx@6876 comp="nsx-edge" subcomp="opsagent" s2comp="edge-service" tid="3304" level="INFO"] empty pnic from physical_port/show 2023-10-10T19:50:12.344Z edge-tn-03.XXXX.local NSX 3267 - [nsx@6876 comp="nsx-edge" subcomp="opsagent" s2comp="edge-service" tid="3304" level="INFO"] Could not get EdgeSystemInfo. Retry in 5 seconds. 2023-10-10T19:50:12.391Z edge-tn-03.XXXX.local nsx-opsagent - - - 2023-10-10T19:50:12Z|00915|unixctl|WARN|failed to connect to /var/run/vmware/edge/dpd.ctl
edge-tn-03> start service dataplane **************************************************************************** Node Upgrade has been started. Please do not make any changes, until the upgrade operation is complete. Run "get upgrade progress-status" to show the progress of last upgrade step. **************************************************************************** edge-tn-03> get service dataplane **************************************************************************** Node Upgrade has been started. Please do not make any changes, until the upgrade operation is complete. Run "get upgrade progress-status" to show the progress of last upgrade step. **************************************************************************** Tue Oct 10 2023 UTC 20:01:22.259 Service name: dataplane Service state: running edge-tn-03> get phy fp-eth0 % Invalid value for argument <physical-port-name>: fp-eth0 <physical-port-name>: String representation. edge-tn-03> get physical-port **************************************************************************** Node Upgrade has been started. Please do not make any changes, until the upgrade operation is complete. Run "get upgrade progress-status" to show the progress of last upgrade step. **************************************************************************** Tue Oct 10 2023 UTC 20:01:52.060 % An unexpected error occurred: The dataplane service is in error state, has failed or is disabled edge-tn-03> get service dataplane **************************************************************************** Node Upgrade has been started. Please do not make any changes, until the upgrade operation is complete. Run "get upgrade progress-status" to show the progress of last upgrade step. **************************************************************************** Tue Oct 10 2023 UTC 20:02:15.332 Service name: dataplane Service state: stopped edge-tn-03> get services **************************************************************************** Node Upgrade has been started. Please do not make any changes, until the upgrade operation is complete. Run "get upgrade progress-status" to show the progress of last upgrade step. **************************************************************************** Tue Oct 10 2023 UTC 20:07:29.838 Service name: dataplane Service state: stopped Service name: dhcp Service state: stopped Service name: dispatcher Service state: running Service name: docker Service state: running Service name: ipsecvpn Service state: stopped Service name: liagent Service state: stopped Service name: local-controller Service state: running Service name: metadata-proxy Service state: stopped Service name: nestdb Service state: running Service name: node-mgmt Service state: running Service name: node-stats Service state: running Service name: nsx-control-plane-agent Service state: running Service name: nsx-message-bus Service state: stopped Service name: nsx-opsagent Service state: running Service name: nsx-platform-client Service state: running Service name: nsx-upgrade-agent Service state: running Service name: ntp Service state: running Start on boot: True Service name: router Service state: stopped Service name: router-config Service state: running Service name: security-hub Service state: stopped Service name: snmp Service state: stopped Start on boot: False Service name: ssh Service state: running Start on boot: True Root login: enabled Service name: ssh-in-lr Service state: stopped Service name: syslog Service state: running
As you can see, it didn’t really matter what we did even when the dataplane service was running, the Edge would simply not work. At this point I checked around a bit more, and when I did I found a KB that mentioned that there is a bug in this specific NSX(T) version. It took a while for me to find this bug since there was no real indication in the logging that pointed towards this. Since we now know what is happening, we need to change the upgrade versie to 4.1.2.x, I changed it to: 4.1.2.3.0.
Now the issue itself is a fault that slipped into the NSX code on the 4.1.1.0 release. You would never notice this if you are running on newer CPU’s. This specific environment is an older CPU environment.
VMware explained the bug as such:
In NSX 4.1.1 due to the DPDK upgrade on Edges, the machine type for building the DPDK was inadvertently set to ‘native’ rather than the previously used ‘corei7’. This restricted the supported CPU types for the Edge VMs and BMEs to newer CPUs only.
I suggest reading the following page if you would want to know more on the DPDK machine types that can be defined. As I understand it, with that type you can set specific CPU instruction sets to be used. The native machine type configures the CPU instruction set to the machine the code was build on, which was other than Sandy Bridge, Ivy Bridge, and Westmere CPUs. VMware also posted this on the release notes for 4.1.2!
Solution
You can re-do you NSX upgrade by uploading a new NSX(T) .mub file. The Upgrade Coordinator will change the bits and re-do the upgrade from the start. Like can be seen below:
In theory, you should be able to go through all of the upgrade steps once more to complete your upgrade. However, you might guess it, this did not work for us! Since the first Edge Transport node was already upgraded to 4.1.1.0, it would not let me upgrade it once more to 4.1.2.3. Now at this point there are several things that can be done.
- Phase out the specific Edge Transport Node by replacing it in the Edge Cluster with another Edge Transport Node (if you have one available). If not, you can deploy a fresh one and use that one.
- Remove the Edge Transport Node from the Edge Cluster altogether. This cannot be done if you have logical routers active on the Edge node as you can see below:
- Redeploy the Edge Node. This can only be done from the API. My co-blogger wrote a fine blogpost for that if you wish to do this.
In my situation I tried option 1 (and deployed a new Edge Transport node), and I tried option 3. Both of which worked fine. Once this was done, I could re-upload the .mub file and this went through. Once the pre-checks were done the upgrades stages all went by without any issues and it resulted in a healthy NSX 4.1.2.3 environment.
I hope you found this post useful!
0 Comments