Introduction

The other week I upgraded both of our production and test VMware Cloud Director (VCD) environments from 10.0.x to 10.2.1, the current latest available version. The upgrade went fine without any issues for production, the test environment suffered a failed node but more on that in a later blogpost. After the upgrade everything seemed to be OK. But after a couple of days I noticed that our HAProxy, which we use in front of the VCD Cells (It’s a multi VCD-Cell environment) started suffering from outages on the console pool on the test environment. A couple days later the same issue started appearing on the production HAProxy.

Troubleshooting

Once both environments had the same issue I started to think this might be something we need to investigate a little further. The quick fix for the issue was to just simply restart the “vmware-vcd” service with the following command on each of the affected VCD Cell:

This did mean that that specific cell was taken out of the HAProxy (Back-End) pool for a couple of seconds, so initiated sessions that were specifically on that Cell ofcourse crashed and had to be restarted/handed over to the other remaining VCD Cells. Once we executed the command the VCD Cells Console Proxy pool in HAProxy was available again.

Once we looked back inside the logs on the affected VCD Cells I noticed the following exception inside the “console-proxy.log”:

Once the above exception was logged, the Cell no longer had a working Console Proxy service. After another couple of days I noticed that this issue was occuring more and more frequently now. We had to restart the VCD service more or less every hour. In the beginning the primary VCD Cell seemed to be spared from the crash, but after a while even the primary cell started crashing. Since this wasn’t doable for us I decided to ask some help from some guys over on the VMware Slack Channel. I quickly started talking with loads of other people that had this issue. I also started talking and troubleshooting with one of the developers to try and tackle this issue. After a while we found out that the java thread that holds the console proxy service crashed once the above piece of code was logged in the “console-proxy.log” file. You can check this yourselfes if you try and look for the console proxy service on the network connection list with:

If you have a working environment this should show something along the following:

Correctly working Console Proxy Service in VMware Cloud Director
Correctly working Console Proxy Service in VMware Cloud Director

In my situation though, it showed nothing, there was no output at all. This means that there is no service listening on the Console Proxy port 8443. This tells me that the java thread is crashed for sure.

So after talking and discussing some more with the developer he quickly found out what was happening. The Console Proxy service works in the following way in a VCD Cell:

  1. The VCD Cell(s) listen for IO on port TCP/8443 for any incoming console requests sessions.
  2. Once IO comes in, and the listener accepts this console session the process dispatches a thread to handle the incoming IO. Once the dispatching starts, the Console Proxy service shortly quiesces the first step. This means the server socket is temporary closed. Once done it re-enables the listener.
  3. Rinse and repeat. This way the console proxy service is always listening and able to dispatch threads to handle the incoming sessions.

It turned out that in some occurences with VCD 10.2.1 it seems like the Console Proxy service doesn’t go back to Step 1 when there is an exception. And since the server socket is closed (permanently now) the cell doesn’t have a working Console Proxy service anymore. The java thread itself doesn’t restart. Another thing to note is that at this time we are not able to only restart the Console Proxy service, so the beforementioned “service vmware-vcd restart” is needed and more or less the only quick fix.

The fix

Fortunately there is a workaround available. What we want is that Step 2 closes by “re-enabling” the Console Proxy service listener on port TCP/8443 so that future sessions get accepted again, eventhough there was an exception somewhere in the process. By applying the workaround we can bypass the current default behaviour (not re-enabling the listener). You can apply the workaround by going into each VCD Cell and appending the following to the “global.properties” file, which is located at /opt/vmware/vcloud-director/etc:

Save the file and once done you will have to restart the “vmware-vcd” service again. I already mentioned how to do this before in this post. This can also be done with a more easier way, execute the following:

And again, once you did this you will have to restart all VCD Cell services.

There are a couple of things to consider before applying the workaround. Do NOT use this workaround if you don’t have any issues. This is an undocumented and unsupported workaround. In a future Hot Fix/Patch this should be fixed, probably in VCD 10.2.2 and above. Once you want to upgrade to that version, you should also remove the workaround again before you do! The last thing to consider is that this will change the VCD Cell behaviour in regards to how many console sessions a Cell can handle. This will drop to ~400 instead of ~2.000 connections per VCD Cell.

Once I applied the workaround we’ve since been running with three working VCD Cells and Console Proxy services on both my environments. Like I said before this will get fixed in a future Hot Fix/Patch. I hope this helps anybody that is having the same issues.

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *