This week I have been troubleshooting a fresh issue that I came across in our environment. We have recently upgraded one of our VMware Cloud Director (VCD) environments to version 10.4.0. This version of VCD brings loads of new enhancements that we loved and wanted to use after the upgrade. I’ve talked about the new enhancements in this version and for example the enhancement Console Proxy mechanics before in my other blog post.
The issue we were having now was the following. If we edited a Virtual Machine (VM) with more vCPU’s and then wanted to power on the VM, the VM Power On task keeps running in the VCD environment. In the back-end on vCenter Server we could actually see that the VM was being relocated to another physical vSphere Cluster. Once this was done, the VM was relocated to another physical vSphere Cluster, and so on and so on. This kept happening all the while the task in VCD kept running. So as an end user the only thing you can think about is that it’s taking its time. Eventually the task fails once VCD relocated the VM across all the vSphere Clusters availble in the environment.
So in this case there are a couple of things that might need some clarification for some people. Since this is a larger environment we are using multiple vSphere clusters als Resource Pools in VCD joined together in a single Provider VDC. If we give this a bit of a logical overview you can see it as below:
In this overview you can see that the “pVDC Cloud Platform” is connected to multiple tenants, Organizations and Organization VDC’s. The green “Compute” block is in our case one or more physical vSphere clusters. But all with the same hardware and underlying seperated network and storage. In VDC terminology this is called an Elastic pVDC.
Ok so back to the issue. Let’s illustrate the problem in VCD 10.4.0. We have a VM that has 20 vCPU’s.
Now we edit this VM to have 21 vCPU’s. The task completes without a problem.
Now we try to power on the VM from the VCD UI. Once you do this you can see the task is starting. It will run for a while. So checking this from the vCenter Server you can see the following behaviour on the VM:
You can see the first reconfigure, which was the reconfigure to 21 vCPU’s. The next Relocate tasks are tasks that are relocating the VM to another physical vSphere Cluster in the environment. How many times it does this is depending on your environment and the amount of physical clusters you have. We have loads so you can imagine a VM that is large will take forever to get to a point where the task fails. Once the task fails in VCD you get the following error:
com.vmware.ssdc.library.exceptions.MultipleLMException: One or more exceptions have occurred - Multiple Exceptions follow: [com.vmware.ssdc.library.exceptions.InsufficientResourceException: The operation could not be performed, because there are no more CPU resources, com.vmware.pbm.placement.PlacementException: PlacementException NO_HUBS_AVAILABLE_FOR_PLACEMENT]
Now if we look closely in the
vcloud-container-debug.log file in the
/opt/vmware/vcloud-director/logs folder you can also see the following catched in the logs:
2023-03-18 13:50:36,770 | ERROR | task-service-activity-pool-97 | FutureUtil | Failed to deploy VM | requestId=xxxxxxx-1979-4da1-ab56-xxxxxxx,request=POST https://xxx.xxxxxx.nl/api/vApp/vm-xxxxxxxxxxxxxxx/power/action/powerOn,requestTime=1679143752011,remoteAddress=xxx.xxx.xxx.xxx:60354,userAgent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...,accept=application/*+xml;version 38.0.0-alpha vcd=7474ce19-48aa-xxxx-a6c5-xxxxxxxxxx,task=xxxxxxxx-bc38-438d-964f-xxxxxxxxx activity=(com.vmware.vcloud.backendbase.management.system.TaskActivity,urn:uuid:xxxxxxxx-bc38-438d-xxxxf-xxxxxxxxxxx) com.vmware.ssdc.library.exceptions.MultipleLMException: One or more exceptions have occurred - Multiple Exceptions follow: [com.vmware.ssdc.library.exceptions.InsufficientResourceException: The operation could not be performed, because there are no more CPU resources, com.vmware.pbm.placement.PlacementException: PlacementException NO_HUBS_AVAILABLE_FOR_PLACEMENT] at com.vmware.ssdc.library.ExceptionFactory.createFromMultiple(ExceptionFactory.java:35) at com.vmware.ssdc.backend.PowerOnVmActivity$RelocateVmPhase.invoke(PowerOnVmActivity.java:301) at com.vmware.vcloud.activity.executors.ActivityRunner.runPhase(ActivityRunner.java:175) at com.vmware.vcloud.activity.executors.ActivityRunner.run(ActivityRunner.java:112) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) ............................
OK great. Well fortunately I’ve had my fair share of VCD knowledge so I knew where to find this. I tested the same procedure on my test environment, which was 10.4.1.1 (latest version while writing this blogpost). The behaviour was unfortunately the same there:
- Have a VM with 20 vCPU’s -> Check
- Edit the VM to 21 vCPU’s -> Check
- Power On the VM -> Many relocates to other clusters before the task fails.
The error message at that point was also more or less the same:
064d695e-1f39-4e4d-xxxx-xxxxxxxxxxx] The operation failed because no suitable resource was found. Out of 2 candidate hubs: 2 hubs eliminated because: Compute requirement not met: [type:NumCpu, value:32, mandatory:false, checkMinHostCapacity:false]. Rejected hubs: resgroup-xxxxx, resgroup-xxx - PlacementException NO_FEASIBLE_PLACEMENT_SOLUTION
So the behaviour in VCD 10.4.1.1 is not different from the behaviour for VCD version 10.4.0. Like I mentioned I probably knew where to find the issue. We are using limits in our VCD environment. Not limits enforced by VM Sizing Policies (or previously called Compute policies), but with a couple hidden VCD cell-management-tool commands. I’ve found them years ago at this blog post. Which would be the following:
/opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n vmlimits.cpu.numcpus /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n vmlimits.disk.capacity.maxMb /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n vmlimits.memory.limit
These commands actually enforce the limits across the entire VCD environment, for all tenants at once. They can be edited/removed/put in while VCD is running and there is no need to restart anything. Since in this environment we are using 20 vCPU’s as the limit and we were trying to edit the VM’s to 21 vCPU’s (which is over the limit) you are receiving the error message in VCD 10.4.x. It’s a bit misinforming, but it means that you have crossed the limit and the number of selected vCPU’s, or GB RAM or disk size are not allowed. Before VCD 10.4.x the configuration didn’t go through at all and failed before expanding the vCPU’s/RAM/Disk above the configured limit.
While talking to VMware support it looks like the VCD environment simply does not do a correct validation anymore on these configured limits. It will go through and then you will receive the relocates mentioned before.
Solution / Workaround
Fortunately the resolution for now if you are on VCD 10.4.x is simpel. Remove the previously mentioned configured limits in the cell-management-tool and you are good to go. The VM’s will not be relocated across all the vSphere clusters anymore. The customers can go over the maximum desired number of resources, but that is because it’s no longer enforced. So you would have to keep an eye out for this to correct later on.
Currently as of now while writing this blogpost there is no additional workaround/solution besides removing the values. There is another option, but if you are not using it right now it might be a lot of work. You could use VM Sizing policies to have your tenants comply with limits. This is also a bug and known internally with VMware. Unfortunately it is not covered in the VCD 10.4.0 Known Issues section so I figured I’d write a post on this.