Introduction
Every now and then you come across something that we take as granted and more or less forget it or live with it. The same goes for the issue I will describe in this post.
Since VMware Cloud Director (VCD) 10.5.1.1 changing a VM policy such as a VM Placement Policy or VM Storage Policy might trigger a Storage vMotion to another datastore and/or cluster (Resource Pool – in an Elastic pVDC). This was and is no issue ofcourse, however the issue is that the task that manages this from the VCD side fails after a while with a timeout.
Troubleshooting
I did actually start to live with this issue. We encounter this quite a lot because this timeout also comes up with you move Virtual Machines (VM’s) from one cluster to another in an Elastic pVDC. This is also done from the VCD UI (inside the pVDC -> Resource Pools -> Select Resource Pool -> Migrate option). Reason being that this also triggers a Storage vMotion.
Executing this, or changing the beforementioned VM policies might result in the following messages:

[ XXXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXX ] Internal Server Error - Expected completed future, but received future which is still in progress
com.vmware.ssdc.util.LMException: Internal Server Error at com.vmware.ssdc.util.LMException.wrap(LMException.java:135) at com.vmware.ssdc.library.ExceptionFactory.createFromMultiple(ExceptionFactory.java:32) at com.vmware.vcloud.common.future.FutureUtil.waitForFutures(FutureUtil.java:96) at com.vmware.vcloud.common.future.FutureUtil.waitForFutures(FutureUtil.java:105) at com.vmware.vcloud.vapp.impl.VAppServiceImpl.migrateVmsTask(VAppServiceImpl.java:3963) at com.vmware.vcloud.vapp.impl.VAppServiceImpl.executeTask(VAppServiceImpl.java:807) at com.vmware.vcloud.backendbase.management.system.TaskActivity$ExecutePhase$1.doInSecurityContext(TaskActivity.java:865) at com.vmware.vcloud.backendbase.management.system.TaskActivity$ExecutePhase$1.doInSecurityContext(TaskActivity.java:860) at com.vmware.vcloud.backendbase.management.system.SecurityContextTemplate.executeForOrgAndUser(SecurityContextTemplate.java:49) at com.vmware.vcloud.backendbase.management.system.TaskActivity$ExecutePhase.execute(TaskActivity.java:867) at com.vmware.vcloud.backendbase.management.system.TaskActivity$ExecutePhase.invokeInner(TaskActivity.java:763) at com.vmware.vcloud.backendbase.management.system.TaskActivity$TaskActivityBasePhase.invokeCancelableOperation(TaskActivity.java:378) at com.vmware.vcloud.common.activity.toolkit.VcdAbstractActivity$CancelablePhase.invoke(VcdAbstractActivity.java:591) at com.vmware.vcloud.activity.executors.ActivityRunner.runPhase(ActivityRunner.java:175) at com.vmware.vcloud.activity.executors.ActivityRunner.run(ActivityRunner.java:112) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: com.vmware.vcloud.api.presentation.service.InternalServerErrorException: Internal Server Error at com.vmware.ssdc.backend.util.ComputeFabricUtil.translateException(ComputeFabricUtil.java:221) at com.vmware.ssdc.backend.util.ComputeFabricUtil.waitForFuture(ComputeFabricUtil.java:135) at com.vmware.vcloud.activities.vm.MigrateVmActivity$MigratePhase.invoke(MigrateVmActivity.java:117) ... 7 more Caused by: java.lang.AssertionError: Expected completed future, but received future which is still in progress at com.vmware.vcloud.common.future.FutureUtil.checkCompletedFuture(FutureUtil.java:184) at com.vmware.ssdc.backend.services.impl.RelocateVmActivity$UpdateComputeVmModelIfNeeded.invoke(RelocateVmActivity.java:1007) ... 7 more
Even though the task failed from the VCD perspective, it always succeeds from the vSphere perspective. So the relocate does eventually complete. VCD does see this once it’s done and the new location is reflected in the GUI also.

I do think that in some unique scenario’s this task failure does give some residual errors in the VCD database regarding VM locations. If this happens you will have to either do two things 1. Edit the database to reflect the correct location where the VM is placed or 2. Relocate the VM back so that it correctly reflects the current database entry and redo the Relocate. I will write a blogpost on this later on.
So, for the current issue at hand. My colleague pointed me to a KB article that was published only 2 weeks ago! The exact reason for this issue is mention in that KB:
vCenter will trigger a storage vMotion when the policy is updated in Cloud Director. This can take some time to complete. In Cloud Director 10.5.1.1, there is new timeout setting introduced for which when a VM policy change or a Storage vMotion task takes longer than default 5 minutes, the task would fail at VCD level as the timeout is hit.
Luckily, there is an easy fix. Since this VCD version there was a new undocumented configuration parameter introduced. The only result on Google at the moment is this specific KB that mentions this value. So let’s go ahead and edit this with the following command:
/opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n relocate.vm.workflow.timeout.minutes -v 10 Updating property: Property "relocate.vm.workflow.timeout.minutes" has value "10"
This sets the Relocate timeout to 10 minutes instead of the default 5 minutes. This value can be tweaked according to your own needs. There is no maximum. However, there are some side effects that you might want to consider before you push this value into hours:
- Once the value is an hour for example, the Relocate Task in VCD will timeout after an hour. Which means in this hour there is nothing else to do except wait. Which is already true because the relocate is running on the back-end, but still something to remember.
- Each Relocate can take anywhere from seconds, to days depending on your environment. Performance is determined by the ESXi hosts and Storage Array’s. So there will never be a ‘correct’ value.
We went ahead and set it on 120 minutes. We see that 99.9% of the Relocates are finished within this timeframe on our environments. If you want to see what value is configured run the following command:
/opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n relocate.vm.workflow.timeout.minutes -l Property "relocate.vm.workflow.timeout.minutes" has value "10"
Now whenever you change VM Placement Policy or VM Storage Policy or trigger a Storage vMotion on the back-end the task will timeout after the configured value.
Happy relocating! Hope this helped.
0 Comments