Introduction
We are back at it again with another blogpost for VMware Cloud Director (VCD)! This week we were working on a new issue in our environment, that was again a new not yet discovered bug in the VCD 10.4.x branch. This time we found out that if editing existing Virtual Machine Harddrives it would crash the VMware VCD cell (specifically the cell that received the task). You can imagine that this does not scale well in an environment with loads of Virtual Machines and users. Thus I hope this blog can help you guys figure out if you are seeing the same thing and are also affected.
Troubleshooting
So first things first. Like I said in the introduction. This issue only comes up (and not in all VM’s or all cases) while editing Virtual Machine harddrives, and only existing harddrives. So adding a new harddrive to a Virtual Machine, or deploying a new Virtual Machine does not cause the same issue. Atleast not on my environment. So let’s have a look at how you can see if this is happening to you, and how we troubleshooted this issue. First up, the VCD GUI showed us the following error on the task once it failed (editing the disk that is):

Looking a bit deeper into the cells, we were checking the cell-runtime.log
file located in the /opt/vmware/vcloud-director/log/
folder:
2023-03-08 16:07:06,415 | FATAL | task-service-activity-pool-5 | UncaughtExceptionHandlerStartupAction | Uncaught Exception. Originating thread: Thread[task-service-activity-pool-5,5,main]. Message: null | requestId=xxxxxxxx-00c5-4ebd-9c94-xxxxxxxxxxxxx,request=POST https://vcd/api/vApp/vm-xxxxxx-391f-4a2e-xxxx-xxxxxxxxxxxxxx/action/reconfigureVm,requestTime=1678288011924,remoteAddress=xxx.xxx.xxx.xxx:50712,userAgent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...,accept=application/*+xml;version 38.0.0-alpha vcd=xxxxxxxxx-48aa-4b7a-a6c5-xxxxxxxxxxx,task=1b501817-2fdb-4c40-88bc-xxxxxxxxxxx activity=(com.vmware.vcloud.backendbase.management.system.TaskActivity,urn:uuid:xxxxxxxxx-2fdb-4c40-88bc-xxxxxxxxxxxx) java.lang.StackOverflowError at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:298) at org.postgresql.jdbc.PgConnection.executeTransactionCommand(PgConnection.java:755) at org.postgresql.jdbc.PgConnection.commit(PgConnection.java:777) at org.apache.tomcat.jdbc.pool.ConnectionPool.terminateTransaction(ConnectionPool.java:853) at org.apache.tomcat.jdbc.pool.ConnectionPool.shouldClose(ConnectionPool.java:875) at org.apache.tomcat.jdbc.pool.ConnectionPool.returnConnection(ConnectionPool.java:904) at org.apache.tomcat.jdbc.pool.ProxyConnection.invoke(ProxyConnection.java:100) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at org.apache.tomcat.jdbc.pool.interceptor.AbstractCreateStatementInterceptor.invoke(AbstractCreateStatementInterceptor.java:61) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at org.apache.tomcat.jdbc.pool.interceptor.AbstractCreateStatementInterceptor.invoke(AbstractCreateStatementInterceptor.java:61) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at org.apache.tomcat.jdbc.pool.interceptor.ConnectionState.invoke(ConnectionState.java:153) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at com.vmware.vcloud.common.datasource.CustomJdbcInterceptor.invoke(CustomJdbcInterceptor.java:34) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at org.apache.tomcat.jdbc.pool.TrapException.invoke(TrapException.java:41) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at org.apache.tomcat.jdbc.pool.DisposableConnectionFacade.invoke(DisposableConnectionFacade.java:80) at com.sun.proxy.$Proxy42.close(Unknown Source) at com.vmware.vcloud.common.db.SessionFactoryInitializer$InternalConnectionProvider.closeConnection(SessionFactoryInitializer.java:105) at org.hibernate.jdbc.ConnectionManager.closeConnection(ConnectionManager.java:474) at org.hibernate.jdbc.ConnectionManager.aggressiveRelease(ConnectionManager.java:429) at org.hibernate.jdbc.ConnectionManager.afterTransaction(ConnectionManager.java:316) at org.hibernate.jdbc.JDBCContext.afterTransactionCompletion(JDBCContext.java:248) at org.hibernate.transaction.JDBCTransaction.rollback(JDBCTransaction.java:208) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.vmware.vcloud.common.db.HibernateClassLoaderHelperBeanPostProcessor$SessionInvocationHandler.invoke(HibernateClassLoaderHelperBeanPostProcessor.java:280) at com.sun.proxy.$Proxy158.rollback(Unknown Source) at com.vmware.vcloud.common.db.Hibernate3TransactionManager.doRollback(Hibernate3TransactionManager.java:297) at org.springframework.transaction.support.AbstractPlatformTransactionManager.processRollback(AbstractPlatformTransactionManager.java:835) at org.springframework.transaction.support.AbstractPlatformTransactionManager.rollback(AbstractPlatformTransactionManager.java:809) at org.springframework.transaction.interceptor.TransactionAspectSupport.completeTransactionAfterThrowing(TransactionAspectSupport.java:672) at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:392) at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) at com.vmware.vcloud.common.reservation.dao.impl.SdrsReservationsDaoImpl$$EnhancerBySpringCGLIB$$3bf7d319.fetchReservations(<generated>) at com.vmware.vcloud.fabric.storage.placement.impl.FabricReservationManagerImpl.fetchSdrsReservations(FabricReservationManagerImpl.java:189) at com.vmware.vcloud.fabric.storage.placement.sdrs.impl.SdrsPlacementManagerImpl.getSdrsRecommendations(SdrsPlacementManagerImpl.java:172) at com.vmware.vcloud.fabric.storage.placement.sdrs.impl.SdrsPlacementManagerImpl.getSdrsRecommendationsForSpodPlacement(SdrsPlacementManagerImpl.java:141) at com.vmware.vcloud.fabric.storage.placement.impl.ValidDatastoresPerItemSolverUtils$VmValidDsPerItemSolver.getDiskDatastoreRecommendations(ValidDatastoresPerItemSolverUtils.java:795) at com.vmware.vcloud.fabric.storage.placement.impl.ValidDatastoresPerItemSolverUtils$VmValidDsPerItemSolver.getExistingVmSdrsRecommendations(ValidDatastoresPerItemSolverUtils.java:738) at com.vmware.vcloud.fabric.storage.placement.impl.ValidDatastoresPerItemSolverUtils$VmValidDsPerItemSolver.getExistingSubjectSdrsRecommendations(ValidDatastoresPerItemSolverUtils.java:699) OR 2023-03-09 17:44:10,584 | FATAL | task-service-activity-pool-66 | UncaughtExceptionHandlerStartupAction | Uncaught Exception. Originating thread: Thread[task-service-activity-pool-66,5,main]. Message: null | requestId=xxxxxxxxxxxx-b93b-4499-943a-xxxxxxxxxxxxx,request=POST https://vcd/api/vApp/vm-xxxxxxxx-ce4f-42fa-863e-xxxxxxxxxxx/action/reconfigureVm,requestTime=1678380233778,remoteAddress=xxx.xxx.xxx.xxx:58484,userAgent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...,accept=application/*+xml;version 38.0.0-alpha vcd=xxxxxxx-a9e5-4b42-a7b4-xxxxxxxxxx,task=9e9f0d28-782e-49c9-b127-xxxxxxxxxxactivity=(com.vmware.vcloud.backendbase.management.system.TaskActivity,urn:uuid:xxxxxxxxxxxx-782e-49c9-b127-xxxxxxxxxxxx) java.lang.StackOverflowError at java.base/java.net.URI$Parser.parseServer(URI.java:3295) at java.base/java.net.URI$Parser.parseAuthority(URI.java:3216) at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3158) at java.base/java.net.URI$Parser.parse(URI.java:3114) at java.base/java.net.URI.<init>(URI.java:600) at java.base/java.net.URI.create(URI.java:881) at com.vmware.vcloud.common.model.UriType.nullSafeGet(UriType.java:87) at org.hibernate.type.CustomType.nullSafeGet(CustomType.java:128) at org.hibernate.type.AbstractType.hydrate(AbstractType.java:105) at org.hibernate.persister.entity.AbstractEntityPersister.hydrate(AbstractEntityPersister.java:2124) at org.hibernate.loader.Loader.loadFromResultSet(Loader.java:1404) at org.hibernate.loader.Loader.instanceNotYetLoaded(Loader.java:1332) at org.hibernate.loader.Loader.getRow(Loader.java:1230) at org.hibernate.loader.Loader.getRowFromResultSet(Loader.java:603) at org.hibernate.loader.Loader.doQuery(Loader.java:724) at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:259) at org.hibernate.loader.Loader.doList(Loader.java:2228) at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2125) at org.hibernate.loader.Loader.list(Loader.java:2120) at org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:118) at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1596) at jdk.internal.reflect.GeneratedMethodAccessor125.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.vmware.vcloud.common.db.HibernateClassLoaderHelperBeanPostProcessor$SessionInvocationHandler.invoke(HibernateClassLoaderHelperBeanPostProcessor.java:280) at com.sun.proxy.$Proxy106.list(Unknown Source) at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:306) at com.vmware.vcloud.common.repository.impl.SessionFactoryRepositoryTemplate.findByCriteria(SessionFactoryRepositoryTemplate.java:137) at com.vmware.vcloud.common.repository.impl.SessionFactoryRepositoryTemplate$$FastClassBySpringCGLIB$$f7adee23.invoke(<generated>) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123) at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388) at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) at com.vmware.vcloud.common.repository.impl.SessionFactoryRepositoryTemplate$$EnhancerBySpringCGLIB$$fc510c3.findByCriteria(<generated>) at com.vmware.vcloud.common.reservation.dao.impl.SdrsReservationsDaoImpl.fetchReservations(SdrsReservationsDaoImpl.java:44) at com.vmware.vcloud.common.reservation.dao.impl.SdrsReservationsDaoImpl$$FastClassBySpringCGLIB$$874b9d4a.invoke(<generated>) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
If we have a close look at the above logging output, we figured it might have something to do with the postgres configuration. But having a look at that we didn’t notice anything out of the ordinary. The entire postgres cluster was healthy and all three nodes were participating like they should. The end of the first and second log snippet suggested it might also be a Storage DRS (SDRS) issue in combination with VCD. But at first glance we didn’t see anything wrong with that either. Looking in the vcloud-container-debug.log
file located in the same folder we did find another error message, but they seem a like alot:
2023-03-08 16:07:06,411 | FATAL | task-service-activity-pool-5 | TaskActivity | [Activity Execution] Encountered unrecoverable error in com.vmware.vcloud.backendbase.management.system.TaskActivity/urn:uuid:xxxxxxxxx-2fdb-4c40-88bc-xxxxxxxxxx- Handle: urn:uuid:xxxxxxxxxxx-2fdb-4c40-88bc-xxxxxxxxxxxx, Current Phase: TaskActivity$ExecutePhase | requestId=xxxxxxxxx-00c5-4ebd-9c94-xxxxxxxxxxx,request=POST https://vcd/api/vApp/vm-xxxxxxxxx-391f-4a2e-ae44-xxxxxxxxxxxx/action/reconfigureVm,requestTime=1678288011924,remoteAddress=xxx.xxx.xxx.xxx:50712,userAgent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...,accept=application/*+xml;version 38.0.0-alpha vcd=xxxxxxxx-48aa-4b7a-a6c5-xxxxxxxxxx,task=xxxxxxxxxxxx-2fdb-4c40-88bc-xxxxxxxxxxx activity=(com.vmware.vcloud.backendbase.management.system.TaskActivity,urn:uuid:xxxxxxxxxxxx-2fdb-4c40-88bc-xxxxxxxxxxxx) java.lang.StackOverflowError at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:298) at org.postgresql.jdbc.PgConnection.executeTransactionCommand(PgConnection.java:755) at org.postgresql.jdbc.PgConnection.commit(PgConnection.java:777) at org.apache.tomcat.jdbc.pool.ConnectionPool.terminateTransaction(ConnectionPool.java:853) at org.apache.tomcat.jdbc.pool.ConnectionPool.shouldClose(ConnectionPool.java:875) at org.apache.tomcat.jdbc.pool.ConnectionPool.returnConnection(ConnectionPool.java:904) at org.apache.tomcat.jdbc.pool.ProxyConnection.invoke(ProxyConnection.java:100) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at org.apache.tomcat.jdbc.pool.interceptor.AbstractCreateStatementInterceptor.invoke(AbstractCreateStatementInterceptor.java:61) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at org.apache.tomcat.jdbc.pool.interceptor.AbstractCreateStatementInterceptor.invoke(AbstractCreateStatementInterceptor.java:61) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at org.apache.tomcat.jdbc.pool.interceptor.ConnectionState.invoke(ConnectionState.java:153) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at com.vmware.vcloud.common.datasource.CustomJdbcInterceptor.invoke(CustomJdbcInterceptor.java:34) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at org.apache.tomcat.jdbc.pool.TrapException.invoke(TrapException.java:41) at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109) at org.apache.tomcat.jdbc.pool.DisposableConnectionFacade.invoke(DisposableConnectionFacade.java:80) at com.sun.proxy.$Proxy42.close(Unknown Source) at com.vmware.vcloud.common.db.SessionFactoryInitializer$InternalConnectionProvider.closeConnection(SessionFactoryInitializer.java:105) at org.hibernate.jdbc.ConnectionManager.closeConnection(ConnectionManager.java:474) at org.hibernate.jdbc.ConnectionManager.aggressiveRelease(ConnectionManager.java:429) at org.hibernate.jdbc.ConnectionManager.afterTransaction(ConnectionManager.java:316) at org.hibernate.jdbc.JDBCContext.afterTransactionCompletion(JDBCContext.java:248) at org.hibernate.transaction.JDBCTransaction.rollback(JDBCTransaction.java:208) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.vmware.vcloud.common.db.HibernateClassLoaderHelperBeanPostProcessor$SessionInvocationHandler.invoke(HibernateClassLoaderHelperBeanPostProcessor.java:280) at com.sun.proxy.$Proxy158.rollback(Unknown Source) at com.vmware.vcloud.common.db.Hibernate3TransactionManager.doRollback(Hibernate3TransactionManager.java:297) at org.springframework.transaction.support.AbstractPlatformTransactionManager.processRollback(AbstractPlatformTransactionManager.java:835) at org.springframework.transaction.support.AbstractPlatformTransactionManager.rollback(AbstractPlatformTransactionManager.java:809) at org.springframework.transaction.interceptor.TransactionAspectSupport.completeTransactionAfterThrowing(TransactionAspectSupport.java:672) at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:392) at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) 2023-03-09 17:44:10,580 | FATAL | task-service-activity-pool-66 | TaskActivity | [Activity Execution] Encountered unrecoverable error in com.vmware.vcloud.backendbase.management.system.TaskActivity/urn:uuid:xxxxxxxx-782e-49c9-b127-xxxxxxxxxxx- Handle: urn:uuid:xxxxxxxx-782e-49c9-b127-xxxxxx, Current Phase: TaskActivity$ExecutePhase | requestId=xxxxxxx-b93b-4499-943a-xxxxxxxxx,request=POST https://vcd/api/vApp/vm-xxxxxxx-ce4f-42fa-863e-xxxxxxxxxx/action/reconfigureVm,requestTime=1678380233778,remoteAddress=xxx.xxx.xxx.xxx:58484,userAgent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...,accept=application/*+xml;version 38.0.0-alpha vcd=xxxxxxx-a9e5-4b42-a7b4-xxxxxxxxxx,task=xxxxxxxxxxx-782e-49c9-b127-xxxxxxxxxxx activity=(com.vmware.vcloud.backendbase.management.system.TaskActivity,urn:uuid:xxxxxxxxx-782e-49c9-b127-xxxxxxxxxxx) java.lang.StackOverflowError at java.base/java.net.URI$Parser.parseServer(URI.java:3295) at java.base/java.net.URI$Parser.parseAuthority(URI.java:3216) at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3158) at java.base/java.net.URI$Parser.parse(URI.java:3114) at java.base/java.net.URI.<init>(URI.java:600) at java.base/java.net.URI.create(URI.java:881) at com.vmware.vcloud.common.model.UriType.nullSafeGet(UriType.java:87) at org.hibernate.type.CustomType.nullSafeGet(CustomType.java:128) at org.hibernate.type.AbstractType.hydrate(AbstractType.java:105) at org.hibernate.persister.entity.AbstractEntityPersister.hydrate(AbstractEntityPersister.java:2124) at org.hibernate.loader.Loader.loadFromResultSet(Loader.java:1404) at org.hibernate.loader.Loader.instanceNotYetLoaded(Loader.java:1332) at org.hibernate.loader.Loader.getRow(Loader.java:1230) at org.hibernate.loader.Loader.getRowFromResultSet(Loader.java:603) at org.hibernate.loader.Loader.doQuery(Loader.java:724) at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:259) at org.hibernate.loader.Loader.doList(Loader.java:2228) at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2125) at org.hibernate.loader.Loader.list(Loader.java:2120) at org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:118) at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1596) at jdk.internal.reflect.GeneratedMethodAccessor125.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.vmware.vcloud.common.db.HibernateClassLoaderHelperBeanPostProcessor$SessionInvocationHandler.invoke(HibernateClassLoaderHelperBeanPostProcessor.java:280) at com.sun.proxy.$Proxy106.list(Unknown Source) at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:306) at com.vmware.vcloud.common.repository.impl.SessionFactoryRepositoryTemplate.findByCriteria(SessionFactoryRepositoryTemplate.java:137) at com.vmware.vcloud.common.repository.impl.SessionFactoryRepositoryTemplate$$FastClassBySpringCGLIB$$f7adee23.invoke(<generated>) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123) at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388) at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) at com.vmware.vcloud.common.repository.impl.SessionFactoryRepositoryTemplate$$EnhancerBySpringCGLIB$$fc510c3.findByCriteria(<generated>
At this point actually wanted to restart the vmware-vcd service to try if this helped. But then we found out the the vmware-vcd cell services were only up for a couple of minutes. This means that it was already restarted. Finding this interesting, we started doing multiple tests to see what was happening. Once we did, we figured out the entire vmware-vcd service crashes once the above error messages come around. After talking to GSS we learned that this is the default behaviour for the vmware-vcd service. It does this to make sure there is no harm done to the environment once a fatal exception occurs in the code. Fortunately the restart is something that you probably won’t notice in a quiet environment. But if you have an environment like we have with thousands of customers, you will notice quick enough.
Obviously being out of our league since we cannot debug specific VCD code (yet ;)) we started an investigation with VMware GSS to find a fix for the issue at hand. After a while we figured it out together with the VCD engineering team that the following was happening:
Initially when a VM and it’s disks are created. They start out residing on a specific datastore and targeted Storage POD (SPOD or Storage Cluster in vCenter). When you grow existing disks on a VM a SDRS recommendation gets created in the VCD database table called sdrs_reservation
. Before this happens you can investigate the VCD logs and see that VCD is invoking a SDRS recommendation, and mentioning that is is done calculating once. The calculated SDRS recommendation get’s placed into the aforementioned table. So you can imagine that this table contains all of the SDRS recommendations. Example logging:
Invoking the SDRS: 2023-03-08 19:51:59,973 | DEBUG | task-service-activity-pool-1 | SdrsPlacementManagerImpl | Invoking SDRS recommendDatastores for placement result: VirtualMachineInitialPlacementResultData [vmHomePlacementResult=StorageContainerPlacementResultData [storagePodVALRef=[vcId=xxxxx-286d-4202-8961-xxxxx, moref=group-xxxx]], vmDiskPlacementResults={VmDiskRef [diskObjectId=2000]=SimplePlacementResultData [datastoreVALRef=[vcId=xxxxx-286d-4202-8961-xxxxxx, moref=datastore-xxxx]]}], Placement type: class com.vmware.vcloud.fabric.storage.placement.api.VirtualMachineStorageSelector$ForRelocateVm Invoke SDRS result: 2023-03-08 19:52:01,950 | DEBUG | task-service-activity-pool-1 | SdrsPlacementManagerImpl | Calculated SDRS recommendations for subject moref://xxxxxx-286d-4202-8961-xxxxxxxxxx/VirtualMachine#vm-xxxxxxxx with initial params VirtualMachineInitialPlacementResultData [vmHomePlacementResult=StorageContainerPlacementResultData [storagePodVALRef=[vcId=xxxxx-286d-4202-8961-xxxxxxx, moref=group-xxxxxx]], vmDiskPlacementResults={VmDiskRef [diskObjectId=2000]=SimplePlacementResultData [datastoreVALRef=[vcId=xxxxxxxx-286d-4202-8961-xxxxxxxx, moref=datastore-xxxx]]}]. Result: {moref://xxxxxxxx-286d-4202-8961-xxxxxxxxx/VirtualMachine#vm-xxxxxxx=[[vcId=xxxxxxx-286d-4202-8961-xxxxxxxxxx, moref=datastore-xxxxxxxxxx]]}
Once this is done when you extend a disk there was a check in place in the code that checked for any existing SDRS recommendations based on SubjectURI, however the SubjectURI for the VM home is the same as the SubjectURI for the disk on the VM which was already present in the table. Therefor the SDRS call for the disk was never called and this results in a URI parser error which you can see in the log output. Because of this the SDRS recommendation for the VM home is returned for the VM disk. But the VM disk does not have any datastore mapping present. At this point the code runs in loops and never fetches the actual SDRS recommendation for the disk and therefor crashes the vmware-vcd service.
Now I do have to say that this might be an edge case, we were actually the first globally to experience this bug on this level. It might be due to the fact we are using multi-cluster setup which I briefly mentioned in an earlier post together with a shared storage policy across all environments. The VCD placement engine considers all clusters and all storage pods (storage clusters in vCenter) in this case and this is why SDRS reservations are done on multiple levels.
VMware engineering fixed this by filtering on multiple levels. In a hotfix we received we are now filtering on SubjectURI’s but also ItemURI’s. This way the code can really filter on unique entries (including seperate disks) instead of ‘global’ VM entries.
Conclusion
It seems that VCD release branch 10.4 is more buggy than release branch 10.3 (atleast for us). This is unfortunate for us since we are using this on a massive scale! Atleast the issue I’ve talked about can be fixed, albeit by a hotfix that is not yet publicly available. This hotfix does fix some of the placement logic in combination with the Storage DRS placement engine in vCenter Server. You should ask your VMware GSS representative for this patch. If you need to you can point it to my blogpost. Important to know, the hotfix is only for VCD 10.4.1.1 (at the time I am writing this blogpost), so if you are not yet on this version, make sure you upgrade to it first. The fix is not compatible with VCD 10.4.0. Applying the hotfix is as simple as upgrading VCD through the regular procedure mentioned here. I hope you found this helpful.
Yesterday VCD 10.4.2 also released, yet there is no mention of this in the Known Issues or Resolved Items section. That is unfortunate!
0 Comments