How to: Fix vMotion issues because of Emulex driver logging

Recently I was troubleshooting an environment that had some weird issues regarding the centralized VMware Tools repository that we configured. I will not bore you with these details, they will get explained in a future blogpost. But I also had some issues regarding vMotion on this environment. At first I thought that that was part of the problem for the VMware Tools issue, but after some troubleshooting it seems it was not.

The issue

So the issue was as follows. vMotion was working, although only partly. And we only noticed it when we wanted to put a host in maintenance mode. By doing this, several dozen Virtual Machines would be vMotioned to another available host. But the vMotion processes on the ESXi host decided it only wanted to do 4, and always 4, vMotions at a time. After these 4, all other vMotions would fail. Strangely, all consequent vMotions, on whatever host we picked in the cluster as the Source or Destination would fail. The VMware ESXi build we used on this cluster was VMware ESXi, 6.7.0, 15018017 (ESXi 6.7 EP 13).

Once we tried the vMotion manually, we receive errors that the Virtual Machine “was already running a task” (it was not) and we could not proceed. Once we waited a few minutes and tried again, the Virtual Machine would vMotion without any problems. Just never more than 4 at a time.

They would fail with the following error messages found in several logs and within the vCenter vSphere Client:

vCenter vSphere Client vMotion error message

vpxd.log:
2020-05-19T13:24:15.226Z error vpxd[39555] [Originator@6876 sub=VmProv opID=lro-1443200-6d7b9eae-07-01] Failed to track task vim.Task:task-165852 on host vim.HostSystem:host-15: Fault cause: vmodl.fault.SystemError
2020-05-19T13:24:15.226Z error vpxd[39555] [Originator@6876 sub=VmProv opID=lro-1443200-6d7b9eae-07-01] Aborting task tracking since task vim.Task:task-165852 failed
-->
2020-05-19T13:24:15.226Z error vpxd[39555] [Originator@6876 sub=VmProv opID=lro-1443200-6d7b9eae-07-01] [VMOTION_RECOVER] Failed to prepare VMotion on destination: A general system error occurred: . Aborting vMotion xaction

vmware.log for the VM that is failing:
2020-05-19T13:25:44.914Z| vmx| I125: MigrateSetStateFinished: type=1 new state=6
2020-05-19T13:25:44.914Z| vmx| I125: MigrateSetState: Transitioning from state 1 to 6.
2020-05-19T13:25:44.914Z| vmx| A100: ConfigDB: Setting config.readOnly = "FALSE"
2020-05-19T13:25:44.914Z| vmx| I125: Migrate_SetFailureMsgList: switching to new log file.
2020-05-19T13:25:44.914Z| vmx| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp. Error = 17
2020-05-19T13:25:44.915Z| vmx| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp/vmware-root. Error = 17
2020-05-19T13:25:44.915Z| vmx| I125: Migrate_SetFailureMsgList: Now in new log file.
2020-05-19T13:25:44.963Z| vmx| I125: Migrate: Caching migration error message list:
2020-05-19T13:25:44.963Z| vmx| I125: [msg.migrate.expired] Timed out waiting for migration start request. 
2020-05-19T13:25:44.963Z| vmx| I125: Migrate: cleaning up migration state.

It took us some time to figure this out, but in the end we noticed that 2 of the 9 hosts in the cluster had a full RAMdisk. The reason it took us some time is because these two hosts were never used in the Source/Destination schedule for the vMotions. So we didn’t think it would matter. I guess we were wrong. The full RAMdisk was found on the ESXi host in the following log snippet:

hostd.log
020-05-19T13:24:14.406Z warning hostd[2099675] [Originator@6876 sub=journal opID=lro-1443200-6d7b9eae-03-01-61-a4a6 user=vpxuser] [JrnlFilePersistenceProvider] Error while writing to file. There is no space left on the device
2020-05-19T13:24:14.406Z info hostd[10630738] [Originator@6876 sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event: (vim.event.EventEx) {
-->    key = 255,
-->    chainId = 82,
-->    createdTime = "1970-01-01T00:00:00Z",
-->    userName = "",
-->    datacenter = (vim.event.DatacenterEventArgument) null,
-->    computeResource = (vim.event.ComputeResourceEventArgument) null,
-->    host = (vim.event.HostEventArgument) {
-->       name = "esx04",
-->       host = 'vim.HostSystem:ha-host'
-->    },
-->    vm = (vim.event.VmEventArgument) null,
-->    ds = (vim.event.DatastoreEventArgument) null,
-->    net = (vim.event.NetworkEventArgument) null,
-->    dvs = (vim.event.DvsEventArgument) null,
-->    fullFormattedMessage = <unset>,
-->    changeTag = <unset>,
-->    eventTypeId = "esx.problem.visorfs.ramdisk.full",
-->    severity = <unset>,
-->    message = <unset>,
-->    arguments = (vmodl.KeyAnyValue) [
-->       (vmodl.KeyAnyValue) {
-->          key = "1",
-->          value = "var"
-->       },
-->       (vmodl.KeyAnyValue) {
-->          key = "2",
-->          value = "/var/lib/vmware/hostd/journal/1589894654.384"
-->       }
-->    ],
-->    objectId = "ha-host",
-->    objectType = "vim.HostSystem",
-->    objectName = <unset>,
-->    fault = (vmodl.MethodFault) null
--> }

The resolution

Since we noticed that the RAMdisk was full for at least two of the hosts. We examined all the ESXi hosts. You can easily do this by executing the following command on the ESXi cli:

vdf -h

This will provide you with the following output, in which you can see the RAMdisk is full:

Ramdisk                   Size      Used Available Use% Mounted on
root                       32M        4M       27M  13% --
etc                        28M      580K       27M   2% --
opt                        32M      120K       31M   0% --
var                        48M       48M        0B 100% --
tmp                       256M      192K      255M   0% --
iofilters                  32M        0B       32M   0% --
shm                      1024M        0B     1024M   0% --
hostdstats                803M        8M      794M   1% --

If we continue to troubleshoot and check why the “/var” partition is getting full. We can see that it is because of the following file:

[esx04:/var/log/EMU/mili] ls -lsah
total 46004
     4 drw-r--r--    1 root     root         512 Feb 24 13:00 .
     4 drwxr-xr-x    1 root     root         512 Feb 24 13:00 ..
 45996 -rw-r--r--    1 root     root       44.9M Mar 17 20:33 mili2d.log

Ok great we found the culprit. Now let’s see why this file is getting so big. Well after some searches through the VMware KB’s it seems this is happening because certain Emulex drivers write to the “/var/log/” partition instead of the “/scratch/log/” partition. This behaviour should be fixed in vSphere ESXi 6.7 U2, but apparently it is not, since we are running a higher version.

When we look at the content of the log file “mili2d.log”, it is filled with the following couple of snippets over and over again:

Tue Mar 17 20:32:19 2020,3678490,   ERROR:MILI_enumerate_elxiscsi:Failed to initialize User Init with status = 19
Tue Mar 17 20:32:19 2020,3678490,   ERROR:MILI_enumerate_elx_nics:Failed to initialize USer Init with status = 19
Tue Mar 17 20:32:19 2020,3678490,   ERROR:could not open device node /vmfs/devices/char/vmkdriver/be_esx_nic
Tue Mar 17 20:32:19 2020,3678490,   CRITICAL:backend_init:OneConnect Adapter Not Found.
Tue Mar 17 20:32:19 2020,3678492,   ERROR:MILI_enumerate_elxiscsi:Failed to initialize User Init with status = 19
Tue Mar 17 20:32:19 2020,3678492,   ERROR:MILI_enumerate_elx_nics:Failed to initialize USer Init with status = 19
Tue Mar 17 20:32:19 2020,3678492,   ERROR:could not open device node /vmfs/devices/char/vmkdriver/be_esx_nic
Tue Mar 17 20:32:19 2020,3678492,   CRITICAL:backend_init:OneConnect Adapter Not Found.
Tue Mar 17 20:33:19 2020,3678528,   ERROR:MILI_enumerate_elxiscsi:Failed to initialize User Init with status = 19
Tue Mar 17 20:33:19 2020,3678528,   ERROR:MILI_enumerate_elx_nics:Failed to initialize USer Init with status = 19
Tue Mar 17 20:33:19 2020,3678528,   ERROR:could not open device node /vmfs/devices/char/vmkdriver/be_esx_nic
Tue Mar 17 20:33:19 2020,3678528,   CRITICAL:backend_init:OneConnect Adapter Not Found.
Tue Mar 17 20:33:19 2020,3678529,   ERROR:MILI_enumerate_elxiscsi:Failed to initialize User Init with status = 19
Tue Mar 17 20:33:19 2020,3678529,   ERROR:MILI_enumerate_elx_nics:Failed to initialize USer Init with status = 19
Tue Mar 17 20:33:19 2020,3678529,   ERROR:could not open device node /vmfs/devices/char/vmkdriver/be_esx_nic
Tue Mar 17 20:33:19 2020,3678529,   CRITICAL:backend_init:OneConnect Adapter Not Found.

Fortunately there are four easy ways to fix this issue:

Update your ESXi hosts to a build above ESXi 6.7 U2. This is not a guaranteed fix, but it should be fixed in this release.
Empty the mili2d.log file with echo > mili2d.log.
Remove the Emulex driver altogether.
1. Put the ESXi Host in Maintenance Mode.
2. Execute the following command to remove one of the Emulex driver in question: esxcli software vib remove --vibname=elxnet
3. Reboot the host.
4. If you want to remove ALL Emulex drivers, execute the following command to find them all: esxcli software vib list | grep EMU

[esx04:/vmfs/volumes/datastore/testbve] esxcli software vib list | grep EMU
brcmfcoe                       12.0.1278.0-1OEM.670.0.0.8169922     EMU              VMwareCertified   2020-05-26
elxiscsi                       12.0.1188.0-1OEM.670.0.0.8169922     EMU              VMwareCertified   2020-05-26
elxnet                         12.0.1216.4-1OEM.670.0.0.8169922     EMU              VMwareCertified   2020-05-26
lpfc                           12.4.270.6-1OEM.670.0.0.8169922      EMU              VMwareCertified   2020-05-26

Edit the loglevel for the Emulex drivers. You can do this by executing the following couple of commands (use value 0 for disabled/no logging):

vi /etc/config/EMU/mili/libmili.conf

libmili.conf:
#MILILogLevel: Control the number of messages logged to the mili2d.log file.
#Possible values are 0 - Disabled, 1 - FATAL, 2 - CRITICAL, 3 - ERROR, 4 - WARNING, 5 - INFORMATION, 6 - DEBUG.
MILILogLevel 0

You should consult your hardware vendor if you are not sure if the Emulex drivers are required for your ESXi host to work at all. If you are not using the hardware for it, you can safely remove the vibs.

The conclusion

So to conclude this post. In the beginning we had several issues that seemed to be related to another case, but in the end were not. The vMotions were not working for the most part and doing the vMotions manually was also not working. After troubleshooting it seemed the issues came from a couple of full RAMdisks, caused by a faulty Emulex driver, that were present on a couple of ESXi hosts in the cluster. Once we cleared those, or reconfigured the Emulex driver logging, or removed the vibs and rebooted the ESXi host, the vMotions were working again. Putting a host in maintenance mode would work flawlessly once again.

Thanks for reading and until the next blogpost!

1 Comment

Johannes · February 2, 2021 at 4:02 pm

Had a similar problem: vMotion failed from and to a certain host with “A general system error occurred” while “Preparing Virtual Machine for live migration on source host”.

Turned out to be also /var full due to a previous test with trivia logging, which was of course reverted but a large log file remained – since /var is only 48 or 50 MB small, this was a problem.

File deleted and everything works again. vdf -h is a good step in troubleshooting…!

How to: Fix vMotion issues because of Emulex driver logging

Published by Bryan van Eeden on June 2, 2020June 2, 2020

The issue

The resolution

The conclusion

More posts on our blog...

Bryan van Eeden

1 Comment

Johannes · February 2, 2021 at 4:02 pm

Leave a Reply Cancel reply

How to detach Provider VDC Resource Pool in VMware Cloud Director

How to fix VMware Cloud Director relocate task timeouts

How to fix VCF Usage Meter v9 connectivity issues after upgrade

How to: Fix vMotion issues because of Emulex driver logging

Published by Bryan van Eeden on June 2, 2020June 2, 2020

The issue

The resolution

The conclusion

More posts on our blog...

Bryan van Eeden

1 Comment

Johannes · February 2, 2021 at 4:02 pm

Leave a Reply Cancel reply

Related Posts

How to detach Provider VDC Resource Pool in VMware Cloud Director

How to fix VMware Cloud Director relocate task timeouts

How to fix VCF Usage Meter v9 connectivity issues after upgrade