Recently I was troubleshooting an environment that had some weird issues regarding the centralized VMware Tools repository that we configured. I will not bore you with these details, they will get explained in a future blogpost. But I also had some issues regarding vMotion on this environment. At first I thought that that was part of the problem for the VMware Tools issue, but after some troubleshooting it seems it was not.
The issue
So the issue was as follows. vMotion was working, although only partly. And we only noticed it when we wanted to put a host in maintenance mode. By doing this, several dozen Virtual Machines would be vMotioned to another available host. But the vMotion processes on the ESXi host decided it only wanted to do 4, and always 4, vMotions at a time. After these 4, all other vMotions would fail. Strangely, all consequent vMotions, on whatever host we picked in the cluster as the Source or Destination would fail. The VMware ESXi build we used on this cluster was VMware ESXi, 6.7.0, 15018017 (ESXi 6.7 EP 13).
Once we tried the vMotion manually, we receive errors that the Virtual Machine “was already running a task” (it was not) and we could not proceed. Once we waited a few minutes and tried again, the Virtual Machine would vMotion without any problems. Just never more than 4 at a time.
They would fail with the following error messages found in several logs and within the vCenter vSphere Client:
vpxd.log:
2020-05-19T13:24:15.226Z error vpxd[39555] [Originator@6876 sub=VmProv opID=lro-1443200-6d7b9eae-07-01] Failed to track task vim.Task:task-165852 on host vim.HostSystem:host-15: Fault cause: vmodl.fault.SystemError
2020-05-19T13:24:15.226Z error vpxd[39555] [Originator@6876 sub=VmProv opID=lro-1443200-6d7b9eae-07-01] Aborting task tracking since task vim.Task:task-165852 failed
-->
2020-05-19T13:24:15.226Z error vpxd[39555] [Originator@6876 sub=VmProv opID=lro-1443200-6d7b9eae-07-01] [VMOTION_RECOVER] Failed to prepare VMotion on destination: A general system error occurred: . Aborting vMotion xaction
vmware.log for the VM that is failing:
2020-05-19T13:25:44.914Z| vmx| I125: MigrateSetStateFinished: type=1 new state=6
2020-05-19T13:25:44.914Z| vmx| I125: MigrateSetState: Transitioning from state 1 to 6.
2020-05-19T13:25:44.914Z| vmx| A100: ConfigDB: Setting config.readOnly = "FALSE"
2020-05-19T13:25:44.914Z| vmx| I125: Migrate_SetFailureMsgList: switching to new log file.
2020-05-19T13:25:44.914Z| vmx| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp. Error = 17
2020-05-19T13:25:44.915Z| vmx| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp/vmware-root. Error = 17
2020-05-19T13:25:44.915Z| vmx| I125: Migrate_SetFailureMsgList: Now in new log file.
2020-05-19T13:25:44.963Z| vmx| I125: Migrate: Caching migration error message list:
2020-05-19T13:25:44.963Z| vmx| I125: [msg.migrate.expired] Timed out waiting for migration start request.
2020-05-19T13:25:44.963Z| vmx| I125: Migrate: cleaning up migration state.
It took us some time to figure this out, but in the end we noticed that 2 of the 9 hosts in the cluster had a full RAMdisk. The reason it took us some time is because these two hosts were never used in the Source/Destination schedule for the vMotions. So we didn’t think it would matter. I guess we were wrong. The full RAMdisk was found on the ESXi host in the following log snippet:
hostd.log
020-05-19T13:24:14.406Z warning hostd[2099675] [Originator@6876 sub=journal opID=lro-1443200-6d7b9eae-03-01-61-a4a6 user=vpxuser] [JrnlFilePersistenceProvider] Error while writing to file. There is no space left on the device
2020-05-19T13:24:14.406Z info hostd[10630738] [Originator@6876 sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event: (vim.event.EventEx) {
--> key = 255,
--> chainId = 82,
--> createdTime = "1970-01-01T00:00:00Z",
--> userName = "",
--> datacenter = (vim.event.DatacenterEventArgument) null,
--> computeResource = (vim.event.ComputeResourceEventArgument) null,
--> host = (vim.event.HostEventArgument) {
--> name = "esx04",
--> host = 'vim.HostSystem:ha-host'
--> },
--> vm = (vim.event.VmEventArgument) null,
--> ds = (vim.event.DatastoreEventArgument) null,
--> net = (vim.event.NetworkEventArgument) null,
--> dvs = (vim.event.DvsEventArgument) null,
--> fullFormattedMessage = <unset>,
--> changeTag = <unset>,
--> eventTypeId = "esx.problem.visorfs.ramdisk.full",
--> severity = <unset>,
--> message = <unset>,
--> arguments = (vmodl.KeyAnyValue) [
--> (vmodl.KeyAnyValue) {
--> key = "1",
--> value = "var"
--> },
--> (vmodl.KeyAnyValue) {
--> key = "2",
--> value = "/var/lib/vmware/hostd/journal/1589894654.384"
--> }
--> ],
--> objectId = "ha-host",
--> objectType = "vim.HostSystem",
--> objectName = <unset>,
--> fault = (vmodl.MethodFault) null
--> }
The resolution
Since we noticed that the RAMdisk was full for at least two of the hosts. We examined all the ESXi hosts. You can easily do this by executing the following command on the ESXi cli:
vdf -h
This will provide you with the following output, in which you can see the RAMdisk is full:
Ramdisk Size Used Available Use% Mounted on
root 32M 4M 27M 13% --
etc 28M 580K 27M 2% --
opt 32M 120K 31M 0% --
var 48M 48M 0B 100% --
tmp 256M 192K 255M 0% --
iofilters 32M 0B 32M 0% --
shm 1024M 0B 1024M 0% --
hostdstats 803M 8M 794M 1% --
If we continue to troubleshoot and check why the “/var” partition is getting full. We can see that it is because of the following file:
[esx04:/var/log/EMU/mili] ls -lsah
total 46004
4 drw-r--r-- 1 root root 512 Feb 24 13:00 .
4 drwxr-xr-x 1 root root 512 Feb 24 13:00 ..
45996 -rw-r--r-- 1 root root 44.9M Mar 17 20:33 mili2d.log
Ok great we found the culprit. Now let’s see why this file is getting so big. Well after some searches through the VMware KB’s it seems this is happening because certain Emulex drivers write to the “/var/log/” partition instead of the “/scratch/log/” partition. This behaviour should be fixed in vSphere ESXi 6.7 U2, but apparently it is not, since we are running a higher version.
When we look at the content of the log file “mili2d.log”, it is filled with the following couple of snippets over and over again:
Tue Mar 17 20:32:19 2020,3678490, ERROR:MILI_enumerate_elxiscsi:Failed to initialize User Init with status = 19
Tue Mar 17 20:32:19 2020,3678490, ERROR:MILI_enumerate_elx_nics:Failed to initialize USer Init with status = 19
Tue Mar 17 20:32:19 2020,3678490, ERROR:could not open device node /vmfs/devices/char/vmkdriver/be_esx_nic
Tue Mar 17 20:32:19 2020,3678490, CRITICAL:backend_init:OneConnect Adapter Not Found.
Tue Mar 17 20:32:19 2020,3678492, ERROR:MILI_enumerate_elxiscsi:Failed to initialize User Init with status = 19
Tue Mar 17 20:32:19 2020,3678492, ERROR:MILI_enumerate_elx_nics:Failed to initialize USer Init with status = 19
Tue Mar 17 20:32:19 2020,3678492, ERROR:could not open device node /vmfs/devices/char/vmkdriver/be_esx_nic
Tue Mar 17 20:32:19 2020,3678492, CRITICAL:backend_init:OneConnect Adapter Not Found.
Tue Mar 17 20:33:19 2020,3678528, ERROR:MILI_enumerate_elxiscsi:Failed to initialize User Init with status = 19
Tue Mar 17 20:33:19 2020,3678528, ERROR:MILI_enumerate_elx_nics:Failed to initialize USer Init with status = 19
Tue Mar 17 20:33:19 2020,3678528, ERROR:could not open device node /vmfs/devices/char/vmkdriver/be_esx_nic
Tue Mar 17 20:33:19 2020,3678528, CRITICAL:backend_init:OneConnect Adapter Not Found.
Tue Mar 17 20:33:19 2020,3678529, ERROR:MILI_enumerate_elxiscsi:Failed to initialize User Init with status = 19
Tue Mar 17 20:33:19 2020,3678529, ERROR:MILI_enumerate_elx_nics:Failed to initialize USer Init with status = 19
Tue Mar 17 20:33:19 2020,3678529, ERROR:could not open device node /vmfs/devices/char/vmkdriver/be_esx_nic
Tue Mar 17 20:33:19 2020,3678529, CRITICAL:backend_init:OneConnect Adapter Not Found.
Fortunately there are four easy ways to fix this issue:
- Update your ESXi hosts to a build above ESXi 6.7 U2. This is not a guaranteed fix, but it should be fixed in this release.
- Empty the mili2d.log file with
echo > mili2d.log
. - Remove the Emulex driver altogether.
- Put the ESXi Host in Maintenance Mode.
- Execute the following command to remove one of the Emulex driver in question:
esxcli software vib remove --vibname=elxnet
- Reboot the host.
- If you want to remove ALL Emulex drivers, execute the following command to find them all:
esxcli software vib list | grep EMU
[esx04:/vmfs/volumes/datastore/testbve] esxcli software vib list | grep EMU
brcmfcoe 12.0.1278.0-1OEM.670.0.0.8169922 EMU VMwareCertified 2020-05-26
elxiscsi 12.0.1188.0-1OEM.670.0.0.8169922 EMU VMwareCertified 2020-05-26
elxnet 12.0.1216.4-1OEM.670.0.0.8169922 EMU VMwareCertified 2020-05-26
lpfc 12.4.270.6-1OEM.670.0.0.8169922 EMU VMwareCertified 2020-05-26
- Edit the loglevel for the Emulex drivers. You can do this by executing the following couple of commands (use value 0 for disabled/no logging):
vi /etc/config/EMU/mili/libmili.conf
libmili.conf:
#MILILogLevel: Control the number of messages logged to the mili2d.log file.
#Possible values are 0 - Disabled, 1 - FATAL, 2 - CRITICAL, 3 - ERROR, 4 - WARNING, 5 - INFORMATION, 6 - DEBUG.
MILILogLevel 0
You should consult your hardware vendor if you are not sure if the Emulex drivers are required for your ESXi host to work at all. If you are not using the hardware for it, you can safely remove the vibs.
The conclusion
So to conclude this post. In the beginning we had several issues that seemed to be related to another case, but in the end were not. The vMotions were not working for the most part and doing the vMotions manually was also not working. After troubleshooting it seemed the issues came from a couple of full RAMdisks, caused by a faulty Emulex driver, that were present on a couple of ESXi hosts in the cluster. Once we cleared those, or reconfigured the Emulex driver logging, or removed the vibs and rebooted the ESXi host, the vMotions were working again. Putting a host in maintenance mode would work flawlessly once again.
Thanks for reading and until the next blogpost!
1 Comment
Johannes · February 2, 2021 at 4:02 pm
Had a similar problem: vMotion failed from and to a certain host with “A general system error occurred” while “Preparing Virtual Machine for live migration on source host”.
Turned out to be also /var full due to a previous test with trivia logging, which was of course reverted but a large log file remained – since /var is only 48 or 50 MB small, this was a problem.
File deleted and everything works again. vdf -h is a good step in troubleshooting…!