Recently I was troubleshooting an environment that had some weird issues regarding the centralized VMware Tools repository that we configured. I will not bore you with these details, they will get explained in a future blogpost. But I also had some issues regarding vMotion on this environment. At first I thought that that was part of the problem for the VMware Tools issue, but after some troubleshooting it seems it was not.

The issue

So the issue was as follows. vMotion was working, although only partly. And we only noticed it when we wanted to put a host in maintenance mode. By doing this, several dozen Virtual Machines would be vMotioned to another available host. But the vMotion processes on the ESXi host decided it only wanted to do 4, and always 4, vMotions at a time. After these 4, all other vMotions would fail. Strangely, all consequent vMotions, on whatever host we picked in the cluster as the Source or Destination would fail. The VMware ESXi build we used on this cluster was VMware ESXi, 6.7.0, 15018017 (ESXi 6.7 EP 13).

Once we tried the vMotion manually, we receive errors that the Virtual Machine “was already running a task” (it was not) and we could not proceed. Once we waited a few minutes and tried again, the Virtual Machine would vMotion without any problems. Just never more than 4 at a time.

They would fail with the following error messages found in several logs and within the vCenter vSphere Client:

vCenter vSphere Client vMotion error message

It took us some time to figure this out, but in the end we noticed that 2 of the 9 hosts in the cluster had a full RAMdisk. The reason it took us some time is because these two hosts were never used in the Source/Destination schedule for the vMotions. So we didn’t think it would matter. I guess we were wrong. The full RAMdisk was found on the ESXi host in the following log snippet:

The resolution

Since we noticed that the RAMdisk was full for at least two of the hosts. We examined all the ESXi hosts. You can easily do this by executing the following command on the ESXi cli:

This will provide you with the following output, in which you can see the RAMdisk is full:

If we continue to troubleshoot and check why the “/var” partition is getting full. We can see that it is because of the following file:

Ok great we found the culprit. Now let’s see why this file is getting so big. Well after some searches through the VMware KB’s it seems this is happening because certain Emulex drivers write to the “/var/log/” partition instead of the “/scratch/log/” partition. This behaviour should be fixed in vSphere ESXi 6.7 U2, but apparently it is not, since we are running a higher version.

When we look at the content of the log file “mili2d.log”, it is filled with the following couple of snippets over and over again:

Fortunately there are four easy ways to fix this issue:

  1. Update your ESXi hosts to a build above ESXi 6.7 U2. This is not a guaranteed fix, but it should be fixed in this release.
  2. Empty the mili2d.log file with echo > mili2d.log.
  3. Remove the Emulex driver altogether.
    1. Put the ESXi Host in Maintenance Mode.
    2. Execute the following command to remove one of the Emulex driver in question: esxcli software vib remove --vibname=elxnet
    3. Reboot the host.
    4. If you want to remove ALL Emulex drivers, execute the following command to find them all: esxcli software vib list | grep EMU
  1. Edit the loglevel for the Emulex drivers. You can do this by executing the following couple of commands (use value 0 for disabled/no logging):

You should consult your hardware vendor if you are not sure if the Emulex drivers are required for your ESXi host to work at all. If you are not using the hardware for it, you can safely remove the vibs.

The conclusion

So to conclude this post. In the beginning we had several issues that seemed to be related to another case, but in the end were not. The vMotions were not working for the most part and doing the vMotions manually was also not working. After troubleshooting it seemed the issues came from a couple of full RAMdisks, caused by a faulty Emulex driver, that were present on a couple of ESXi hosts in the cluster. Once we cleared those, or reconfigured the Emulex driver logging, or removed the vibs and rebooted the ESXi host, the vMotions were working again. Putting a host in maintenance mode would work flawlessly once again.

Thanks for reading and until the next blogpost!

One Comment

  1. Had a similar problem: vMotion failed from and to a certain host with “A general system error occurred” while “Preparing Virtual Machine for live migration on source host”.

    Turned out to be also /var full due to a previous test with trivia logging, which was of course reverted but a large log file remained – since /var is only 48 or 50 MB small, this was a problem.

    File deleted and everything works again. vdf -h is a good step in troubleshooting…!

    Johannes

Leave a Reply

Your email address will not be published. Required fields are marked *