Recently I found my self having to recover a VM that had most of it’s files deleted. They were removed by a script that was intended to clean up a staging datastore. Unfortunately the VM in question was left on that staging datastore and the script did it’s magic.
Because the VM was powered on at the time the ESXi host that hosted the VM had locks on a VMDK -flat file and a log file and were therefore preserved. I also noticed a .hlog file, which had me worried because it is used by vCenter to track VM files that are to be removed. In other words; power off the VM and the remaining files will be removed as well. Now, to make things worse there was no recent backup of this particular VM. So returning it to a healthy state would be quite the challenge. I hoped I could simply SSH into ESXi and copy the files. Unfortunately all attempts to copy the files from within ESXi failed. I did catch a break when I learned the staging datastore was backed by a NFS export hosted on a NetApp which had snapshots enabled. That allowed me to have a copy of the files to tinker with.
Pretty straightforward troubleshooting up until this point, right? But this is where it gets interesting. How does one recreate all relevant files for a VM when all you have is a VMDK -flat file and a log file? Well, as it turns out VMware GSS encountered this problem before because there were two knowledgebase articles that were essential in recovering this VM to a healthy state.
First I recreated the VMDK descriptor file. I used the procedure in this KB article to create a new descriptor file using vmkfstools. Recreating a missing VMware virtual machine disk descriptor file (.vmdk) (1002511)
Next I was able to generate a new VMX file using the information in the log file. This was also a very easy process that was described in detail by a KB article. Rebuilding the virtual machine’s .vmx file from vmware.log (1023880)
After I completed these two steps I was able to add the VM to vCenter without any issues and it booted like a charm.
After recovering from this situation I learned a few things:
- Don’t run VMs from a staging datastore (duh!)
- Snapshots are not backups, but can sometimes save your ass
- Never underestimate the power of VMware’s knowledgebase