Introduction

The last two weeks I’ve been working on a big issue on our environment. On this environement a multitude of products are used such as Veeam, VMware Cloud Director (VCD) and vSAN as main products. The issue we had on this environment was that the backup performance hasn’t been on a level that you would think it would be on. Even worse, the single stream IO performance looked like it was bottlenecked or capped at or around 50MB/s. Which is pretty slow if you imagine that the following hardware is used in this specific environment:

  • vSAN Environment cosisting out of a dozen or more HPE Gen 10 DL380 servers with NVME cache drives and very quick SSD Capacity drives.
  • A HPE Apollo XL450 Gen 9 with a 256TB logical volume in a RAID 60 configuration.
  • The latest v10 Veeam software.

The customers that we have in VCD are able to use the Veeam Backup plugin for VCD to create and edit backup jobs and restore their VMs with it.

Troubleshooting

When looking at the past Veeam backup job history I did notice a certain trend, which is why I discovered this issue at the first place. Have a look at the below Veeam Backup job history for one of the jobs on the platform:

Veeam Backup job performance insight
Veeam Backup job performance insight

Like you can see, the Processing Rate was around 78MB/s for this particular job. This is pretty low, especially for this environment. If we zoom in on the Processing Rate and look at the Read speed (green line) and the Transfer speed (red line) and have a look at the graph behind it (Throughput (All Time)) we can see that there is an easy to spot trend. The Read speed changes from time to time which makes sense. The transfer speed is actually always the same. The red line doesn’t fluctuate like the green one does. I also had a look at some other jobs that had more VM’s configured in it. These actually had a significant higher transfer rate. I quickly learned that this was because of the fact that they simply had more concurrent IO streams running. When looking at the single IO streams on a particular VM inside this job, I noticed the same 50MB/s ish transfer speed.

So at this point I knew something was bottlenecking the performance somewhere in the platform. If you ask me we have the following couple of pieces inside the environment that could potentially bottleneck here:

  • The vSAN environment or vSAN specific Veeam configurations.
  • The hardware itself and the ReFS configuration.
  • The network.

In the following couple of paragraphs I will walk you through the pieces one by one. You might find these interesting for your own environment if you want or need to test any specific performance related issues. If you’ve payed attention to the screenshot above, you might have noticed that the bottleneck is the “Target”, which is the repository. Continue reading the blogpost if you want to find out if this is true!

vSAN / Veeam configuration

With the vSAN environment the configured Storage Policy might affect the performance that a VM proxy can deliver in regards to backup performance. Within our environment we have the VM proxies configured with a RAID 1 FTT 1 policy, which yields the most performance it can get, next to RAID 0, with still having a copy of the data. The performance with this SPBM policy is very sufficient, I’ve tried and benchmarked these enough in the past.

Within Veeam however you should also be aware that you need to configure the proxies with the Virtual Appliance Transport Mode. Veeam also has a deep integration in vSAN in which Veeam will automatically, based upon data locality, will choose the most suitable VM proxy to backup the data. Since we don’t have as many VM proxies on the environment as hosts, this is not really applicable in my situation, but this is still something to remember.

The VM Proxies themselfes were also not the bottleneck in this situation. Since the proxies were configured with loads of CPU’s and concurrent tasks, just like the repository for that matter. So no configured limits from the Veeam perspective are present.

Hardware and ReFS

So like I said before, the hardware is pretty new and really top notch if you ask me. There are a couple of places where the configuration on the HPE Apollo server might be incorrect though. So to be sure I doublechecked them all. The RAID controller that is being used on the Apollo was configured with a 90% write cache, 10% read cache and 256KB block size, like mentioned in the Veeam Best Practices Guides. Also mentioned in these guides are the Windows ReFS partition settings, such as 64KB cluster size and the enablement of Block Cloning or with that Fast Clone in Veeam. You should also be using ReFS version 3.1 or higher. You can check these settings with the following couple of commands executed on the repository:

Windows ReFS configuration volumeinfo
Windows ReFS configuration volumeinfo
Windows ReFS configuration ReFS volume
Windows ReFS configuration ReFS volume

Because I wanted to know if the local storage was the issue, or maybe even the ReFS partition I ran the following two tests:

  1. Atto Disk Benchmark test.

This is my go to test if I want to do a very quick basic storage performance test. This looked pretty sufficient to me. You can even bypass the controller Write Cache if you want. This is more or less the way Veeam works too.

Atto Disk Benchmark quick test
Atto Disk Benchmark quick test
  1. Diskspd disk benchmark test

This is the second storage load test I do most of the time. Diskspd is a storage load generator and performance test tool which is created by the Windows Engineering teams. If you want to know what specific test you should run, you can have a look at the following Veeam KB. This KB explains what diskspd commands you need to issue to simulate a specific Veeam Backup & Replication disk action. I used the “Active full or forward incremental” test like below:

This will do a load test for 600 seconds on a 25GB file and with a 100% write 0% read ratio. This gave me the following result:

Diskspd test results
Diskspd test results

I think 2.7 million IOPS and a couple of GB/s performance should be sufficient for my backups right? So this part was checked off my list!

Network

The Veeam Backup and Replication management server is located on the same site as te Backup proxy virtual machines. The repositories however are located on another geographical site. This means that the netwerk between these two might be bottlenecking in this situation. There is a pretty easy test you can do to try and test if you hit a limit somewhere in the network. You can try an iperf test. You can try this by running iperf in the server mode (iperf3 -s) on a server on the first site, and running the iperf program in the client mode on the target site (iperf3 -c sourceserver). So that is exactly what I tried, and the result was pretty convincing:

iperf3 network speed test performance
iperf3 network speed test performance

You can try and spice up the performance by using more threads on the iperf test. You can do this by using the “-P X” switch on the command where X is the number of threads you are using. On my 10GB interface I got over 8,5GB/s of throughput. So this part was checked off my list.

The fix

After vigorously going through the log files and any piece of this environment I didn’t find anything usefull. Going through the log files one more time I did however see something interesting in the backup job agent logs located on the repository:

And on the log file for the job located on the Veeam Backup and Replication server I also found something similar:

Both of these log files were spammed with thousands and thousands of entries like above. Correlating these log entries I figured this might be because we are using Veeam Enterprise Manager and Tenant Quota’s on ORG VDC’s in this environment. But after looking at the storage quota’s it seemed like these were fine and none of the quota’s were actually full, or nearing full.

So I created a Veeam support case and had a couple of great sessions with them. After looking at the log files with some escalation engineers we found out that the fix is something that’s not that well known, or even documented for that matter.

It turns out that when you are using Veeam software in conjunction with a Tenant configuration, Veeam uses a logic called “quota allocation”. This logic is mostly only used in Veeam Cloud Connect environments in which you have multiple tenants that can either replicate or backup to your Service Provider Veeam Cloud Connect environment. It didn’t really make sense to me because we don’t have the Veeam Cloud Connect Service Provider side on this environment. It turns out that this logic however also applies to environments in which any Tenant like structure with quota’s is used. Such as with Veeam Enterprise Manager environments that have vSphere/vCloud Tenant Quota’s inside a VCD configuration. I would have never guessed this without the help of Veeam support!

So what does this “quota allocation” logic do? Well, this logic uses a mechanism that allocates a certain amount of data to the backup file the IO stream is writing to. You should think of it this way; When the backup starts, there is an empty backup file. This file grows with 512MB per 15 seconds (if needed and by default). This means that the throughput can have a maximum of 34MB per second per data increase increment. After 15 seconds the quota allocation for this backup file (and with it the stream) will increase with another 512MB providing the backup agent with another 512MB of writable backup data. This goes on and on until the backup is finished and the IO stream closes.

Most Veeam Cloud Connect environments never get above 34MB/s per Tenant, especially when there are hundreds of tenants. In our situation however, one could already imagine the bottleneck in our situation. Our environment is a lot faster than the maximum throughput off 34MB per second. Because of this the 512MB data file allocation quota is already full within less than a second. This means that our Veeam backup environment waits the remaining 14 seconds (94%), until the the next 512MB increment has been allocated to the quota. This is what you are seeing in the log file snippets I posted above. This more or less means that our backup environment is idling most of the time.

Talking with support I got a suggestion to change this internal logic to another value so that the environment doesn’t need to wait anymore. If you want to change the internal logic, you will have to create a registry key such as below:

This registry key called “CloudConnectQuotaAllocationMode” has a couple of values which you can choose from:

  1. Default – With the value “0” the quota allocation logic operates like I explained above. It will increase the allocated quota by 512MB each 15 seconds.
  2. Veeam Agent – With the value “1” the Veeam backup agent asks the server for the required quota and gets the required quota, without waiting for a certain amount of seconds.
  3. Hybrid Mode – With the value “2” both of the above situations occur. The quota is allocated 512MB each 15 seconds, and the Veeam Agent asks, and gets the required amount.

Once you created the registry key you will have to reload the Veeam Backup Service. Once this is done the logic is changed to whatever value you set.

After changing the logic, I tried one of the larger backup jobs again. Only this time the performance has been enhanced just by a little bit:

Veeam Backup job performance insight after applying fix
Veeam Backup job performance insight after applying fix

Not only has the transfer speed increase by ~12 times, read speeds have also increased about 7 times. And this is only an example. Other jobs have seen an increase for up to 50 or 100 times. The backup environment is so much more quicker now after applying the registry key. Jobs take took over 40 hours are now done in less than 30 minutes! Now that the environment doesn’t need to wait for a larger quota on the data file, it can keep pumping data in the file without any hesitations.

Another interesting thing to notice here is that the target is no longer the bottleneck but rather the proxy is. This was due to the fact that the proxy the proxy needed more CPU resources.

Conclusion

We started off by having a performance issue on our Veeam Backup environment used together with vSAN and VCD. Like any other performance issue, most of the time it’s the best solution to go through a list of suspects to try and pinpoint the issue at hand. In the beginning, looking at the Veeam backup job report, one might suspect the backup target repository is the bottleneck. But we quickly learned that this was not the case since the repository did a couple of million IOPS and a significant amount of throughput. Once we went through the logs we found some interesting entries which helped us figure out the issue.

The issue was related to the fact that because we are using a Tenants configuration setup through the Veeam Enterprise Manager configuration in combination with VMware Cloud Director a certain Veeam logic called “quota allocation” was being used. This mechanism prevented the environment from delivering the performance it could, because it would have to wait 15 seconds before receiving another piece of quota which it could write to.

Once we added the suggested Veeam registry key to the environment, the Veeam logic was changed which resulted in a significant increase in transfer speeds. In general this improved read and transfer speeds for the jobs by at least 10x, in some situations even 50-100 times! Since this information is nowhere to be found on the internet, except for the Veeam v10 release notes I found somewhere (which do not explain the usecase in combination with VCD) I figured I’d post this for everybody.

There is a feature request internally with Veeam to change this logic on certain Tenants or Repositories so that there is a better form of control. So if you ask your Veeam Support for this, they can hit you up with the information you need.

I hope this blogpost will help out all of you so that you don’t have to go through the pain I had to find the issue!

Leave a Reply

Your email address will not be published. Required fields are marked *