So, I’ve been working with Rubrik for more than a year and so far it’s been quite a journey. All in all, everything has been very positive and it’s nice to work with a vendor who really listens. Lately though, we’ve been having a few issues regarding scalability and session concurrency that I wanted to share.

Sessions

So, first of. Rubrik has a limit of 10 active sessions per user acccount. This might not seem as a big issue, but since I do pretty much everything via the Rubrik API it actually got to be an issue for me. Let me explain. I have a lot of vCenters (think +50 and growing rapidly). Some of the backend jobs that I’ve written run on a per vCenter basis. So, as an example I have a clean up script that is responsible for monitoring all jobs for a given vCenter. That job was configured to run on all vCenters within a 10 minute window. I had only created a single service account for all of my vCenters so naturally when the jobs ran, the first 10 would log in just fine and start processing. When the 11th began, it would make rubrik kill session number 1 and that could very well cause the job that session 1 was handling to fail.

It’s not mentioned anywhere that this limit exists, but if you write to support, they can increase the limit.

Concurrent backups pr. ESXi / concurrent task pr. Rubrik node.

Our current installation handles somewhere around 2000 backups a day. Recently though, we started seeing a lot of missed backups everyday. When researching this it was quickly evident that the rubrik cluster / node had a limitation of 3 concurrent backups pr. rubrik node. This was evident since we have 20 nodes and the cluster was constantly peaking at 60 tasks running every night. This number was constant during the backup window.

After talking with they support, I was explained that they have the following default settings:

  • Concurrent backups pr. ESXi: 3
  • Concurrent tasks pr. Rubrik node: 3

They can tweak both values if asked nicely. In my case, they ran some analysis on the cluster to see the activity and then changed the concurrent tasks pr. node to 4. This number will probably be tweaked a bit more for our installation. The reason for that is that it is quite worrisome that a cluster with 20 nodes, 1.2TB memory and 160 cpu cores is only able to handle 60 concurrent backups by default.

I’ve sat through a few VMUG sessions, webex presentations and sales pitches from newer, “cooler” backup vendors (you know who you are) who have made the argument that proxies were the enemy, they were up to no good and they provide nothing but complexity and annoyance for anyone who managed them. This point came up usually around the time where they showed a PowerPoint of their solution compared to the other vendors out there like IBM (with Spectrum Protect for Virtual Environments) or Veeam.

And yes, I get that for the average enterprise, the need/requirement of having to deploy a few proxies (or many) can be annoying but my guess is that it’s probably mostly because of the cost and the fact that getting them deployed by someone can take anywhere from days to weeks (depending on how siloed up they are). But most of that is probably not the proxies fault.

Personally, for me it takes around 2 hours to deploy Veeam setup from start to finish. That includes creating networks, 3 VMs (Console, Repository, Proxy (if needed)) and mapping some storage to the repository VM. So hey, not that big of a deal.

But why are they not the enemy? All I’ve said so far is that they are not that bad. Well, for me the proxies provide a lot of flexibility purely networking wise. They enable restores directly to the VMs from my repository. But why is that so? To explain that you have to understand how our products are built.

I won’t go into too much details about it here but the short and sweet is that my customers VMs are not able to reach/talk to the infrastructure that they are running on top of. So in short, a VM is not able to reach the veeam infrastructure directly and the veeam infrastructure is not able to reach the VMs. This is not due to firewall limitations, this is due to the, being on separate routed networks where we don’t decide what IP space they use = a lot of overlapping address space. All this is by design. So as you can see by this, restores can be difficult it proxies weren’t around.

What the proxies do for me is provide me with a machine that I can dual home (ie 2 NICs). Have the gateway on the customers routed network and then having a few (3) static routes to the things that it needs to reach on my network (VMware infrastructure, Veeam infrastructure) and then I’m good and the customer can now do his restores as he pleases. This does require the customer to forfeit the use of 3 /24 subnets but that is never an issue.

Rubrik has asked a few times if I had plans to begin using their agents to do in-guest backups. I’ve said no every time and given them a detailed explanation as to why that is. For the customers that I have who use Veeam as a method of backing up application level items (or just file level), the proxies are a critical component and as such, they cannot be moved to a competing product until that product has the same functionality.

So that’s why I think that proxies are great. They provide a lot of flexibility. A lot more than annoyance and complexity if you ask me 🙂