Pro Tip: Rubrik, part 1

So, I’ve been working with Rubrik for more than a year and so far it’s been quite a journey. All in all, everything has been very positive and it’s nice to work with a vendor who really listens. Lately though, we’ve been having a few issues regarding scalability and session concurrency that I wanted to share.

Sessions

So, first of. Rubrik has a limit of 10 active sessions per user acccount. This might not seem as a big issue, but since I do pretty much everything via the Rubrik API it actually got to be an issue for me. Let me explain. I have a lot of vCenters (think +50 and growing rapidly). Some of the backend jobs that I’ve written run on a per vCenter basis. So, as an example I have a clean up script that is responsible for monitoring all jobs for a given vCenter. That job was configured to run on all vCenters within a 10 minute window. I had only created a single service account for all of my vCenters so naturally when the jobs ran, the first 10 would log in just fine and start processing. When the 11th began, it would make rubrik kill session number 1 and that could very well cause the job that session 1 was handling to fail.

It’s not mentioned anywhere that this limit exists, but if you write to support, they can increase the limit.

Concurrent backups pr. ESXi / concurrent task pr. Rubrik node.

Our current installation handles somewhere around 2000 backups a day. Recently though, we started seeing a lot of missed backups everyday. When researching this it was quickly evident that the rubrik cluster / node had a limitation of 3 concurrent backups pr. rubrik node. This was evident since we have 20 nodes and the cluster was constantly peaking at 60 tasks running every night. This number was constant during the backup window.

After talking with they support, I was explained that they have the following default settings:

  • Concurrent backups pr. ESXi: 3
  • Concurrent tasks pr. Rubrik node: 3

They can tweak both values if asked nicely. In my case, they ran some analysis on the cluster to see the activity and then changed the concurrent tasks pr. node to 4. This number will probably be tweaked a bit more for our installation. The reason for that is that it is quite worrisome that a cluster with 20 nodes, 1.2TB memory and 160 cpu cores is only able to handle 60 concurrent backups by default.

Leave a Reply

Your email address will not be published. Required fields are marked *