Performance enhancements on distributed asynchronous optimization
Q-CTRL develops quantum control software to solve the hardest problems facing quantum technology, improving hardware performance and accelerating pathways to useful quantum computers and other technologies. We do this by leveraging the power of cloud computing - including parallelization and GPU acceleration.. However, with every new development, the platform requires revision to ensure the highest quality and performance standards are maintained.
Q-CTRL recently released a new version of it's optimization engine which introduced significant performance improvements. However, when deploying the upgraded version to the Q-CTRL cloud platform, the expected performance gains were not realized and, in some cases, performance was worse than the previous version. This led us to research the issues with some simple but powerful outcomes.
The strategy
- Run experimental optimizations for five to eight qubits with multiple segments sizes.
- Time five local optimizations on a MacBook Pro for each set of parameters.
- Time five cloud optimizations on the Q-CTRL cloud platform for each set of parameters.
With good reference numbers and comparable calculations, we profiled the application and started digging deeper into anything that would stand out - either because of the amount of time taken or the number of times it happened. The new execution time would then become the reference for the next enhancement.
As a starting point, we had the following average time to complete the 16 optimizations sequentially:
- Local execution on MacBook Pro: ~18 min.
- Using the initial cloud architecture/design: ~97 min.
Do not underestimate the number of HTTP requests
The first thing we noticed was the vast amount of HTTP requests being made to complete the tasks. On the first iteration, over 4.5k HTTP requests happened and most of them between services. On the other hand, when running the calculations locally, it required no HTTP requests.
Analysing the requests, it was clear that a large portion of them were being used for low value status updates only and not for higher value calculations or data management. We could easily avoid these requests without an impact on the actual results.
Q-CTRL uses Django for the backend APIs and Celery for task management. Extending some basic status management from Celery allowed us to reduce the number of HTTP requests between our services by 43%, with a good impact on the overall performance.
After the change, the results were:
- Previous time/HTTP requests: ~97 min / ~4.5k.
- New time/HTTP requests: ~28 min / ~2.5k.
Microservices the proper way
Now with 2.5k HTTP requests, we noticed some specific request latencies between services were higher than expected. We could summarise our cloud architecture involved in this specific calculation in the following way:
- Fully managed databases services.
- Fully managed API hosting and load balancing.
- Fully managed task workers instances.
- Kubernetes cluster for GPU instances and other specialized services.
This design worked very well for some time but, with increasing calculation sizes and more specialized needs, the performance of the fully managed services started to degrade. To address those weak spots, we moved everything but the database to the Kubernetes cluster, allowing us to have finer control over load balancing and reducing latency times between within-cluster requests.
This produced the following results:
- Previous time/HTTP requests: ~28 min / ~2.5k
- New time/HTTP requests: ~23 min / ~2.5k
From REST to GraphQL
Looking for ways to decrease the number of HTTP requests (still at 2.5k), a design choice had to be made. Our API was a RESTful service conforming with the JSON:API and OpenAPI specifications. Despite it being a great specification and design choice, the setup required to optimize a system would demand the creation of several individually created stateful resources related to each other.
After some research, GraphQL became a good successor for our needs because of its dynamic nature with sparse queries and nested types management. It’s well-defined schema and strong documentation philosophy were excellent substitutes for the JSON:API and OpenAPI bundle.
With that decision made and a version of the same function deployed using GraphQL, the new results became:
- Previous time/HTTP requests: ~23 min / ~2.5k
- New time/HTTP requests: ~19 min / ~900
There is a limit to parallelization
A core value of cloud architectures is the idea that, if a given computational task can be broken into smaller pieces, auto-scale instances can handle parallelism with no virtual limits. But that is just part of the equation. Sometimes, the time required to split a task into subtasks, or to set up the calculations individually, can be greater than the actual processing time. With that in mind, we executed the same optimizations with the parallelization disabled, running the whole calculation on our cloud instances but in a single process, similar to what happens when we run it locally on a laptop.
As expected, that change made some processes slower but (unexpectedly) gave us better overall speed when running all the optimizations sequentially. And, more importantly, running the optimizations on the cloud distributed architecture produced faster results than running the optimizations locally for the first time.
- Previous time/HTTP requests: ~19 min / ~900
- Local execution on MacBook Pro: ~18 min
- New time/HTTP requests: ~12 min / ~720
Finding the balance: Parallelization advantages combined with GPU allocation
Close analysis of the benchmarking generated profile comparing the previous executions allowed us to calculate optimised parallelization strategies. The parallelization would happen, but instead of parallelizing every small part of the calculation, we divided the entire process into intermediate calculation sizes. That way, we could reduce the setup time, if compared to more subtasks, but could benefit from the parallelization to run multiple pieces in parallel.
Another insight from the previously collected data was regarding the use of GPU. Not everything runs better on a GPU if compared to a CPU, and we had our threshold for that allocation on six qubit system optimizations. Therefore, any system with up to five qubits would be optimized on CPU instances, and six qubits or more would use GPU instances. With the optimized parallelization, we changed the new minimum size for GPU instances to be engaged at seven qubits.
- Previous time/HTTP requests: ~12 min / ~720
- New time/HTTP requests: ~7 min / ~720
Tweaking resources on Kubernetes PODS
The current average time for the optimizations after all the enhancements through all the mentioned layers were already satisfactory. Almost three times faster than a MacBook Pro and almost twice as fast as a custom build machine with GPU attached running the same algorithms locally. But there were more areas to improve, and we increased the number of CPU and GPU instances aiming at avoiding tasks being queued, ensuring that any parallel process would start immediately without relying on auto-scale features. This doubled our performance:
- Previous time: ~7 min
- New time: ~3.5 min
The results
We used this exploration as a proof-of-concept to see how much faster we can calculate heavy optimizations when taking advantage of our dynamic cloud infrastructure - and this doesn't even explore all the hardware options available that we can take advantage of for further improvements.
- Boulder Opal (original): ~97 min
- MacBook Pro (local): ~18 min
- Custom GPU instance (local): ~11 min
- Boulder Opal (enhanced): ~3.5 min
In conclusion, all the research and experimentation that allowed us to reduce a set of calculations from 97 minutes down to 3.5 minutes not only shows that Boulder Opal is faster than most local environments, but is also capable of taking advantage of new technologies as and when they become available.