Q-CTRL digest

Accelerating quantum advantage by scaling error suppression with NVIDIA and OQC

March 20, 2025
Written by
Esteban Ginez
Yulun Wang
Rowen Wu

Complex computing tasks involved in executing quantum applications on real hardware can create a bottleneck, hindering our progress toward achieving quantum advantage.

  • The challenge: Complex computing tasks involved in executing quantum applications on real hardware can create a bottleneck, hindering our progress toward achieving quantum advantage.
  • The outcome: In partnership with NVIDIA and OQC, we leveraged advanced accelerated computing technology to solve key computational bottlenecks in error suppression that we foresee on the path to quantum advantage.
  • Impact: 300,000X reduction in classical compute cost by leveraging NVIDIA GPUs and accelerated libraries for layout ranking.

Quantum computing is sometimes suggested as a substitute for classical computing. However, the reality of how quantum computing will be used in production is quite different.

Delivering useful quantum computing requires continued co-innovation with classical computing in order to address the many tasks inherent to executing workloads on quantum hardware.

Looking ahead, it appears that some of these tasks—including mapping abstract quantum algorithms into hardware-specific machine instructions—will rapidly become bottlenecks if we do not develop new cutting-edge classical computing solutions. If we want to accelerate the path to quantum advantage, we need to get ahead of these problems now.

The emergence of GPUs in the information processing landscape has powered the AI industry and is now poised to advance the state of the art in quantum computing by solving many of its biggest scaling challenges.

Fire Opal is Q-CTRL’s error-suppressing performance-management software for quantum computers. It leverages a deterministic pipeline of interconnected techniques to reduce errors and optimize instructions in every single step of executing a job on quantum hardware. It’s been shown to totally transform the capabilities of today’s machines, unlocking latent performance entirely through software. Users can experience over 1000X improvement in computational accuracy and 1000X reduction in execution time with a single command.

Under the hood, a crucial step in Fire Opal’s pipeline involves selecting among different possible ways of mapping an abstract quantum circuit onto physical quantum hardware, an area where we've developed new AI-driven technology using an approach similar to the Netflix ranking algorithm. 

This process rapidly becomes a computational bottleneck due to the enormous space of possible variations of each circuit and the many factors that must be accounted for in order to perform accurate ranking. And as you’d imagine, the problem becomes more challenging as the number of qubits available on a quantum processor grows.

In order to truly accelerate the path to quantum advantage, we set up a new collaboration with Oxford Quantum Circuits (OQC) to explore how the use of the NVIDIA platform, including AI and data science libraries like RAPIDS and cuDF, can dramatically speed up layout ranking for the quantum processing units (QPU) of today and tomorrow, reducing computational time and cost for this critical task.

“Many different approaches are needed to overcome the challenge of noise in quantum hardware,” said Sam Stanwyck, Group Product Manager for quantum computing at NVIDIA. “This work is a great example of how NVIDIA’s accelerated computing is key to driving these approaches forward.”

As a leading quantum hardware provider, OQC is focused on making real-world quantum computing scalable and commercially viable. Efficient layout selection directly benefits hardware providers like OQC by improving hardware utilization and unlocking better performance for end users.

“As quantum computers scale, more users will be able to tackle increasingly complex algorithmic tasks representing more meaningful industrial, utility-scale applications,” said Alex Shih, VP of Product at Q-CTRL. “We’re delighted to be working with NVIDIA and OQC to efficiently orchestrate computationally intensive workflows through optimized hybrid classical-quantum resources in order to save users time and money.”

Together, we’re building the innovative tools needed to realize the Software-Defined Quantum Data Center, capable of combining CPUs, GPUs, and QPUs in delivering real value to end users. Using GPUs not only to independently accelerate key parts of a computational workload in parallel with the other resources available but also to accelerate the supporting tasks in operating QPUs is critical to achieving a true quantum advantage in quantum computing.

Challenges scaling quantum circuit layout ranking

To run a quantum algorithm on real hardware, we must map an abstract quantum circuit onto a specific selection of qubits on a quantum processor. This layout selection process can critically reduce execution errors by minimizing the need for additional operations – for instance by selecting neighboring qubits and avoiding qubits known to be more error-prone. Our published research on the topic showed that choosing the “right” layout can deliver over 10X higher circuit fidelity than the median case. Therefore, getting this step correct is essential whenever we run a circuit on a QPU.

In practice, the first step in this process involves generating potential mappings from abstract logical qubits to physical devices. Logical mappings have to account not only for the number of qubits relative to the size of the overall QPU but also for the connectivity between those qubits.

Figure 1. Circuit transpilation and the task of layout selection. (a) A representative initial circuit. Such circuits are unconstrained by hardware considerations, such as matching the device topology or consisting only of native gates. (b) The qubit connectivity graph of the initial circuit. Nodes represent the qubits appearing in (a) and edges represent 2-qubit gates. (c) The hardware-compatible circuit. Hardware transpilation transforms the initial circuit into a unitarily equivalent, hardware-compatible circuit. Shown here is the hardware-transpiled circuit corresponding to the initial circuit on the ibmq_guadalupe device (only a portion is shown for brevity). (d) The qubit connectivity graph of the hardware-transpiled circuit. Unlike the initial connectivity graph, the transpiled connectivity graph is a subgraph of the device topology (shown in (e)). The node labels correspond to the qubit labels in (d). (e) The layout is a map between circuit qubits and physical qubits. There are typically many such mappings; shown here is a subset of four layouts. The goal of layout selection is then to choose the best-performing layout for execution on the device.

Next, all of the possible layouts have to be ranked in a way that captures which layout is most likely to deliver a high-quality outcome when the actual circuit is run. This step accounts for qubit connectivity relative to the target circuit's structure, and data about the performance of every qubit and pairwise interaction on the QPU. That is, the ranking procedure becomes a complex calculation with many interdependencies between the device and the target circuit (see our technical manuscript for more information on the procedure we invented).

Unfortunately, as the number of qubits increases, the number of possible layouts expands rapidly. For practical applications requiring thousands or even millions of qubits, efficiently ranking the huge space of possible layouts becomes infeasible with existing CPU-based computational engines. Computational cost and time become major obstacles, slowing down progress in quantum computing and potentially even inhibiting our aim of achieving quantum advantage.

Optimizing layout ranking with parallelism and GPUs

We explored using GPUs along with cuDF and RAPIDs to speed up the ranking step in Fire Opal’s error suppression pipeline. As you’ll see below, our collaborative R&D with OQC using NVIDIA hardware not only improves ranking performance but also enables future AI-driven layout selection techniques, opening doors for even more advanced optimizations.

As a baseline benchmark, our implementations branch from and compare to Qiskit’s VF2PostLayout pass. Proceeding in this way allows us to carefully isolate the computational engine and understand the role of accelerated computing in the process of layout selection. Currently, Qiskit processes these rankings sequentially in Python, evaluating one mapping at a time. Since each layout ranking consists of independent operations, this process is an ideal candidate for efficient parallel implementation.

Our collaboration developed multiple improved implementations using both CPU and GPU parallelism:

  • CPU Multithreading: We first optimized the ranking process to distribute calculations across multiple CPU cores, improving efficiency compared to a serial implementation.
  • GPU Acceleration: We created two GPU implementations that take advantage of different levels of parallelization:
    • Layout-level parallelism: Multiple layouts are evaluated simultaneously, each assigned to a separate GPU thread.
    • Qubit-level parallelism: The ranking computations within each layout are further parallelized, distributing work across many smaller GPU operations.

By developing these different implementations, we could fairly assess the best methodology and computational resources for optimizing layout ranking.

Benchmarking performance across time and cost

We conducted two sets of experiments to measure the impact of GPU acceleration: testing real quantum circuits and random layouts. To directly understand the impact on our path to quantum advantage, we measured both compute time and cost – two metrics that matter to end users.

1. Full-Pipeline Benchmark (real quantum circuits)

We first focused on real-world performance using a standard quantum circuit benchmark—Bernstein-Vazirani circuits of increasing size—and a realistic synthetic QPU backend with twice the number of qubits as needed for the circuit. 

For these benchmarks, we compared our GPU-accelerated ranking step against the default Qiskit 1.2.4 VF2PostLayout pass. Each benchmark run was evaluated and ranked at most 500,000 candidate layouts, with the number of layouts scaling with qubit count. For each ranking procedure we evaluate the time per layout by measuring the wall-clock time of the whole ranking operation and dividing it by the generated number of layouts.

We conducted this benchmark using a single host machine equipped with an AMD Ryzen Threadripper PRO (128 logical cores) for the default, and an NVIDIA RTX A5000 GPU, enabling a direct comparison of CPU and GPU performance. Evaluating solutions using distributed CPU computing was outside the scope of these experiments. 

Results showed that the GPU-enhanced ranking method consistently achieved nearly 10× speedup compared to the default implementation.

Figure 2. Comparison of ranking time per layout of the GPU-based parallel ranking approach and the “default” VF2PostLayout pass Python implementation.

At the scale of 200 qubits, it’s possible to have hundreds of thousands or even millions of layout options from which to choose. If there are 1 million possible layouts, then ranking the layouts would require 11.7 minutes per circuit using the conventional CPU approach, but just 1.2 minutes with a GPU. When considering real applications that may involve the execution of 100 or more different circuits—each requiring a layout selection procedure—10X speedup can mean the difference between an application that runs over the course of a few hours versus a few days. 

Cost efficiency was another important factor in the benchmarking process. Cost benchmarks showed that running tasks on high-end CPUs can be far more expensive than conducting the same tasks on GPUs. For instance, the cost-per-layout at 200 qubits decreases from $1 to just $0.01 when using GPUs. For those working within a fixed budget and capable of scaling resources accordingly, utilizing GPUs can result in a performance increase of up to 100 times.

Figure 3. Cost data is calculated from the compute time per layout, where Cost per Layout = Time * Device Cost per Hour / Number of Layouts. Cost calculations were based on AWS's c6a.32xlarge instance (x86 Processor), which is $4.896/hour, while the GPU cost, using AWS's g5.xlarge instance (NVIDIA A10G), is $1.007/hour.

2. Modular Benchmark (large-scale, randomized layout data)

The second benchmark we implemented focused on scaling even further to 1000 qubits and 500,000 layouts per circuit. By ranking performance on randomly generated layout data, the layout-ranking procedure could be decoupled from the layout-generation algorithm, which also suffers from runtime scaling challenges.

For this benchmark, we executed the default ranking algorithm on a single host machine equipped with an AMD Ryzen Threadripper PRO (128 logical cores), while GPU computations were performed on an NVIDIA GH200. Beyond the 200-qubit scale, the default VF2PostLayout ranking implementation struggles to efficiently handle layout ranking. To estimate its runtime, we executed the default implementation on a small subset of layouts and extrapolated the results.

Figure 4. Comparison of ranking time per layout of the GPU-based parallel ranking approach and the “default” VF2PostLayout pass approach.

Results showed that the GPU-based parallel ranking approach was 100,000 to 300,000 times faster than the default approach for large-scale datasets! Benchmarking also demonstrated that GPUs are significantly more cost-efficient than high-end CPUs, even as the number of qubits scales into the thousands.

Figure 5. Cost data is calculated from the compute time of ranking 500k layout per circuit, where Cost per task = Time * Device Cost per Hour. Cost calculations were based on AWS's c6a.32xlarge instance (AMD EPYC 7R13 Processor), which is $4.896/hour, while the GPU cost, using Lambda Labs’s GH200 pricing, is $3.19/hour. 

Enabling efficient and practical quantum computing

This work highlights the immense potential of accelerated classical computing as a powerful tool for executing quantum applications at scale. Our results show that GPU-accelerated layout ranking is not only faster but also significantly more cost-effective than CPU-based methods.

For quantum hardware providers like OQC, these advancements aren’t just technical improvements—they are critical investments in the future of practical quantum computing.

By improving circuit layout selection and ranking (both of which are steps in hardware-aware compilation), we directly enhance the efficiency, usability, and performance of quantum hardware. Faster compilation means:

  • Faster time to solution – Users can run more experiments in less time, ensuring greater throughput for hardware vendors.
  • Better algorithm performance – Optimized layouts lead to lower error rates and higher-quality results, helping drive up user adoption of QPUs in their most valuable workloads.
  • Scalability for future devices – As qubit counts grow, hybrid solutions incorporating GPU acceleration for these critical operational tasks will be essential to delivering useful outcomes to customers on the timescales they expect.

Accelerated computing enables opportunities in AI for Quantum

The team at Q-CTRL has been a pioneer in the application of advanced AI tools to drive the improvement of quantum hardware. Simply put, intelligent machines are often better at teasing out maximum performance from fragile quantum devices than humans are. That’s why most steps in Fire Opal’s error-suppression pipeline involve a form of custom AI augmentation in order to deliver the best possible performance from QPU hardware.

Looking ahead, we see multiple exciting opportunities for further GPU-accelerated enhancements:

  • AI-powered layout ranking: Accelerated computing provides the computational power needed for advanced AI-driven layout selection methods—such as our previously published “Learning to rank” method—which can also drastically improve the performance of algorithms on quantum hardware.
  • Faster layout generation: The initial step of layout selection—graph generation—can also benefit from GPU acceleration, reducing compilation times even further.
  • AI augmentation beyond layout selection: Similar computationally challenging problems appear in other quantum computing tasks, such as circuit knitting, where GPU acceleration could drive further breakthroughs. Measurement error mitigation and even gate design involving machine learning techniques can also achieve enhanced performance at scale with GPU acceleration. 

As quantum hardware grows in size and complexity, GPU acceleration will play an increasingly important role in addressing the practical bottlenecks in achieving useful results in real-world applications. By combining the best in quantum, classical, and AI-driven approaches at scale, we’re paving the way for true quantum advantage, one breakthrough at a time.

Get started

Interested in trying out these cutting-edge error suppression techniques on real quantum hardware? Sign up for Fire Opal today to get started!