Technical blog

Avoiding an unexpected roadblock in quantum computing - compilation

March 25, 2024
Written by
Yulun Wang
Adam Winick
Pranav Mundada
Yuval Baum
Michael J. Biercuk

With major acceleration in industry roadmaps for hardware development, we’re now operating in the era of quantum utility: machines now have over 100 qubits and can tackle problems that are at least challenging with conventional approaches.

Right now, you can run any algorithm on the public cloud using 127 qubits with IBM Quantum Services and Q-CTRL’s natively embedded performance management, and it works!

But lurking in the background is an underappreciated challenge. Yes, hardware execution is very tough - that’s why our performance management software has been so important! But even the classical pre-processing required to execute a quantum algorithm can become a bottleneck at these scales if you aren’t careful.

The step known as compilation can present a huge roadblock to the successful execution of real applications. So what is it? How do you evaluate compilers? And which compilers are best?

You won’t be surprised to hear that at Q-CTRL we’ve put a lot of effort into optimizing this critical preprocessing step. In this short piece, we’ll share new insights into the bottleneck presented by compilation at utility scales and showcase benchmarking results revealing just how much benefit you can gain from the compiler we offer.

What do compilers do?

Computers have become an integral part of our modern lives, transforming the way we work, communicate, and access information. We communicate with them in different ways, all the way from writing code snippets in our favorite programming language or using a GUI to just sliding a finger on our phone’s screen. It is quite a miracle how a simple human interpretable instruction is instantly translated into an orchestrated sequence of electrical signals that execute our orders.

The translation between human language and machine language is handled by “compilers”. While most of us take this process for granted, developing good compilers was and remains a huge effort pursued by top computer scientists and engineers around the world.

And the same is true in the field of quantum computing.

Quantum hardware comes in many flavors, each with different properties, constraints, and means of communicating instructions. Quantum compilers are an essential part of the quantum computing stack: they take an abstract set of human interpretable instructions and translate those into a hardware-compatible instruction set.  

For instance, most algorithm designers will happily write a quantum circuit in a language like Qiskit or QASM that says “CNOT(1,5)” - a CNOT logical operation between qubits 1 and 5. However, most hardware systems don’t actually implement a CNOT gate - instead, it’s decomposed into instructions that the hardware can actually perform like cross-resonance interactions and single-qubit operations. But even more, the ability to perform a gate between qubits 1 and 5 will be limited by the connectivity of the device. Accordingly, there may also be a need to move information around a quantum computer using SWAP gates.

Turning these high-level abstract notions into instructions that can be performed on quantum hardware is the responsibility of the quantum compiler.

Are all quantum compilers the same?

The process by which high-level abstractions are turned into machine instructions is much more complex than it may seem, primarily because there isn’t a simple one-to-one “dictionary” that enables that translation.  

A compiler designer might want to use mathematical tricks to reduce the number of operations required, choose the best mapping between abstract “virtual” qubits and their physical counterparts to minimize SWAPS or avoid faulty qubits, and even add error-suppression strategies like dynamical decoupling.  Many of these tasks require challenging mathematical optimization routines that can dramatically impact the quality of a compiler.

You can see for yourself how much difference this makes. Compare the outputs of an “unoptimized” compiler (close to a 1:1 implementation of the abstract algorithm) and an “optimized” compiler which leverages those mathematical tricks to improve the circuit. Optimization reduces the number of quantum logic gates required by about 80%.  And this is for a pretty trivial four-qubit test circuit - imagine if it was something complex!

Figure 1: Comparing an original abstract four-qubit circuit (upper), an unoptimized compilation to the native gateset of hardware(middle), and an optimized compilation (lower) circuit for the same four-qubit Bernstein Vazirani circuit.

How do you even measure compiler quality? We can boil this down to just a handful of key metrics:

  1. Resulting circuit depth: A good compiler will always minimize the number of operations that must be executed, to reduce overall circuit runtime (hence minimizing decoherence and execution costs) as well as the number of error-prone two-qubit gates.
  2. Output variability: A good compiler will deliver consistent results, despite the fact that many of the underlying mathematical optimization tasks have an element of randomness.  You need to know your system will work reliably every time.
  3. Compiler runtime: A good compiler will execute the relevant mathematical optimizations quickly in order to deliver a solution rapidly. Large classical computing overheads can incur unnecessary additional costs, inconvenience, or even runtime errors.
  4. Hardware performance:  A good compiler will integrate additional tactics to address known sources of error like crosstalk suppression via dynamical decoupling, gate-level optimization, or awareness of error-prone hardware in layout selection in order to ensure resulting circuits work well in execution.

These four metrics are likely reasonably intuitive to anyone working with quantum computers.

Still, point number 3 may come as a surprise. Historically, almost no one has thought about compiler speed as a relevant quantity about which to be concerned, mostly because we’ve only had to deal with tiny circuits. Things like QPU processor execution time over many shots would dwarf the entirely classical compilation step when dealing with circuits of just a few qubits.  

If we consider a “race” between compilers, the experience to date has been a bit like running a 5 meter sprint. It doesn’t matter much whether you’re out of shape or an elite athlete - the race is so short that all times will be pretty similar.

Amazingly, the compilation of complex circuits at the scale of just 100 qubits can blow out to hours with an underperforming compiler!

At the utility scale, the race has changed from a 5m sprint to a 5k. In this regime, the true athletes stand out.

These problems can be particularly severe in the applications people now care about. In hybrid or iterative algorithms it’s essential to run many circuits with small changes as part of a classical optimization loop, each time requiring some form of recompilation for the circuit. For utility-scale problems in logistics, transport, or materials compiler bottlenecks can even break execution on some systems.

There are ways to mitigate the impact of compiler delays, like “parametric compilation”, but these approaches generally trade quality against the simplification enabling parameterization (we will discuss the compiler runtime vs performance tradeoffs for hybrid algorithms in a subsequent blog - stay tuned).  Still, it’s now obvious that beyond convenience, compiler runtime is a critical metric to consider in the utility era!

Benchmarking quantum compilers

Measuring performance under real and varying race conditions is the only way to determine the best compilers on the market. It’s a lot of work, so we’ve done it for you.

In general, the quality of output can be traded against compiler speed. But is this practical tradeoff even necessary? Let’s find out.

We compare four compilers: Q-CTRL, Qiskit, Pytket, and an Alternative commercial compiler.

Qiskit is an open-source software development kit for working with quantum computers at the level of circuits, pulses and algorithms. The Qiskit compiler offers four levels of optimization ranging from minimal circuit optimization (i.e. L0) to maximum optimization (i.e. L3). Pytket is a tool for manipulating and optimizing platform-agnostic quantum circuits produced by Cambridge Quantum Computing/Quantinuum. The Alternative commercial compiler (Alt compiler) is a proprietary software package developed by another quantum company. The Q-CTRL compiler is the complete compilation chain available in our performance-management software; it’s available in both our native integration with IBM and our cloud service Fire Opal. A full outline of versions, key features, and relevant computational resourcing for these tests is provided in the table at the end of this article.

In our tests we measure the quality and speed of these compilers across a range of metrics. To make these results concrete, we explore compilation for a 127 qubit IBM quantum processor, IBM Brisbane, and validate the performance of the output circuits on real hardware.

To address the high variability between algorithms in the number of qubits used (a.k.a. circuit width) and the number of two-qubit-gate layers (a.k.a. circuit depth) we provide results across three representative algorithms - Bernstein Vazirani (BV), Quantum Fourier Transform (QFT) and Grover’s Search.  To keep things fair we ensure that we employ the same instance of the problem across each compiler in every matchup.

👉 TLDR: In every case and across every benchmark Q-CTRL’s compiler outperforms its peers at scale.

Resulting circuit depth

We’ll first explore the quality of Resulting circuit depth across the various compilers, and present measurements in terms of both two-qubit gate count and circuit duration.  In these plots “smaller is better” as it corresponds to simpler, shorter circuits.

In every case, and for every circuit width up to the full 127 qubits, Q-CTRL successfully returns solutions with the best performance across both metrics.

Figure 2: Comparing resulting circuit depth for three test algorithms as a function of circuit width. Each row corresponds to a single algorithm and each column addresses a different performance metric.  Markers indicate the different compilers under test.  In these plots, low values indicate “good” compiler performance.

Below we provide additional technical discussion of the observations for each algorithm.

  • Bernstein Vazirani: At the scale of 127 qubits, Q-CTRL performs 22% better than Qiskit L3, with Qiskit L3 47% better than Qiskit L1 (lower optimization).  Overall we observe a close clustering in Resulting circuit depth for all compilers other than Pytket, which shows decidedly worse performance than the others in the case of compiling shallow circuits.
  • Quantum Fourier Transform: In this algorithm, Q-CTRL’s compiler dramatically outperforms all other options and gives the largest divergence from its competitors.  Q-CTRL provides a 93% reduction in two-qubit-gate counts and a 94% reduction in circuit duration compared to Pytket for a 127 qubit implementation. Besides Q-CTRL, only Pytket is successful in compiling circuits at the scale of 127 qubits within a reasonable runtime. Pytket performs better than Qiskit L3 in the reduction of 2Q gate counts however it is unable to reduce the circuit duration better than Qiskit L3. Qiskit L1 and L3 compilation was terminated for >63Q QFT because it takes >15 min for each circuit compilation (see below for more on compiler runtime).
  • Grover’s search: Q-CTRL performs best, providing a 84% reduction in two-qubit gate counts over Pytket as the next-best-performing compiler at the scale of 30-qubit Grover’s search. We terminate here due to the extreme depth of Grover’s search circuits. In this benchmark both measures of circuit depth grow rapidly for Qiskit’s compilers and as such we present results on a semilog plot.  

We note that the publicly accessible version of the Alt compiler doesn’t allow compiling circuits with more than approximately 1000 two-qubit gates. Hence, we were not able to test the Alt compiler beyond 16 qubit QFT and 7 qubit Grover. Even at these small scales Q-CTRL delivers consistently shorter circuits across all algorithms - almost 4X shorter for QFT at just 16 qubits.

Output variability

We take the reliability of classical computers for granted. Imagine your calculator returning answers stochastically, e.g., sometimes 2+2=4.1 and sometimes 2+2=3.95: this would lead to a cascade of errors for any accountant because the error in the output of one calculation would add additional errors in the subsequent calculation. A key feature of any reliable computing technology is the ability to provide a consistent experience to the end user.

It’s broadly under-appreciated that quantum compilers tend to exhibit substantial variability in their outputs.  Again in an era of short sprints on faulty hardware, we hardly notice.  But in the utility era, this variability can lead to huge performance fluctuations from run to run.

These fluctuations can be particularly problematic in the context of hybrid algorithm execution where the output of a circuit is meant to be used as part of a classical optimization loop.  The failure chain of compiler variability → performance variability → classical-loop can’t converge → hybrid-algorithm failure is a real concern (there are additional tricks to mitigate this that we won’t cover here).

To test the reliability of different compilers, we compile a series of test circuits three times each and observe the variation in the resulting two-qubit gate counts and compiled circuit duration. In these plots the lower left corner corresponds to the best performance.

Figure 3: Illustrating the variability of outputs over multiple runs using the five compilers under test. Resulting circuit depth is plotted for four different test cases where each compiler is run three independent times. All markers with the same color/shape represent the same compiler.

Once again Q-CTRL’s compiler performs best. It returns both the most compact (in gates) and shortest (in time) circuits across these tests, in line with the results above. But the results are also consistent from run to run. All Q-CTRL outputs over the three tests are indistinguishable within numerical precision. By contrast other compilers can show up to 10% variation in output parameters from run to run.

Compiler runtime

We don’t typically expect the strongest athlete in a competition to also be the fastest on the track.  But in the case of compilers, they’re one-and-the-same for Q-CTRL.

In a test of compiler runtime Q-CTRL’s compiler also outperforms its competitors in time-to-solution at scale.  And as you’ll see in the table later, we’ve loaded up the Q-CTRL compiler with a range of additional features to improve hardware performance (such as optimized dynamical decoupling embedding and layout selection) which are not required of the other compilers.  Even with this added burden, Q-CTRL executes quickly.

Starting with Bernstein Vazirani, Q-CTRL provides the best performance at 127 qubits relative to the “optimized compilers”, but time can be saved with the unoptimized Qiskit L1 (as anticipated - this compiler does little beyond accounting for connectivity and the native gate set).  For this simple algorithm, absolute compiler time is only about 15 seconds across the best compilers, while here the Alt compiler is roughly 10X slower.

In the case of both QFT and Grover (more complex and high-depth circuits) we see a very different dynamic, with Q-CTRL’s speed shining through as a major advantage.  At the largest circuit widths, Q-CTRL delivers superior performance and scaling, with pytket close behind.  For QFT, Pytket and Q-CTRL are very similar up to ~60 qubits, but then diverge.  As mentioned above the low-optimization Qiskit L1 performs well in this metric at small scales but is simultaneously delivering lower-quality outputs and runtime is increasing more rapidly than its counterparts.

It is also important to consider absolute runtimes.  Focusing on QFT, in all cases up to 127 qubits Q-CTRL delivers high-quality output in <100 seconds of compiler runtime, over 5X faster than Pytket at maximum circuit size.  The Qiskit L3 compiler’s runtime explodes to ~800 seconds at just 63 qubits.  The Alt compiler’s scaling seems to approximate Qiskit L3, but is unable to exceed ~16 qubits as mentioned above. Total runtime for QFT using the Alt compiler at 16 qubits is actually larger than Q-CTRL at full device scale.

Figure 4: Measurement of compiler runtime for three different test algorithms as a function of circuit width. In these plots, low values represent good performance.

Hardware performance

Ultimately the proof is in the pudding.  Other than compiler runtime, the metrics we’ve treated above are mostly proxies that help us anticipate the performance you’re likely to achieve using the compiler’s output circuit when run on hardware.  So how do the different benchmarks perform?

If you’re familiar with our core focus on delivering maximum hardware performance, it won’t surprise you that Q-CTRL performs the best. By a lot.

We execute the circuits obtained from the above compilers on the 127-qubit IBM Brisbane QPU. As representative examples of the achievable performance, we present experimental measurements for both Bernstein Vazirani and QFT. Since these algorithms have a known correct answer, we present the “success probability” of achieving the right result (see our documentation for further discussion).

In the plots below, high is good.

We’ve tried hard to make this a fair competition. Q-CTRL and the Alt compiler both include different forms of interleaved dynamical decoupling for error suppression in their compiler pipelines, so we have turned on the relevant option in Qiskit L3. Pytket does not include such an option.

Figure 5: Comparing hardware performance for the outputs of the five different compilers and across two different test circuits. In these plots high values indicate successful circuit execution.

Even at modest circuit widths and depths, Q-CTRL outperforms its peers by a large margin.  For the 25-qubit Bernstein-Vazirani algorithm, most of the other protocols do not return a measurable result, and execution of the Alt compiler’s circuit performs more than 2X worse than Q-CTRL. This divergence in measured performance even for a reasonably simple circuit makes clear the strength of Q-CTRL’s error suppression pipeline; we know for Bernstein Vazirani that both the Q-CTRL and the Alt compiler’s circuits are similar in length and duration.

As soon as a larger and more complex circuit is encountered, all competitors fall away. At the smallest QFT circuit width - 8 qubits - Q-CTRL achieves ~40% success probability while all other compilers are under 5%. As you can see for yourself, the Q-CTRL advantage grows more extreme for larger circuits, and beyond 12 qubits Q-CTRL provides the only measurable output.

Conclusion

Compilers are an oft-overlooked component of any quantum computing service.  The compiler determines the actual circuit to be run on hardware and as such carries an enormous responsibility to deliver performant outputs.

Here we’ve endeavored to shine a light on the important role played by compilers and identify potential bottlenecks emerging at the utility scale that were previously immaterial.

In this work, we benchmarked the performance of a range of compilers available in the market and showed how well Q-CTRL’s compiler performs relative to its peers.  From resulting circuit depth and compiler runtime to actual circuit performance on hardware, Q-CTRL delivers the best combination of efficiency, speed, and performance.

This is of course not the end of the story.  Our partners at IBM have publicly announced major upgrades to the Qiskit compilers with a focus on performance coming this year.  And we have no doubt others in the community will push forward as well.  We’re excited to see these new capabilities and know that there will be an “upwards ratchet” in performance based on testing like this across many offerings.

You can learn more about the Q-CTRL compiler and start leveraging it immediately via our technical documentation for Q-CTRL Embedded (in IBM) and our independent cloud service Fire Opal.

And don’t take our word for it - try it yourself today!

Compiler Packages and versions Hardware compatibility and depth reduction Post compilation layout selection DD Scheduling
Q-CTRL - Yes Yes Yes Yes
Q-CTRL Qiskit 0.46.0
Qiskit-terra 0.46.0
Qiskit-ibm-runtime 0.19.1
Yes Yes Yes* Yes
Pyket Pyket 1.24.0
Pyket-qiskit 0.48.0
Yes No No Yes
Alt compiler Contact us for details Yes Unknown Yes Yes

*Qiskit offers embedding of dynamical decoupling (DD) as separate methods that may be included by users. Including these would add additional compilation runtime but would not impact the achieved circuit depth. Qiskit IBMQ Sampler and Estimator primitives incorporate DD by default.

** The benchmark tests of Q-CTRL, Qiskit, and Pytket are performed on a cluster platform (AMD Ryzen Threadripper PRO 3995WX, using 1 Core with 2 threads). The tests of Alt compiler are performed through their cloud service which does not give visibility into the relevant compute resources available.