Workflows and Quantum Go Together like Peanut Butter & Jelly
Workflows and Quantum Go Together like Peanut Butter & Jelly
Conflicting dependencies for your project
Inability to share the data that you want to share
Difficulty of reproducing old data and old versions
Inability to re-purpose code for use in multiple projects
Hello friends,
My name is Ethan Hansen and I am a summer intern working with the marketing and product teams here at Zapata. I’ve been working here for four weeks and in that time, I’ve been trying to get the bottom of the question,
At first, I wondered why it was beneficial to use workflows at all. I was perfectly happy coding up little tutorials from different SDKs and libraries in Jupyter notebooks and running them a couple of times. Couldn’t all of quantum be done like this?
I already knew that was unsustainable from experience. There’s no single language or framework that does everything in quantum computing. Because of that, as I was learning different frameworks and going through tutorials, not only did I have to deal with making sure different versions of dependencies were compatible, but also whenever I wanted to switch from one framework to another, it meant switching WAY more than just the framework. One time I even had to completely re-do my file directory structure! The tutorials were also all in Jupyter notebooks or individual scripts, so if I wanted to re-run a set of them, I had to go through and individually re-start each one, often making my laptop freeze if I was running too many at the same time.
This is just a small-scale example of challenges that people and companies are already running into with quantum computing. Think about doing what I went through but for quantum research! You can imagine how managing dependencies, sharing data, reproducing data, visualizing data, and re-purposing code can quickly become significant burdens on the people and teams doing the work. In fact, you don’t have to imagine! I’ve collected examples of issues from Zapata team members who have been doing quantum computing for years to demonstrate to you just how much of a headache these problems can quickly become. I’ll also explain why and how they built Orquestra to mitigate those problems. I will be recounting the tales of tragedy and comedy, of headache and heartbreak that these Zapatistas experienced before they had Orquestra and workflows.
This post is not focused on teaching, “What is a workflow?” If you’d like to know more about them, please see other resources for that, like our summary in the docs or this definition of Workflow Management Software. Nevertheless, I’ll give a brief overview, just to make sure everyone is on the same page:
Quantum researchers face a number of problems with manual execution of scripts that make results difficult to reproduce, managing data across steps a pain, experimenting with new libraries time consuming, and accessing compute/quantum hardware to run those experiments a whole project in and of itself. Workflows abstract these complexities by codifying each step of your work into containerized tasks that can be run in parallel/serially. In other words, a workflow is a specification of what happens when, so that data and execution management can be done automatically.
Allow me to add just one more note: Workflows don’t replace SDKs and languages; they automate the management of them. Workflows aren’t an exclusive platform. They are highly inclusive, allowing users to switch between libraries and backends with ease.
Speaking of switching between libraries, let’s dive into the tragic tale of William Simon, who worked so hard to simulate molecules, but the forces of Dependencies and Packages stymied his advance:
“There were a couple of different packages that we wanted to use for the project, notably Psi4 and qHiPSTER, Intel’s Quantum Simulator, … and basically I could run my psi4 calculations on the Tufts cluster, which is what I was using, and then it turned out that I couldn’t install the Intel Simulator on that cluster because of a lack of packages and dependencies.”
It’s time to find out who wants to be a millionaire. Did William…
A: Quickly find a clever workaround to allow him to use qHiPSTER on his university cluster
B: Ask Tufts cluster admins to install the needed packages and dependencies, waiting weeks for it to be completed
C: Give up on running qHiPSTER altogether, building out his own solution that would work on the Tufts cluster
D: Build his OWN CLUSTER IN HIS BACK YARD so he could install and use what he needed
Ready for the answer? Let’s find out what William did. In his own words:
“The admins at Tufts wouldn’t let me install those base things,” so we know that B can’t be the right answer.
Alright if you chose B, here’s a second chance. Go ahead and pick a second answer. Turns out the correct answer is, drumroll please…
None of them! It was a trick question!
William actually was, “able to find another supercomputing cluster at Pittsburgh” that he could use qHiPSTER on. Which is great! Now he could just go ahead and use that cluster to do all his calcula… “BUT it wouldn’t let me install the psi4 stuff.”
Oh.
Eventually, William ended up doing half of the calculation on the Tufts cluster, then downloading all of the files on to his laptop, then uploading them to the Pittsburgh cluster. And he had to do that manually for weeks on end. Do you know what the worst part was? He couldn’t watch Netflix while all this was happening! (my words, not William’s) With workflows on Orquestra on the other hand, William ran the same experiment in a couple of hours. And, if he wanted to he could have watched some Netflix at the same time. The moral of this story:
Workflows allow you to ensure you have all your needed dependencies. If you have conflicting dependencies, workflows allow you to automatically pass the data between two steps that run in separate environments.
The next story comes courtesy of Max Radin, who wanted to send data, but first had to overcome the two-headed monster of Data Versioning and Data Sharing.
There was a person in Max’s old postdoc group who wanted to reproduce some of the graphs from Max’s project and needed the original data to do so. Max made a point to back up all his data, so thankfully he still had it!
That alone was a lucky break for the postdoc group because not everyone is so organized. However, because there were different iterations of the code as Max refined his project, there were different versions of the data set.
Max got to work, finding the right version of the data set. It didn’t take him very long because it was well organized, but that is unusual. Typically, when one generates datasets, they get thrown into different folders until they’re needed.
Unfortunately, Max wasn’t done yet. If he wanted to give the people in the group the correct version of the dataset, they’d be getting a lot of data they didn’t really need to re-do the specific plots they wanted. It was also just too much data—“many many gigabytes” worth. So, he had to comb through the data set to find the specific outputs he needed to send over, then download just those. Next, he had to open his old MATLAB code, edit it to output to CSV instead of plotting, and email the CSV file to the people in the group.
The real kicker is that he didn’t have to do this just once, he had to repeat the whole process multiple times for multiple plots. Now imagine this was an industry use case: suppose a report needed to be made using the old data to compare it to new data. However, to do so required getting in contact with someone who’d left the company in order to find out how to parse their data. That would have dramatically slowed down the time from idea creation to the usable end-product.
With workflows and Orquestra, all this data management is automated and sharing scientific results is significantly easier. As your data is processed through your workflow it is stored for you in the database, with an easy way to look at which version is which. If you want to share the results with someone, you can just give them a link to the repo where your workflow is and point to the data. Then they can easily see all your data in a JSON file, allowing them to easily parse it and use the data they need.
This third short story is all about mystery, intrigue, and Data Reproducibility. In it, Jerome Gonthier works to understand the terrain, even with shifting data structures.
During Jerome’s postdoc work, one of the big pain points was structuring and organizing the data. He started his project organized in one way, but after learning more and wanting to try different things, the structure of future revisions changed a bit. Because it’s hard to remember what exactly the structure and the code was a few months ago, wanting to go back and communicate results became a real pain.
Orquestra, on the other hand, makes it easier to track versioning with Git. You can also automatically record the commit hash with the workflow; this makes going back to check out previous versions of the code, your analytic tools, and the data a fairly simple process. And because every step and output from the workflow is recorded, there is a story of the data from start to finish. This means that going back to re-do a specific version of your workflow is possible. It also guarantees that you can give all the parameters for the environment to another person (or just yourself in the future). That ensures almost perfect data reproducibility.
Marta Mauri plots a lot. Plotting to take over the world! Well, maybe not the world… and the plots might actually be for experiments… okay fine she’s creating graphs and plots, not villainy. But in many cases, Data Visualization is the real villain! While working at Zapata, Marta needs to run a workflow, then quickly visualize the results.
In an ideal setting, that would mean:
In the real world, without Orquestra, it actually turns into:
The process with Orquestra and workflows looks so different! Here’s what Marta had to say about how it works now:
“The fact that you have the DCS [Data Correlation Service] and you can connect to Tableau and just drag things until you have the plot that you want, for me, this is super convenient. Just to have a basic idea of what is going on. Maybe to have fancy stuff, it requires time as always because graphing things is always a big deal, but just to have an idea of how your simulations are doing and the scaling of your quantities, that can be done easily. It could be easier to see if there’s some giant bug that you didn’t notice before because you don’t have to write another script to get your data”
This is just another part of the process of running quantum experiments that Orquestra makes much faster, easier and more user-friendly. Making it easier to do a sanity check on your data with easy plotting means less time is wasted creating plots for bad data and running experiments that give bad data. Making it faster also means that quick plots to show to customers, reviewers, or to put into a presentation are exceedingly easy to create, compared to doing it manually.
Another point to note is that Marta almost exactly echoed what Jerome said about Data Reproducibility. In her own words, “So you can retrieve what you ran and how long it took and all the information that you need without like having to struggle with combining, ‘okay so I ran this on that day and that day I was using these parameters because I wanted to check this…”
Almost like this is a common problem in quantum that Orquestra eliminates?
Our final tale is from Jhonathan Romero Fontalvo and Hannah Sim. They had different stories, but similar themes. When Jhonathan was completing his PhD he also had to deal with Code Reusability (Or the Lack Thereof). Hannah is in grad school, and before she had Orquestra she had to deal with many of the same problems, especially surrounding changing code bases and libraries.
In academia, Jhonathan had different ideas for algorithms that he wanted to publish. However, in order to publish them, he needed proof that they actually worked. In other words, he needed to show proof-of-principal codes that produced numerical results.
That made sense, but it did become a pain point when every time he developed a new algorithm, he had to go back and re-build the proof-of-principal code from scratch. Some very basic components were re-usable, but most of the projects had to be completely re-done. As Jhonathan says, “
You don’t pay too much attention to making code reproducible and making it easy to use for other people and unfortunately, that actually backfires because you end up spending more time building stuff from scratch every time and you spend a lot of time if you are collaborating with someone trying to explain how your code works for someone else to use it
“
This can be avoided if your code is re-usable, but often it’s hard enough to re-use your own code (let alone try to understand the format someone else is using in their code). There’s also an issue of compatibility when sharing code. If one person in a group is using one framework, and another person is using a different framework, interfacing between them can be virtually impossible. However, Orquestra makes it easy because the intermediate data between steps is stored in a consistent, standardized format. That means each step in Orquestra is modular and can be easily switched out for a different module. If you want to compare one optimizer to another the change is just one line in a workflow, rather than many lines and many hours of coding individual scripts.
Hannah Sim’s stories were similar to Jhonathan’s, with the added complexity that the quantum computing libraries we take for granted today were just starting up when she started grad school. That meant a lot of her code base was shifting, even as she tried to run experiments. Plus, she had to consider all the different changes within a program that make an experiment successful. It’s too much to efficiently track without a program to help manage, or orquestrate, all the moving parts. Because of this experience and as a current grad student herself it means a lot when Hannah says about Orquestra, “
This is basically a grad student’s dream. Take, for example, a VQE calculation. There are so many moving parts to VQE, and often when you’re writing a paper, it’s usually an improvement to an algorithm and you want to test that against different types of molecules, over different bonds lengths and geometries. To do that, even with a script it’s a lot of things to keep track of and submit batches at a time. But using Orquestra you can do that using one script, or YAML file
”
For most users, the biggest speedup from Orquestra won’t come from running parallel steps simultaneously (although you can do that, too). The speedup will come from Orquestra’s ability to automate tasks that humans would otherwise have to do manually. If you want to change the input or setup for an experiment without workflows, it can be a tedious, time-consuming task taking months to adjust the parameters you need. After using Orquestra, Jhonathan says, “With Orquestra, it is just so much easier. If I want to re-run an example, I just tweak a couple of things in a workflow and launch it. It runs and I don’t have to struggle creating code to analyze the data … So that definitely speeds up the whole process and leaves you more time for doing the fun stuff which is coming up with the algorithms and with interesting applications.”
All of these are great reasons to use workflows—automating manual tasks, data management, easy data reproducibility, and dependency management. However, there are two things to note here:
Just like peanut butter and jelly, the individual components of Orquestra are good, but when they come together, they can create something incredible. In fact, that’s what a lot of people say about Orquestra. (Not the part about PB&J, although I think they’d agree.) When I asked Zapatistas to answer the question, “Why Workflows?” in just one or two sentences, here’s what they said:
I’d summarize it this way: