This post originally was written for zapatacomputing.com, but we thought it would be useful to have here too. Enjoy!
This post is going to be long thorough. Because of that, here’s a TL;DR:
- Orquestra is based on workflows
- Workflows mitigate problems such as
- Conflicting dependencies for your project
- Inability to share the data that you want to share
- Difficulty of reproducing old data and old versions
- Inability to re-purpose code for use in multiple projects
- They also allow you to plug-and-play pieces of software and hardware backends as they become available
- Workflows can automate a lot of the laborious tasks of doing quantum computing at large scale
My name is Ethan Hansen and I am a summer intern working with the marketing and product teams here at Zapata. I’ve been working here for four weeks and in that time, I’ve been trying to get the bottom of the question,
“What is it about using workflows that make Orquestra so purpose-built for quantum computing?”
At first, I wondered why it was beneficial to use workflows at all. I was perfectly happy coding up little tutorials from different SDKs and libraries in Jupyter notebooks and running them a couple of times. Couldn’t all of quantum be done like this?
I already knew that was unsustainable from experience. There’s no single language or framework that does everything in quantum computing. Because of that, as I was learning different frameworks and going through tutorials, not only did I have to deal with making sure different versions of dependencies were compatible, but also whenever I wanted to switch from one framework to another, it meant switching WAY more than just the framework. One time I even had to completely re-do my file directory structure! The tutorials were also all in Jupyter notebooks or individual scripts, so if I wanted to re-run a set of them, I had to go through and individually re-start each one, often making my laptop freeze if I was running too many at the same time.
This is just a small-scale example of challenges that people and companies are already running into with quantum computing. Think about doing what I went through but for quantum research! You can imagine how managing dependencies, sharing data, reproducing data, visualizing data, and re-purposing code can quickly become significant burdens on the people and teams doing the work. In fact, you don’t have to imagine! I’ve collected examples of issues from Zapata team members who have been doing quantum computing for years to demonstrate to you just how much of a headache these problems can quickly become. I’ll also explain why and how they built Orquestra to mitigate those problems. I will be recounting the tales of tragedy and comedy, of headache and heartbreak that these Zapatistas experienced before they had Orquestra and workflows.
A bit of background
This post is not focused on teaching, “What is a workflow?” If you’d like to know more about them, please see other resources for that, like our summary in the docs or this definition of Workflow Management Software. Nevertheless, I’ll give a brief overview, just to make sure everyone is on the same page:
Quantum researchers face a number of problems with manual execution of scripts that make results difficult to reproduce, managing data across steps a pain, experimenting with new libraries time consuming, and accessing compute/quantum hardware to run those experiments a whole project in and of itself. Workflows abstract these complexities by codifying each step of your work into containerized tasks that can be run in parallel/serially. In other words, a workflow is a specification of what happens when, so that data and execution management can be done automatically.
Allow me to add just one more note: Workflows don’t replace SDKs and languages; they automate the management of them. Workflows aren’t an exclusive platform. They are highly inclusive, allowing users to switch between libraries and backends with ease.
Depending on what you want
Speaking of switching between libraries, let’s dive into the tragic tale of William Simon, who worked so hard to simulate molecules, but the forces of Dependencies and Packages stymied his advance:
“There were a couple of different packages that we wanted to use for the project, notably Psi4 and qHiPSTER, Intel’s Quantum Simulator, … and basically I could run my psi4 calculations on the Tufts cluster, which is what I was using, and then it turned out that I couldn’t install the Intel Simulator on that cluster because of a lack of packages and dependencies.”
It’s time to find out who wants to be a millionaire. Did William…
A: Quickly find a clever workaround to allow him to use qHiPSTER on his university cluster
B: Ask Tufts cluster admins to install the needed packages and dependencies, waiting weeks for it to be completed
C: Give up on running qHiPSTER altogether, building out his own solution that would work on the Tufts cluster
D: Build his OWN CLUSTER IN HIS BACK YARD so he could install and use what he needed
Ready for the answer? Let’s find out what William did. In his own words:
“The admins at Tufts wouldn’t let me install those base things,” so we know that B can’t be the right answer.
Alright if you chose B, here’s a second chance. Go ahead and pick a second answer. Turns out the correct answer is, drumroll please…
None of them! It was a trick question!
William actually was, “able to find another supercomputing cluster at Pittsburgh” that he could use qHiPSTER on. Which is great! Now he could just go ahead and use that cluster to do all his calcula… “BUT it wouldn’t let me install the psi4 stuff.”
Eventually, William ended up doing half of the calculation on the Tufts cluster, then downloading all of the files on to his laptop, then uploading them to the Pittsburgh cluster. And he had to do that manually for weeks on end. Do you know what the worst part was? He couldn’t watch Netflix while all this was happening! (my words, not William’s) With workflows on Orquestra on the other hand, William ran the same experiment in a couple of hours. And, if he wanted to he could have watched some Netflix at the same time. The moral of this story:
Workflows allow you to ensure you have all your needed dependencies. If you have conflicting dependencies, workflows allow you to automatically pass the data between two steps that run in separate environments.
Sharing is Caring
The next story comes courtesy of Max Radin, who wanted to send data, but first had to overcome the two-headed monster of Data Versioning and Data Sharing.
There was a person in Max’s old postdoc group who wanted to reproduce some of the graphs from Max’s project and needed the original data to do so. Max made a point to back up all his data, so thankfully he still had it!
That alone was a lucky break for the postdoc group because not everyone is so organized. However, because there were different iterations of the code as Max refined his project, there were different versions of the data set.
Max got to work, finding the right version of the data set. It didn’t take him very long because it was well organized, but that is unusual. Typically, when one generates datasets, they get thrown into different folders until they’re needed.
Unfortunately, Max wasn’t done yet. If he wanted to give the people in the group the correct version of the dataset, they’d be getting a lot of data they didn’t really need to re-do the specific plots they wanted. It was also just too much data—“many many gigabytes” worth. So, he had to comb through the data set to find the specific outputs he needed to send over, then download just those. Next, he had to open his old MATLAB code, edit it to output to CSV instead of plotting, and email the CSV file to the people in the group.
The real kicker is that he didn’t have to do this just once, he had to repeat the whole process multiple times for multiple plots. Now imagine this was an industry use case: suppose a report needed to be made using the old data to compare it to new data. However, to do so required getting in contact with someone who’d left the company in order to find out how to parse their data. That would have dramatically slowed down the time from idea creation to the usable end-product.
With workflows and Orquestra, all this data management is automated and sharing scientific results is significantly easier. As your data is processed through your workflow it is stored for you in the database, with an easy way to look at which version is which. If you want to share the results with someone, you can just give them a link to the repo where your workflow is and point to the data. Then they can easily see all your data in a JSON file, allowing them to easily parse it and use the data they need.
Can you do that again?
This third short story is all about mystery, intrigue, and Data Reproducibility. In it, Jerome Gonthier works to understand the terrain, even with shifting data structures.
During Jerome’s postdoc work, one of the big pain points was structuring and organizing the data. He started his project organized in one way, but after learning more and wanting to try different things, the structure of future revisions changed a bit. Because it’s hard to remember what exactly the structure and the code was a few months ago, wanting to go back and communicate results became a real pain.
Orquestra, on the other hand, makes it easier to track versioning with Git. You can also automatically record the commit hash with the workflow; this makes going back to check out previous versions of the code, your analytic tools, and the data a fairly simple process. And because every step and output from the workflow is recorded, there is a story of the data from start to finish. This means that going back to re-do a specific version of your workflow is possible. It also guarantees that you can give all the parameters for the environment to another person (or just yourself in the future). That ensures almost perfect data reproducibility.
Plans are of little importance, but Plotting is essential
Marta Mauri plots a lot. Plotting to take over the world! Well, maybe not the world… and the plots might actually be for experiments… okay fine she’s creating graphs and plots, not villainy. But in many cases, Data Visualization is the real villain! While working at Zapata, Marta needs to run a workflow, then quickly visualize the results.
In an ideal setting, that would mean:
- Get the data
- Parse out the data that’s needed
- Create a script to plot the data
In the real world, without Orquestra, it actually turns into:
- Get the data
- Clean up the data
- Parse out the data that’s needed
- Create a script to plot the data
- Realize the script doesn’t work
- Realize you need to use more / different data
The process with Orquestra and workflows looks so different! Here’s what Marta had to say about how it works now:
“The fact that you have the DCS [Data Correlation Service] and you can connect to Tableau and just drag things until you have the plot that you want, for me, this is super convenient. Just to have a basic idea of what is going on. Maybe to have fancy stuff, it requires time as always because graphing things is always a big deal, but just to have an idea of how your simulations are doing and the scaling of your quantities, that can be done easily. It could be easier to see if there’s some giant bug that you didn’t notice before because you don’t have to write another script to get your data”
This is just another part of the process of running quantum experiments that Orquestra makes much faster, easier and more user-friendly. Making it easier to do a sanity check on your data with easy plotting means less time is wasted creating plots for bad data and running experiments that give bad data. Making it faster also means that quick plots to show to customers, reviewers, or to put into a presentation are exceedingly easy to create, compared to doing it manually.
Another point to note is that Marta almost exactly echoed what Jerome said about Data Reproducibility. In her own words, “So you can retrieve what you ran and how long it took and all the information that you need without like having to struggle with combining, ‘okay so I ran this on that day and that day I was using these parameters because I wanted to check this…”
Almost like this is a common problem in quantum that Orquestra eliminates?
Starting from scratch, again
Our final tale is from Jhonathan Romero Fontalvo and Hannah Sim. They had different stories, but similar themes. When Jhonathan was completing his PhD he also had to deal with Code Reusability (Or the Lack Thereof). Hannah is in grad school, and before she had Orquestra she had to deal with many of the same problems, especially surrounding changing code bases and libraries.
In academia, Jhonathan had different ideas for algorithms that he wanted to publish. However, in order to publish them, he needed proof that they actually worked. In other words, he needed to show proof-of-principal codes that produced numerical results.
That made sense, but it did become a pain point when every time he developed a new algorithm, he had to go back and re-build the proof-of-principal code from scratch. Some very basic components were re-usable, but most of the projects had to be completely re-done. As Jhonathan says,
“You don’t pay too much attention to making code reproducible and making it easy to use for other people and unfortunately, that actually backfires because you end up spending more time building stuff from scratch every time and you spend a lot of time if you are collaborating with someone trying to explain how your code works for someone else to use it“
This can be avoided if your code is re-usable, but often it’s hard enough to re-use your own code (let alone try to understand the format someone else is using in their code). There’s also an issue of compatibility when sharing code. If one person in a group is using one framework, and another person is using a different framework, interfacing between them can be virtually impossible. However, Orquestra makes it easy because the intermediate data between steps is stored in a consistent, standardized format. That means each step in Orquestra is modular and can be easily switched out for a different module. If you want to compare one optimizer to another the change is just one line in a workflow, rather than many lines and many hours of coding individual scripts.
Hannah Sim’s stories were similar to Jhonathan’s, with the added complexity that the quantum computing libraries we take for granted today were just starting up when she started grad school. That meant a lot of her code base was shifting, even as she tried to run experiments. Plus, she had to consider all the different changes within a program that make an experiment successful. It’s too much to efficiently track without a program to help manage, or orquestrate, all the moving parts. Because of this experience and as a current grad student herself it means a lot when Hannah says about Orquestra,
“This is basically a grad student’s dream. Take, for example, a VQE calculation. There are so many moving parts to VQE, and often when you’re writing a paper, it’s usually an improvement to an algorithm and you want to test that against different types of molecules, over different bonds lengths and geometries. To do that, even with a script it’s a lot of things to keep track of and submit batches at a time. But using Orquestra you can do that using one script, or YAML file”
For most users, the biggest speedup from Orquestra won’t come from running parallel steps simultaneously (although you can do that, too). The speedup will come from Orquestra’s ability to automate tasks that humans would otherwise have to do manually. If you want to change the input or setup for an experiment without workflows, it can be a tedious, time-consuming task taking months to adjust the parameters you need. After using Orquestra, Jhonathan says, “With Orquestra, it is just so much easier. If I want to re-run an example, I just tweak a couple of things in a workflow and launch it. It runs and I don’t have to struggle creating code to analyze the data … So that definitely speeds up the whole process and leaves you more time for doing the fun stuff which is coming up with the algorithms and with interesting applications.”
Wrapping it all up
All of these are great reasons to use workflows—automating manual tasks, data management, easy data reproducibility, and dependency management. However, there are two things to note here:
- Workflows convey many more advantages than just those listed here. Other benefits include modularity, parallelization, and cloud computing compatibility. For instance, modularity means that as new libraries and hardware backends become available, you can plug-and-play them into your workflow seamlessly. All of these speed up different parts of the path to creating and testing a real application.
- The true power of Orquestra doesn’t come from just one of these advantages, but rather the fact that this one product does all of it. Having a single location for everything you need from data management to libraries, from backends to containerization. The goal of Orquestra is to make your life easier with all the stuff that you have to do, but no one wants to do.
Just like peanut butter and jelly, the individual components of Orquestra are good, but when they come together, they can create something incredible. In fact, that’s what a lot of people say about Orquestra. (Not the part about PB&J, although I think they’d agree.) When I asked Zapatistas to answer the question, “Why Workflows?” in just one or two sentences, here’s what they said:
- Jerome: To get reproducible results, to be able to share a project easily, basically you can just transfer what you have to them and they can run it right away. They don’t have to understand a bunch of obscure python scripts you’ve written.
- Alex: Reproducibility is a good thing, having a unified format for output. There are a lot of different packages and they all output data differently, so having a standardized output is very nice.
- William: It’s a smarter and more natural way to run experiments where a lot of the things you wish you had but didn’t know you needed are automatically taken care of.
- Max: It automates the tasks that become really cumbersome when you’re working on a complicated project that involves a lot of 1. Calculations, 2. different pieces of software, or 3. people. Automating a lot of these very laborious things that need to get done.
- Jhonathan: In science, it’s important to reproduce experiments. That is sort of the principle behind, one of the main principles behind, experimental science…. In order to make things reproducible in software, workflows is the answer. It’s essentially the perfect way of expressing a numerical experiment.
- Michał: Workflows are a framework that allows you to develop research in a way that is inherently scalable and inherently reusable, rather than writing scripts for your own purpose and then making it hard for others to use it because everyone uses different standards.
- Hannah: In my mind, how I’ve been thinking about submitting a collection of jobs and managing them is in terms of workflows. It’s almost a natural way of thinking about doing that process. It’s just now it’s formalized and there is a structure to it. Often it’s like I think of them as workflows, but then how I execute it is like manually going in and doing the workflow manager’s job for them.
- Marta: Because their structure helps you as a scientist to design your experiments carefully all the way from the code to the deliverables! The workflow approach forces you to ask yourself questions like what kind of data do I want to produce?, which intermediate step do I need to get to the final results?, what are the dirty details of the code that I can sweep under the rug and don’t need to see again after I’ve written and tested them?, and instead what parameters shall I be able to tweak to do interesting science? Workflows help you by shaping your scientific experiment in a highly compact, modular and reproducible way.
I’d summarize it this way:
Workflows make life much easier by specifying the steps to run an experiment, all while making it faster, and easier to run what you want (both now and in the future) and to effectively utilize the data.