Workflow Orchestration vs. Data Orchestration — Are Those Different?

Workflow Orchestration vs. Data Orchestration — Are Those Different?

Published
May 2, 2022
Tags
Data EngineeringArchitecture & DesignPrefectData QualityObservability & SRE

Workflow Orchestration vs. Data Orchestration — Are Those Different?

Photo by
Photo by Artem Podrez from Pexels

Let’s disambiguate the terms to understand workflow orchestration better — with a real-life analogy!

With the rise of the Modern Data Stack, many tools in the industry started positioning themselves as “data orchestrators” rather than “workflow orchestrators.” This article attempts to disambiguate the terms. I’d argue that the data orchestration moniker is a confusing shorthand term and that workflow orchestration and data flow automation better represent what orchestration for the Modern Data Stack is about.

What is workflow orchestration?

Workflow orchestration means governing your data flow in a way that respects the orchestration rules and your business logic. A workflow orchestration tool allows you to turn any code into a workflow that you can schedule, run, and observe.

A good workflow orchestration tool will provide you with building blocks to connect to your existing data stack and will allow you to:

  • pass data between tasks,
  • trigger an ad-hoc parametrized run
  • assign custom (complex) schedules
  • alert you when something fails,
  • retry and recover from failures,
  • avoid expensive recomputation via caching
  • and will save you time writing defensive code only to ensure that your workflow steps run in the right order and that you find out when a failure took place (visibility).

In short, good workflow software eliminates negative engineering.

When you think about good workflow tools, you should think about how they work when things go wrong. Workflows are only interesting when things fail; they’re kind of like insurance or risk management for code. Assuming things will go wrong — and they will — the right workflow tool should make it quick and simple to handle those “wrong things” and direct you towards how to fix them. — Jeremiah Lowin, May 2020

What do people mean when they say “data orchestration”?

They usually refer to the orchestration of workflow nodes that touch data. Any workflow nodes that interact with data, either producing or consuming data, fall into this category.

Following this definition, “data orchestration” is a shorthand term for orchestrating data (or data warehousing) workflows, but it still describes workflow orchestration or data flow automation.

Workflow orchestration — Explain Like I’m 5

Imagine that the workflow orchestration tool is your personal delivery service:

  • Each order (or shopping cart) reflects your workflow,
  • Each delivery is a workflow run,
  • It’s extremely easy and convenient to put things into a shopping cart — you just add a couple of decorators, and you’re off to the races,
  • Within each order (or shopping cart), you may have many products that get packaged into boxes — your tasks
  • Products within the boxes may have various flavors, forms, and shapes, and they reflect what you put into your shopping cart — what you wished to be orchestrated and how
  • Flavors may reflect your data replication jobs (e.g. Airbyte), data transformations (e.g. dbt), data cleaning (e.g. pandas), your ML use cases (e.g. scikit-learn), and so much more.
  • Your boxes may be as small or as big as you wish — it’s your order in the end (your workflow design),
  • Products inside of your boxes may come from various vendors, i.e. your data tools, e.g. dbt, Fivetran, your favorite ML frameworks, your custom data cleaning libraries,
  • The delivery address may either be your home address (your data warehouse), your holiday address (your data lakehouse), or an address of a friend (some external database, data processing service, microservice, or application).
Photo by
Photo by Norma Mortenson from Pexels

Sequential, concurrent, and distributed workflow execution

Following the delivery service analogy, in the orchestration, you may:

  • decide whether you want to get your order delivered all at once or sequentially — the order of execution of your tasks,
  • choose your delivery type — you may choose a standard (sequential execution) or an express delivery using special distributed services such as Dask, Ray (imagine multiple delivery trucks), or even speed up the execution within a single thread using concurrency with async (a single but faster and more efficient deliverer who can better context-switch),
  • determine how your order should be (gift) wrapped — you may choose to package it into a subprocess, a Docker container, Kubernetes job, or an ECS task.

The workflow orchestration will take care of the delivery, i.e. the execution. It will ensure that your products will get packaged as desired, get shipped at the right schedule, and with the right delivery type for all boxes — some packages need to be delivered quickly with express delivery, while others can wait and get executed sequentially.

Scale and graceful failure handling

A good delivery (orchestration) service scales extremely well. You can have multiple deliveries scheduled for the same time with potentially thousands of trucks (or even cargo ship fleets) and millions of packages, and it will still give you fine-granular visibility into the delivery state of every package.

This scheduling service is asynchronous and highly available to guarantee that your order will get shipped even when some suppliers get sick, or some trucks break down. And many things can go wrong during the delivery:

  • some boxes may get damaged and may need to be returned, i.e. retried or restarted
  • the entire delivery may need to be rescheduled because you weren’t at home at that time.

Hybrid execution model

Good delivery service also respects your privacy and operates purely on metadata, such as your shipping address, the delivery type, packaging form, etc. It is then responsible for the transport and execution (the data flow), but it cannot and should not open the box to check what’s inside (your data).

Why “workflow orchestration” is a less confusing term than “data orchestration”

A good workflow orchestration tool (your delivery service) allows you to pick and choose your products, put them into boxes and customize your order as you wish, but in the end, it’s responsible for the data movement (the delivery, transport, data flow, the execution), not about the actual products within those boxes (your data). Therefore, the term data orchestration is a confusing shorthand missing the word describing the actual movement of data as it flows through your system. Data flow orchestration, workflow orchestration, data movement orchestration, data processing orchestration, and data transport orchestration — all those are much clearer than the shorthand term “data orchestration”.

Workflow orchestration is about the data flow and ensuring that you can rely on its execution through various failure handling mechanisms. It can give you visibility into how long the delivery took. It can provide you with all shipment updates (your workflow execution logs). It can tell you whether a box was successfully shipped to the end recipient, but it cannot directly open it to check the brand, quality, and origin of the products inside.

Borrowing the analogy from this blog post, orchestration is about the arrows indicating transitioning between various boxes (tasks) as they get executed, not about the boxes themselves (your data). It should guarantee that your flows and tasks run as intended in the right order at the right time and with the right parallelism. It should guard against errors and failures and help you recover from them and correctly interpret the execution states.

Conclusion

This article is intended to disambiguate the terms and give you some insights into what is workflow orchestration. If you still have any questions or want to discuss this topic further, you can join Prefect Community Slack: prefect.io/slack. There is a dedicated channel #best-practices-orchestration where you can ask me anything about that.

Thanks for reading!