Containers have become the de facto standard for moving data projects to production. No more dependency management nightmares— projects developed on a local machine can be “shipped” to a staging and production cluster (typically) with no surprises. Data pipelines and ML models are finally reproducible and can run anywhere in the same fashion.
However, with an ever-growing number of containerized data workloads, orchestration platforms are becoming increasingly important.
- 1. Orchestrating Containers
- 2. Communicating Between Teams
- 3. Declarative Definition
- 4. Documentation
- 5. Versioning
- 6. Maintaining the Health of Your Execution Layer
- 7. Seamlessly Scale Your Execution Platform As Your Data Grows
- 8. Development and Production Environment Parity
- 9. Iterating Faster
- 10. Installing Almost Any Software With Helm Charts and Further Features
- The Common Theme
1. Orchestrating Containers
If you want to run reproducible data pipelines and ML models that can run anywhere, you probably know that a Docker image is the way to go. Everybody loves container images, from software engineers to DevOps and SREs.
In the past, the challenge was to run containers at scale. With Kubernetes — especially when deployed to elastic cloud services — managing the execution of containers at scale has become considerably less painful.
2. Communicating Between Teams
As much as many of us love building end-to-end data products, in reality, many enterprises split the responsibilities between who builds a data product and who runs it. After building data pipelines, data scientists and engineers often have to hand those over to DevOps and SREs for deployment and monitoring. Container images make this handover process much easier.
At the same time, containers facilitate sharing code and collaboration between data engineers, data scientists, and analysts who can all work in the same reproducible environment.
3. Declarative Definition
These days, data engineering workloads are dynamic in nature. Imagine that you deployed your data pipeline to a Kubernetes cluster and it’s now failing due to an out-of-memory error. Fixing the issue may be a matter of increasing the memory size in your declarative deployment file and applying the changes to the cluster.
A nice “side effect” of workflow and configuration as code is that everything is documented — not only the business logic but also execution details such as where things are running and which resources they consume. When other engineers need to work on your code, there is no guessing and back-and-forth communication to get them up to speed with your data workloads.
Another benefit of a declarative workflow and environment definition is that everything can be version-controlled. If something goes wrong, you can revert to the previous version. You can track changes made to your environment over time and provide an audit log for compliance. GitOps and MLOps made this approach popular, but containers and orchestration platforms effectively made it possible.
6. Maintaining the Health of Your Execution Layer
One of the most prominent benefits of Kubernetes is that it will always attempt to maintain the desired state and restart or recreate resources once they die. Of course, the self-healing “powers” of Kubernetes won’t fix all your problems (such as broken business logic), but at least you don’t need to intervene when an error occurs due to a temporary network issue that can get resolved with a restart or redeployment to a new pod.
7. Seamlessly Scale Your Execution Platform As Your Data Grows
Running your data pipelines on a single server probably won’t cut it for you these days. With growing amounts of data, it becomes difficult to manage data processing — regardless of whether we package our code as container images or run it in a single local process.
If you need to scale your workloads across multiple nodes, Kubernetes (especially in combination with Helm charts) makes it much easier to install Dask or Spark on a compute cluster and thus distribute data processing across multiple nodes. Most cloud providers offer autoscaling services or even provide a completely serverless data plane (AWS EKS on Fargate and GCP GKE Autopilot). Those cloud vendors take care of scaling out worker nodes when needed, thereby entirely eliminating the need for guessing required capacity.
Despite all the goodness, Kubernetes won’t give you immediate visibility into what happens in your data workloads, but it will make it easier to install a distributed compute engine and seamlessly scale it out to multiple nodes.
8. Development and Production Environment Parity
A common challenge when building data products is that a development environment is often vastly different from how things are supposed to run in production. Containerized workloads make this transition much easier. You may have different clusters or namespaces for staging and production environments. Switching between them should be seamless.
9. Iterating Faster
Iterations are crucial for building data workloads. Most data products are bad at first. You may need several cycles of cleaning and transforming data, testing various classifiers and hyperparameters, or enriching models with new data. Kubernetes deployments allow you to implement A/B testing or to run multiple instances of the same ML training job, but with different hyperparameters.
The real proof that leveraging Kubernetes allows you to iterate faster is when you combine Kubernetes with tools that abstract away low-level details. For instance:
- To build and scale out your data flows, you can leverage Prefect with Dask on Kubernetes.
- To serve ML models or to perform A/B testing, you could leverage Seldon.
- To build visualizations, insights, metrics, and KPIs, you can use GoodData.CN.
10. Installing Almost Any Software With Helm Charts and Further Features
Kubernetes has become so universally popular that plenty of tools have been built or redesigned to work on K8s clusters, and you can typically install them using Helm chart repositories. In the list above, we just scratched the surface. It should only demonstrate that you can leverage this container orchestration platform to build all data workloads — execute ETL jobs, train and serve ML models, and even build and host visualizations.
The Common Theme
The commonality between all the points above is the fact that containers (and container orchestration platforms) allow us to:
- Build reproducible code that can run the same way anywhere at any scale.
- Reduce friction in the handoff between different teams.
- Apply modern software engineering practices to data workloads, including GitOps, DevOps, and MLOps.
If you want to follow this paradigm for your data visualizations, GoodData has recently launched its cloud-native platform GoodData.CN that you can install to your Kubernetes cluster. You can start by running a single-image Docker container for local development of dashboards, metrics, and KPIs (in a DRY way!):
docker run --name gooddata -p 3000:3000 -p 5432:5432 \\ -e LICENSE_AND_PRIVACY_POLICY_ACCEPTED=YES gooddata/gooddata-cn-ce:latest
Once you have developed your dashboards, metrics, and insights locally, you can export your declarative definitions and deploy them to a production Kubernetes cluster. The GoodData team provided detailed documentation showing how you can install their software for various scenarios:
- AWS EKS with RDS Postgres database and ElastiCache for Redis
- GCP GKE with Cloud SQL for Postgres and MemoryStore for Redis
- Azure AKS with Azure Database for Postgres and Azure Cache for Redis
- On-premise deployment
- And many helpful tips on how to manage organizations, workspaces, and data sources.
In this article, we looked at the features of Kubernetes (and container orchestration platforms in general) that have made containers so universally popular among data teams. The sheer number of tools integrating with K8s and its presence in every cloud platform make it a comprehensive execution layer for reproducible and scalable data workloads.
Thank you for reading!