Can Data Lakes Accelerate Building ML Data Pipelines?
A common challenge in data engineering is to combine traditional data warehousing and BI reporting with experiment-driven machine learning projects. Many data scientists tend to work more with Python and ML frameworks rather than SQL. Therefore, their data needs are often different from those of data analysts. In this article, we’ll explore why having a data lake often provides tremendous help for data science use cases. We’ll finish up with a fun computer-vision demo extracting text from images stored in an S3 data lake.
1. Data lakes are data agnostic
Having only a purely relational data warehouse imposes a limitation on the variety of data formats this data platform can support. Many data warehouse solutions allow you to analyze nested JSON-like structure, but it’s still a fraction of data formats that can be supported by a data lake.
While nothing beats relational table structure for analytics, it’s still beneficial to have an additional platform that allows you to do more than that. Data lakes are data agnostic. They support a large variety of data crucial for data science:
- different file types (csv, tsv, json, txt, parquet, orc), data encryption, and compression formats (snappy, gzip, zlib, lzo),
- images, audio, video files enabling deep learning use cases with Computer Vision algorithms,
- model checkpoints created by ML training jobs,
- joins across relational and non-relational data, both server-side (ex. Presto or Athena) and on the client-side (ex. your Python script),
- data from the web: clickstream, shopping cart data, social media (ex. tweets, Reddit, blog posts, and news article scraped for Natural Langage Processing analytics),
- time-series data: IoT and sensor data, weather data, financial data.
2. Increased development efficiency
Manual ingestion of raw data into a data warehouse is quite a tedious and slow process. You need to define a schema and specify all data types in advance, create a table, open a JDBC connection in your script or ETL tool, then finally you can start loading your data. In contrast, the load step in a data lake is often as simple as a single command. For instance, ingesting a Pandas dataframe into an S3-based data lake with AWS Glue catalog can be accomplished in a single line of Python code (the syntax in Pyspark and Dask is quite similar):
Ingestion into an S3 data lake — Image by author
If you are mainly using Python for analytics and data engineering, you will likely find it much easier to write and read data using a data lake rather than using a data warehouse.
3. Support for a wider range of data processing tools
We discussed a variety of data supported by data lakes, but they also support a variety of processing frameworks. While a data warehouse encourages processing data in memory using primarily SQL and UDFs, data lakes make it easy to retrieve data in a programming language or platform of your choice. Because no proprietary format is enforced, you have more freedom. This way, you can leverage the power of a Spark or Dask cluster and a wide range of extremely useful libraries that are built on top of them simply using Python. See the example below demonstrating reading a parquet file in Dask and Spark:
Note that it’s not an either-or decision, a good data engineer can handle both (DWH and data lake), but data lakes make it easier to use data with those distributed processing frameworks.
4. Failures in your data pipelines become easier to fix
It’s difficult and time-consuming to fix the traditional ETL data pipelines in which only the “Load” part failed because you have to start the full pipeline from scratch. Data lakes encourage and enable the ELT approach. You can load extracted data in its raw format straight into a data lake and transform it later, either in the same or within an entirely separate data pipeline. There are many tools that let you sync raw data into your data warehouse or data lake and do the transformations later when needed. Decoupling of raw data ingestion from transformation leads to more resilient data workloads.
Demo: using data lake to provide ML as a service
To illustrate the benefits of data lakes for data science projects, we’ll do a simple demo of the AWS Rekognition service to extract text from images.
What’s our use case? We upload an image to an S3 bucket that stores raw data. This triggers a Lambda function that extracts text from those images. Finally, we store the extracted text into a DynamoDB table and inspect the results using SQL.
How can you use it in your architecture? Instead of DynamoDB, you might as well use a data warehouse table or another S3 bucket location that could be queried using Athena. Also, instead of using the
detect_text method (line 18 in the code snippet below) of AWS Rekognition, you can modify the code to:
- …and many more.
How to implement this? First, we created an S3 bucket and a DynamoDB table. The table is configured with
img_filename, i.e. the file name of an uploaded image, as a partition key so that rerunning our function will not cause any duplicates (idempotency).
Create DynamoDB table for our demo — image by the author
We already have an S3 bucket with a folder called
We also need to create a Lambda function with an IAM role to which we attached the IAM policy for S3, Rekognition, and DynamoDB attached. The function shown below has a lambda handler called
lambda_function.lambda_handler and runtime
Python 3.8. It also has an S3 trigger attached, which calls the function upon any PUT object operation, i.e. on any file upload to the folder
Lambda function to detect text in images uploaded to data lake — image by author
The code of the function:
- creates a client for Rekognition and corresponding S3 and DynamoDB resource objects,
- extracts the S3 bucket name and key (filename) from the event trigger,
- reads the image object and passes it to the Rekognition client,
- finally, it retrieves the detected text and uploads it to our DynamoDB table.
Let’s test it with some images:
After uploading those images to the S3 bucket that we defined in our Lambda trigger, the Lambda should be invoked once for each image upload. Once finished, we can inspect the results in DynamoDB.
PartiQL query editor in DynamoDB (2) — image by the author
It looks like the first two images were recognized pretty well, but the difficult image on the right (“Is this just fantasy”) was not.
To answer the question from the title: yes, a data lake can definitely speed up the development of data pipelines, especially those related to data science use cases. The ability to deal with a wide range of data formats and easily integrate this data with distributed processing and ML frameworks makes data lakes particularly useful for teams that do data science at scale. Still, if using raw data from a data lake requires too much cleaning, data scientists and data engineers may still prefer to use already preprocessed and historized data from DWH. As always, consider carefully what works best for your use case.
Thank you for reading! If this article was useful, follow me to see my next posts.
References & additional resources:
 A list of AWS ML Tools (2021) — Dashbird
 Machine Learning with AWS Lambda — Dashbird