During the re:invent in 2017, Amazon’s VP & CTO, Werner Vogels, made a bold statement: he claimed that all the code we will ever write in the future is business logic.
Back then, many of us were skeptical, but looking at the current developments, especially in the data engineering and analytics space, this quote might hold true.
As long as you are not a technology company, the chances are that maintaining internally developed tools not directly tied to a concrete business objective (expressed by business logic) may no longer be necessary.
In fact, it may even be detrimental in the long run. Let’s discuss the underlying reasons and implications of this phenomenon.
Table of contents
- How It Typically Begins
- Example scenario
- Reasons for In-House Solutions
- Lack of understanding of the problem domain
- The engineering ego
- Failing to leverage cloud resources and containerized workloads
- Homegrown Solutions Don’t Scale
- Data ingestion
- Workflow orchestration
- Data transformation
- When “Make” Still Trumps “Buy”
- How Will You Build Your Modern Data Stack?
How It Typically Begins
Almost any internal system begins when we encounter some business problem, and there seem to be no tools on the market that would allow us to adequately solve it. This means that one of the following is true:
- even when some products exist, they don't do exactly what we need
- existing products don’t integrate well with our specific technology stack.
Imagine a use case where engineers try to store vast amounts of time series data for analytics in a relational OLTP database.
Since this type of database is not designed for this purpose (it would be slow and expensive), they serialize and compress each dataset so that it can be stored as a single BLOB object in a relational database table.
The method of serialization they use is Python-specific. Therefore, to provide a programming language agnostic access layer, they additionally create a REST API that deserializes compressed BLOBs at request time and serializes this data again — this time as JSON.
If we step back and analyze the actual problem that the above in-house system is trying to solve, we realize that all we really need is to:
- store data as compressed objects (e.g., snappy compressed parquet files)
- store additional metadata about each dataset
- retrieve it in a simple way by object name, ideally by using SQL or Python.
If the engineers from the above example spent more time doing market research prior to building the internal system, they would realize that there are many off-the-shelf data stores that address those exact needs:
- open-source Trino (previously named Presto) provides a fast SQL engine to query data stored, e.g., as compressed objects in an S3 data lake
- Dremio provides an entire lakehouse platform to efficiently query data from object storage and to connect it to BI tools for visualizations
- open-source awswrangler Python package makes it easy to store compressed parquet files to S3, attach metadata using AWS Glue catalog, and query this data using Athena SQL engine
- cloud data-warehouses such as Snowflake, Redshift (Spectrum), and BigQuery allow reading data from compressed files stored in object storage
- …and many more.
All of the above products provide a flexible programming-language agnostic access layer so that we wouldn’t have to build any serialization or decompression APIs.
No need to worry about scale or that the chosen serialization method will stop working upon upgrades of specific packages. Less risk of operational outages thanks to managed solutions.
In short, we can choose one of the available options and start implementing business logic that can provide real value, rather than spending time on maintaining in-house developed data storage systems.
Reasons for In-House Solutions
Lack of understanding of the problem domain
The most likely reason for implementing superfluous in-house systems, as shown in the scenario above, is not thinking enough about the problem that needs to be solved and failing to properly evaluate existing tools on the market.
The engineers from the example seemed to have prematurely decided on using an OLTP database and storing datasets as BLOBs before:
- understanding what access patterns they need to support — in this case, choosing a Python-specific serialization method seems to be a suboptimal decision if the goal is to provide a programming-language agnostic interface to this data.
- understanding the intended usage of this data — the use case was described as analytical, rather than transactional; thus, an OLTP database seems to be a bad choice in the first place.
- understanding the type and amount of data that needs to be stored there — we mentioned that, in this scenario, the goal was to store vast amounts of time-series data. Historically, OLTP data stores proved to be highly inefficient as a storage mechanism for this type of data (with some notable exceptions such as TimescaleDB and CrateDB). A simple Google search of database solutions for time-series data would provide more information on how others approached this problem in the past.
Sometimes it also depends on how people express their issues and requirements. Suppose the problem is specified in a way that already implies a specific solution. In that case, we may fail to recognize more general patterns and falsely believe that our problem is unique to our business, company, or strategy. Thus, we may erroneously conclude that a homegrown system is the only option.
The engineering ego
Another reason for superfluous in-house tools is the software engineering ego. Sometimes engineers want to prove to others that they can build anything themselves.
But they forget that any self-built system needs to be maintained in the long run. It doesn’t only have to work now, but also in the future when the world around us changes and the dependent packages will (or will not) get upgraded or redesigned.
The same ego often prevents senior engineers from asking for feedback. It’s a good practice to ask several people for advice (ideally, also external consultants) before building an in-house solution.
Others can help us find our blind spots and point us in the right direction if we fail to realize the actual problem or when a solution to it already exists on the market.
Failing to leverage cloud resources and containerized workloads
What did Amazon’s CTO mean by saying the quote from the title? Most of the building blocks we typically need for building applications and data workloads are already out there.
If not provided by cloud vendors and open source platforms, then by third-party technologies built around those, serving as a glue between them and (often legacy) on-prem systems.
Engineers need to define the required business logic and then deploy it using services for storage, compute, networking, monitoring, and security.
This means that issues such as scaling databases and servers, building custom storage or execution systems, and any similar undifferentiated heavy lifting shouldn’t be their concerns.
In particular, containerized workloads and orchestration platforms serve as enablers that make this future of writing nothing but your core business logic a reality.
Homegrown Solutions Don’t Scale
So far, we suggested that custom in-house solutions are becoming increasingly a subject of technical debt rather than something that would provide a competitive edge. To test this hypothesis, let’s look at the tools in the data analytics space.
Historically, every company building a data warehouse would develop its own ETL processes to extract data from operational systems such as ERP, CRM, PIM, etc.
Over time, engineers realized that it’s quite redundant if every company builds its own version of the same boilerplate code to copy data from A to B. In the end, it’s not that different to sync data from a source system like Salesforce to Redshift, Snowflake, or any other data warehouse. Some companies (a.o. Stitch, Fivetran, Airbyte) realized the potential to make things better.
They started building a more flexible set of connectors that let us intelligently sync source systems with a data warehouse of our choice, thereby automating the ingestion and allowing us to skip the boilerplate code moving data from A to B and focus only on writing business logic using the ELT paradigm.
A similar story can be told about workflow orchestration systems. In the past, almost every data-driven company had its own custom tool to manage dependencies in their data pipelines, deploy them to a specific compute cluster, and schedule them for execution.
After adding more and more features over time, engineers usually start to realize how challenging it is to maintain a homegrown platform and make it flexible enough to support all data-related problems.
These days, thanks to tools such as Prefect, we can focus on building business logic, i.e., solving the actual data science and analytical problems required by our business rather than on maintaining the underlying system.
The platform takes care of tracking data and state dependencies, executing flows on-demand and on schedule across various agents, providing highly granular visibility into the health of your system, and communicating with distributed Dask clusters for fast execution regardless of the size of your data.
So far, we’ve discussed data ingestion and workflow orchestration. But there are many more areas that keep confirming the hypothesis from the title.
Take data transformations for data warehousing. In the past, data engineers kept executing the same tedious tasks of writing DDL to create tables, handcrafting merge queries for incremental loads, building Slowly Changing Dimension scripts, and figuring out in which order to trigger all those interdependent ETL jobs.
Then, dbt completely changed the way we approach this problem. It automated those tedious tasks to the point that building large-scale in-database transformations became accessible to data analysts and domain-knowledge experts.
It created a new role of analytics engineers who can finally focus on writing business logic in SQL and deploy it using dbt. Again, all that is left to do is to write business logic rather than maintaining homegrown tools and boilerplate code that don’t add any real business value.
When “Make” Still Trumps “Buy”
Reading all those arguments, you can start thinking that all internal systems are inherently “bad” and that we should always use off-the-shelf tools and fully managed solutions. However, there are some circumstances when “Make” can provide a significant advantage over “Buy.”
First, if there are no viable options on the market that are capable of solving your problem, then you have no choice but to build it yourself.
But the most important argument for MAKE in the “make-or-buy” dilemma is when you implement something that relates to the core competency of your business — the product or service that differentiates you from the competitors.
Imagine that you lead a data science startup that automatically generates summaries of long articles (such as this one).
Your core competency is your Natural Language Processing engine that can generate high-quality summaries. Internally, you could use all the tools we mentioned so far, e.g., using Snowflake to store data, Fivetran to sync data from source systems, Prefect to orchestrate your data science flows, and dbt for in-warehouse transformations.
But when it comes to the NLP algorithm generating summaries, you most likely wouldn’t want to outsource it (e.g., by using some off-the-shelf NLP algorithms from AWS or GCP) since this is your core product.
Anything that constitutes an essential part of your business or provides a competitive advantage is where custom in-house solution pays off.
How Will You Build Your Modern Data Stack?
One of the topics that are often considered too late is how to visualize data, build KPIs and metrics, and embed analytics into the existing front-end applications.
The cloud-native GoodData BI platform can help with that so that you can focus solely on your business logic. You can start experimenting with the Community Edition of the platform by starting a single Docker container:
docker run --name gooddata -p 3000:3000 -p 5432:5432 -e LICENSE_AND_PRIVACY_POLICY_ACCEPTED=YES gooddata/gooddata-cn-ce:latest
Then, you can view the UI in your browser using http://localhost:3000/, log in using email:
firstname.lastname@example.org and password:
demo123. After that, you can start building insights, KPIs, and dashboards directly from your browser.
Thanks to the platform, you can create a semantic model and a shared definition of metrics that serve as a single source of truth across the company and all consumers, be it BI tools, ML models, or client-facing applications.
Additional features such as intelligent caching, connectors to almost any data warehouse, and single-sign-on reinforce the fact that you only need to build your business logic, and the platform will support you in everything else.
GoodData also wrote an article discussing the same topic of when to build or buy tools for analytics.
This article discussed the hypothesis that all the code we will ever write in the future will be business logic.
We examined an example situation that led to developing a suboptimal homegrown storage system and investigated potential reasons for such scenarios.
We then talked about why homegrown solutions don’t scale and under what circumstances building custom in-house systems still makes sense.
Thank you for reading!