Table of contents:
Have you ever encountered the same queries being applied over and over again in various dashboards? Or the same KPIs being calculated in nearly every single report? If your answer is yes, you are not alone. It’s common among business users to simply copy-paste the same queries, data definitions, and KPI calculations. But there is a better way.
The dangers of knowledge duplication
Most software engineers are taught from day one the “DRY” principle: Don’t Repeat Yourself. This principle states that:
“Every piece of knowledge must have a single, unambiguous, authoritative representation within a system”. — The Pragmatic Programmer
Even though most data and software engineers know how to write DRY code, many data analysts and business users are unaware of the dangers of copy-pasting their queries and data definitions. As a result, we often end up in Business Intelligence reports which have different representations of KPIs and business logic. To make the argument more concrete, here are some of the consequences of code duplication in analytical workloads:
- If something in your code needs to change, you need to make those changes in X different places.
- You risk conflicting definitions if one “copy” gets modified while another one does not.
- In the end, nobody knows what is the single source of truth because any time you use the same KPI, it’s calculated or defined differently.
- If you add documentation to a duplicated code, query, code, or data definition in one place, this documentation is lacking in another place leading to further inconsistencies.
I remember when my team had a discussion about what is the best place for a data dictionary. One of the suggestions was to use Confluence (or a shared spreadsheet) for that purpose. What’s wrong with this suggestion? It violates the DRY principle as it would inherently cause duplication: one definition in the database and one in the external documentation. Manually created data dictionaries never stay up to date — don’t believe anyone who claims otherwise.
A potentially better solution is to apply all data definitions directly within a database (or data warehouse) by using code comments. Here is how it could look like:
Example applying data definitions directly within a database — image by author
Many SQL clients, these days, make it easy to view data dictionaries implemented as code comments. For instance, when using Datagrip, we can simply use Ctrl + Q (or command + Q on Mac) to see a full table definition, including table and column comments.
Display data dictionary in Datagrip — image by author
Of course, you may have more than a single data warehouse. If you use a data lake, you most likely leverage some centralized metadata repository such as AWS Glue, which provides data dictionary capabilities. Regardless of your chosen implementation, it’s important to think about the right approach for your use case while following the DRY principle.
Side note: If you are interested in how to leverage AWS Glue metadata for data discovery, I built a simple MVP of a data discovery API.
How can we apply software engineering principles to mitigate the problem of duplication?
One possible solution to this problem is to leverage a version-control system as a single source of truth for all queries and KPI definitions used for reporting. You could push the code to a Git repository and ensure that you follow Git flow, code reviews, and all good engineering practices before applying this code in your BI tools. However, this way, we still violate the DRY principle since we are copy-pasting the code from version control to the BI tool. This approach would additionally be prone to human error: anytime we manually change something in the BI tool, but not in the Git repository, we end up in the same problem we were trying to solve using Git. And what if some queries or KPIs never get committed?
In this article, we investigated how to avoid duplication in analytical reports. We looked at why duplicated KPI definitions are dangerous and how we can prevent them. Thank you for reading!