Designing a data model for analytics is not the same as doing it for transactional processing. You optimize for access patterns that are very different from row-level data retrieval used in OLTP systems. In this article, we’ll look at the most common pitfalls when designing schemas and tables for analytics.
- 1. Treating Schema Design as a One-Off Project
- 2. Building Tables and ETL Processes That Are Too Big
- 3. Choosing (And Sticking To) The Wrong Modeling Approach
- 4. Lack of Empathy for the End Users
- 5. Not Tracking Changes Made to Your Data
- 6. Mixing Data With Different Granularity
- 7. Using Poor Naming Conventions
- 8. Too Many Views
- 9. Thinking Short Term
- 10. Insufficient Communication Between Data Consumers
1. Treating Schema Design as a One-Off Project
Building data assets is an ongoing process. As your analytical needs change over time, the schema will have to be adjusted as well. Treating data modeling as a one-off activity is unrealistic. Just think about all those times when you had to change a column name, data type, or even rebuild the entire table because the data structure in one of the source systems has changed.
In the same way, you may build a schema that perfectly fits the use case at that particular time. But when data continues to grow or source systems change (a new ERP, CRM, PIM systems), the design work continues due to necessary changes in the downstream workloads or performance optimizations.
2. Building Tables and ETL Processes That Are Too Big
Imagine that you are building data pipelines for marketing costs attribution. Rather than implementing it as one giant ETL process resulting in a single table, it makes sense to have an initial process that first combines all raw (not yet attributed) costs from different channels into an intermediate table. This table can then be used by some downstream process(es) that will perform the attribution.
Splitting and decoupling data transformations will make the entire process easier to build, debug and maintain in the long run. Additionally, the intermediate tables are extremely valuable to the end users interested in raw data (before any calculations and transformations were made). If something in the final table looks unreasonable, you can dive deeper into data from a previous step to investigate what happened and why.
3. Choosing (And Sticking To) The Wrong Modeling Approach
There are many modeling techniques such as star schema, snowflake schema, or data vault. If someone in the past decided to follow a snowflake schema, having a dimension “product” would not adhere to the third normal form, and building it would violate the chosen modeling approach. Instead, you would have to split it into additional dimensions such as “product category” or “brand” to normalize the data regardless of whether it makes sense for your use case or not.
Similarly, when you adhere to the data vault modeling, you would split every table into hubs, satellites, and links, even when there is no logical reason to do so for a specific use case. While rigor is desirable in engineering, all data modeling techniques are a set of guidelines rather than hard rules that should be blindly followed.
Data design that requires an extensive number of joins, such as data vault, may provide faster data ingestion due to parallel loads, but it often hurts query performance and, for many, is quite unpleasant to work with. This problem can be to some extent mitigated by creating SQL views on top of data spread across hubs, satellites, and links. Such views let us avoid repetitive multi-join queries, thereby saving a lot of time for everyone working with this data.
There is no “best” data modeling approach, everything is a trade-off. But keep in mind that many data warehouse modeling techniques were designed during times when storage was expensive. These days, both storage and compute are cheap compared to engineering time. Therefore, denormalized analytical schemas often provide a significantly better user experience (fewer joins, less time spent on searching for the right data) with no negative impact on performance.
4. Lack of Empathy for the End Users
Any time you build new tables or processes, try to develop empathy for the end users. Who will be querying this table in the end? Will this be a manager who is interested in looking at aggregated values? If so, perhaps it makes sense to build an additional view or table where those aggregations are already applied. Or will this be a business analyst who prefers a more intuitive denormalized design with fewer joins in their day-to-day work with data?
5. Not Tracking Changes Made to Your Data
You’ve probably heard about slowly changing dimensions that help you track how dimension tables are changing over time. For instance, if one customer modified his or her email address, your table will get an additional row for this customer — each will contain
VALID_TO timestamps showing how the respective dimension data changed over time.
In the past, building the SCD2 logic was involved and often quite repetitive — you needed to build your own merge logic based on a hash difference, or maybe you had to rely on some (terrible) drag-and-drop ETL products to do that. But these days, you have tools like dbt at your disposal — dbt snapshots make the implementation of SCD2 a piece of cake. Having those
VALID_TO fields will save you a lot of time when investigating inconsistencies in your data.
6. Mixing Data With Different Granularity
It’s considered a best practice to define a primary key for your table so that already by looking at the primary key, users can tell what is the granularity of your data.
For instance, imagine that your table is tracking daily user sign-ins. This means your granularity is daily and your primary key may be composed of
DATE. Now, the business wants you to start differentiating between the number of sign-ins during peak time and during less popular hours. Your granularity changes to two rows per day per user: one for peak and one for off-peak time.
If you implement this change within the same table, it will result in mixed granularity which in turn would likely become a source of confusion to everyone working with this data. It would, therefore, make sense to create an entirely new table with a changed granularity and primary key that reflects that, ex.
PEAK_TIME (true or false).
7. Using Poor Naming Conventions
Naming is one of the most underestimated activities in engineering. Good or bad naming conventions can make a difference between having a great or just good (or even mediocre) data product. One of my colleagues often repeats this quote, which nicely summarizes it:
“There are only two hard things in Computer Science: cache invalidation and naming things.” — Phil Karlton
Using obscure column names and abbreviations only very few people understand, and not documenting it in a data dictionary will likely cause a lot of trouble down the road.
8. Too Many Views
Views are easy to build but they cannot offer the same performance as normal tables do. If you start building downstream data transformations based on views, things may start becoming slow over time. Here you can find the pros and cons of different types of materializations.
Overall, views are best suited for simple transformations like changing column names. For other use cases, you can choose between full table refresh (table is rebuilt on each run) or incremental models (upsert of new/changed data) depending on how frequently the underlying data changes.
9. Thinking Short Term
“The Only Constant in Life Is Change.” — Heraclitus
It’s unlikely that you will ever be “done” with your data modeling. Things are always changing. At the same time, once you integrate new data assets into your system, it’s hard to ever get them out. They will be used in downstream processes, introducing dependencies and making later changes more challenging. Anytime you build new data models, it’s good to keep in mind:
- How data will be updated — can new data be easily reloaded or backfilled?
- Do we need to add a merge logic and columns such as
UPDATED_ATto prevent duplicates when performing incremental loads?
- Which tables and processes rely on that data? Can we even keep track of it?
- Is your data warehouse or data lake(house) equipped to grow together with your data? How does storage and compute scale?
- Do you have a data dictionary and documentation mechanism that can be kept up-to-date over time?
10. Insufficient Communication Between Data Consumers
Sometimes we end up with suboptimal data design purely due to communication issues. There are always some conflicts of interest:
- Data (platform) engineers operating data pipelines want to build robust engineering processes and minimize operational risks
- Data scientists want to get as much raw data as they can and decide on their own how they can build features and infer signals from this data for ML use cases
- Software engineers building front-end data applications want seamless integration with their technology stack
- Data analysts want immediate, flexible, easy to use, and performant access to data (and not being blocked by engineers)
- Managers strive for standardization, security, minimizing risks, and keeping costs under control
There are many tools that help to mitigate friction in the handoff between different user groups. One of them is GoodData. The cloud-native GoodData platform allows:
- Data engineers to build semantic data models
- Separation of concerns between data and front-end application design by leveraging the API in between
- Building small components (insights, metrics, KPIs, visualizations) that can be reused across dashboards and analytical applications
- A central source of metrics — GoodData API exposes the semantic model as a service and helps to ensure that all data consumers use a consistent set of metrics and insights across the organization
docker run --name gooddata -p 3000:3000 -p 5432:5432 \ -e LICENSE_AND_PRIVACY_POLICY_ACCEPTED=YES gooddata/gooddata-cn-ce:latest
For a deep dive into the cloud-native BI platform and all its benefits, you can have a look at my article about decoupling your BI stack. And if you want to try it directly in a Kubernetes cluster, you can use the following Helm chart.
Thank you for reading!