Data is the new oil. We rely on it not only to make decisions but also to operate as a business in general. Data loss can lead to significant financial consequences and loss of reputation.
In this article, you will find ten actionable methods to protect your most valuable resources.
- 1. Backup, Backup, Backup
- RPO vs. RTO
- 2. Test Your Recovery Scenario
- 3. Document Processes That Rely on That Data(base)
- 4. Apply the Least-Privilege Security Principle
- 5. Name Your Production Database As Such
- 6. Don’t Trust Any Manually Configured Resources
- 7. Don’t Allow a Single Person To Manage the Entire Infrastructure
- 8. Educate Your Employees About Any Resource Before Giving Them Access to It
- 9. Use Serverless and Monitor Your Resources
- 10. Separate Your Storage From Compute if Possible
- References and Resources
1. Backup, Backup, Backup
This goes without saying, and we all know it. We need to have a backup strategy and an automated way of regularly taking periodic snapshots of our databases.
However, with today’s large amounts of data, implementing a reliable backup plan that can quickly recover your databases becomes challenging. Therefore, it is crucial to develop a strategy of Recovery Time Objective and Recovery Point Objective and implement a solution that can satisfy your Business Continuity plan.
RPO vs. RTO
Recovery Point Objective (RPO) describes how many hours of downtime we can tolerate. An RPO of 10 would entail that your business can afford no more than ten hours of data loss according to your Business Continuity Plan. You could think of RPO in terms of the “staleness” of your backup, plus the recovery time. With RPO=10, we allow our data to be ten hours stale after restoration (i.e. not containing changes made within the last ten hours).
In contrast, Recovery Time Objective (RTO) describes the time within which the database must be up again. An RTO of 3 would mean that regardless of the backup freshness, the database must be up and running within three hours after the downtime occurred.
2. Test Your Recovery Scenario
The following is probably the worst-case scenario: You developed a backup strategy and are regularly taking snapshots, but when the failure happens, you notice that those backups aren’t working as intended or that you can’t find them. It’s critical to test the recovery scenario.
Netflix pioneered “chaos engineering” — a discipline of testing failure scenarios on production systems to be sure that your infrastructure is truly resilient.
Don’t count on backups and recovery plans that have never been tested. Otherwise, you risk ending up in the “cross your fingers and hope for the best” strategy.
Note that if you rely on backups taken by some fully managed service where you don’t have actual access to the snapshot, you risk the possibility that restoring your database may take longer than your RTO and RPO strategy allows. It’s possible that due to timezone differences and a large volume of data that may need to be transferred over a long distance, the recovery may take longer than you expect.
Therefore, it might help to take regular snapshots yourself rather than solely relying on backups from a specific provider.
3. Document Processes That Rely on That Data(base)
If your database goes down, which processes are affected? It’s valuable to have this information documented somewhere to mitigate the impact of a failure and be able to recover quickly by restarting corresponding processes and limiting the impact of downtime.
4. Apply the Least-Privilege Security Principle
We all want to trust people, but allowing too much access to developers without educating them on how to use those production resources may backfire. Only a few trusted people (likely DevOps or senior engineers) should have direct access to modify or terminate production resources. When building any IT solutions, it’s best to work on a development database and have read-only permissions to production resources.
On top of that, it’s advisable to check those permissions regularly. Perhaps somebody who has left the company still has access to production resources.
5. Name Your Production Database As Such
What if your production database is not named as a
prod resource and somebody confuses it for something else? It’s best practice to ensure that production resources are named properly so that people know by looking at a resource that it must be treated with great care.
It may seem obvious to you, but without proper communication and educating users, somebody could confuse a poorly named production database for some temporary resource (e.g. a playground cluster) that can be shut down.
6. Don’t Trust Any Manually Configured Resources
If your resources are configured manually, it becomes more difficult to reproduce the configuration in a failure scenario. Modern DevOps and GitOps culture introduced a highly useful paradigm of Infrastructure as Code, which can significantly help to build an exact copy of a specific resource for development or recovery scenarios.
7. Don’t Allow a Single Person To Manage the Entire Infrastructure
It can be challenging to recover any specific system if the only person who knows how to configure and use it is not available when the failure happens. Knowledge silos are particularly dangerous in such use cases. It’s beneficial to have at least one additional person who can take over this responsibility.
Often, even a timezone difference between employees can significantly contribute to fixing any production downtime faster and, therefore, meeting your RTO.
8. Educate Your Employees About Any Resource Before Giving Them Access to It
This point is related to preventing knowledge silos but more directed towards educating developers. Anytime we give somebody more than just read-only access to production resources, we should educate them on using this resource properly and what impact potential downtime of a single table may have.
As always, effective communication is our best friend.
9. Use Serverless and Monitor Your Resources
Using data stores such as AWS RDS is great, but it has a downside. In the end, we are still responsible for ensuring that our database remains healthy. When using serverless data stores such as DynamoDB, we can rely on AWS DevOps experts to monitor and keep the underlying servers healthy.
If you leverage an observability platform, such as Dashbird, you can quickly identify misconfigured resources or failures within your serverless infrastructure. Dashbird has recently released a feature called Well-Architected-Lens that continuously scans your resources for anomalies. For instance, it will alert you about any DynamoDB table that doesn’t have a continuous backup and Point-In-Time-Recovery enabled. This is one of the easiest ways of ensuring that your data store remains healthy and resilient because:
- AWS takes care of serverless compute and storage behind the service, ensuring high availability and fault tolerance.
- Dashbird will alert you if your architecture deviates from standards defined within the Well-Architected framework, such as when your resources are not properly configured or lack backup.
In the image below, you can see that Dashbird automatically detected that backup is not enabled:
Well-Architected Lens ensures that your DynamoDB tables have a continuous backup enabled for a quick point-in-time recovery. Image courtesy of Dashbird.
In addition to recovery information, you can discover many more insights about your serverless resources, as demonstrated in the image below. For instance, you will be informed any time your real-time data streams have write-throttles. In the end, you are presented with a score of how well your architecture adheres to the Well-Architected framework.
Well-Architected Lens. Image courtesy of Dashbird.
And if the only reason holding you back from using DynamoDB is that you still want to use SQL, you may want to have a look at PartiQL. This query language, developed by AWS, allows you to query your DynamoDB tables (and many other data stores) directly from the AWS management console, as demonstrated in the image below:
Using DynamoDB with PartiQL. Image by the author.
10. Separate Your Storage From Compute if Possible
This point is related to analytical databases. It’s a good practice in analytical data stores for your compute and storage to be independent of each other. Imagine that your data is durably stored in object storage such as S3, and you can query it with a serverless engine such as AWS Athena or Presto. The separation of how your data is stored and how it’s queried makes it easier to ensure the resilience of your analytical infrastructure.
You can establish automatic replication between S3 buckets, enable versioning (allowing you to restore deleted resources), or even prevent anyone from overwriting or deleting anything from S3 by leveraging object locks. Then, even if your Athena table definition is deleted, your data persists and can easily be queried upon a definition of a schema in AWS Glue.
I’m a big fan of storing raw extracted data for ETL purposes into object storage before loading it to any database. This allows you to use it as a staging area or data lake and allows for more resiliency in analytical pipelines. Relational database connections are fragile. Imagine that you are loading large amounts of data from some source system directly into a data warehouse. Then, shortly before the ETL job is finished, it fails because the connection was forcibly closed by a remote host due to some network issues. Having to redo the extraction step can introduce an additional burden on the source system or may even be impossible due to API request limits.
In this article, we examined ten ways to protect your mission-critical data store. These days, data is such a critical resource that downtime can cause significant financial and reputation losses. Make sure to approach it strategically and test your recovery scenario.
References and Resources
- Clive Humby
- Disaster recovery — Wikipedia
- Chaos engineering — Wikipedia
- PartiQL — DynamoDB documentation
- Well-Architected Insights — Dashbird