One of the most common concerns when moving to the cloud is cost. Given that cloud allows you to turn IT costs from CAPEX (long-term investments ex. in hardware equipment and software licenses) into OPEX (day-to-day operating expenses), it’s crucial to choose the right service and size it properly. In this article, we’ll look at the common pitfalls and discuss how you can avoid them to truly benefit from the cloud’s elasticity.
- #1 Following the lift and shift approach
- #2 Not tagging your resources
- #3 Failing to monitor resource usage over time
- #4 Always doing everything yourself from scratch
- #5 Using only tools you are familiar with
- #6 Not making use of serverless and container orchestration platforms
- #7 Not taking TCO into account
- #8 Thinking short term
- #9 Overprovisioning everything “just-in-case”
- #10 Choosing the wrong datastore
- How to mitigate the right-sizing problem?
#1 Following the lift and shift approach
The lift and shift approach means moving an exact copy of your workload to the cloud with as few changes as possible. Even though this pattern may be useful if you want to move to the cloud quickly, it may lead to suboptimal usage of your resources. AWS acknowledged that this is a difficult problem by creating services to make this migration easier (CloudEndure Migration and AWS Server Migration Service). Still, for the best possible resource utilization, it’s best to consider rearchitecting your solution for the cloud.
With lift and shift, you are potentially leaving a lot of money on the table when looking at it long-term. You would also miss out on many benefits your cloud provider can offer. For instance, when choosing fully-managed AWS Aurora over a traditional Postgres instance, you can gain (among others) 3x more throughput, storage autoscaling, and low-latency read replicas. This may be why Aurora is currently one of the most popular and fastest-growing services on AWS.
#2 Not tagging your resources
It’s challenging to improve something if you don’t have enough data to make an informed decision. If you have no way of tracking how your cloud resources perform and how much costs they incur, it’s hard to optimize their utilization.
It’s considered a best practice to tag your resources based on projects or organizational units to allocate costs to the corresponding services correctly.
#3 Failing to monitor resource usage over time
Managing cloud architecture is not a one-off process. It’s a continuous practice of monitoring and evaluating what you use, how you use it, and why. Perhaps your original assumptions about the growth of a specific application turned out to be not entirely right, and making a change could significantly lower costs.
For instance, consider an overprovisioned Kubernetes cluster with many more nodes than needed. Perhaps moving to a serverless version (EKS on Fargate) makes more sense in such a scenario. For more about serverless Kubernetes cluster, see my previous article.
Leaving “zombie” resources running unmonitored is not as uncommon as you may think. Some projects may get abandoned, and the corresponding resources remain active due to incomplete handover processes.
#4 Always doing everything yourself from scratch
As software engineers, we may sometimes be tempted to build our own custom solutions and services for everything. A potentially better approach is first to do proper research of what’s already available. Examples:
- Perhaps you don’t need this self-hosted database on EC2, and you can instead use a fully managed RDS to help you scale and operate the instance much easier?
- Or maybe you don’t need this self-managed RabbitMQ instance, and you can instead adopt the battle-tested serverless message queue SQS?
In general, if there is a serverless or fully-managed solution, it makes sense to at least consider it before investing too much time and effort into your own solution that you would have to maintain entirely by yourself.
#5 Using only tools you are familiar with
When reading some Reddit or blog posts, I see many engineers who are reluctant to use serverless or container orchestration platforms simply because they know more about EC2 and manually administered servers. They assume that it’s all just a new technology that will “come and go” and therefore, there is no need to change your ways. This implies that there is no merit in moving to container orchestration platforms, serverless and other cloud services. This seems to be a close-minded approach.
It’s better to challenge our assumptions and judge new technologies with clear facts, costs, and performance benchmarks rather than skepticism towards what’s new.
#6 Not making use of serverless and container orchestration platforms
If you would create an EC2 instance for every service and tool you manage, you would likely end up in a maintenance nightmare. But if you instead deploy each of your services to a container deployed to a Kubernetes (EKS) or Fargate (ECS) cluster, you can allocate much more resources into a single server instance due to dynamic port mapping and more compact resource utilization of containers (ex. shared layers).
Container orchestration platform will help you ensure that you balance the load between the instances and that your workloads will stay healthy. They take the capacity guesswork, to some extent, out of the picture. You can specify how many container instances should be running at all times, and the control plane will ensure that it happens, just as you defined it.
If you can easily load balance your workload across many containers or serverless resources, then you no longer have to guess which EC2 or RDS instance size will be appropriate for your use case.
#7 Not taking TCO into account
If you only take into account hardware or service costs, you may end up thinking that many resources can be more cost-effective on-prem. But if you add up the costs of additional maintenance, upgrades, and employees managing those servers, that’s an entirely different story.
#8 Thinking short term
If you scale your resources purely based on your current situation, you may fail to consider how your needs may change in the future. What if your business and data grow much faster? What if it turns out to be the opposite? Is your application still easy to change and adapt to unknown future scenarios? And finally, will you be able to find and retain enough employees that can operate that in the long run?
#9 Overprovisioning everything “just-in-case”
On the other extreme, if you want to be cautious, you may be tempted to overprovision everything to make sure you are ready for usage spikes. It’s a good strategy, provided that you can justify the spikes based on past usage patterns. But it can be a bad strategy if you are doing it out of gut feeling.
Cloud allows elasticity in the sense that you can add nodes to your clusters, load balance the workload across more containers, or increase the number of vCPUs or memory size when you see the need for it. If configured and monitored properly, there is no need to overprovision anything. I’m not saying that right-sizing is easy (far from it), but with good processes and automation in place, it’s doable. It can significantly save costs, especially when operating numerous resources at scale.
Overprovisioned prod resources — image courtesy of Dashbird
#10 Choosing the wrong datastore
Sometimes the bottlenecks are not the compute resources but rather a poorly chosen data store. It’s good to consider:
- whether you need a rich query language (SQL) or perhaps your application can do just fine with a simple key-value store (ex. DynamoDB),
- whether you need a database in the first place; perhaps a simple S3 data dump is enough.
It’s naturally use-case dependent, but databases often constitute the main bottleneck of any scalable architecture.
How to mitigate the right-sizing problem?
One possible solution to optimize your cloud resource utilization is to leverage automation. For instance, with Dashbird, you can keep track of your under- and overprovisioned resources and get notified about them. For instance, when using the well-architected lens dashboard, we can find out that our ECS cluster with EC2 instance type (non-serverless data plane) had a CPU utilization of over 90% within the last hour.
Well-architected lens dashboard — image courtesy of Dashbird
Then, we can drill down into specific time intervals and inspect further why this spike occurred.
Underprovisioned ECS cluster reaching the CPU capacity limits — image courtesy of Dashbird
At the same time, another containerized service may be overprovisioned, potentially leaving money on the table. Having this information allows you to optimize your resource configuration based on the actual usage patterns.
Overprovisioned ECS service — image courtesy of Dashbird
This article investigated common pitfalls when sizing your cloud resources and discussed how to avoid them to truly benefit from the cloud’s elasticity. By making use of container orchestration platforms, serverless and fully-managed solutions, and by continuously monitoring your usage patterns over time, you can optimize your architecture for performance and costs.
Thank you for reading! If this article was useful, follow me to see my next posts.
References & additional resources:
 CAPEX vs OPEX: Investopedia
 AWS Aurora