When choosing the right storage system for a cloud native workload, it’s important to consider not only how well the storage system will perform at the time of deployment, but also how the storage system can be operationalized. As you’re evaluating storage systems and how they will perform on Day 2, in production, here are some of the issues you need to consider.
Monitoring for both errors and performance issues is one of the most important Day 2 characteristics for any part of an application, including the storage system it relies on. This is essential for debugging both any errors that arise as well as tracking performance.
Legacy systems often lack instrumentation to monitor storage-related performance. A developer might say, ‘my database is slow today.” But that isn’t an actionable metric, and without real-time and historic monitoring, there’s no way to uncover what ‘slow’ means, let alone get to the bottom of what might be causing performance problems or what changed in the environment.
With the right storage systems and monitoring capabilities, you can track performance changes over time and immediately put numbers to a drop in performance, so instead of just saying a database is ‘slow’ you’ll be able to quantify the performance problems. With the right instrumentation, you can also get to the root cause—is it a networking problem? Is there a new project or new load on the server? is there a failure at some point?
Being able to monitor for errors and performance degradation is critical, as is the instrumentation to get visibility into how your storage system is working and how it connects to the rest of the application. With this in place, when something does go wrong you can quickly find and fix the problem.
Upgrades are an inevitable part of any application lifecycle. You will need new versions of Kubernetes, new data versions, new application versions. Kubernetes is very effective at orchestrating and managing upgrades, but the storage system needs to be flexible enough to work with Kubernetes during rolling upgrades. Some storage systems utilize a Kubernetes Operator, which is code that automates deployment and product lifecycle for the system.
All applications need to assume that something will go wrong at some point—there will be failures, nodes and disks will disappear, networks will fail. Your storage system has to be built with high availability in mind to provide redundancy and protect your data in the event of infrastructure failures. This ensures that if there are any component failures (such as a disk, network or server node), you can automatically fail over your application and data to other nodes in a cluster.
Using a storage system that understands best practices for where to place data is also important to proactively address potential Day 2 problems.
There are two parts to the data placement equation. You want data to be placed as close to the application using that data as possible, to maximize performance. However, you also need to ensure that data is replicated across failure domains, which in a cloud-based system means ensuring there is data replication across availability zones, so if one zone goes down you’ll still have access to the other availability zone.
Consider storage early
Storage problems are one of the primary causes of application downtime and application performance issues. The ability to proactively monitor and log your storage infrastructure is the only way to make storage related problems easily fixable. Effectively managing the entire application lifecycle requires considering how storage plays into the operations story during the development process.
Author: Alex Chircop
Experienced CTO with a focus on infrastructure engineering, architecture and strategy definition. Expert in designing innovative solutions based on a broad and deep understanding of a wide range of technology areas. Previously Global Head of Storage Platform Engineering at Goldman Sachs and Head of Infrastructure Platform Engineering at Nomura International.