Why Containers Miss a Major Mark: Solving Persistent Data in Docker

6 common problems with persistent container storage

We know how to develop faster. We know how to release faster. Now we need to move applications to the new world of microservices and containers where meaningful and persistent data exists. How are we going to achieve this?

Developers need a simple, rapid development environment to deploy to the cloud and get their work done so we’ve turned to containers to help solve the problem, only to open up some new ones, like data persistence (and others we won’t cover in this blog including authorization and support for a wide range of network technologies or something completely different).

But how can we manage applications and persistent container storage together as a single unit as we can with ephemeral storage in containers?

Containers have changed the world of development

Containers have changed how we deploy applications. Despite the efficiency gains, virtualization isn’t keeping pace with continuous development and the demands to scale applications more efficiently. Applications have become loosely coupled, stateless, designed to scale and manage failure – it is no longer economical to remediate state, so developers are now turning to containers and microservices to help address their needs.

Containers have also changed how we consume infrastructure. With the low cost of bringing up or down a new instance, resources can be allocated more aggressively than before. Stateless applications have introduced a new problem of managing stateful data which needs to be externalized and persisted somewhere. However, with the adoption of commodity compute, networking and storage, hyper-convergence is gaining prevalence, helping to address these problems and further drive down costs and increase efficiency reinforcing the desire to move away from dedicated storage infrastructure.

Containers offer a lot to the development environment

Docker and containers bring a number of important capabilities to the development environment.

Containers introduce simplicity:

  • Docker allows you to run any platform with its own config without the overhead of a full-blown OS
  • Users can take configuration, put it into code and redeploy – you no longer need to manage application state across numerous VMs – it’s a different paradigm
  • You are no longer tied to the environment you deploy to either – develop on Mac, Windows or Linux (for example create a Dockerfile on your Mac, build and deploy it to Linux)

Containers increase productivity:

  • It’s fast and interactive offering feedback in a timely manner throughout the development cycle
  • Docker makes development easy to automate
  • Dockerized services have a smaller footprint – you can run more with less compared to VMs

Containers promote rapid deployment:

  • Bringing up new hardware used to take months. VMs reduced this to hours, Docker reduces this to seconds – and redeploying or upgrading can be fractions of a second
  • Containers are repeatable, applications can be created and destroyed with no concerns
  • Resources can be allocated more aggressively due to the low cost of binging containers up and down

Containers are missing a major mark

Containers however miss a major mark. There is no data persistence.

And we are being held back. So we turn to our existing storage investments, only to realize these are overly complex and lack adequate integration to work with a modern container ecosystem.

DevOps can’t rely on storage admins anymore – applications need storage provisioned at runtime, not next week, not even tomorrow. And while you can mount NFS file storage into containers, it’s not managed, integrated, container storage – it’s an NFS mount that needs to be managed externally by admins. Where does this leave us?  We are being held back by traditional, often monolithic, storage infrastructure solutions.

How persistent storage works with Docker as a container platform

Now the developer has been empowered with provisioning storage, how can Docker help with managing persistent container storage? To begin, there are two key technologies behind Docker’s image and application container management:

  1. Stackable image layers
  2. Copy-on-write ‘container layer’

Docker images are made from stackable layers which means developers effectively deploy changes. The stackable image layers and a writable top ‘container layer’ are what constitute a container. The act of removing a container is to delete the top read-write layer leaving the underlying image layers untouched.

Consider this with continuously delivered containers. As applications are versioned, new layers are applied or removed from the previous version. Only changes need to be applied and deployed against the previous version. When you launch a container, a copy-on-write R/W ‘container layer’ is applied to the top of the R/O image stack making it read-writable.

In this example, I’ve created a container ‘mycontainer’ using the application image ‘myapp’ where the various layers are downloaded from the Docker registry.

$ docker run –d --name mycontainer myapp:v2

Once an application has been stopped and terminated however…

$ docker stop mycontainer

$ docker rm mycontainer

your data is gone.

Docker’s solution

There is no getting away from it, data needs to be externalized and persisted outside the container and not maintained as a ‘container layer’. To address this problem, Docker offers directory mounts, named volumes and volume plugins.

1. Local directory mounts

Local directory mounts can be a directory or file or an NFS mount in the Docker host’s filesystem that is mounted into a container.  Directory mounts can be shared with other containers and data persists after the container has stopped and been terminated.

To illustrate a local directory mount, we can write some data to ‘~/tmp’ and terminate:

$ docker run --rm -v ~/tmp:/data alpine ash -c \
"echo hello world > /data/myfile"

Looking at the contents of the file in the directory we wrote to after the container was removed reveals the data is still there.

$ sudo cat ~/tmp/myfile hello world

Because a local directory mount can be mounted by multiple containers, think carefully.  As there is no file locking on a local file system, multiple containers writing to a single shared volume can cause data corruption.  One option here is to use NFS which supports file locking, the other is to mount the directory as read-only by appending an ‘:ro’ tag to the end of the path.

2. Local named volumes

Docker local named volumes place data into Docker’s data storage region ‘var/lib/docker/volumes’. Local named volumes can be shared between containers, but as previously highlighted for local directory mounts, careful consideration is required here to avoid data corruption.

In this example, we will create a persistent volume under ‘var/lib/docker/volumes’:

$ docker volume create --name mydata

Next, we can write some data to the volume and terminate our container:

$ docker run --rm -v mydata:/data:rw alpine ash -c \
"echo hello world > /data/myfile"

The container has gone but the data, myfile, persists in our named volume which can be verified by reading the data from the file:

$ sudo cat /var/lib/docker/volumes/mydata/_data/myfile hello world

Once a container has been terminated its data persists in the directory that was mounted to it.  With named volumes, however data is vulnerable to housekeeping where for example the ‘docker volume prune’ command will remove any unowned volumes along with all the data contained within them.

As before there is no file locking but you can enable read only access using the ‘:ro’ tag.

3. Volume plugins

Volume plugins take container storage to another level. They extend the storage capabilities of Docker and other orchestration platforms and provide the capability to manage applications and persistent container storage within the same ecosystem.

Volume plugins deliver native storage services to the container platform as opposed to the underlying server infrastructure.

The above illustration shows how the orchestration engine or container runtime (Docker/K8s) makes a request to the plugin to provision a volume. Here, the control plane provisions the storage from available infrastructure. This is then served up through the data plane as a virtual volume and presented into the container.

Creating a volume is similar to a named volume but this time we specify the driver to use, in this case we are using the StorageOS volume driver.

$ docker volume create --driver storageos --opt size=1 myvol

If the volume has already been created we can omit the ‘–volume-driver=storageos’ at run time telling Docker the name of the volume driver we are using to create the volume. Alternatively, as shown below, we can create the volume and start the container at the same time without the need to pre-create the volume first.

$ docker run --rm -v myvol:/data \
--volume-driver=storageos \
alpine ash -c "echo hello world > /data/myfile"

And that’s it – very simple. In addition, you can also pass optional parameters such as volume size, filesystem type as well as applying other properties such as labels with key/value tags which can be used as the basis of a data rules engine. For example, if the label ‘environment’ has the value ‘production’, place the data into a performance pool and replicate the volume.

These features will vary between plugins and depending on their native integration capabilities, will provide you with varying control over data policy and all from the Docker command line or orchestration API.

StorageOS Volume Plugin

The StorageOS volume plugin illustrates a real world example of this architecture and how volume plugins can deliver persistent storage for Docker.  As illustrated above, the StorageOS plugin comprises a data plane and a control plane.

The StorageOS data plane consumes storage from existing infrastructure such as cloud or virtualized or bare metal server storage. This is presented up as feature rich virtual volumes and can be consumed by containers based on the presentation driver – file, block or object.

Management is handled via the control plane storage API for config, health, scheduling, policy, provisioning and recovery. Integration is handled directly through the API via plugins, CLI, REST or GUI and configuration is managed through a Key/Value store.

The Control Plane and Data Plane communicate with one another through a message bus. Everything runs as a container – an instantaneous storage platform served out of a container.

Architecture Overview Diagram 1280px

Why a new storage paradigm for Docker is needed

The way we deploy applications and how we consume infrastructure has changed and so must storage. Not all plugins are equal however – unless they have been built from the ground up (like StorageOS), some may simply be a thin wrapper around IaaS to extend legacy platform capabilities or not integrate properly into other platforms such as Kubernetes or Mesos.

What we do know is that container storage needs to be natively delivered to the platform not infrastructure, it should address a variety of SLAs, performance, access and cost constraints and hat it needs to deliver repeatable processes into CI/CD workflows; here features like clones and snapshots become very powerful tools from your storage toolbox.

Containers are still evolving

Docker has been around for four years and the journey has really only just begun with extending the ecosystem with plugins to help fulfil the demands of the enterprise. Containers started out as simple ephemeral applications, they have now become highly evolved and complex in very little time with a growing need for data persistence.

As we gain more confidence in the platform more demand will be made on features that can be delegated to partners with deep expertise and where plugins can provide the means to accomplish this.

mm

Author: Chris Brandon

Try for free