Docker Insides

Objectives

You’ll learn how Docker implement containers, and how Linux features are used for doing so.

Prerequisites

You need basic knowledge about Docker and the containers technology. You also need basic Linux understanding, mainly running commands on terminal, and some knowledge about services and processes.

The Docker architecture

Understanding how containers are implemented require, indeed, understanding what is a container. I love Mark Church’s description about containers¹:

Docker containers wrap a piece of software in a complete filesystem that contains everything needed to run: code, runtime, system tools, system libraries – anything that can be installed on a server. This guarantees that the software will always run the same, regardless of its environment.

In my own words, a container provides all pieces of software needed to execute another piece of software: the operating system, the file system, libraries, dependencies, network system, etc. Not the hardware itself, but all software dependencies at its widest meaning.

Docker is a solution for running containers. But “running containers” is a very short description for the actual objective: running software in a repeatable, isolated and secure fashion, while keeping it simple and available.

For doing so, Docker architecture splits those requirements into three components, each one focusing on one need: Docker Client, Docker Host and Docker Registry.

The first thing you should know is that, when you run `docker run` or `docker ps` commands, you are not actually running the Docker Engine. You are just reaching the so-called Docker Cli (meaning docker client or docker command-line-interface). Docker Cli is a thin client that mainly just parses the command line parameters. Then, it uses a REST API² to send commands to, or get results from the Docker Host. Both Cli and REST API are designed with simplicity in mind, so it can be easily used by any user (either human or another program).

Docker Host is the component actually creating and running the containers. Responding to the REST API, it is responsible for providing all software needs for containers to run, as well as ensure containers have required (and only allowed) resources available. Docker Host is a complex component, including parts like the Docker Daemon -the process actually doing the container job-, an internal image registry, etc.

Containers need images as the software to be run. Images are stored in a Docker Registry. While Docker Host contains an internal Registry -where it keeps images for containers in its scope- external registries may also be available. Those registries, public o private, allow Docker Hosts to easily locate and retrieve images. Probably the most common public Docker Registry is Docker Hub (https://hub.docker.com/), hosting both public and private registries.

Communication between Docker Hosts and Docker Registries is also standardized, via an HTTP API. If interested in this API details, description and specs can be found in Docker docs website (https://docs.docker.com/registry/spec/api/).

Docker Architecture

This three component architecture allows Docker to be easily usable, enabling containers to be run safely, almost everywhere.

Docker Containers are Linux

It is not a secret that the most complex part of this architectures is, indeed, securely running containers. And for doing so, Docker relies deeply on Linux kernel features.

Before getting deep into those features, you need to understand what “running a container” means. You should realize that, like any other piece of software, “running” means creating a process that the Operating System (Linux, here) manages and places in a processor. So, by running a container, we actually mean creating a process that performs the container behavior.

The difference between a process and a container is, mainly, isolation. Processes can communicate each other, share resources, etc. But containers are isolated. They run independently from each other, with their own resources and their own limits. And for the sake of this isolation, docker uses different strategies, depending on the resources to isolate.

The first and simplest strategy is creation of namespaces. Namespaces are like prefixes or groups of resource identifiers. All resources attached to a container will be related to the same namespace, and any container can only reach resources related to its namespace. Different kind of resources may have different namespaces²:

  • All processes in a container are grouped into the pid namespace
  • All network interfaces are grouped into the net namespace.
  • IPC resources (pipes, process shared memory, etc) share the same ipc namespace
  • Filesystem mount points in a container refer to the same mnt namespace
  • The uts namespace allow containers to isolate UTS resources, like hostname or domain.

In fact, those namespaces are not exclusive from docker or containers, but a core feature for Linux systems. If you are really interested in knowing more about Linux namespaces, take a look at Wikipedia’s page (https://en.wikipedia.org/wiki/Linux_namespaces), and review references and external links.

Another strategy (optionally) used by docker in order to avoid resources being used without authorization is SELinux. SELinux (Security Enhaced Linux) is a module, introduced in version 2.6.0 for Linux kernel,  that allow definition of mandatory resource access policies. In short, SELinux allow creating “labels” or “policies”, and restricting the actions those policies allow (white list). Those labels, once assigned to resources allow kernel to restrict how resources are used.

While SELinux and Namespaces strategies face the isolation problem, there is another resource-related issue containers should face: usage limits. Docker should be able to avoid containers using too much from limited resources -like memory, CPU time, etc-, avoiding other containers correct behavior.

The kernel feature Docker uses for resource limiting are cgroups. CGroups (or control groups) allow creation of restriction groups. Those groups define the amount of CPU or memory that can be used, and can be assigned to processes. Docker assign container processes to specially created cgroups, ensuring container will not exceed them. If a container process causes a breach of those limits, it will be OOM killed³.

By using those strategies, docker avoid “containerized” processes reach unauthorized resources, or consume too much of them. There are many other strategies (i.e. kernel capabilities restriction, or seccomp), that are nicely introduced in other documents like this (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html/container_security_guide/docker_selinux_security_policy).

Fully understanding Docker and Linux containers is not an easy task. This was just a quick introduction, and I encourage interested readers to follow links and go deeper into strategies, technologies and features that make containers an exceptional technology.

Keep learning!

Bonus:

For those really interested in investigating docker processes and resource utilization, I do recommend using systemd tools. `systemd-cgls` can show you cgroups associated to specific container processes, while `systemd-cgtop`will display resource utilization:

$ systemd-cgls
...
├─docker
│ └─50993dd8689e561467739465a5491f0cf42a671783d4ad24870278118ee88149
│   ├─2399 /usr/bin/python3 -u /sbin/my_init -- /start_java.sh -c -jar /data/app.jar server /data/app-config.yml
│   ├─2426 /usr/bin/runsvdir -P /etc/service
│   ├─2431 /bin/bash /start_java.sh -c -jar /data/app.jar server /data/app-config.yml
│   └─2443 java -XX:NativeMemoryTracking=summary ...
...

Footnotes and references:

¹ Docker Reference Architecture: Designing Scalable, Portable Docker Container Networks: https://success.docker.com/article/networking

² Details about the REST API can be found in Docker Engine API docs: https://docs.docker.com/engine/api/latest/.

³ Out Of Memory. To be true, this will only happen when the container processes ask for more memory than it is granted. Other resources, like CPU or Network may not kill the container, but simply avoid overuse.