Containers are chroot with a Marketing Budget

Every explanation is a simplification.

There are many ways to understand how containers work, but most useful explanations are actually simplifications.

Many people have settled on explaining containers by calling them ‘light-weight VMs’ and they are light-weight because they ‘share the kernel with the host’. This is useful, but it simplifies a lot away. What is a ‘light-weight VM’? What does sharing the kernel mean?

Others will tell you containers are about namespaces and specific kernel visibility tweaks. This is also a helpful explanation because namespaces partition visibility, so that running containers can’t see other things on the same machine.

But for me, containers are just chrooted processes. Sure, they are more than that: Containers have a nice developer experience, an open-source foundation, and a whole ecosystem of cloud-native companies pushing them forward. But, let me show you why I think chroot¹ is the key.

So, let’s build a container runtime using only the chroot system call. Doing so, we can learn a little about chroot, a little about container runtimes, and it will also be fun!

The Goal

By the end, I’ll have something that looks like docker run, called chrun, where you can pull docker images:

> chrun pull redis

Pulling image redis
export image 16b87aa63c8f3a1e14a50feb94cba39eaa5d19bec64d90ff76c3ded058ad09c8

And then run them:

> chrun run redis "/usr/local/bin/redis-server"

Running /usr/local/bin/redis-server in /tmp/_assets_redis_tar_gz4234401501
4360:C 31 Oct 2022 16:07:57.253 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
4360:C 31 Oct 2022 16:07:57.253 # Redis version=7.0.5, bits=64, 
4360:C 31 Oct 2022 16:07:57.253 # Warning: no config file specified, using the 
4360:M 31 Oct 2022 16:07:57.256 * Increased maximum number of open files to 
4360:M 31 Oct 2022 16:07:57.256 * monotonic clock: POSIX clock_gettime
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 7.0.5 (00000000/0) 64 bit
  .-`` .-` `.  ` `\/    _.,_ ''-._                                  
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 4360
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           https://redis.io       
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

4360:M 31 Oct 2022 16:07:57.260 # Server initialized
4360:M 31 Oct 2022 16:07:57.265 * Ready to accept connections

And it will do this using chroot. But first, some background.

History of `chroot`

To the ~~Observatory~~ Unix Source

chroot probably doesn’t get a lot of mention now that containers exist, but it’s a Unix system call. This means it’s a way to request something from the operating system kernel. It is also a utility program, so it’s easy to call from the shell.

> chroot /bin /bash

All it does is change the root directory (/) to a new value. That’s all chrooting does. It just changes what / means. That sounds simple, but file paths are at the heart of how Unix works, so you can do a lot with this call.

Chroot is a much older system call than the ones modern container runtimes use, which means, in theory, the chrun shown above could run on a much older linux kernel. But how far back into Linux history could we go?

Actually, we can go back to way before the creation of Linux. chroot first appeared in 1979 for Unix v7.

(I know this because Diomidis Spinellis put together this excellent github repository that recreates the history of Unix from the earliest available source to today’s modern variations². The history recreated in this repo stretches back to 1970 and includes the original PDP-7 assembly code of the first iteration of Unix.)

It came along with chdir ( the system call equivalent of cd ) and looked like this:

chdir()
{
    chdirec(&u.u_cdir);
}

chroot()
{
    if (suser())
        chdirec(&u.u_rdir);
}

chroot and chdir - Source: Research-V7 tag of Unix History Repo

struct user
{
    ...
    struct inode *u_cdir;        /* pointer to inode of current directory */
    struct inode *u_rdir;        /* root directory of current process */
    ...
}

&u is a reference to the current users struct, which holds u_rdir and u_cdir.

So, a user on a Unix system has a current directory and root directory and chroot is a way to change the root value (u_rdir) in the same way cd changes the current working directory (u_cdir). In Unix V7 that’s basically all the chroot code I see, except for the syscall list and some userland code so that you can call chroot from your shell:

/ C library -- chroot

/ error = chroot(string);

.globl    _chroot
.globl    cerror
.chroot = 61.

_chroot:
    mov    r5,-(sp)
    mov    sp,r5
    mov    4(r5),0f
    sys    0; 9f
    bec    1f
    jmp    cerror
1:
    clr    r0
    mov    (sp)+,r5
    rts    pc
.data
9:
    sys    .chroot; 0:..

chroot.s in Unix V7 - Source: Research-V7 tag of Unix History Repo

So chroot goes way back, back into the 70s, and while the implementation has probably changed over the years, semantically it still matches the description found in the UNIX V7 Manual:

Chroot sets the root directory, the starting point for path names beginning with /. The call is restricted to the super-user.

Ok, history lesson over. Let’s start building things.

Using `chroot` Directly

Let’s start with the command-line and work towards our docker run clone.

The most straightforward docker run is hello-world:

> docker run hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.
...

To recreate run this hello-world in chroot jail³ is relatively straightforward.

`chroot` Hello World

At the command line, I can setup the hello-world in a changed root like so:

> mkdir /testroot
> cp hello /testroot

Then run it:

> chroot  /testroot /hello

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

The Root of the Matter

chroot only works as a root user, so assume from here on out everything is being done as root on a Linux machine.

If you try as a non-root user, you will get something like this:


> chroot /testroot /hello
chroot: cannot change root directory to '/testroot': Operation not permitted

We can also do this from go, making the system call directly:

package main

import (
    "os"
    "os/exec"
    "syscall"
)

func main() {
    cmd := exec.Command("/hello")
    syscall.Chroot("/testroot")
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr
    cmd.Run()
}

And the output is the same:

> go run go-change-root.go 

Hello from Docker!
This message shows that your installation appears to be working correctly.
...

That hello process runs with a filesystem rooted to /testroot. So there is nothing in the filesystem it can see besides itself.

I could verify that by running a shell inside it and poking around. However, when you change the root, the command passed is relative to the new root, so running /bin/sh will fail.

> chroot /testroot /bin/sh
chroot: /bin/sh: No such file or directory

I could cp /bin/sh into /testroot, but sh dynamically links in libc and probably other stuff. Those file pointers won’t point to anything in our new root so it won’t work. This is also why you can’t shell into the hello-world image with docker run:

> docker run -it hello-world /bin/sh
exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown.

Predictably, you can only shell into an image with a shell (and supporting userspace dependencies) inside it. So I’m going to be using redis:latest today:

> docker run -it redis /bin/sh
> cd /
> ls
bin  boot  data  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  run
sbin  srv  sys  tmp  usr  var

Now let’s try to chroot into this redis image. But to do that, I first need to get the file-system out of the image so I can pass it to chroot.

Extracting FileSystems From Container

To extract file-system from the image, the first thing I’ll try is to grab the image and extract it:

docker save redis -o redisImage.tar
mkdir redis
cd redis && tar -mxvf ../redisImage.tar

I can then look inside it:

redis
├── 131d224a301217b1d881f2464837d310dc8e0bf701d049fc30fb9eabddd98cbc
│   ├── VERSION
│   ├── json
│   └── layer.tar
├── 2279e9cb00a8a268cb01a1ccd1b7c0a01dc6b9ec619a7877dda2ca81e7409428
│   ├── VERSION
│   ├── json
│   └── layer.tar
├── 2d0405b8f23157bc9f45cadc12b8b7ff23446dfe968bfa7473cb78ec2444d198
│   ├── VERSION
│   ├── json
│   └── layer.tar
├── 770413d3495f9ba555e345d5c5397580a61cc64d9a945135b4b2235eed19d07b
│   ├── VERSION
│   ├── json
│   └── layer.tar
├── beb3916dbc72988060eaa0ba9ba119c76eb1c07db1c18ea53d3ca4f40a03c436
│   ├── VERSION
│   ├── json
│   └── layer.tar
├── c2342258f8ca7ab5af86e82df6e9ade908a949216679667b0f39b59bcd38c4e9.json
├── f7b46deebf614151dce2888bcb81e312da2ac791230b02688a5dbab1dee7ea91
│   ├── VERSION
│   ├── json
│   └── layer.tar
├── manifest.json
└── repositories

I’m on the right track, but this is not exactly what I wanted. Each layer.tar is the union file-system changes for that image layer. To build the completed file structure I would need to extract each of these and combine them in the right order.

Thankfully, I can just ask docker to do that for us with docker export.

> docker export $(docker create redis) -o redis.tar.gz
> mkdir redis && cd redis
> tar --no-same-owner --no-same-permissions --owner=0 --group=0 \
  -mxf ../redis.tar.gz

Then I end up with the extracted redis file structure:

./redis
├── bin
├── boot
├── data
├── dev
├── etc
├── home
├── lib
├── lib64
├── media
├── mnt
├── opt
├── proc
├── root
├── run
├── sbin
├── sys
├── tmp
├── usr
└── var

And so if I wrap that docker export up into a bash script, I can grab the file system for any image on docker hub, turning any Linux container image into a tar file.

./pull "redis"                
Pulling image redis
export image c20f5ecac2f9c49521b32433ffc6abeade950e77592805b0fc61fea00d6e32f5

Pull the file-system from an image. ( source)

From there, my trusty rusty chroot command works much like my docker run -it redis /bin/sh from a couple of steps ago:

> chroot ./redis /bin/sh
> ls
bin  boot  data  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  run 
 sbin  srv  sys  tmp  usr  var

It’s Just a Process

The container was a process all along

Here is why this is interesting from a learning perspective:

When I run docker run .. something happens – an image is turned into a container and started up. It’s not really a VM, but if you shell inside and look around, it seems like one. But now, with chroot at hand, you can see what ’not really a VM means: It’s just a process!

Namespaces mean when you start a container, you can’t see it in your process list, and cgroups mean that the process can have CPU and memory limits placed on it, but really, at a conceptual level, it’s just a process running with a different file-system root. Really containers are just a fancier way to chroot something!

Ok, let’s keep going.

ChRun Time

Another thing you may have noticed about containers is that they are ephemeral and relatively isolated. I can run N containers from one image and they will each be unique. Modern container runtimes use a union file-system ( like overlayfs ) for this but I get close to that with just temp directories.

Here’s my plan. When chrun pull <imagename> is called, I grab a tar of the image and store it somewhere. Then each time chrun run <imagename> is called, I’ll do the following:

Create a temporary directory
Extract <imagename>.tar.gz into it
Change root into that directory
On exit, delete the directory

It’s looks like this:

func main() {
    tar := fmt.Sprintf("./assets/%s.tar.gz", os.Args[2])
    cmd := os.Args[3]
    dir := createTempDir(tar)
    defer os.RemoveAll(dir)
    must(unTar(tar, dir))
    chroot(dir, cmd)
}

First, I create a temp directory:

func createTempDir(name string) string {
    var nonAlphanumericRegex = regexp.MustCompile(`[^a-zA-Z0-9 ]+`)
    prefix := nonAlphanumericRegex.ReplaceAllString(name, "_")
    dir, err := ioutil.TempDir("", prefix)
    if err != nil {
        log.Fatal(err)
    }
    return dir
}

Then I untar things:

func unTar(source string, dst string) error {
    r, err := os.Open(source)
    if err != nil {
        return err
    }
    defer r.Close()

    ctx := context.Background()
    return extract.Archive(ctx, r, dst, nil)
}

And then chroot, and we’re off:

func chroot(root string, call string) {
    fmt.Printf("Running %s in %s\n", call, root)
    cmd := exec.Command(call)
    must(syscall.Chroot(root))
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr
    must(cmd.Run())
}

( There are actually a couple more bits to it, but they are uninteresting. Entire file in this repo. )

And with that, I can do things like start up a redis client and server and have them talk to each other:

> ./chrun pull redis
Pulling image redis
export image 16b87aa63c8f3a1e14a50feb94cba39eaa5d19bec64d90ff76c3ded058ad09c8

chrun pulls an image from docker hub and builds a tar archive it. (docker export does the heavy lifting)

> chrun run redis "/usr/local/bin/redis-server"

Running /usr/local/bin/redis-server in /tmp/_assets_redis_tar_gz4234401501
4360:C 31 Oct 2022 16:07:57.253 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
4360:C 31 Oct 2022 16:07:57.253 # Redis version=7.0.5, bits=64, 
4360:C 31 Oct 2022 16:07:57.253 # Warning: no config file specified, using the 
4360:M 31 Oct 2022 16:07:57.256 * Increased maximum number of open files to 10032 (it was originally set to 1024).
4360:M 31 Oct 2022 16:07:57.256 * monotonic clock: POSIX clock_gettime
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 7.0.5 (00000000/0) 64 bit
  .-`` .-` `.  ` `\/    _.,_ ''-._                                  
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 4360
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           https://redis.io       
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

4360:M 31 Oct 2022 16:07:57.260 # Server initialized
4360:M 31 Oct 2022 16:07:57.265 * Ready to accept connections

chrun run extracts the tar to a temp dir, changes root to it, and then starts the passed command. Afterward, it cleans up the temp dir.

> chrun run redis "/usr/local/bin/redis-cli"

Running /usr/local/bin/redis-cli in /tmp/_assets_redis_tar_gz1366317376
127.0.0.1:6379> SET mykey "Hello\nWorld"
OK
127.0.0.1:6379> GET mykey
"Hello\nWorld"
127.0.0.1:6379> 
127.0.0.1:6379> exit

Because of the temp dirs, you can start many processes from one image, and each gets its own file-system. They can talk to each other over the network.

And after I stop them, the temp dir is removed, and they disappear. So there you go, ‘containers’ using only chroot.

The source is on github.

Who Cares?

So who cares? I mean, many container runtimes already exist (runC, containerd, gVisor, StarStruck) and they’re all better than this one in almost every way⁴.

Well, it could just be me, but understanding that a container is very similar to a process that has been chrooted – so it’s running against the same operating system but with a different root – that understanding helps ground my knowledge of what containers are. It makes them seem less magical⁵ and lets me think about new possibilities⁶.

And so containers are great. Namespaces, cgroups v2, runC, overlayfs, the OCI image format, and everything else in this space is impressive engineering. It’s incredible forward progress we can all take advantage of. But it’s not magic. It’s just a long series of progressive refinements ( and a bit of marketing ) on top of a feature that has been in Unix since … let me check:

> git log usr/src/libc/sys/chroot.s | head -5

commit a0b0c390d5f37060bf64b63bba8e9f0a1dceb337
Author: Dennis Ritchie <dmr@research.uucp>
Date:   Wed Jan 10 14:59:44 1979 -0500

    Research V7 development

Turn your engineering standards into automated guardrails that provide feedback directly in pull requests, with 100+ guardrails included out of the box and support for the tools and CI/CD systems you already have.

Learn More

I pronounce chroot as change root. I asked around and some say ‘see haitch root’ and some ‘chur-oot’ (which I kind of hate).↩︎
Thanks to Diomidis Spinellis for creating the unix history repo and pointing me towards the V7 manual and thanks to Alex Couture-Beil for pushing me to understand the code.↩︎
The term chroot jail used to be used quite a bit on the internet, and so I’m using it here, but it’s a bit problematic as a term. A program running in a restricted root lacks access to files outside the root, but breaking out of this jail is often possible, especially in the way will be using them today.

Containers provide more restrictions than chroot. For example, cgroups provide resource isolation, and namespaces limit visibility, but even containers are not proper security barriers.

I’m not a security expert, but it’s probably better to think of both chroot and containers as offering an abstraction barrier, rather than providing a security barrier.↩︎
gVisor is really cool. It’s an application kernel for containers, which limits some of the security surface-area of a container runtime. StarStruck is just there to check if you are paying attention. It’s not a container runtime at all, but Joe Exotic’s 2nd studio album.↩︎
Here’s one example where chroot made things seem less magical.

When I see people complaining that docker takes up too much disk space, I know that it’s more about having lots of docker images sitting around on disk than anything innate to how containers work. Also, its probably a bit about how many containers include full linux distros in them. No one is complaining about the size of hello-world:latest or any from scratch image.

Nothing about containers means they need a full linux distro in them, that’s just a convenience.↩︎
One new possibility that seems exciting to me is building native OS X containers based on chroot.

When you run a Linux container on a mac, it runs them inside a Linux VM. But OS X has the chroot system call, so it should be possible to build native mac containers, which contain an OS X build of redis and a base layer including the OS X dependencies that redis needs to run. These containers would lack some of the isolation of Linux containers, but they would have almost no overhead and could have much faster volume mounts (a current source of frustration for many devs who are using OS X).↩︎

Adam Gordon Bell

Spreading the word about Earthly. Host of CoRecursive podcast. Physical Embodiment of Cunningham's Law.
@adamgordonbell
✉Email Adam✉

Get notified about new articles!

We won't send you spam. Unsubscribe at any time.