Series: Understanding OCI from the Ground Up (Part 3 of 5)

In Part 1 we built an OCI image. In Part 2 we pushed and pulled it with raw HTTP. Now we run it — without runc, without Docker, without any container runtime. Just chroot, unshare, mount, and hostname. Four commands that are already on every Linux system.

What is the OCI Runtime Spec?

The OCI Runtime Specification defines how to take a container image and turn it into a running process with isolation. It answers: "Given a root filesystem and some configuration, how do I run an isolated process?"

The spec describes:

  1. A runtime bundle — a directory containing:
    • rootfs/ — the container's root filesystem (extracted from image layers)
    • config.json — how to run it (process args, env, namespaces, mounts, hostname)
  2. A container lifecycle — create → start → run → stop → delete
  3. Linux primitives that provide isolation:
    • chroot — filesystem isolation
    • unshare — namespace isolation (PID, UTS, mount, network, user)
    • mount — controlled filesystem views
    • hostname — UTS namespace (container identity)

Key insight: A container is not a VM. It's a regular Linux process that uses kernel features to limit what it can see and do. runc is just a program that calls these same Linux primitives in the right order. We'll do it by hand.


Prerequisites

We work inside a Docker container on Docker Desktop for Mac. We need --privileged so that unshare can create new namespaces:

docker run --rm -d --name oci-lab --dns 8.8.8.8 --privileged \
  ubuntu:22.04 sleep 3600

docker exec oci-lab bash -c \
  "apt-get update -qq && apt-get install -y -qq skopeo jq wget file xz-utils procps iproute2 > /dev/null 2>&1"

Why --privileged? Creating new PID, UTS, and mount namespaces with unshare requires CAP_SYS_ADMIN. Docker's default security profile blocks this. The --privileged flag lifts that restriction. This only affects the lab container — your Mac is untouched.


Step 1: Assemble the Root Filesystem with OverlayFS

In Part 1, we built an OCI image with two layers. A container runtime's first job is to assemble those layers into a single root filesystem. Docker and containerd use overlayfs for this — and so will we.

What is OverlayFS?

OverlayFS is a Linux filesystem that merges multiple directories into a single unified view:

┌────────────────────────────────────────────────────────────┐
│                    merged/ (unified view)                   │
│  The container sees this as its root filesystem             │
│  Reads come from the first layer that has the file          │
│  Writes go to the upper layer (copy-on-write)              │
├────────────────────────────────────────────────────────────┤
│  upper/   (writable)   │  All container writes land here   │
├────────────────────────────────────────────────────────────┤
│  lower2/  (read-only)  │  curl layer: /usr/local/bin/curl  │
├────────────────────────────────────────────────────────────┤
│  lower1/  (read-only)  │  base layer: Ubuntu 22.04 rootfs  │
└────────────────────────────────────────────────────────────┘

This is how Docker stores containers. In fact, if you look at the host mount table of our lab container itself:

overlay / overlay rw,relatime,
  lowerdir=/var/lib/desktop-containerd/...snapshots/444/fs:...419/fs,
  upperdir=...snapshots/445/fs,
  workdir=...snapshots/445/work

Docker Desktop used overlayfs to create our lab container. We'll now do the same thing by hand.

Extract each layer into its own directory

cd /work

# Pull the base image
skopeo copy docker://ubuntu:22.04 oci:ubuntu-base:22.04

# Find the layer blob
MANIFEST_PATH="ubuntu-base/blobs/$(jq -r '.manifests[0].digest' ubuntu-base/index.json | tr ':' '/')"
LAYER_PATH="ubuntu-base/blobs/$(jq -r '.layers[0].digest' "$MANIFEST_PATH" | tr ':' '/')"

# Extract base layer into its own directory
mkdir -p layers/base
tar -xzf "$LAYER_PATH" -C layers/base

# Download and extract curl binary into its own layer directory
wget -q -O /tmp/curl.tar.xz \
  "https://github.com/stunnel/static-curl/releases/download/8.19.0/curl-linux-aarch64-musl-8.19.0.tar.xz"
tar -xf /tmp/curl.tar.xz -C /tmp/
mkdir -p layers/curl/usr/local/bin
cp /tmp/curl layers/curl/usr/local/bin/curl
chmod +x layers/curl/usr/local/bin/curl

Two separate layers, each in its own directory:

Layer 1 (base) — Ubuntu root filesystem:
  bin  boot  dev  etc  home  lib  media  mnt  opt  proc ...
  18 entries total, 76 MB

Layer 2 (curl) — just the curl binary:
  layers/curl/usr/local/bin/curl
  1 file, 9.3 MB

Mount overlayfs to merge the layers

Note: OverlayFS requires the upper and work directories to be on a filesystem that supports d_type (like ext4 or tmpfs). Since our container root is already overlayfs, we use tmpfs for the upper/work dirs.

# Create directories on tmpfs (needed because you can't nest overlay on overlay)
mount -t tmpfs tmpfs /mnt
mkdir -p /mnt/lower-base /mnt/lower-curl /mnt/upper /mnt/work /mnt/merged

# Copy layers to tmpfs (so overlayfs can use them)
cp -a layers/base/* /mnt/lower-base/
cp -a layers/curl/* /mnt/lower-curl/

# Mount overlayfs — this is the key command
mount -t overlay overlay \
  -o lowerdir=/mnt/lower-curl:/mnt/lower-base,upperdir=/mnt/upper,workdir=/mnt/work \
  /mnt/merged

The mount command explained:

OptionPurpose
-t overlayFilesystem type: overlayfs
lowerdir=/mnt/lower-curl:/mnt/lower-baseRead-only layers, colon-separated. Order matters: first listed = highest priority
upperdir=/mnt/upperWritable layer — all modifications go here
workdir=/mnt/workInternal scratch space for overlayfs atomics
/mnt/mergedThe mount point — the unified view

Verify the merged view

ls /mnt/merged/
bin   dev  home  media  opt   root  sbin  sys  usr
boot  etc  lib   mnt    proc  run   srv   tmp  var
# Base layer content visible through merged view
head -2 /mnt/merged/etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
# Curl layer content visible through merged view
ls -la /mnt/merged/usr/local/bin/curl
file /mnt/merged/usr/local/bin/curl
/mnt/merged/usr/local/bin/curl --version | head -1
-rwxr-xr-x 1 root root 9671880 Apr 25 13:18 /mnt/merged/usr/local/bin/curl
/mnt/merged/usr/local/bin/curl: ELF 64-bit LSB pie executable, ARM aarch64, static-pie linked, stripped
curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7

Both layers merged into a single unified view. The container sees one filesystem — Ubuntu root with curl at /usr/local/bin/curl — even though they came from separate layers.

Copy-on-Write in action

The upper layer starts empty. Let's see what happens when the container modifies files:

# Upper layer — currently empty
ls -la /mnt/upper/
total 0
drwxr-xr-x 2 root root  40 Apr 25 13:19 .
drwxrwxrwt 7 root root 140 Apr 25 13:19 ..

Modify a file from the base layer:

echo "my-overlay-container" > /mnt/merged/etc/hostname
cat /mnt/merged/etc/hostname
my-overlay-container
# The modified file was COPIED to the upper layer (copy-on-write)
find /mnt/upper -type f
cat /mnt/upper/etc/hostname
/mnt/upper/etc/hostname
my-overlay-container
# Base layer is UNTOUCHED
head -1 /mnt/lower-base/etc/hostname
localhost.localdomain

The original file in the base layer wasn't modified. OverlayFS copied it to the upper layer first, then applied the change. This is copy-on-write (COW).

Create a new file:

echo "hello from overlay" > /mnt/merged/tmp/overlay-test.txt
cat /mnt/upper/tmp/overlay-test.txt
hello from overlay

New files go directly to the upper layer. The lower layers remain untouched.

Delete a file (whiteout):

# Before delete
ls -la /mnt/merged/etc/legal
-rw-r--r-- 1 root root 267 Oct 15  2021 /mnt/merged/etc/legal
rm /mnt/merged/etc/legal
ls /mnt/merged/etc/legal 2>&1
ls: cannot access '/mnt/merged/etc/legal': No such file or directory
# How does overlayfs hide a file that still exists in the base layer?
# It creates a "whiteout" — a character device with major:minor 0:0
ls -la /mnt/upper/etc/legal
c--------- 2 root root 0, 0 Apr 25 13:19 legal
# The base layer still has the original
ls -la /mnt/lower-base/etc/legal
-rw-r--r-- 1 root root 267 Oct 15  2021 /mnt/lower-base/etc/legal

Whiteouts are how the OCI Image Spec handles deletions in layers. In Part 1, we mentioned that layers are filesystem diffs. Now you can see how: a new layer is essentially the contents of the upper directory — modified files, new files, and whiteout markers for deletions.

Summary of the upper layer after all changes:

/mnt/upper/etc/hostname            — modified (copy-on-write)
/mnt/upper/tmp/overlay-test.txt    — created (new file)
/mnt/upper/etc/legal               — deleted (whiteout: char device 0,0)

This is exactly how docker commit works. It takes the upper layer, creates a tar archive from it (including whiteout files as .wh.<filename>), and that becomes a new image layer.

This /mnt/merged directory is the bundle's rootfs — what the OCI Runtime Spec calls the "root filesystem" of the container.


Step 2: Linux Isolation Primitives — One at a Time

Before we combine everything, let's understand each primitive individually.

Primitive 1: chroot — Filesystem Isolation

chroot changes the apparent root directory for a process. Everything outside the new root becomes invisible.

# Outside chroot (host view)
echo "Root entries: $(ls / | wc -l)"
echo "Hostname: $(hostname)"
Root entries: 19
Hostname: 92921a41b725
# Inside chroot (container view)
chroot /mnt/merged /bin/bash -c '
  echo "Root entries: $(ls / | wc -l)"
  echo "os-release: $(head -1 /etc/os-release)"
  echo "curl: $(/usr/local/bin/curl --version | head -1)"
  echo "Can see host /work? $(ls /work 2>&1)"
'
Root entries: 18
os-release: PRETTY_NAME="Ubuntu 22.04.5 LTS"
curl: curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7
Can see host /work? ls: cannot access '/work': No such file or directory

What happened: The process inside chroot sees /mnt/merged (our overlayfs mount) as /. It can't access /work, /etc, or anything on the host. The filesystem is isolated.

What it doesn't do: The process still shares the host's PID space, hostname, and mount table. We fix that next.

Primitive 2: unshare --pid — PID Namespace

A PID namespace gives the container its own process ID space. Inside the namespace, the first process gets PID 1.

# Host PID namespace
echo "My PID: $$"
echo "Process count: $(ps -e --no-headers | wc -l)"
ps -eo pid,comm | head -6
My PID: 3392
Process count: 5
  PID COMMAND
    1 sleep
 3392 bash
 3401 ps
 3402 head
# New PID namespace
unshare --pid --fork --mount chroot /mnt/merged /bin/bash -c '
  mount -t proc proc /proc
  echo "My PID: $$"
  echo "Process count: $(ps -e --no-headers | wc -l)"
  ps -eo pid,comm
  umount /proc
'
My PID: 1
Process count: 4
  PID COMMAND
    1 bash
    6 ps

What happened:

  • Host has 5 processes. The container sees only its ownbash (PID 1) and ps (PID 6).
  • The shell got PID 1 — it's the init process of its own PID namespace.
  • --fork is required because the calling process can't enter a new PID namespace itself; only its children can.
  • --mount gives us a private mount table so mount -t proc doesn't affect the host.

Primitive 3: unshare --uts — Hostname Isolation

UTS (Unix Timesharing System) namespace isolates the hostname. The container can set its own hostname without affecting the host.

# Host
echo "Hostname: $(hostname)"
Hostname: 92921a41b725
# New UTS namespace
unshare --uts chroot /mnt/merged /bin/bash -c '
  hostname my-container
  echo "Hostname: $(hostname)"
'
Hostname: my-container
# Back on host
echo "Hostname still: $(hostname)"
Hostname still: 92921a41b725

What happened: The container set its hostname to my-container, but the host's hostname is unchanged. Each UTS namespace gets its own copy of the hostname.

Primitive 4: unshare --mount + mount — Mount Isolation

A mount namespace gives the container its own mount table. Mounts inside the container don't appear on the host.

# Host mount count
cat /proc/mounts | wc -l
11 mount points
# New mount namespace
unshare --mount chroot /mnt/merged /bin/bash -c '
  mount -t proc proc /proc
  mount -t tmpfs tmpfs /tmp

  echo "Mount points inside: $(cat /proc/mounts | wc -l)"
  cat /proc/mounts

  echo "Created a file in tmpfs:"
  echo hello-container > /tmp/test.txt
  cat /tmp/test.txt

  umount /tmp
  umount /proc
'
Mount points inside: 2
proc /proc proc rw,relatime 0 0
tmpfs /tmp tmpfs rw,relatime 0 0

Created a file in tmpfs:
hello-container
# Host: file does NOT exist
ls /tmp/test.txt 2>&1
ls: cannot access '/tmp/test.txt': No such file or directory

What happened:

  • The container sees only 2 mount points (its own proc and tmpfs), while the host has 11.
  • A file created in the container's /tmp (tmpfs) doesn't exist on the host.
  • proc gives the container its own view of the process table (needed for ps to work).
  • tmpfs gives the container scratch space that disappears when the container stops.

Step 3: Combine Everything — Run the Container

Now we combine all four primitives into one command:

unshare --pid --fork --mount --uts chroot /mnt/merged /bin/bash

This single line:

  1. Creates new PID, mount, and UTS namespaces (unshare)
  2. Forks a child process into the new PID namespace (--fork)
  3. Changes root to /mnt/merged — our overlayfs mount (chroot)
  4. Executes /bin/bash as PID 1

Here's the full run with mounts and our application:

unshare --pid --fork --mount --uts chroot /mnt/merged /bin/bash -c '
  export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

  # Set hostname
  hostname oci-container

  # Mount essential filesystems
  mount -t proc proc /proc
  mount -t tmpfs tmpfs /tmp

  echo "=== Container Environment ==="
  echo "Hostname : $(hostname)"
  echo "PID      : $$"
  echo "User     : $(whoami)"

  echo ""
  echo "=== Process Table ==="
  ps -eo pid,ppid,comm

  echo ""
  echo "=== Filesystem ==="
  echo "Mount points: $(cat /proc/mounts | wc -l)"
  cat /proc/mounts

  echo ""
  echo "=== Run our application (curl) ==="
  /usr/local/bin/curl --version | head -1

  echo ""
  echo "=== /etc/os-release ==="
  head -3 /etc/os-release

  echo ""
  echo "=== Proof of isolation ==="
  echo "Can see host /work? $(ls /work 2>&1)"

  umount /tmp
  umount /proc
'

Output:

=== Container Environment ===
Hostname : oci-container
PID      : 1
User     : root
Root dir : bin boot dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var

=== Process Table ===
  PID  PPID COMMAND
    1     0 bash
    8     1 ps

=== Filesystem ===
Mount points: 3
overlay / overlay rw,relatime,lowerdir=/mnt/lower-curl:/mnt/lower-base,upperdir=/mnt/upper,workdir=/mnt/work 0 0
proc /proc proc rw,relatime 0 0
tmpfs /tmp tmpfs rw,relatime 0 0

=== Run our application (curl) ===
curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7

=== /etc/os-release ===
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"

=== Proof of isolation ===
Can see host /work? ls: cannot access '/work': No such file or directory
# Back on host — everything is unchanged
echo "Hostname: $(hostname)"
echo "PID: $$"
Hostname: b6e4d5977a9b
PID: 3503

This is a container. An isolated process with:

  • Its own root filesystem via overlayfs (chroot → can’t see /work)
  • Its own PID space (PID 1 = bash, only 2 processes visible)
  • Its own hostname (oci-container, not b6e4d5977a9b)
  • Its own mounts (3 mount points: overlay root + proc + tmpfs)

Step 4: The OCI Runtime Spec config.json

What we just did manually is exactly what runc does — but runc reads its instructions from a config.json file. This is the core of the OCI Runtime Spec.

Here's a config.json that describes our container:

{
    "ociVersion": "1.0.2",
    "process": {
        "terminal": false,
        "user": { "uid": 0, "gid": 0 },
        "args": ["/usr/local/bin/curl", "--version"],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "TERM=xterm"
        ],
        "cwd": "/"
    },
    "root": {
        "path": "/mnt/merged",
        "readonly": false
    },
    "hostname": "oci-container",
    "mounts": [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc"
        },
        {
            "destination": "/tmp",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": ["nosuid", "nodev"]
        }
    ],
    "linux": {
        "namespaces": [
            { "type": "pid" },
            { "type": "mount" },
            { "type": "uts" }
        ]
    }
}

How config.json maps to our manual commands

config.json fieldWhat runc doesOur manual equivalent
root.path: "/mnt/merged"Set root filesystemchroot /mnt/merged (overlayfs mount point)
linux.namespaces: [pid, mount, uts]Create namespacesunshare --pid --mount --uts
hostname: "oci-container"Set hostname in UTS nshostname oci-container
mounts: [{dest: "/proc", type: "proc"}]Mount procmount -t proc proc /proc
mounts: [{dest: "/tmp", type: "tmpfs"}]Mount tmpfsmount -t tmpfs tmpfs /tmp
process.args: ["/usr/local/bin/curl", "--version"]Execute the process/usr/local/bin/curl --version
process.env: ["PATH..."]Set environmentexport PATH=...
process.user: {uid: 0, gid: 0}Run as root(default in our setup)
process.cwd: "/"Working directory(default in chroot)

Parsing config.json → unshare command

We can even parse config.json with jq and build the equivalent command:

ARGS=$(jq -r '.process.args | join(" ")' config.json)
HOSTNAME=$(jq -r '.hostname' config.json)
ROOTFS=$(jq -r '.root.path' config.json)
NS_FLAGS=""
for ns in $(jq -r '.linux.namespaces[].type' config.json); do
  NS_FLAGS="$NS_FLAGS --$ns"
done

echo "Parsed from config.json:"
echo "  args     = $ARGS"
echo "  hostname = $HOSTNAME"
echo "  rootfs   = $ROOTFS"
echo "  ns_flags = $NS_FLAGS"
Parsed from config.json:
  args     = /usr/local/bin/curl --version
  hostname = oci-container
  rootfs   = /mnt/merged
  ns_flags =  --pid --mount --uts

The equivalent command:

unshare --pid --mount --uts --fork chroot /mnt/merged /bin/bash -c \
  "hostname oci-container; mount -t proc proc /proc; mount -t tmpfs tmpfs /tmp; /usr/local/bin/curl --version"

Output:

Hostname: oci-container
PID: 1
Running: /usr/local/bin/curl --version

curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7
Release-Date: 2026-03-11
Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp ws wss
Features: alt-svc asyn-rr AsynchDNS brotli HSTS HTTP2 HTTP3 HTTPS-proxy HTTPSRR IDN IPv6 Largefile libz NTLM PSL SSL threadsafe TLS-SRP UnixSockets zstd

From Image Config to Runtime Config

One thing the OCI Runtime Spec does not define is how to go from an image config to a config.json. That's the job of a higher-level tool (like Docker or containerd). But the mapping is straightforward:

Image Config (Part 1)Runtime config.jsonNotes
config.Cmd: ["/bin/bash"]process.args: ["/usr/local/bin/curl", "--version"]We overrode Cmd with our curl binary
config.Env: ["PATH..."]process.env: ["PATH...", "TERM=xterm"]Carried over, can add more
architecture: "arm64"(implicit)Must match the host
os: "linux"(implicit)Must be Linux for Linux namespaces
rootfs.diff_ids: [...]root.path: "/mnt/merged"Layers assembled via overlayfs into merged view

The image config describes what to run. The runtime config describes how to run it (with what isolation, mounts, and constraints).


What runc Adds Beyond Our Manual Approach

Our unshare + chroot approach works, but a real OCI runtime like runc adds:

FeatureOur approachrunc
PID namespaceunshare --pidclone(CLONE_NEWPID)
Mount namespaceunshare --mountclone(CLONE_NEWNS)
UTS namespaceunshare --utsclone(CLONE_NEWUTS)
Network namespaceNot usedclone(CLONE_NEWNET) + veth pairs
User namespaceNot usedclone(CLONE_NEWUSER) + uid mapping
CgroupsNot usedCPU/memory/IO limits
SeccompNot usedSyscall filtering
CapabilitiesFull (root)Dropped to minimum
Filesystem assemblyoverlayfs + chrootoverlayfs + pivot_root (more secure)
Lifecycle hooksNoneprestart, poststart, poststop
State managementNone/run/runc/<container-id>/state.json

pivot_root vs chroot: chroot just changes the path lookup root. A process can escape with chroot("../..") if it has root. pivot_root actually swaps the root mount, making escape impossible from within the mount namespace. Real runtimes always use pivot_root.


The Container Lifecycle

The OCI Runtime Spec defines a strict lifecycle:

┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐
│ creating │────►│ created │────►│ running │────►│ stopped │────►│ deleted │
└─────────┘     └─────────┘     └─────────┘     └─────────┘     └─────────┘
                                                       │
                                                       ▼
                                            (exit code captured)
StateWhat happensOur manual equivalent
creatingSet up namespaces, mounts, cgroupsunshare --pid --mount --uts
createdContainer ready, process not started(chroot done, bash not yet running)
runningProcess is executingcurl --version is running
stoppedProcess exitedCurl finished, bash exited
deletedAll resources cleaned upNamespaces destroyed, mounts removed

In our demo, all these states happen in rapid succession because we combined them into one command. runc exposes them as separate operations (runc create, runc start, runc delete).


Cleanup

docker rm -f oci-lab

Recap

What We DidLinux PrimitiveOCI Runtime Spec Concept
Extracted layers separatelytar -xzf into layers/base, layers/curlRuntime bundle layers
Merged layers with overlayfsmount -t overlaySnapshotter (overlayfs driver)
Isolated filesystemchroot /mnt/mergedroot.path
Isolated PID spaceunshare --pidlinux.namespaces: [{type: "pid"}]
Isolated hostnameunshare --uts + hostnamehostname + linux.namespaces: [{type: "uts"}]
Isolated mountsunshare --mount + mountmounts array + linux.namespaces: [{type: "mount"}]
Ran the application/usr/local/bin/curl --versionprocess.args
Defined it all in JSONconfig.jsonOCI Runtime Spec config

The big takeaway

A container is a Linux process with restricted visibility. The OCI Runtime Spec is a JSON file (config.json) that tells a runtime which restrictions to apply. We applied them by hand:

  • mount -t overlay"Merge these layers into a single filesystem"
  • chroot"You can only see this directory"
  • unshare --pid"You can only see your own processes"
  • unshare --uts"You have your own hostname"
  • unshare --mount"You have your own mount table"

No magic. No VMs. Just five Linux commands that have existed for decades.


Deep Dive: The OCI Runtime Spec in Detail

We've run a container by hand. Now let's understand the full spec — every namespace, every security mechanism, and the architecture of a real container runtime.

All Seven Linux Namespaces

We used three namespaces (PID, UTS, mount). Linux has seven, and a real container uses most of them:

NamespaceFlagWhat it isolatesWe used it?
PIDCLONE_NEWPIDProcess IDs — container sees PID 1
UTSCLONE_NEWUTSHostname and domain name
MountCLONE_NEWNSMount table — filesystem topology
NetworkCLONE_NEWNETNetwork interfaces, IPs, routes, ports
UserCLONE_NEWUSERUID/GID mappings — root inside, non-root outside
IPCCLONE_NEWIPCSystem V IPC, POSIX message queues
CgroupCLONE_NEWCGROUPCgroup root view

Network Namespace — The Most Complex

A new network namespace starts with only a loopback interface. To give a container network access, the runtime must:

1. Create a veth pair (virtual Ethernet cable)
2. Move one end into the container's network namespace
3. Assign an IP to the container's end
4. Set up routing (default gateway)
5. Connect the host end to a bridge (docker0)
6. Set up NAT/masquerading for outbound traffic

This is why Docker networking is complex — it's not just one syscall. The runtime creates virtual networking plumbing for every container. We skipped it because our container doesn't need network access (curl just prints version info).

User Namespace — Rootless Containers

A user namespace maps UIDs inside the container to different UIDs outside:

Container view:    uid 0 (root)
                      ↓ mapped to
Host view:         uid 100000 (unprivileged user)

This enables rootless containers: the process thinks it's root (and can do root things inside its namespace), but from the host's perspective it's an unprivileged user. If it escapes the container, it has no host privileges.

// config.json for rootless
"linux": {
    "uidMappings": [{"containerID": 0, "hostID": 100000, "size": 65536}],
    "gidMappings": [{"containerID": 0, "hostID": 100000, "size": 65536}]
}

We ran as real root inside a privileged container. In production, rootless is preferred.

Cgroups — Resource Limits

Namespaces control visibility. Cgroups control resource usage. They answer: "How much CPU, memory, and IO can this container use?"

// config.json cgroups section
"linux": {
    "resources": {
        "memory": {
            "limit": 536870912,
            "reservation": 268435456
        },
        "cpu": {
            "shares": 1024,
            "quota": 100000,
            "period": 100000
        },
        "pids": {
            "limit": 100
        },
        "blockIO": {
            "weight": 500
        }
    }
}
ResourceWhat it controlsExample limit
memory.limitMax RSS + cache512 MB — OOM killer fires beyond this
cpu.quota/periodCPU time per period100000/100000 = 1 full core
cpu.sharesRelative CPU weight1024 = default; 512 = half as much time under contention
pids.limitMax number of processes100 — prevents fork bombs
blockIO.weightDisk IO priority100-1000, relative to other containers

Cgroups v1 vs v2:

Cgroups v1Cgroups v2
StructureMultiple hierarchies at /sys/fs/cgroup/<controller>/Single unified hierarchy at /sys/fs/cgroup/
ControlEach resource controller is a separate filesystemAll controllers in one tree
DelegationFragmented, error-proneClean delegation to non-root processes
AdoptionLegacy, still default on older systemsDefault on Ubuntu 22.04+, Fedora 31+, Docker 25+

Our demo had no resource limits — the container could use all available CPU/memory. In production, cgroups prevent a single container from starving others.

Seccomp — Syscall Filtering

Seccomp (Secure Computing) filters which Linux syscalls a container can make. It's a BPF-based firewall for the kernel interface.

// config.json seccomp section
"linux": {
    "seccomp": {
        "defaultAction": "SCMP_ACT_ERRNO",
        "architectures": ["SCMP_ARCH_AARCH64"],
        "syscalls": [
            {
                "names": ["read", "write", "open", "close", "stat",
                          "fstat", "mmap", "mprotect", "brk", "execve", ...],
                "action": "SCMP_ACT_ALLOW"
            }
        ]
    }
}

Default action: DENY. Only explicitly listed syscalls are allowed. Docker's default seccomp profile:

  • Allows ~300 common syscalls (read, write, open, execve, etc.)
  • Blocks ~50 dangerous ones (reboot, kexec_load, mount, ptrace, etc.)
Blocked syscallWhy
rebootContainer shouldn't reboot the host
kexec_loadLoading a new kernel
mount (without mount ns)Modifying host filesystem
ptraceDebugging/injecting into other processes
clone (with user ns flag)Creating new user namespaces (privilege escalation)

We used --privileged which disables seccomp entirely. In production, the default profile provides a significant security boundary.

Linux Capabilities — Fine-Grained Privileges

Traditional Unix has two privilege levels: root (can do everything) and non-root (restricted). Linux capabilities split root privileges into ~40 individual capabilities:

// config.json capabilities section
"process": {
    "capabilities": {
        "bounding": ["CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_FOWNER",
                      "CAP_KILL", "CAP_NET_BIND_SERVICE", "CAP_SETGID",
                      "CAP_SETUID", "CAP_SYS_CHROOT"],
        "effective": ["CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_FOWNER",
                       "CAP_KILL", "CAP_NET_BIND_SERVICE"],
        "permitted": ["CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_FOWNER",
                       "CAP_KILL", "CAP_NET_BIND_SERVICE"]
    }
}
CapabilityWhat it allows
CAP_NET_BIND_SERVICEBind to ports < 1024
CAP_CHOWNChange file ownership
CAP_SYS_ADMINMount filesystems, create namespaces, etc. (the "god" capability)
CAP_NET_RAWUse raw sockets (ping)
CAP_SYS_PTRACETrace/debug other processes
CAP_SYS_CHROOTUse chroot

Docker containers run with ~14 capabilities by default — enough to function but far less than full root. CAP_SYS_ADMIN is notably absent, which is why unshare doesn't work without --privileged.

The config.json — Full Structure

Here's the complete structure of a runtime config, with sections we used and didn't use:

config.json
├── ociVersion          ← "1.0.2"
├── process             ← USED: args, env, cwd, user
│   ├── args            ← ["/usr/local/bin/curl", "--version"]
│   ├── env             ← ["PATH=...", "TERM=xterm"]
│   ├── cwd             ← "/"
│   ├── user            ← {uid: 0, gid: 0}
│   ├── capabilities    ← NOT USED: cap_net_bind, cap_chown, ...
│   ├── rlimits         ← NOT USED: RLIMIT_NOFILE, etc.
│   └── terminal        ← false
├── root                ← USED: path, readonly
│   ├── path            ← "/mnt/merged"
│   └── readonly        ← false
├── hostname            ← USED: "oci-container"
├── mounts              ← USED: proc, tmpfs
│   ├── /proc           ← type: proc
│   ├── /tmp            ← type: tmpfs
│   ├── /dev            ← NOT USED: devtmpfs
│   ├── /dev/pts        ← NOT USED: devpts
│   └── /sys            ← NOT USED: sysfs
├── linux
│   ├── namespaces      ← USED: pid, mount, uts
│   ├── resources       ← NOT USED: cgroups limits
│   ├── seccomp         ← NOT USED: syscall filtering
│   ├── devices         ← NOT USED: /dev/null, /dev/zero, etc.
│   ├── maskedPaths     ← NOT USED: /proc/kcore, /proc/keys, etc.
│   └── readonlyPaths   ← NOT USED: /proc/sys, /proc/irq, etc.
└── hooks               ← NOT USED
    ├── prestart        ← Run before container starts
    ├── createRuntime   ← Run after runtime creates container
    ├── poststart       ← Run after container process starts
    └── poststop        ← Run after container process exits

Hooks — Lifecycle Extension Points

Hooks let you run custom programs at specific points in the container lifecycle:

"hooks": {
    "prestart": [{
        "path": "/usr/bin/setup-network",
        "args": ["setup-network", "--container-id", "abc123"]
    }],
    "poststart": [{
        "path": "/usr/bin/notify-orchestrator",
        "args": ["notify", "--status", "running"]
    }],
    "poststop": [{
        "path": "/usr/bin/cleanup-network",
        "args": ["cleanup-network", "--container-id", "abc123"]
    }]
}

Use cases:

  • prestart: Set up network interfaces, mount volumes, configure logging
  • poststart: Notify the orchestrator, register with service discovery
  • poststop: Clean up network, release IPs, remove temp files

Kubernetes uses hooks through the CRI (Container Runtime Interface) to set up pod networking, inject volumes, and manage the container lifecycle.

Masked and Readonly Paths — Hiding Sensitive Kernel Interfaces

Docker/runc hide sensitive kernel information from containers:

"linux": {
    "maskedPaths": [
        "/proc/acpi",
        "/proc/kcore",
        "/proc/keys",
        "/proc/latency_stats",
        "/proc/timer_list",
        "/proc/timer_stats",
        "/proc/sched_debug",
        "/sys/firmware"
    ],
    "readonlyPaths": [
        "/proc/asound",
        "/proc/bus",
        "/proc/fs",
        "/proc/irq",
        "/proc/sys",
        "/proc/sysrq-trigger"
    ]
}
  • Masked paths are bind-mounted from /dev/null — reads return empty, writes disappear. This hides kernel cryptographic keys (/proc/keys), ACPI tables, and timing information.
  • Readonly paths can be read but not written. This prevents containers from tuning kernel parameters (/proc/sys) or triggering sysrq.

We mounted a raw /proc with no restrictions — a security risk in production.

How Docker and containerd Use the Runtime Spec

The OCI Runtime Spec is the bottom layer of the container stack:

┌─────────────────────────────────────┐
│  User: docker run -it ubuntu bash   │
├─────────────────────────────────────┤
│  Docker CLI                         │  Parses flags, calls API
├─────────────────────────────────────┤
│  dockerd (Docker daemon)            │  Image management, networking, volumes
├─────────────────────────────────────┤
│  containerd                         │  Manages container lifecycle
│  ├── Image pull & unpack            │  Uses overlayfs snapshotter
│  ├── Generate config.json           │  Maps image config → runtime config
│  └── Call runc                      │  Passes the runtime bundle
├─────────────────────────────────────┤
│  runc (OCI runtime)                 │  Reads config.json
│  ├── Set up namespaces              │  clone() with NS flags
│  ├── Configure cgroups              │  Write to /sys/fs/cgroup/
│  ├── Apply seccomp + capabilities   │  prctl() + BPF
│  ├── pivot_root to rootfs           │  swap the root mount
│  └── exec the process               │  execve("/bin/bash")
├─────────────────────────────────────┤
│  Linux Kernel                       │  namespaces, cgroups, overlayfs
└─────────────────────────────────────┘

What each layer does:

  • Docker CLI — User-facing. Translates docker run flags into API calls.
  • dockerd — Manages images, networks, volumes. Doesn't know about namespaces.
  • containerd — Pulls images, unpacks layers (overlayfs), generates config.json, delegates to runc.
  • runc — The only component that actually calls the kernel. Reads config.json, sets up isolation, execs the process.

The OCI Runtime Spec is the contract between containerd (or any higher-level tool) and runc (or any OCI-compliant runtime). This is why you can swap runtimes:

RuntimeWhat's different
runcReference implementation, written in Go
crunWritten in C, faster startup, lower memory
youkiWritten in Rust, for security-focused deployments
gVisor (runsc)Intercepts syscalls with a user-space kernel
Kata ContainersRuns each container in a lightweight VM

All of them read the same config.json. All of them create the same isolation primitives. The interface is identical — only the implementation differs.

Security Layers — Defense in Depth

A production container has multiple overlapping security boundaries:

Layer 1: Namespaces        ← Visibility isolation
Layer 2: Capabilities      ← Privilege restriction
Layer 3: Seccomp           ← Syscall filtering
Layer 4: AppArmor/SELinux  ← Mandatory access control
Layer 5: Cgroups           ← Resource limits (prevent DoS)
Layer 6: Read-only rootfs  ← Prevent filesystem tampering
Layer 7: User namespace    ← Non-root on host even if root in container

Our demo used only Layer 1. Each additional layer reduces the attack surface. This is why container escapes require chaining multiple vulnerabilities — the attacker must bypass all layers, not just one.


Up Next

In Part 4, we leave the runtime behind and head back to the registry — this time to sign our image with Notation and discover how the OCI 1.1 subject + Referrers mechanism attaches signatures (and, in Part 5, SBOMs) without modifying the image itself.


Every command output, PID, hostname, and process table in this post was captured from an actual run inside an Ubuntu 22.04 container (privileged mode) on Docker Desktop for Mac on April 25, 2026.