Part 3: The OCI Runtime Spec — Run a Container with Bare Linux Tools

Series · Understanding OCI from the Ground Up · Part 3 of 5

Series: Understanding OCI from the Ground Up (Part 3 of 5)

In Part 1 we built an OCI image. In Part 2 we pushed and pulled it with raw HTTP. Now we run it — without runc, without Docker, without any container runtime. Just chroot, unshare, mount, and hostname. Four commands that are already on every Linux system.

What is the OCI Runtime Spec?

The OCI Runtime Specification defines how to take a container image and turn it into a running process with isolation. It answers: "Given a root filesystem and some configuration, how do I run an isolated process?"

The spec describes:

A runtime bundle — a directory containing:
- rootfs/ — the container's root filesystem (extracted from image layers)
- config.json — how to run it (process args, env, namespaces, mounts, hostname)
A container lifecycle — create → start → run → stop → delete
Linux primitives that provide isolation:
- chroot — filesystem isolation
- unshare — namespace isolation (PID, UTS, mount, network, user)
- mount — controlled filesystem views
- hostname — UTS namespace (container identity)

Key insight: A container is not a VM. It's a regular Linux process that uses kernel features to limit what it can see and do. runc is just a program that calls these same Linux primitives in the right order. We'll do it by hand.

Prerequisites

We work inside a Docker container on Docker Desktop for Mac. We need --privileged so that unshare can create new namespaces:

docker run --rm -d --name oci-lab --dns 8.8.8.8 --privileged \
  ubuntu:22.04 sleep 3600

docker exec oci-lab bash -c \
  "apt-get update -qq && apt-get install -y -qq skopeo jq wget file xz-utils procps iproute2 > /dev/null 2>&1"

Why --privileged? Creating new PID, UTS, and mount namespaces with unshare requires CAP_SYS_ADMIN. Docker's default security profile blocks this. The --privileged flag lifts that restriction. This only affects the lab container — your Mac is untouched.

Step 1: Assemble the Root Filesystem with OverlayFS

In Part 1, we built an OCI image with two layers. A container runtime's first job is to assemble those layers into a single root filesystem. Docker and containerd use overlayfs for this — and so will we.

What is OverlayFS?

OverlayFS is a Linux filesystem that merges multiple directories into a single unified view:

┌────────────────────────────────────────────────────────────┐
│                    merged/ (unified view)                   │
│  The container sees this as its root filesystem             │
│  Reads come from the first layer that has the file          │
│  Writes go to the upper layer (copy-on-write)              │
├────────────────────────────────────────────────────────────┤
│  upper/   (writable)   │  All container writes land here   │
├────────────────────────────────────────────────────────────┤
│  lower2/  (read-only)  │  curl layer: /usr/local/bin/curl  │
├────────────────────────────────────────────────────────────┤
│  lower1/  (read-only)  │  base layer: Ubuntu 22.04 rootfs  │
└────────────────────────────────────────────────────────────┘

This is how Docker stores containers. In fact, if you look at the host mount table of our lab container itself:

overlay / overlay rw,relatime,
  lowerdir=/var/lib/desktop-containerd/...snapshots/444/fs:...419/fs,
  upperdir=...snapshots/445/fs,
  workdir=...snapshots/445/work

Docker Desktop used overlayfs to create our lab container. We'll now do the same thing by hand.

Extract each layer into its own directory

cd /work

# Pull the base image
skopeo copy docker://ubuntu:22.04 oci:ubuntu-base:22.04

# Find the layer blob
MANIFEST_PATH="ubuntu-base/blobs/$(jq -r '.manifests[0].digest' ubuntu-base/index.json | tr ':' '/')"
LAYER_PATH="ubuntu-base/blobs/$(jq -r '.layers[0].digest' "$MANIFEST_PATH" | tr ':' '/')"

# Extract base layer into its own directory
mkdir -p layers/base
tar -xzf "$LAYER_PATH" -C layers/base

# Download and extract curl binary into its own layer directory
wget -q -O /tmp/curl.tar.xz \
  "https://github.com/stunnel/static-curl/releases/download/8.19.0/curl-linux-aarch64-musl-8.19.0.tar.xz"
tar -xf /tmp/curl.tar.xz -C /tmp/
mkdir -p layers/curl/usr/local/bin
cp /tmp/curl layers/curl/usr/local/bin/curl
chmod +x layers/curl/usr/local/bin/curl

Two separate layers, each in its own directory:

Layer 1 (base) — Ubuntu root filesystem:
  bin  boot  dev  etc  home  lib  media  mnt  opt  proc ...
  18 entries total, 76 MB

Layer 2 (curl) — just the curl binary:
  layers/curl/usr/local/bin/curl
  1 file, 9.3 MB

Mount overlayfs to merge the layers

Note: OverlayFS requires the upper and work directories to be on a filesystem that supports d_type (like ext4 or tmpfs). Since our container root is already overlayfs, we use tmpfs for the upper/work dirs.

# Create directories on tmpfs (needed because you can't nest overlay on overlay)
mount -t tmpfs tmpfs /mnt
mkdir -p /mnt/lower-base /mnt/lower-curl /mnt/upper /mnt/work /mnt/merged

# Copy layers to tmpfs (so overlayfs can use them)
cp -a layers/base/* /mnt/lower-base/
cp -a layers/curl/* /mnt/lower-curl/

# Mount overlayfs — this is the key command
mount -t overlay overlay \
  -o lowerdir=/mnt/lower-curl:/mnt/lower-base,upperdir=/mnt/upper,workdir=/mnt/work \
  /mnt/merged

The mount command explained:

Option	Purpose
`-t overlay`	Filesystem type: overlayfs
`lowerdir=/mnt/lower-curl:/mnt/lower-base`	Read-only layers, colon-separated. Order matters: first listed = highest priority
`upperdir=/mnt/upper`	Writable layer — all modifications go here
`workdir=/mnt/work`	Internal scratch space for overlayfs atomics
`/mnt/merged`	The mount point — the unified view

Verify the merged view

ls /mnt/merged/

bin   dev  home  media  opt   root  sbin  sys  usr
boot  etc  lib   mnt    proc  run   srv   tmp  var

# Base layer content visible through merged view
head -2 /mnt/merged/etc/os-release

PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"

# Curl layer content visible through merged view
ls -la /mnt/merged/usr/local/bin/curl
file /mnt/merged/usr/local/bin/curl
/mnt/merged/usr/local/bin/curl --version | head -1

-rwxr-xr-x 1 root root 9671880 Apr 25 13:18 /mnt/merged/usr/local/bin/curl
/mnt/merged/usr/local/bin/curl: ELF 64-bit LSB pie executable, ARM aarch64, static-pie linked, stripped
curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7

Both layers merged into a single unified view. The container sees one filesystem — Ubuntu root with curl at /usr/local/bin/curl — even though they came from separate layers.

Copy-on-Write in action

The upper layer starts empty. Let's see what happens when the container modifies files:

# Upper layer — currently empty
ls -la /mnt/upper/

total 0
drwxr-xr-x 2 root root  40 Apr 25 13:19 .
drwxrwxrwt 7 root root 140 Apr 25 13:19 ..

Modify a file from the base layer:

echo "my-overlay-container" > /mnt/merged/etc/hostname
cat /mnt/merged/etc/hostname

my-overlay-container

# The modified file was COPIED to the upper layer (copy-on-write)
find /mnt/upper -type f
cat /mnt/upper/etc/hostname

/mnt/upper/etc/hostname
my-overlay-container

# Base layer is UNTOUCHED
head -1 /mnt/lower-base/etc/hostname

localhost.localdomain

The original file in the base layer wasn't modified. OverlayFS copied it to the upper layer first, then applied the change. This is copy-on-write (COW).

Create a new file:

echo "hello from overlay" > /mnt/merged/tmp/overlay-test.txt
cat /mnt/upper/tmp/overlay-test.txt

hello from overlay

New files go directly to the upper layer. The lower layers remain untouched.

Delete a file (whiteout):

# Before delete
ls -la /mnt/merged/etc/legal

-rw-r--r-- 1 root root 267 Oct 15  2021 /mnt/merged/etc/legal

rm /mnt/merged/etc/legal
ls /mnt/merged/etc/legal 2>&1

ls: cannot access '/mnt/merged/etc/legal': No such file or directory

# How does overlayfs hide a file that still exists in the base layer?
# It creates a "whiteout" — a character device with major:minor 0:0
ls -la /mnt/upper/etc/legal

c--------- 2 root root 0, 0 Apr 25 13:19 legal

# The base layer still has the original
ls -la /mnt/lower-base/etc/legal

-rw-r--r-- 1 root root 267 Oct 15  2021 /mnt/lower-base/etc/legal

Whiteouts are how the OCI Image Spec handles deletions in layers. In Part 1, we mentioned that layers are filesystem diffs. Now you can see how: a new layer is essentially the contents of the upper directory — modified files, new files, and whiteout markers for deletions.

Summary of the upper layer after all changes:

/mnt/upper/etc/hostname            — modified (copy-on-write)
/mnt/upper/tmp/overlay-test.txt    — created (new file)
/mnt/upper/etc/legal               — deleted (whiteout: char device 0,0)

This is exactly how docker commit works. It takes the upper layer, creates a tar archive from it (including whiteout files as .wh.<filename>), and that becomes a new image layer.

This /mnt/merged directory is the bundle's rootfs — what the OCI Runtime Spec calls the "root filesystem" of the container.

Step 2: Linux Isolation Primitives — One at a Time

Before we combine everything, let's understand each primitive individually.

Primitive 1: `chroot` — Filesystem Isolation

chroot changes the apparent root directory for a process. Everything outside the new root becomes invisible.

# Outside chroot (host view)
echo "Root entries: $(ls / | wc -l)"
echo "Hostname: $(hostname)"

Root entries: 19
Hostname: 92921a41b725

# Inside chroot (container view)
chroot /mnt/merged /bin/bash -c '
  echo "Root entries: $(ls / | wc -l)"
  echo "os-release: $(head -1 /etc/os-release)"
  echo "curl: $(/usr/local/bin/curl --version | head -1)"
  echo "Can see host /work? $(ls /work 2>&1)"
'

Root entries: 18
os-release: PRETTY_NAME="Ubuntu 22.04.5 LTS"
curl: curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7
Can see host /work? ls: cannot access '/work': No such file or directory

What happened: The process inside chroot sees /mnt/merged (our overlayfs mount) as /. It can't access /work, /etc, or anything on the host. The filesystem is isolated.

What it doesn't do: The process still shares the host's PID space, hostname, and mount table. We fix that next.

Primitive 2: `unshare --pid` — PID Namespace

A PID namespace gives the container its own process ID space. Inside the namespace, the first process gets PID 1.

# Host PID namespace
echo "My PID: $$"
echo "Process count: $(ps -e --no-headers | wc -l)"
ps -eo pid,comm | head -6

My PID: 3392
Process count: 5
  PID COMMAND
    1 sleep
 3392 bash
 3401 ps
 3402 head

# New PID namespace
unshare --pid --fork --mount chroot /mnt/merged /bin/bash -c '
  mount -t proc proc /proc
  echo "My PID: $$"
  echo "Process count: $(ps -e --no-headers | wc -l)"
  ps -eo pid,comm
  umount /proc
'

My PID: 1
Process count: 4
  PID COMMAND
    1 bash
    6 ps

What happened:

Host has 5 processes. The container sees only its own — bash (PID 1) and ps (PID 6).
The shell got PID 1 — it's the init process of its own PID namespace.
--fork is required because the calling process can't enter a new PID namespace itself; only its children can.
--mount gives us a private mount table so mount -t proc doesn't affect the host.

Primitive 3: `unshare --uts` — Hostname Isolation

UTS (Unix Timesharing System) namespace isolates the hostname. The container can set its own hostname without affecting the host.

# Host
echo "Hostname: $(hostname)"

Hostname: 92921a41b725

# New UTS namespace
unshare --uts chroot /mnt/merged /bin/bash -c '
  hostname my-container
  echo "Hostname: $(hostname)"
'

Hostname: my-container

# Back on host
echo "Hostname still: $(hostname)"

Hostname still: 92921a41b725

What happened: The container set its hostname to my-container, but the host's hostname is unchanged. Each UTS namespace gets its own copy of the hostname.

Primitive 4: `unshare --mount` + `mount` — Mount Isolation

A mount namespace gives the container its own mount table. Mounts inside the container don't appear on the host.

# Host mount count
cat /proc/mounts | wc -l

11 mount points

# New mount namespace
unshare --mount chroot /mnt/merged /bin/bash -c '
  mount -t proc proc /proc
  mount -t tmpfs tmpfs /tmp

  echo "Mount points inside: $(cat /proc/mounts | wc -l)"
  cat /proc/mounts

  echo "Created a file in tmpfs:"
  echo hello-container > /tmp/test.txt
  cat /tmp/test.txt

  umount /tmp
  umount /proc
'

Mount points inside: 2
proc /proc proc rw,relatime 0 0
tmpfs /tmp tmpfs rw,relatime 0 0

Created a file in tmpfs:
hello-container

# Host: file does NOT exist
ls /tmp/test.txt 2>&1

ls: cannot access '/tmp/test.txt': No such file or directory

What happened:

The container sees only 2 mount points (its own proc and tmpfs), while the host has 11.
A file created in the container's /tmp (tmpfs) doesn't exist on the host.
proc gives the container its own view of the process table (needed for ps to work).
tmpfs gives the container scratch space that disappears when the container stops.

Step 3: Combine Everything — Run the Container

Now we combine all four primitives into one command:

unshare --pid --fork --mount --uts chroot /mnt/merged /bin/bash

This single line:

Creates new PID, mount, and UTS namespaces (unshare)
Forks a child process into the new PID namespace (--fork)
Changes root to /mnt/merged — our overlayfs mount (chroot)
Executes /bin/bash as PID 1

Here's the full run with mounts and our application:

unshare --pid --fork --mount --uts chroot /mnt/merged /bin/bash -c '
  export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

  # Set hostname
  hostname oci-container

  # Mount essential filesystems
  mount -t proc proc /proc
  mount -t tmpfs tmpfs /tmp

  echo "=== Container Environment ==="
  echo "Hostname : $(hostname)"
  echo "PID      : $$"
  echo "User     : $(whoami)"

  echo ""
  echo "=== Process Table ==="
  ps -eo pid,ppid,comm

  echo ""
  echo "=== Filesystem ==="
  echo "Mount points: $(cat /proc/mounts | wc -l)"
  cat /proc/mounts

  echo ""
  echo "=== Run our application (curl) ==="
  /usr/local/bin/curl --version | head -1

  echo ""
  echo "=== /etc/os-release ==="
  head -3 /etc/os-release

  echo ""
  echo "=== Proof of isolation ==="
  echo "Can see host /work? $(ls /work 2>&1)"

  umount /tmp
  umount /proc
'

Output:

=== Container Environment ===
Hostname : oci-container
PID      : 1
User     : root
Root dir : bin boot dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var

=== Process Table ===
  PID  PPID COMMAND
    1     0 bash
    8     1 ps

=== Filesystem ===
Mount points: 3
overlay / overlay rw,relatime,lowerdir=/mnt/lower-curl:/mnt/lower-base,upperdir=/mnt/upper,workdir=/mnt/work 0 0
proc /proc proc rw,relatime 0 0
tmpfs /tmp tmpfs rw,relatime 0 0

=== Run our application (curl) ===
curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7

=== /etc/os-release ===
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"

=== Proof of isolation ===
Can see host /work? ls: cannot access '/work': No such file or directory

# Back on host — everything is unchanged
echo "Hostname: $(hostname)"
echo "PID: $$"

Hostname: b6e4d5977a9b
PID: 3503

This is a container. An isolated process with:

Its own root filesystem via overlayfs (chroot → can’t see /work)
Its own PID space (PID 1 = bash, only 2 processes visible)
Its own hostname (oci-container, not b6e4d5977a9b)
Its own mounts (3 mount points: overlay root + proc + tmpfs)

Step 4: The OCI Runtime Spec `config.json`

What we just did manually is exactly what runc does — but runc reads its instructions from a config.json file. This is the core of the OCI Runtime Spec.

Here's a config.json that describes our container:

{
    "ociVersion": "1.0.2",
    "process": {
        "terminal": false,
        "user": { "uid": 0, "gid": 0 },
        "args": ["/usr/local/bin/curl", "--version"],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "TERM=xterm"
        ],
        "cwd": "/"
    },
    "root": {
        "path": "/mnt/merged",
        "readonly": false
    },
    "hostname": "oci-container",
    "mounts": [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc"
        },
        {
            "destination": "/tmp",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": ["nosuid", "nodev"]
        }
    ],
    "linux": {
        "namespaces": [
            { "type": "pid" },
            { "type": "mount" },
            { "type": "uts" }
        ]
    }
}

How `config.json` maps to our manual commands

config.json field	What runc does	Our manual equivalent
`root.path: "/mnt/merged"`	Set root filesystem	`chroot /mnt/merged` (overlayfs mount point)
`linux.namespaces: [pid, mount, uts]`	Create namespaces	`unshare --pid --mount --uts`
`hostname: "oci-container"`	Set hostname in UTS ns	`hostname oci-container`
`mounts: [{dest: "/proc", type: "proc"}]`	Mount proc	`mount -t proc proc /proc`
`mounts: [{dest: "/tmp", type: "tmpfs"}]`	Mount tmpfs	`mount -t tmpfs tmpfs /tmp`
`process.args: ["/usr/local/bin/curl", "--version"]`	Execute the process	`/usr/local/bin/curl --version`
`process.env: ["PATH..."]`	Set environment	`export PATH=...`
`process.user: {uid: 0, gid: 0}`	Run as root	(default in our setup)
`process.cwd: "/"`	Working directory	(default in chroot)

Parsing config.json → unshare command

We can even parse config.json with jq and build the equivalent command:

ARGS=$(jq -r '.process.args | join(" ")' config.json)
HOSTNAME=$(jq -r '.hostname' config.json)
ROOTFS=$(jq -r '.root.path' config.json)
NS_FLAGS=""
for ns in $(jq -r '.linux.namespaces[].type' config.json); do
  NS_FLAGS="$NS_FLAGS --$ns"
done

echo "Parsed from config.json:"
echo "  args     = $ARGS"
echo "  hostname = $HOSTNAME"
echo "  rootfs   = $ROOTFS"
echo "  ns_flags = $NS_FLAGS"

Parsed from config.json:
  args     = /usr/local/bin/curl --version
  hostname = oci-container
  rootfs   = /mnt/merged
  ns_flags =  --pid --mount --uts

The equivalent command:

unshare --pid --mount --uts --fork chroot /mnt/merged /bin/bash -c \
  "hostname oci-container; mount -t proc proc /proc; mount -t tmpfs tmpfs /tmp; /usr/local/bin/curl --version"

Output:

Hostname: oci-container
PID: 1
Running: /usr/local/bin/curl --version

curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7
Release-Date: 2026-03-11
Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp ws wss
Features: alt-svc asyn-rr AsynchDNS brotli HSTS HTTP2 HTTP3 HTTPS-proxy HTTPSRR IDN IPv6 Largefile libz NTLM PSL SSL threadsafe TLS-SRP UnixSockets zstd

From Image Config to Runtime Config

One thing the OCI Runtime Spec does not define is how to go from an image config to a config.json. That's the job of a higher-level tool (like Docker or containerd). But the mapping is straightforward:

Image Config (Part 1)	Runtime config.json	Notes
`config.Cmd: ["/bin/bash"]`	`process.args: ["/usr/local/bin/curl", "--version"]`	We overrode Cmd with our curl binary
`config.Env: ["PATH..."]`	`process.env: ["PATH...", "TERM=xterm"]`	Carried over, can add more
`architecture: "arm64"`	(implicit)	Must match the host
`os: "linux"`	(implicit)	Must be Linux for Linux namespaces
`rootfs.diff_ids: [...]`	`root.path: "/mnt/merged"`	Layers assembled via overlayfs into merged view

The image config describes what to run. The runtime config describes how to run it (with what isolation, mounts, and constraints).

What runc Adds Beyond Our Manual Approach

Our unshare + chroot approach works, but a real OCI runtime like runc adds:

Feature	Our approach	runc
PID namespace	`unshare --pid`	`clone(CLONE_NEWPID)`
Mount namespace	`unshare --mount`	`clone(CLONE_NEWNS)`
UTS namespace	`unshare --uts`	`clone(CLONE_NEWUTS)`
Network namespace	Not used	`clone(CLONE_NEWNET)` + veth pairs
User namespace	Not used	`clone(CLONE_NEWUSER)` + uid mapping
Cgroups	Not used	CPU/memory/IO limits
Seccomp	Not used	Syscall filtering
Capabilities	Full (root)	Dropped to minimum
Filesystem assembly	`overlayfs` + `chroot`	`overlayfs` + `pivot_root` (more secure)
Lifecycle hooks	None	prestart, poststart, poststop
State management	None	`/run/runc/<container-id>/state.json`

pivot_root vs chroot: chroot just changes the path lookup root. A process can escape with chroot("../..") if it has root. pivot_root actually swaps the root mount, making escape impossible from within the mount namespace. Real runtimes always use pivot_root.

The Container Lifecycle

The OCI Runtime Spec defines a strict lifecycle:

┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐
│ creating │────►│ created │────►│ running │────►│ stopped │────►│ deleted │
└─────────┘     └─────────┘     └─────────┘     └─────────┘     └─────────┘
                                                       │
                                                       ▼
                                            (exit code captured)

State	What happens	Our manual equivalent
creating	Set up namespaces, mounts, cgroups	`unshare --pid --mount --uts`
created	Container ready, process not started	(chroot done, bash not yet running)
running	Process is executing	`curl --version` is running
stopped	Process exited	Curl finished, bash exited
deleted	All resources cleaned up	Namespaces destroyed, mounts removed

In our demo, all these states happen in rapid succession because we combined them into one command. runc exposes them as separate operations (runc create, runc start, runc delete).

Cleanup

docker rm -f oci-lab

Recap

What We Did	Linux Primitive	OCI Runtime Spec Concept
Extracted layers separately	`tar -xzf` into `layers/base`, `layers/curl`	Runtime bundle layers
Merged layers with overlayfs	`mount -t overlay`	Snapshotter (overlayfs driver)
Isolated filesystem	`chroot /mnt/merged`	`root.path`
Isolated PID space	`unshare --pid`	`linux.namespaces: [{type: "pid"}]`
Isolated hostname	`unshare --uts` + `hostname`	`hostname` + `linux.namespaces: [{type: "uts"}]`
Isolated mounts	`unshare --mount` + `mount`	`mounts` array + `linux.namespaces: [{type: "mount"}]`
Ran the application	`/usr/local/bin/curl --version`	`process.args`
Defined it all in JSON	`config.json`	OCI Runtime Spec config

The big takeaway

A container is a Linux process with restricted visibility. The OCI Runtime Spec is a JSON file (config.json) that tells a runtime which restrictions to apply. We applied them by hand:

mount -t overlay → "Merge these layers into a single filesystem"
chroot → "You can only see this directory"
unshare --pid → "You can only see your own processes"
unshare --uts → "You have your own hostname"
unshare --mount → "You have your own mount table"

No magic. No VMs. Just five Linux commands that have existed for decades.

Deep Dive: The OCI Runtime Spec in Detail

We've run a container by hand. Now let's understand the full spec — every namespace, every security mechanism, and the architecture of a real container runtime.

All Seven Linux Namespaces

We used three namespaces (PID, UTS, mount). Linux has seven, and a real container uses most of them:

Namespace	Flag	What it isolates	We used it?
PID	`CLONE_NEWPID`	Process IDs — container sees PID 1	✓
UTS	`CLONE_NEWUTS`	Hostname and domain name	✓
Mount	`CLONE_NEWNS`	Mount table — filesystem topology	✓
Network	`CLONE_NEWNET`	Network interfaces, IPs, routes, ports	✗
User	`CLONE_NEWUSER`	UID/GID mappings — root inside, non-root outside	✗
IPC	`CLONE_NEWIPC`	System V IPC, POSIX message queues	✗
Cgroup	`CLONE_NEWCGROUP`	Cgroup root view	✗

Network Namespace — The Most Complex

A new network namespace starts with only a loopback interface. To give a container network access, the runtime must:

1. Create a veth pair (virtual Ethernet cable)
2. Move one end into the container's network namespace
3. Assign an IP to the container's end
4. Set up routing (default gateway)
5. Connect the host end to a bridge (docker0)
6. Set up NAT/masquerading for outbound traffic

This is why Docker networking is complex — it's not just one syscall. The runtime creates virtual networking plumbing for every container. We skipped it because our container doesn't need network access (curl just prints version info).

User Namespace — Rootless Containers

A user namespace maps UIDs inside the container to different UIDs outside:

Container view:    uid 0 (root)
                      ↓ mapped to
Host view:         uid 100000 (unprivileged user)

This enables rootless containers: the process thinks it's root (and can do root things inside its namespace), but from the host's perspective it's an unprivileged user. If it escapes the container, it has no host privileges.

// config.json for rootless
"linux": {
    "uidMappings": [{"containerID": 0, "hostID": 100000, "size": 65536}],
    "gidMappings": [{"containerID": 0, "hostID": 100000, "size": 65536}]
}

We ran as real root inside a privileged container. In production, rootless is preferred.

Cgroups — Resource Limits

Namespaces control visibility. Cgroups control resource usage. They answer: "How much CPU, memory, and IO can this container use?"

// config.json cgroups section
"linux": {
    "resources": {
        "memory": {
            "limit": 536870912,
            "reservation": 268435456
        },
        "cpu": {
            "shares": 1024,
            "quota": 100000,
            "period": 100000
        },
        "pids": {
            "limit": 100
        },
        "blockIO": {
            "weight": 500
        }
    }
}

Resource	What it controls	Example limit
`memory.limit`	Max RSS + cache	512 MB — OOM killer fires beyond this
`cpu.quota/period`	CPU time per period	100000/100000 = 1 full core
`cpu.shares`	Relative CPU weight	1024 = default; 512 = half as much time under contention
`pids.limit`	Max number of processes	100 — prevents fork bombs
`blockIO.weight`	Disk IO priority	100-1000, relative to other containers

Cgroups v1 vs v2:

	Cgroups v1	Cgroups v2
Structure	Multiple hierarchies at `/sys/fs/cgroup/<controller>/`	Single unified hierarchy at `/sys/fs/cgroup/`
Control	Each resource controller is a separate filesystem	All controllers in one tree
Delegation	Fragmented, error-prone	Clean delegation to non-root processes
Adoption	Legacy, still default on older systems	Default on Ubuntu 22.04+, Fedora 31+, Docker 25+

Our demo had no resource limits — the container could use all available CPU/memory. In production, cgroups prevent a single container from starving others.

Seccomp — Syscall Filtering

Seccomp (Secure Computing) filters which Linux syscalls a container can make. It's a BPF-based firewall for the kernel interface.

// config.json seccomp section
"linux": {
    "seccomp": {
        "defaultAction": "SCMP_ACT_ERRNO",
        "architectures": ["SCMP_ARCH_AARCH64"],
        "syscalls": [
            {
                "names": ["read", "write", "open", "close", "stat",
                          "fstat", "mmap", "mprotect", "brk", "execve", ...],
                "action": "SCMP_ACT_ALLOW"
            }
        ]
    }
}

Default action: DENY. Only explicitly listed syscalls are allowed. Docker's default seccomp profile:

Allows ~300 common syscalls (read, write, open, execve, etc.)
Blocks ~50 dangerous ones (reboot, kexec_load, mount, ptrace, etc.)

Blocked syscall	Why
`reboot`	Container shouldn't reboot the host
`kexec_load`	Loading a new kernel
`mount` (without mount ns)	Modifying host filesystem
`ptrace`	Debugging/injecting into other processes
`clone` (with user ns flag)	Creating new user namespaces (privilege escalation)

We used --privileged which disables seccomp entirely. In production, the default profile provides a significant security boundary.

Linux Capabilities — Fine-Grained Privileges

Traditional Unix has two privilege levels: root (can do everything) and non-root (restricted). Linux capabilities split root privileges into ~40 individual capabilities:

// config.json capabilities section
"process": {
    "capabilities": {
        "bounding": ["CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_FOWNER",
                      "CAP_KILL", "CAP_NET_BIND_SERVICE", "CAP_SETGID",
                      "CAP_SETUID", "CAP_SYS_CHROOT"],
        "effective": ["CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_FOWNER",
                       "CAP_KILL", "CAP_NET_BIND_SERVICE"],
        "permitted": ["CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_FOWNER",
                       "CAP_KILL", "CAP_NET_BIND_SERVICE"]
    }
}

Capability	What it allows
`CAP_NET_BIND_SERVICE`	Bind to ports < 1024
`CAP_CHOWN`	Change file ownership
`CAP_SYS_ADMIN`	Mount filesystems, create namespaces, etc. (the "god" capability)
`CAP_NET_RAW`	Use raw sockets (ping)
`CAP_SYS_PTRACE`	Trace/debug other processes
`CAP_SYS_CHROOT`	Use chroot

Docker containers run with ~14 capabilities by default — enough to function but far less than full root. CAP_SYS_ADMIN is notably absent, which is why unshare doesn't work without --privileged.

The `config.json` — Full Structure

Here's the complete structure of a runtime config, with sections we used and didn't use:

config.json
├── ociVersion          ← "1.0.2"
├── process             ← USED: args, env, cwd, user
│   ├── args            ← ["/usr/local/bin/curl", "--version"]
│   ├── env             ← ["PATH=...", "TERM=xterm"]
│   ├── cwd             ← "/"
│   ├── user            ← {uid: 0, gid: 0}
│   ├── capabilities    ← NOT USED: cap_net_bind, cap_chown, ...
│   ├── rlimits         ← NOT USED: RLIMIT_NOFILE, etc.
│   └── terminal        ← false
├── root                ← USED: path, readonly
│   ├── path            ← "/mnt/merged"
│   └── readonly        ← false
├── hostname            ← USED: "oci-container"
├── mounts              ← USED: proc, tmpfs
│   ├── /proc           ← type: proc
│   ├── /tmp            ← type: tmpfs
│   ├── /dev            ← NOT USED: devtmpfs
│   ├── /dev/pts        ← NOT USED: devpts
│   └── /sys            ← NOT USED: sysfs
├── linux
│   ├── namespaces      ← USED: pid, mount, uts
│   ├── resources       ← NOT USED: cgroups limits
│   ├── seccomp         ← NOT USED: syscall filtering
│   ├── devices         ← NOT USED: /dev/null, /dev/zero, etc.
│   ├── maskedPaths     ← NOT USED: /proc/kcore, /proc/keys, etc.
│   └── readonlyPaths   ← NOT USED: /proc/sys, /proc/irq, etc.
└── hooks               ← NOT USED
    ├── prestart        ← Run before container starts
    ├── createRuntime   ← Run after runtime creates container
    ├── poststart       ← Run after container process starts
    └── poststop        ← Run after container process exits

Hooks — Lifecycle Extension Points

Hooks let you run custom programs at specific points in the container lifecycle:

"hooks": {
    "prestart": [{
        "path": "/usr/bin/setup-network",
        "args": ["setup-network", "--container-id", "abc123"]
    }],
    "poststart": [{
        "path": "/usr/bin/notify-orchestrator",
        "args": ["notify", "--status", "running"]
    }],
    "poststop": [{
        "path": "/usr/bin/cleanup-network",
        "args": ["cleanup-network", "--container-id", "abc123"]
    }]
}

Use cases:

prestart: Set up network interfaces, mount volumes, configure logging
poststart: Notify the orchestrator, register with service discovery
poststop: Clean up network, release IPs, remove temp files

Kubernetes uses hooks through the CRI (Container Runtime Interface) to set up pod networking, inject volumes, and manage the container lifecycle.

Masked and Readonly Paths — Hiding Sensitive Kernel Interfaces

Docker/runc hide sensitive kernel information from containers:

"linux": {
    "maskedPaths": [
        "/proc/acpi",
        "/proc/kcore",
        "/proc/keys",
        "/proc/latency_stats",
        "/proc/timer_list",
        "/proc/timer_stats",
        "/proc/sched_debug",
        "/sys/firmware"
    ],
    "readonlyPaths": [
        "/proc/asound",
        "/proc/bus",
        "/proc/fs",
        "/proc/irq",
        "/proc/sys",
        "/proc/sysrq-trigger"
    ]
}

Masked paths are bind-mounted from /dev/null — reads return empty, writes disappear. This hides kernel cryptographic keys (/proc/keys), ACPI tables, and timing information.
Readonly paths can be read but not written. This prevents containers from tuning kernel parameters (/proc/sys) or triggering sysrq.

We mounted a raw /proc with no restrictions — a security risk in production.

How Docker and containerd Use the Runtime Spec

The OCI Runtime Spec is the bottom layer of the container stack:

┌─────────────────────────────────────┐
│  User: docker run -it ubuntu bash   │
├─────────────────────────────────────┤
│  Docker CLI                         │  Parses flags, calls API
├─────────────────────────────────────┤
│  dockerd (Docker daemon)            │  Image management, networking, volumes
├─────────────────────────────────────┤
│  containerd                         │  Manages container lifecycle
│  ├── Image pull & unpack            │  Uses overlayfs snapshotter
│  ├── Generate config.json           │  Maps image config → runtime config
│  └── Call runc                      │  Passes the runtime bundle
├─────────────────────────────────────┤
│  runc (OCI runtime)                 │  Reads config.json
│  ├── Set up namespaces              │  clone() with NS flags
│  ├── Configure cgroups              │  Write to /sys/fs/cgroup/
│  ├── Apply seccomp + capabilities   │  prctl() + BPF
│  ├── pivot_root to rootfs           │  swap the root mount
│  └── exec the process               │  execve("/bin/bash")
├─────────────────────────────────────┤
│  Linux Kernel                       │  namespaces, cgroups, overlayfs
└─────────────────────────────────────┘

What each layer does:

Docker CLI — User-facing. Translates docker run flags into API calls.
dockerd — Manages images, networks, volumes. Doesn't know about namespaces.
containerd — Pulls images, unpacks layers (overlayfs), generates config.json, delegates to runc.
runc — The only component that actually calls the kernel. Reads config.json, sets up isolation, execs the process.

The OCI Runtime Spec is the contract between containerd (or any higher-level tool) and runc (or any OCI-compliant runtime). This is why you can swap runtimes:

Runtime	What's different
runc	Reference implementation, written in Go
crun	Written in C, faster startup, lower memory
youki	Written in Rust, for security-focused deployments
gVisor (runsc)	Intercepts syscalls with a user-space kernel
Kata Containers	Runs each container in a lightweight VM

All of them read the same config.json. All of them create the same isolation primitives. The interface is identical — only the implementation differs.

Security Layers — Defense in Depth

A production container has multiple overlapping security boundaries:

Layer 1: Namespaces        ← Visibility isolation
Layer 2: Capabilities      ← Privilege restriction
Layer 3: Seccomp           ← Syscall filtering
Layer 4: AppArmor/SELinux  ← Mandatory access control
Layer 5: Cgroups           ← Resource limits (prevent DoS)
Layer 6: Read-only rootfs  ← Prevent filesystem tampering
Layer 7: User namespace    ← Non-root on host even if root in container

Our demo used only Layer 1. Each additional layer reduces the attack surface. This is why container escapes require chaining multiple vulnerabilities — the attacker must bypass all layers, not just one.

Up Next

In Part 4, we leave the runtime behind and head back to the registry — this time to sign our image with Notation and discover how the OCI 1.1 subject + Referrers mechanism attaches signatures (and, in Part 5, SBOMs) without modifying the image itself.

Every command output, PID, hostname, and process table in this post was captured from an actual run inside an Ubuntu 22.04 container (privileged mode) on Docker Desktop for Mac on April 25, 2026.

What is the OCI Runtime Spec?

Prerequisites

Step 1: Assemble the Root Filesystem with OverlayFS

What is OverlayFS?

Extract each layer into its own directory

Mount overlayfs to merge the layers

Verify the merged view

Copy-on-Write in action

Step 2: Linux Isolation Primitives — One at a Time

Primitive 1: chroot — Filesystem Isolation

Primitive 2: unshare --pid — PID Namespace

Primitive 3: unshare --uts — Hostname Isolation

Primitive 4: unshare --mount + mount — Mount Isolation

Step 3: Combine Everything — Run the Container

Step 4: The OCI Runtime Spec config.json

How config.json maps to our manual commands

Parsing config.json → unshare command

From Image Config to Runtime Config

What runc Adds Beyond Our Manual Approach

The Container Lifecycle

Cleanup

Recap

The big takeaway

Deep Dive: The OCI Runtime Spec in Detail

All Seven Linux Namespaces

Network Namespace — The Most Complex

User Namespace — Rootless Containers

Cgroups — Resource Limits

Seccomp — Syscall Filtering

Linux Capabilities — Fine-Grained Privileges

The config.json — Full Structure

Hooks — Lifecycle Extension Points

Masked and Readonly Paths — Hiding Sensitive Kernel Interfaces

How Docker and containerd Use the Runtime Spec

Security Layers — Defense in Depth

Up Next

Primitive 1: `chroot` — Filesystem Isolation

Primitive 2: `unshare --pid` — PID Namespace

Primitive 3: `unshare --uts` — Hostname Isolation

Primitive 4: `unshare --mount` + `mount` — Mount Isolation

Step 4: The OCI Runtime Spec `config.json`

How `config.json` maps to our manual commands

The `config.json` — Full Structure