Series: Understanding OCI from the Ground Up (Part 3 of 5)
In Part 1 we built an OCI image. In Part 2 we pushed and pulled it with raw HTTP. Now we run it — withoutrunc, without Docker, without any container runtime. Justchroot,unshare,mount, andhostname. Four commands that are already on every Linux system.
What is the OCI Runtime Spec?
The OCI Runtime Specification defines how to take a container image and turn it into a running process with isolation. It answers: "Given a root filesystem and some configuration, how do I run an isolated process?"
The spec describes:
- A runtime bundle — a directory containing:
rootfs/— the container's root filesystem (extracted from image layers)config.json— how to run it (process args, env, namespaces, mounts, hostname)
- A container lifecycle — create → start → run → stop → delete
- Linux primitives that provide isolation:
chroot— filesystem isolationunshare— namespace isolation (PID, UTS, mount, network, user)mount— controlled filesystem viewshostname— UTS namespace (container identity)
Key insight: A container is not a VM. It's a regular Linux process that uses kernel features to limit what it can see and do. runc is just a program that calls these same Linux primitives in the right order. We'll do it by hand.
Prerequisites
We work inside a Docker container on Docker Desktop for Mac. We need --privileged so that unshare can create new namespaces:
docker run --rm -d --name oci-lab --dns 8.8.8.8 --privileged \ ubuntu:22.04 sleep 3600 docker exec oci-lab bash -c \ "apt-get update -qq && apt-get install -y -qq skopeo jq wget file xz-utils procps iproute2 > /dev/null 2>&1"
Why --privileged? Creating new PID, UTS, and mount namespaces with unshare requires CAP_SYS_ADMIN. Docker's default security profile blocks this. The --privileged flag lifts that restriction. This only affects the lab container — your Mac is untouched.
Step 1: Assemble the Root Filesystem with OverlayFS
In Part 1, we built an OCI image with two layers. A container runtime's first job is to assemble those layers into a single root filesystem. Docker and containerd use overlayfs for this — and so will we.
What is OverlayFS?
OverlayFS is a Linux filesystem that merges multiple directories into a single unified view:
┌────────────────────────────────────────────────────────────┐ │ merged/ (unified view) │ │ The container sees this as its root filesystem │ │ Reads come from the first layer that has the file │ │ Writes go to the upper layer (copy-on-write) │ ├────────────────────────────────────────────────────────────┤ │ upper/ (writable) │ All container writes land here │ ├────────────────────────────────────────────────────────────┤ │ lower2/ (read-only) │ curl layer: /usr/local/bin/curl │ ├────────────────────────────────────────────────────────────┤ │ lower1/ (read-only) │ base layer: Ubuntu 22.04 rootfs │ └────────────────────────────────────────────────────────────┘
This is how Docker stores containers. In fact, if you look at the host mount table of our lab container itself:
overlay / overlay rw,relatime, lowerdir=/var/lib/desktop-containerd/...snapshots/444/fs:...419/fs, upperdir=...snapshots/445/fs, workdir=...snapshots/445/work
Docker Desktop used overlayfs to create our lab container. We'll now do the same thing by hand.
Extract each layer into its own directory
cd /work # Pull the base image skopeo copy docker://ubuntu:22.04 oci:ubuntu-base:22.04 # Find the layer blob MANIFEST_PATH="ubuntu-base/blobs/$(jq -r '.manifests[0].digest' ubuntu-base/index.json | tr ':' '/')" LAYER_PATH="ubuntu-base/blobs/$(jq -r '.layers[0].digest' "$MANIFEST_PATH" | tr ':' '/')" # Extract base layer into its own directory mkdir -p layers/base tar -xzf "$LAYER_PATH" -C layers/base # Download and extract curl binary into its own layer directory wget -q -O /tmp/curl.tar.xz \ "https://github.com/stunnel/static-curl/releases/download/8.19.0/curl-linux-aarch64-musl-8.19.0.tar.xz" tar -xf /tmp/curl.tar.xz -C /tmp/ mkdir -p layers/curl/usr/local/bin cp /tmp/curl layers/curl/usr/local/bin/curl chmod +x layers/curl/usr/local/bin/curl
Two separate layers, each in its own directory:
Layer 1 (base) — Ubuntu root filesystem: bin boot dev etc home lib media mnt opt proc ... 18 entries total, 76 MB Layer 2 (curl) — just the curl binary: layers/curl/usr/local/bin/curl 1 file, 9.3 MB
Mount overlayfs to merge the layers
Note: OverlayFS requires the upper and work directories to be on a filesystem that supports d_type (like ext4 or tmpfs). Since our container root is already overlayfs, we use tmpfs for the upper/work dirs.
# Create directories on tmpfs (needed because you can't nest overlay on overlay) mount -t tmpfs tmpfs /mnt mkdir -p /mnt/lower-base /mnt/lower-curl /mnt/upper /mnt/work /mnt/merged # Copy layers to tmpfs (so overlayfs can use them) cp -a layers/base/* /mnt/lower-base/ cp -a layers/curl/* /mnt/lower-curl/ # Mount overlayfs — this is the key command mount -t overlay overlay \ -o lowerdir=/mnt/lower-curl:/mnt/lower-base,upperdir=/mnt/upper,workdir=/mnt/work \ /mnt/merged
The mount command explained:
| Option | Purpose |
|---|---|
-t overlay | Filesystem type: overlayfs |
lowerdir=/mnt/lower-curl:/mnt/lower-base | Read-only layers, colon-separated. Order matters: first listed = highest priority |
upperdir=/mnt/upper | Writable layer — all modifications go here |
workdir=/mnt/work | Internal scratch space for overlayfs atomics |
/mnt/merged | The mount point — the unified view |
Verify the merged view
ls /mnt/merged/
bin dev home media opt root sbin sys usr boot etc lib mnt proc run srv tmp var
# Base layer content visible through merged view head -2 /mnt/merged/etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS" NAME="Ubuntu"
# Curl layer content visible through merged view ls -la /mnt/merged/usr/local/bin/curl file /mnt/merged/usr/local/bin/curl /mnt/merged/usr/local/bin/curl --version | head -1
-rwxr-xr-x 1 root root 9671880 Apr 25 13:18 /mnt/merged/usr/local/bin/curl /mnt/merged/usr/local/bin/curl: ELF 64-bit LSB pie executable, ARM aarch64, static-pie linked, stripped curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7
Both layers merged into a single unified view. The container sees one filesystem — Ubuntu root with curl at /usr/local/bin/curl — even though they came from separate layers.
Copy-on-Write in action
The upper layer starts empty. Let's see what happens when the container modifies files:
# Upper layer — currently empty ls -la /mnt/upper/
total 0 drwxr-xr-x 2 root root 40 Apr 25 13:19 . drwxrwxrwt 7 root root 140 Apr 25 13:19 ..
Modify a file from the base layer:
echo "my-overlay-container" > /mnt/merged/etc/hostname cat /mnt/merged/etc/hostname
my-overlay-container
# The modified file was COPIED to the upper layer (copy-on-write) find /mnt/upper -type f cat /mnt/upper/etc/hostname
/mnt/upper/etc/hostname my-overlay-container
# Base layer is UNTOUCHED head -1 /mnt/lower-base/etc/hostname
localhost.localdomain
The original file in the base layer wasn't modified. OverlayFS copied it to the upper layer first, then applied the change. This is copy-on-write (COW).
Create a new file:
echo "hello from overlay" > /mnt/merged/tmp/overlay-test.txt cat /mnt/upper/tmp/overlay-test.txt
hello from overlay
New files go directly to the upper layer. The lower layers remain untouched.
Delete a file (whiteout):
# Before delete ls -la /mnt/merged/etc/legal
-rw-r--r-- 1 root root 267 Oct 15 2021 /mnt/merged/etc/legal
rm /mnt/merged/etc/legal ls /mnt/merged/etc/legal 2>&1
ls: cannot access '/mnt/merged/etc/legal': No such file or directory
# How does overlayfs hide a file that still exists in the base layer? # It creates a "whiteout" — a character device with major:minor 0:0 ls -la /mnt/upper/etc/legal
c--------- 2 root root 0, 0 Apr 25 13:19 legal
# The base layer still has the original ls -la /mnt/lower-base/etc/legal
-rw-r--r-- 1 root root 267 Oct 15 2021 /mnt/lower-base/etc/legal
Whiteouts are how the OCI Image Spec handles deletions in layers. In Part 1, we mentioned that layers are filesystem diffs. Now you can see how: a new layer is essentially the contents of the upper directory — modified files, new files, and whiteout markers for deletions.
Summary of the upper layer after all changes:
/mnt/upper/etc/hostname — modified (copy-on-write) /mnt/upper/tmp/overlay-test.txt — created (new file) /mnt/upper/etc/legal — deleted (whiteout: char device 0,0)
This is exactly how docker commit works. It takes the upper layer, creates a tar archive from it (including whiteout files as .wh.<filename>), and that becomes a new image layer.
This /mnt/merged directory is the bundle's rootfs — what the OCI Runtime Spec calls the "root filesystem" of the container.
Step 2: Linux Isolation Primitives — One at a Time
Before we combine everything, let's understand each primitive individually.
Primitive 1: chroot — Filesystem Isolation
chroot changes the apparent root directory for a process. Everything outside the new root becomes invisible.
# Outside chroot (host view) echo "Root entries: $(ls / | wc -l)" echo "Hostname: $(hostname)"
Root entries: 19 Hostname: 92921a41b725
# Inside chroot (container view) chroot /mnt/merged /bin/bash -c ' echo "Root entries: $(ls / | wc -l)" echo "os-release: $(head -1 /etc/os-release)" echo "curl: $(/usr/local/bin/curl --version | head -1)" echo "Can see host /work? $(ls /work 2>&1)" '
Root entries: 18 os-release: PRETTY_NAME="Ubuntu 22.04.5 LTS" curl: curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7 Can see host /work? ls: cannot access '/work': No such file or directory
What happened: The process inside chroot sees /mnt/merged (our overlayfs mount) as /. It can't access /work, /etc, or anything on the host. The filesystem is isolated.
What it doesn't do: The process still shares the host's PID space, hostname, and mount table. We fix that next.
Primitive 2: unshare --pid — PID Namespace
A PID namespace gives the container its own process ID space. Inside the namespace, the first process gets PID 1.
# Host PID namespace echo "My PID: $$" echo "Process count: $(ps -e --no-headers | wc -l)" ps -eo pid,comm | head -6
My PID: 3392
Process count: 5
PID COMMAND
1 sleep
3392 bash
3401 ps
3402 head
# New PID namespace unshare --pid --fork --mount chroot /mnt/merged /bin/bash -c ' mount -t proc proc /proc echo "My PID: $$" echo "Process count: $(ps -e --no-headers | wc -l)" ps -eo pid,comm umount /proc '
My PID: 1
Process count: 4
PID COMMAND
1 bash
6 ps
What happened:
- Host has 5 processes. The container sees only its own —
bash(PID 1) andps(PID 6). - The shell got PID 1 — it's the init process of its own PID namespace.
--forkis required because the calling process can't enter a new PID namespace itself; only its children can.--mountgives us a private mount table somount -t procdoesn't affect the host.
Primitive 3: unshare --uts — Hostname Isolation
UTS (Unix Timesharing System) namespace isolates the hostname. The container can set its own hostname without affecting the host.
# Host echo "Hostname: $(hostname)"
Hostname: 92921a41b725
# New UTS namespace unshare --uts chroot /mnt/merged /bin/bash -c ' hostname my-container echo "Hostname: $(hostname)" '
Hostname: my-container
# Back on host echo "Hostname still: $(hostname)"
Hostname still: 92921a41b725
What happened: The container set its hostname to my-container, but the host's hostname is unchanged. Each UTS namespace gets its own copy of the hostname.
Primitive 4: unshare --mount + mount — Mount Isolation
A mount namespace gives the container its own mount table. Mounts inside the container don't appear on the host.
# Host mount count cat /proc/mounts | wc -l
11 mount points
# New mount namespace unshare --mount chroot /mnt/merged /bin/bash -c ' mount -t proc proc /proc mount -t tmpfs tmpfs /tmp echo "Mount points inside: $(cat /proc/mounts | wc -l)" cat /proc/mounts echo "Created a file in tmpfs:" echo hello-container > /tmp/test.txt cat /tmp/test.txt umount /tmp umount /proc '
Mount points inside: 2 proc /proc proc rw,relatime 0 0 tmpfs /tmp tmpfs rw,relatime 0 0 Created a file in tmpfs: hello-container
# Host: file does NOT exist ls /tmp/test.txt 2>&1
ls: cannot access '/tmp/test.txt': No such file or directory
What happened:
- The container sees only 2 mount points (its own
procandtmpfs), while the host has 11. - A file created in the container's
/tmp(tmpfs) doesn't exist on the host. procgives the container its own view of the process table (needed forpsto work).tmpfsgives the container scratch space that disappears when the container stops.
Step 3: Combine Everything — Run the Container
Now we combine all four primitives into one command:
unshare --pid --fork --mount --uts chroot /mnt/merged /bin/bash
This single line:
- Creates new PID, mount, and UTS namespaces (
unshare) - Forks a child process into the new PID namespace (
--fork) - Changes root to
/mnt/merged— our overlayfs mount (chroot) - Executes
/bin/bashas PID 1
Here's the full run with mounts and our application:
unshare --pid --fork --mount --uts chroot /mnt/merged /bin/bash -c ' export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin # Set hostname hostname oci-container # Mount essential filesystems mount -t proc proc /proc mount -t tmpfs tmpfs /tmp echo "=== Container Environment ===" echo "Hostname : $(hostname)" echo "PID : $$" echo "User : $(whoami)" echo "" echo "=== Process Table ===" ps -eo pid,ppid,comm echo "" echo "=== Filesystem ===" echo "Mount points: $(cat /proc/mounts | wc -l)" cat /proc/mounts echo "" echo "=== Run our application (curl) ===" /usr/local/bin/curl --version | head -1 echo "" echo "=== /etc/os-release ===" head -3 /etc/os-release echo "" echo "=== Proof of isolation ===" echo "Can see host /work? $(ls /work 2>&1)" umount /tmp umount /proc '
Output:
=== Container Environment ===
Hostname : oci-container
PID : 1
User : root
Root dir : bin boot dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
=== Process Table ===
PID PPID COMMAND
1 0 bash
8 1 ps
=== Filesystem ===
Mount points: 3
overlay / overlay rw,relatime,lowerdir=/mnt/lower-curl:/mnt/lower-base,upperdir=/mnt/upper,workdir=/mnt/work 0 0
proc /proc proc rw,relatime 0 0
tmpfs /tmp tmpfs rw,relatime 0 0
=== Run our application (curl) ===
curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7
=== /etc/os-release ===
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
=== Proof of isolation ===
Can see host /work? ls: cannot access '/work': No such file or directory
# Back on host — everything is unchanged echo "Hostname: $(hostname)" echo "PID: $$"
Hostname: b6e4d5977a9b PID: 3503
This is a container. An isolated process with:
- Its own root filesystem via overlayfs (chroot → can’t see
/work) - Its own PID space (PID 1 = bash, only 2 processes visible)
- Its own hostname (
oci-container, notb6e4d5977a9b) - Its own mounts (3 mount points: overlay root + proc + tmpfs)
Step 4: The OCI Runtime Spec config.json
What we just did manually is exactly what runc does — but runc reads its instructions from a config.json file. This is the core of the OCI Runtime Spec.
Here's a config.json that describes our container:
{
"ociVersion": "1.0.2",
"process": {
"terminal": false,
"user": { "uid": 0, "gid": 0 },
"args": ["/usr/local/bin/curl", "--version"],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"TERM=xterm"
],
"cwd": "/"
},
"root": {
"path": "/mnt/merged",
"readonly": false
},
"hostname": "oci-container",
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "proc"
},
{
"destination": "/tmp",
"type": "tmpfs",
"source": "tmpfs",
"options": ["nosuid", "nodev"]
}
],
"linux": {
"namespaces": [
{ "type": "pid" },
{ "type": "mount" },
{ "type": "uts" }
]
}
}
How config.json maps to our manual commands
| config.json field | What runc does | Our manual equivalent |
|---|---|---|
root.path: "/mnt/merged" | Set root filesystem | chroot /mnt/merged (overlayfs mount point) |
linux.namespaces: [pid, mount, uts] | Create namespaces | unshare --pid --mount --uts |
hostname: "oci-container" | Set hostname in UTS ns | hostname oci-container |
mounts: [{dest: "/proc", type: "proc"}] | Mount proc | mount -t proc proc /proc |
mounts: [{dest: "/tmp", type: "tmpfs"}] | Mount tmpfs | mount -t tmpfs tmpfs /tmp |
process.args: ["/usr/local/bin/curl", "--version"] | Execute the process | /usr/local/bin/curl --version |
process.env: ["PATH..."] | Set environment | export PATH=... |
process.user: {uid: 0, gid: 0} | Run as root | (default in our setup) |
process.cwd: "/" | Working directory | (default in chroot) |
Parsing config.json → unshare command
We can even parse config.json with jq and build the equivalent command:
ARGS=$(jq -r '.process.args | join(" ")' config.json)
HOSTNAME=$(jq -r '.hostname' config.json)
ROOTFS=$(jq -r '.root.path' config.json)
NS_FLAGS=""
for ns in $(jq -r '.linux.namespaces[].type' config.json); do
NS_FLAGS="$NS_FLAGS --$ns"
done
echo "Parsed from config.json:"
echo " args = $ARGS"
echo " hostname = $HOSTNAME"
echo " rootfs = $ROOTFS"
echo " ns_flags = $NS_FLAGS"
Parsed from config.json: args = /usr/local/bin/curl --version hostname = oci-container rootfs = /mnt/merged ns_flags = --pid --mount --uts
The equivalent command:
unshare --pid --mount --uts --fork chroot /mnt/merged /bin/bash -c \ "hostname oci-container; mount -t proc proc /proc; mount -t tmpfs tmpfs /tmp; /usr/local/bin/curl --version"
Output:
Hostname: oci-container PID: 1 Running: /usr/local/bin/curl --version curl 8.19.0 (aarch64-pc-linux-gnu) libcurl/8.19.0 OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 zstd/1.5.7 Release-Date: 2026-03-11 Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp ws wss Features: alt-svc asyn-rr AsynchDNS brotli HSTS HTTP2 HTTP3 HTTPS-proxy HTTPSRR IDN IPv6 Largefile libz NTLM PSL SSL threadsafe TLS-SRP UnixSockets zstd
From Image Config to Runtime Config
One thing the OCI Runtime Spec does not define is how to go from an image config to a config.json. That's the job of a higher-level tool (like Docker or containerd). But the mapping is straightforward:
| Image Config (Part 1) | Runtime config.json | Notes |
|---|---|---|
config.Cmd: ["/bin/bash"] | process.args: ["/usr/local/bin/curl", "--version"] | We overrode Cmd with our curl binary |
config.Env: ["PATH..."] | process.env: ["PATH...", "TERM=xterm"] | Carried over, can add more |
architecture: "arm64" | (implicit) | Must match the host |
os: "linux" | (implicit) | Must be Linux for Linux namespaces |
rootfs.diff_ids: [...] | root.path: "/mnt/merged" | Layers assembled via overlayfs into merged view |
The image config describes what to run. The runtime config describes how to run it (with what isolation, mounts, and constraints).
What runc Adds Beyond Our Manual Approach
Our unshare + chroot approach works, but a real OCI runtime like runc adds:
| Feature | Our approach | runc |
|---|---|---|
| PID namespace | unshare --pid | clone(CLONE_NEWPID) |
| Mount namespace | unshare --mount | clone(CLONE_NEWNS) |
| UTS namespace | unshare --uts | clone(CLONE_NEWUTS) |
| Network namespace | Not used | clone(CLONE_NEWNET) + veth pairs |
| User namespace | Not used | clone(CLONE_NEWUSER) + uid mapping |
| Cgroups | Not used | CPU/memory/IO limits |
| Seccomp | Not used | Syscall filtering |
| Capabilities | Full (root) | Dropped to minimum |
| Filesystem assembly | overlayfs + chroot | overlayfs + pivot_root (more secure) |
| Lifecycle hooks | None | prestart, poststart, poststop |
| State management | None | /run/runc/<container-id>/state.json |
pivot_root vs chroot: chroot just changes the path lookup root. A process can escape with chroot("../..") if it has root. pivot_root actually swaps the root mount, making escape impossible from within the mount namespace. Real runtimes always use pivot_root.
The Container Lifecycle
The OCI Runtime Spec defines a strict lifecycle:
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ creating │────►│ created │────►│ running │────►│ stopped │────►│ deleted │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
│
▼
(exit code captured)
| State | What happens | Our manual equivalent |
|---|---|---|
| creating | Set up namespaces, mounts, cgroups | unshare --pid --mount --uts |
| created | Container ready, process not started | (chroot done, bash not yet running) |
| running | Process is executing | curl --version is running |
| stopped | Process exited | Curl finished, bash exited |
| deleted | All resources cleaned up | Namespaces destroyed, mounts removed |
In our demo, all these states happen in rapid succession because we combined them into one command. runc exposes them as separate operations (runc create, runc start, runc delete).
Cleanup
docker rm -f oci-lab
Recap
| What We Did | Linux Primitive | OCI Runtime Spec Concept |
|---|---|---|
| Extracted layers separately | tar -xzf into layers/base, layers/curl | Runtime bundle layers |
| Merged layers with overlayfs | mount -t overlay | Snapshotter (overlayfs driver) |
| Isolated filesystem | chroot /mnt/merged | root.path |
| Isolated PID space | unshare --pid | linux.namespaces: [{type: "pid"}] |
| Isolated hostname | unshare --uts + hostname | hostname + linux.namespaces: [{type: "uts"}] |
| Isolated mounts | unshare --mount + mount | mounts array + linux.namespaces: [{type: "mount"}] |
| Ran the application | /usr/local/bin/curl --version | process.args |
| Defined it all in JSON | config.json | OCI Runtime Spec config |
The big takeaway
A container is a Linux process with restricted visibility. The OCI Runtime Spec is a JSON file (config.json) that tells a runtime which restrictions to apply. We applied them by hand:
mount -t overlay→ "Merge these layers into a single filesystem"chroot→ "You can only see this directory"unshare --pid→ "You can only see your own processes"unshare --uts→ "You have your own hostname"unshare --mount→ "You have your own mount table"
No magic. No VMs. Just five Linux commands that have existed for decades.
Deep Dive: The OCI Runtime Spec in Detail
We've run a container by hand. Now let's understand the full spec — every namespace, every security mechanism, and the architecture of a real container runtime.
All Seven Linux Namespaces
We used three namespaces (PID, UTS, mount). Linux has seven, and a real container uses most of them:
| Namespace | Flag | What it isolates | We used it? |
|---|---|---|---|
| PID | CLONE_NEWPID | Process IDs — container sees PID 1 | ✓ |
| UTS | CLONE_NEWUTS | Hostname and domain name | ✓ |
| Mount | CLONE_NEWNS | Mount table — filesystem topology | ✓ |
| Network | CLONE_NEWNET | Network interfaces, IPs, routes, ports | ✗ |
| User | CLONE_NEWUSER | UID/GID mappings — root inside, non-root outside | ✗ |
| IPC | CLONE_NEWIPC | System V IPC, POSIX message queues | ✗ |
| Cgroup | CLONE_NEWCGROUP | Cgroup root view | ✗ |
Network Namespace — The Most Complex
A new network namespace starts with only a loopback interface. To give a container network access, the runtime must:
1. Create a veth pair (virtual Ethernet cable) 2. Move one end into the container's network namespace 3. Assign an IP to the container's end 4. Set up routing (default gateway) 5. Connect the host end to a bridge (docker0) 6. Set up NAT/masquerading for outbound traffic
This is why Docker networking is complex — it's not just one syscall. The runtime creates virtual networking plumbing for every container. We skipped it because our container doesn't need network access (curl just prints version info).
User Namespace — Rootless Containers
A user namespace maps UIDs inside the container to different UIDs outside:
Container view: uid 0 (root)
↓ mapped to
Host view: uid 100000 (unprivileged user)
This enables rootless containers: the process thinks it's root (and can do root things inside its namespace), but from the host's perspective it's an unprivileged user. If it escapes the container, it has no host privileges.
// config.json for rootless
"linux": {
"uidMappings": [{"containerID": 0, "hostID": 100000, "size": 65536}],
"gidMappings": [{"containerID": 0, "hostID": 100000, "size": 65536}]
}
We ran as real root inside a privileged container. In production, rootless is preferred.
Cgroups — Resource Limits
Namespaces control visibility. Cgroups control resource usage. They answer: "How much CPU, memory, and IO can this container use?"
// config.json cgroups section
"linux": {
"resources": {
"memory": {
"limit": 536870912,
"reservation": 268435456
},
"cpu": {
"shares": 1024,
"quota": 100000,
"period": 100000
},
"pids": {
"limit": 100
},
"blockIO": {
"weight": 500
}
}
}
| Resource | What it controls | Example limit |
|---|---|---|
memory.limit | Max RSS + cache | 512 MB — OOM killer fires beyond this |
cpu.quota/period | CPU time per period | 100000/100000 = 1 full core |
cpu.shares | Relative CPU weight | 1024 = default; 512 = half as much time under contention |
pids.limit | Max number of processes | 100 — prevents fork bombs |
blockIO.weight | Disk IO priority | 100-1000, relative to other containers |
Cgroups v1 vs v2:
| Cgroups v1 | Cgroups v2 | |
|---|---|---|
| Structure | Multiple hierarchies at /sys/fs/cgroup/<controller>/ | Single unified hierarchy at /sys/fs/cgroup/ |
| Control | Each resource controller is a separate filesystem | All controllers in one tree |
| Delegation | Fragmented, error-prone | Clean delegation to non-root processes |
| Adoption | Legacy, still default on older systems | Default on Ubuntu 22.04+, Fedora 31+, Docker 25+ |
Our demo had no resource limits — the container could use all available CPU/memory. In production, cgroups prevent a single container from starving others.
Seccomp — Syscall Filtering
Seccomp (Secure Computing) filters which Linux syscalls a container can make. It's a BPF-based firewall for the kernel interface.
// config.json seccomp section
"linux": {
"seccomp": {
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_AARCH64"],
"syscalls": [
{
"names": ["read", "write", "open", "close", "stat",
"fstat", "mmap", "mprotect", "brk", "execve", ...],
"action": "SCMP_ACT_ALLOW"
}
]
}
}
Default action: DENY. Only explicitly listed syscalls are allowed. Docker's default seccomp profile:
- Allows ~300 common syscalls (read, write, open, execve, etc.)
- Blocks ~50 dangerous ones (reboot, kexec_load, mount, ptrace, etc.)
| Blocked syscall | Why |
|---|---|
reboot | Container shouldn't reboot the host |
kexec_load | Loading a new kernel |
mount (without mount ns) | Modifying host filesystem |
ptrace | Debugging/injecting into other processes |
clone (with user ns flag) | Creating new user namespaces (privilege escalation) |
We used --privileged which disables seccomp entirely. In production, the default profile provides a significant security boundary.
Linux Capabilities — Fine-Grained Privileges
Traditional Unix has two privilege levels: root (can do everything) and non-root (restricted). Linux capabilities split root privileges into ~40 individual capabilities:
// config.json capabilities section
"process": {
"capabilities": {
"bounding": ["CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_FOWNER",
"CAP_KILL", "CAP_NET_BIND_SERVICE", "CAP_SETGID",
"CAP_SETUID", "CAP_SYS_CHROOT"],
"effective": ["CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_FOWNER",
"CAP_KILL", "CAP_NET_BIND_SERVICE"],
"permitted": ["CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_FOWNER",
"CAP_KILL", "CAP_NET_BIND_SERVICE"]
}
}
| Capability | What it allows |
|---|---|
CAP_NET_BIND_SERVICE | Bind to ports < 1024 |
CAP_CHOWN | Change file ownership |
CAP_SYS_ADMIN | Mount filesystems, create namespaces, etc. (the "god" capability) |
CAP_NET_RAW | Use raw sockets (ping) |
CAP_SYS_PTRACE | Trace/debug other processes |
CAP_SYS_CHROOT | Use chroot |
Docker containers run with ~14 capabilities by default — enough to function but far less than full root. CAP_SYS_ADMIN is notably absent, which is why unshare doesn't work without --privileged.
The config.json — Full Structure
Here's the complete structure of a runtime config, with sections we used and didn't use:
config.json
├── ociVersion ← "1.0.2"
├── process ← USED: args, env, cwd, user
│ ├── args ← ["/usr/local/bin/curl", "--version"]
│ ├── env ← ["PATH=...", "TERM=xterm"]
│ ├── cwd ← "/"
│ ├── user ← {uid: 0, gid: 0}
│ ├── capabilities ← NOT USED: cap_net_bind, cap_chown, ...
│ ├── rlimits ← NOT USED: RLIMIT_NOFILE, etc.
│ └── terminal ← false
├── root ← USED: path, readonly
│ ├── path ← "/mnt/merged"
│ └── readonly ← false
├── hostname ← USED: "oci-container"
├── mounts ← USED: proc, tmpfs
│ ├── /proc ← type: proc
│ ├── /tmp ← type: tmpfs
│ ├── /dev ← NOT USED: devtmpfs
│ ├── /dev/pts ← NOT USED: devpts
│ └── /sys ← NOT USED: sysfs
├── linux
│ ├── namespaces ← USED: pid, mount, uts
│ ├── resources ← NOT USED: cgroups limits
│ ├── seccomp ← NOT USED: syscall filtering
│ ├── devices ← NOT USED: /dev/null, /dev/zero, etc.
│ ├── maskedPaths ← NOT USED: /proc/kcore, /proc/keys, etc.
│ └── readonlyPaths ← NOT USED: /proc/sys, /proc/irq, etc.
└── hooks ← NOT USED
├── prestart ← Run before container starts
├── createRuntime ← Run after runtime creates container
├── poststart ← Run after container process starts
└── poststop ← Run after container process exits
Hooks — Lifecycle Extension Points
Hooks let you run custom programs at specific points in the container lifecycle:
"hooks": {
"prestart": [{
"path": "/usr/bin/setup-network",
"args": ["setup-network", "--container-id", "abc123"]
}],
"poststart": [{
"path": "/usr/bin/notify-orchestrator",
"args": ["notify", "--status", "running"]
}],
"poststop": [{
"path": "/usr/bin/cleanup-network",
"args": ["cleanup-network", "--container-id", "abc123"]
}]
}
Use cases:
- prestart: Set up network interfaces, mount volumes, configure logging
- poststart: Notify the orchestrator, register with service discovery
- poststop: Clean up network, release IPs, remove temp files
Kubernetes uses hooks through the CRI (Container Runtime Interface) to set up pod networking, inject volumes, and manage the container lifecycle.
Masked and Readonly Paths — Hiding Sensitive Kernel Interfaces
Docker/runc hide sensitive kernel information from containers:
"linux": {
"maskedPaths": [
"/proc/acpi",
"/proc/kcore",
"/proc/keys",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/sys/firmware"
],
"readonlyPaths": [
"/proc/asound",
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
]
}
- Masked paths are bind-mounted from
/dev/null— reads return empty, writes disappear. This hides kernel cryptographic keys (/proc/keys), ACPI tables, and timing information. - Readonly paths can be read but not written. This prevents containers from tuning kernel parameters (
/proc/sys) or triggering sysrq.
We mounted a raw /proc with no restrictions — a security risk in production.
How Docker and containerd Use the Runtime Spec
The OCI Runtime Spec is the bottom layer of the container stack:
┌─────────────────────────────────────┐
│ User: docker run -it ubuntu bash │
├─────────────────────────────────────┤
│ Docker CLI │ Parses flags, calls API
├─────────────────────────────────────┤
│ dockerd (Docker daemon) │ Image management, networking, volumes
├─────────────────────────────────────┤
│ containerd │ Manages container lifecycle
│ ├── Image pull & unpack │ Uses overlayfs snapshotter
│ ├── Generate config.json │ Maps image config → runtime config
│ └── Call runc │ Passes the runtime bundle
├─────────────────────────────────────┤
│ runc (OCI runtime) │ Reads config.json
│ ├── Set up namespaces │ clone() with NS flags
│ ├── Configure cgroups │ Write to /sys/fs/cgroup/
│ ├── Apply seccomp + capabilities │ prctl() + BPF
│ ├── pivot_root to rootfs │ swap the root mount
│ └── exec the process │ execve("/bin/bash")
├─────────────────────────────────────┤
│ Linux Kernel │ namespaces, cgroups, overlayfs
└─────────────────────────────────────┘
What each layer does:
- Docker CLI — User-facing. Translates
docker runflags into API calls. - dockerd — Manages images, networks, volumes. Doesn't know about namespaces.
- containerd — Pulls images, unpacks layers (overlayfs), generates
config.json, delegates to runc. - runc — The only component that actually calls the kernel. Reads
config.json, sets up isolation, execs the process.
The OCI Runtime Spec is the contract between containerd (or any higher-level tool) and runc (or any OCI-compliant runtime). This is why you can swap runtimes:
| Runtime | What's different |
|---|---|
| runc | Reference implementation, written in Go |
| crun | Written in C, faster startup, lower memory |
| youki | Written in Rust, for security-focused deployments |
| gVisor (runsc) | Intercepts syscalls with a user-space kernel |
| Kata Containers | Runs each container in a lightweight VM |
All of them read the same config.json. All of them create the same isolation primitives. The interface is identical — only the implementation differs.
Security Layers — Defense in Depth
A production container has multiple overlapping security boundaries:
Layer 1: Namespaces ← Visibility isolation Layer 2: Capabilities ← Privilege restriction Layer 3: Seccomp ← Syscall filtering Layer 4: AppArmor/SELinux ← Mandatory access control Layer 5: Cgroups ← Resource limits (prevent DoS) Layer 6: Read-only rootfs ← Prevent filesystem tampering Layer 7: User namespace ← Non-root on host even if root in container
Our demo used only Layer 1. Each additional layer reduces the attack surface. This is why container escapes require chaining multiple vulnerabilities — the attacker must bypass all layers, not just one.
Up Next
In Part 4, we leave the runtime behind and head back to the registry — this time to sign our image with Notation and discover how the OCI 1.1 subject + Referrers mechanism attaches signatures (and, in Part 5, SBOMs) without modifying the image itself.
Every command output, PID, hostname, and process table in this post was captured from an actual run inside an Ubuntu 22.04 container (privileged mode) on Docker Desktop for Mac on April 25, 2026.