Runc 源码解析:(二)Runc 初始化容器环境

大家好，我是费益洲。上一篇文章，我们梳理了 runc 创建容器的整体流程，runc 创建容器的过程中，会通过内部调用 runc init 来为容器初始化容器环境，本文主要梳理 runc init 初始化容器环境的整体流程。

前置条件

runc 源码版本：v1.1.12

runc 项目地址：https://github.com/opencontainers/runc

执行顺序

runc init 从外部看是一个命令，但在进程内部是两段执行链路：

C 侧：主要负责创建容器所需的各类 namespace，并完成 namespace 的切换
Go 侧：在已经隔离好的命名空间内，把当前进程配置成容器内的目标进程，然后执行用户程序

这两个执行链路先执行 C 侧代码，然后由 C 侧代码唤起并执行 Go 侧代码，init.go 本身也明确写了这个顺序：

// line 15// This is the golang entry point for runc init, executed// before main() but after libcontainer/nsenter's nsexec().

🔔 需要注意的是：C 侧代码是在 runc init 进程启动前执行的

Why

为什么要在 Go runtime 启动前，用纯 C 的方式完成所有 namespace 的创建和切换，为什么不让 Go 来完成 namespace 的创建和切换？主要有以下三个原因：

Go runtime 是多线程的：调度器一启动就会创建多个 OS 进程（M）。但是 Linux 的 unshare(CLONE_NEWUSER) 等 syscall 要求调用进程是单线程的，多线程下这些 syscall 会直接 EINVAL。
CLOND_NEWPID 只对调用之后 fork 的子进程生效：这就意味着必须再 fork 一次，但是带 Go runtime fork 子进程语义非常复杂（goroutine 栈、futex、信号处理）
setns 对 PID/USER ns 也有“单线程”要求

综上所述，就有了现在的执行顺序：runc 会抢在 Go runtime 启动前，用纯 C 把所有 namespace 折腾完，最后让 stage-2 子进程"裸"地回到 Go 入口。这时 Go runtime 才第一次看到这个进程，而它已经在新的 ns 里、是单线程的，可以安心初始化。

C 侧

接下来我们就来先看下 C 侧代码的主要逻辑。主流程 runc init 通过匿名导入 nsenter 的方式出发触发了 C constructor：

init.go

import ( ...// line 9 _ "github.com/opencontainers/runc/libcontainer/nsenter" ...)

nsenter 的具体实现如下：

libcontainer/nsenter/nsenter.go

/*extern void nsexec();void __attribute__((constructor)) init(void) { nsexec();}*/import"C"

上述代码的主要作用就是把 nsexec() 函数注册处为__attribute__((constructor))，这里其实是利用了 cgo 构造函数 的机制。而 nsexec.c 的 stage0/1/2 才是最早执行的容器初始化逻辑。

由于 runc init 是一个“内嵌”的进程，所以 nsexec()会先根据 env 来判断是否是 init 进程：

libcontainer/nsenter/nsexec.c

// line 854voidnsexec(void){ ...// line 872 pipenum = getenv_int("_LIBCONTAINER_INITPIPE");if (pipenum < 0) {/* We are not a runc init. Just return to go runtime. */return; } ...}

第 872 获取的环境变量，在 runc create 时进行设置，然后通过 fd 传递过来，getenv_int 的逻辑如下：

libcontainer/nsenter/nsexec.c

// line 407staticintgetenv_int(constchar *name){char *val, *endptr;int ret; val = getenv(name);/* Treat empty value as unset variable. */if (val == NULL || *val == '\0')return -ENOENT; ret = strtol(val, &endptr, 10);if (val == endptr || *endptr != '\0')  bail("unable to parse %s=%s", name, val);/*  * Sanity check: this must be a non-negative number.  */if (ret < 0)  bail("bad value for %s=%s (%d)", name, val, ret);return ret;}

getenv_int 逻辑很简单，就是判断是否设置了环境变量 _LIBCONTAINER_INITPIPE，如果为空，说明这不是 init 进程，直接 return，让 Go 进程继续正常执行。如果设置了环境变量_LIBCONTAINER_INITPIPE，则认为这是 runc init，进而开始执行 stage-0。

stage-0

stage-0，即父协调者，是最先启动的。stage-0 启动后会先拉起 stage-1，并进入与子阶段的同步循环：

libcontainer/nsenter/nsexec.c

// line 1001-1003write_log(DEBUG, "spawn stage-1");stage1_pid = clone_parent(&env, STAGE_CHILD);// line 1014-1016write_log(DEBUG, "-> stage-1 synchronisation loop");stage1_complete = false;while (!stage1_complete) {

在这个循环中，stage-0 处理的关键消息包括：

SYNC_USERMAP_PLS：写 uid_map/gid_map，回 SYNC_USERMAP_ACK
SYNC_RECVPID_PLS：接收 stage-2 pid，并转发给外层 runc
SYNC_MOUNTSOURCES_PLS：下发 mount source fds
SYNC_CHILD_FINISH：stage-1 收尾。

stage-1

stage-1 可以理解为 namespace 的准备者，它是由 stage-0 拉起的,stage-1 的实际顺序非常关键：

1️⃣ join 已存在的 namespace：如果 config 里指定了 namespaces（例如 runc exec 或共享某个已有容器的 ns），先调用 join_namespaces() → setns() 进入这些已有 ns

关键代码如下(line 1169~1170)：

libcontainer/nsenter/nsexec.c

// line 1169if (config.namespaces) join_namespaces(config.namespaces);

2️⃣ 若含 CLONE_NEWUSER，先 unshare(CLONE_NEWUSER)，再请求 stage-0 代写 uid_map/gid_map 映射，再 setresuid(0,0,0)。之所以拆开是因为 UTS/mqueue 的 SELinux label 在 USER 未映射前 unshare 会出错。

关键代码如下所示(line 1191~1232)：

libcontainer/nsenter/nsexec.c

// line 1191if (config.cloneflags & CLONE_NEWUSER) { try_unshare(CLONE_NEWUSER, "user namespace"); config.cloneflags &= ~CLONE_NEWUSER;/*  * We need to set ourselves as dumpable temporarily so that the  * parent process can write to our procfs files.  */if (config.namespaces) {  write_log(DEBUG, "temporarily set process as dumpable");if (prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) < 0)   bail("failed to temporarily set process as dumpable"); }/*  * We don't have the privileges to do any mapping here (see the  * clone_parent rant). So signal stage-0 to do the mapping for  * us.  */ write_log(DEBUG, "request stage-0 to map user namespace"); s = SYNC_USERMAP_PLS;if (write(syncfd, &s, sizeof(s)) != sizeof(s))  bail("failed to sync with parent: write(SYNC_USERMAP_PLS)");/* ... wait for mapping ... */ write_log(DEBUG, "request stage-0 to map user namespace");if (read(syncfd, &s, sizeof(s)) != sizeof(s))  bail("failed to sync with parent: read(SYNC_USERMAP_ACK)");if (s != SYNC_USERMAP_ACK)  bail("failed to sync with parent: SYNC_USERMAP_ACK: got %u", s);/* Revert temporary re-dumpable setting. */if (config.namespaces) {  write_log(DEBUG, "re-set process as non-dumpable");if (prctl(PR_SET_DUMPABLE, 0, 0, 0, 0) < 0)   bail("failed to re-set process as non-dumpable"); }/* Become root in the namespace proper. */if (setresuid(0, 0, 0) < 0)  bail("failed to become root in user namespace");}

3️⃣ 其余 namespace(包含 mount/IPC/PID/NET/UTS) 一次性 unshare

关键代码如下所示(line 1244)：

libcontainer/nsenter/nsexec.c

try_unshare(config.cloneflags & ~CLONE_NEWCGROUP, "remaining namespaces (except cgroupns)");

4️⃣ 再 clone 出 stage-2：因为 unshare(CLONE_NEWPID) 只让"将来 fork 出的子进程"在新 PID ns，所以必须再 fork 一次，stage-2 才真正在新 PID namespace 里 pid=1。

关键代码如下所示(line 1278)：

libcontainer/nsenter/nsexec.c

stage2_pid = clone_parent(&env, STAGE_INIT);

stage-2

stage-2 可以理解为最终过渡进程（pre-Go 最后一跳）。stage-2 再拿到 stage-0 的放行信号后，会最终切换并返回 Go runtime。

关键代码如下所示：

libcontainer/nsenter/nsexec.c

// line 1342-1345read(syncfd, &s, sizeof(s));if (s != SYNC_GRANDCHILD) ...// line 1347-1354setsid();setuid(0);setgid(0);// line 1365-1367s = SYNC_CHILD_FINISH;write(syncfd, &s, sizeof(s));// line 1377-1380write_log(DEBUG, "<= nsexec container setup");write_log(DEBUG, "booting up go runtime ...");return;

这一步 return 之后，才进入 Go runtime，并继续执行 Go 侧 init.go。

Go 侧

pre-Go 完成后，Go 侧 init() 才会运行：

init.go

// line 13funcinit() {iflen(os.Args) > 1 && os.Args[1] == "init" {// line 19-31  level, _ := strconv.Atoi(os.Getenv("_LIBCONTAINER_LOGLEVEL"))  logPipeFd, _ := strconv.Atoi(os.Getenv("_LIBCONTAINER_LOGPIPE"))  logrus.SetOutput(os.NewFile(uintptr(logPipeFd), "logpipe"))// line 34-35  factory, _ := libcontainer.New("")if err := factory.StartInitialization(); err != nil {   os.Exit(1)  } }}

这一段主要是设置日志管道，并进入 StartInitialization()。StartInitialization() 会读取 parent 通过环境变量和 pipe 传入的信息，然后构造具体 initer：

libcontainer/factory_linux.go

// line 262func(l *LinuxFactory)StartInitialization()(err error) {// line 264-272 读取 _LIBCONTAINER_INITPIPE envInitPipe := os.Getenv("_LIBCONTAINER_INITPIPE") pipefd, _ := strconv.Atoi(envInitPipe) pipe := os.NewFile(uintptr(pipefd), "pipe")// line 289-296 读取 _LIBCONTAINER_INITTYPE / _LIBCONTAINER_FIFOFD envInitType := os.Getenv("_LIBCONTAINER_INITTYPE")// line 308-312 读取 _LIBCONTAINER_LOGPIPE logPipeFdStr := os.Getenv("_LIBCONTAINER_LOGPIPE") logPipeFd, _ := strconv.Atoi(logPipeFdStr)// line 334-340 构造并执行 i.Init() i, err := newContainerInit(it, pipe, consoleSocket, fifofd, logPipeFd, mountFds)return i.Init()}

libcontainer/init_linux.go

// line 79funcnewContainerInit(t initType, pipe *os.File, consoleSocket *os.File, fifoFd, logFd int, mountFds []int)(initer, error) {switch t {case initSetns:return &linuxSetnsInit{...}, nilcase initStandard:return &linuxStandardInit{...}, nil }}

对于 runc create -> runc init 这条链路来说，构造的类型是 initStandard。在 linuxStandardInit.Init() 中，核心顺序大致是：

1️⃣ 切换network/route、rootfs、console、hostname、apparmor、sysctl

2️⃣ syncParentReady(l.pipe) 告知父进程 ready

3️⃣ seccomp 与 namespace finalization

4️⃣ 关闭 pipe / log fd

5️⃣ 等待并写入 exec.fifo

6️⃣ 最终 system.Exec(...) 执行用户进程

关键代码如下所示：

libcontainer/standard_init_linux.go

// line 49func(l *linuxStandardInit)Init()error {// line 80-123 network/rootfs/console/finalizeRootfsif err := setupNetwork(l.config); err != nil {return err } ...// Finish the rootfs setup.if l.config.Config.Namespaces.Contains(configs.NEWNS) {if err := finalizeRootfs(l.config.Config); err != nil {return err  } } ...// line 161if err := syncParentReady(l.pipe); err != nil {return fmt.Errorf("sync ready: %w", err) } ...// line 171~188if l.config.Config.Seccomp != nil && !l.config.NoNewPrivileges {  seccompFd, err := seccomp.InitSeccomp(l.config.Config.Seccomp)if err != nil {return err  }if err := syncParentSeccomp(l.pipe, seccompFd); err != nil {return err  } }if err := finalizeNamespace(l.config); err != nil {return err }// finalizeNamespace can change user/group which clears the parent death// signal, so we restore it here.if err := pdeath.Restore(); err != nil {return fmt.Errorf("can't restore pdeath signal: %w", err) } ...// line 227~232 _ = l.pipe.Close()// Close the log pipe fd so the parent's ForwardLogs can exit.if err := unix.Close(l.logFd); err != nil {return &os.PathError{Op: "close log pipe", Path: "fd " + strconv.Itoa(l.logFd), Err: err} } ...// line 238~253 fifoPath := "/proc/self/fd/" + strconv.Itoa(l.fifoFd) fd, err := unix.Open(fifoPath, unix.O_WRONLY|unix.O_CLOEXEC, 0)if err != nil {return &os.PathError{Op: "open exec fifo", Path: fifoPath, Err: err} }if _, err := unix.Write(fd, []byte("0")); err != nil {return &os.PathError{Op: "write exec fifo", Path: fifoPath, Err: err} }// Close the O_PATH fifofd fd before exec because the kernel resets// dumpable in the wrong order. This has been fixed in newer kernels, but// we keep this to ensure CVE-2016-9962 doesn't re-emerge on older kernels.// N.B. the core issue itself (passing dirfds to the host filesystem) has// since been resolved.// https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318 _ = unix.Close(l.fifoFd) ...// line 280return system.Exec(name, l.config.Args[0:], os.Environ())}

❝
🔔 需要注意的是，runc 执行用户程序之前会被阻塞：
libcontainer\standard_init_linux.go
// line 238fifoPath := "/proc/self/fd/" + strconv.Itoa(l.fifoFd)fd, err := unix.Open(fifoPath, unix.O_WRONLY|unix.O_CLOEXEC, 0)if err != nil {return &os.PathError{Op: "open exec fifo", Path: fifoPath, Err: err}}if _, err := unix.Write(fd, []byte("0")); err != nil {return &os.PathError{Op: "write exec fifo", Path: fifoPath, Err: err}}
open(O_WRONLY) 会一直 block，直到有读端打开。runc create 时 init 进程就停在这一行；当用户执行 runc start <id>，那边以 O_RDONLY 打开 > fifo，这里才返回，写入 "0"，继续往下走。

总体来说，runc init = "C 代码先把命名空间隔离好 → Go 代码再把容器内部装修好 → 在 fifo 上等指令 → execve 让用户进程接管"，三段式设计把"创建容器"与"启动容器"清晰地分开，既绕过了 Go runtime 的限制，又满足了 OCI 规范中 create/start 的语义解耦。