16 Jan 2020

Cross-Arch Reproducibility using Containers

Introduction

Part of my work for Igalia is on the 32bit support on MIPS (little endian) and ARM for JSC (JavaScriptCore - the JavaScript compiler in WebKit) and one of the problems we face is that of reproducibility of failures.

We have boards to test these, namely Raspberry Pi 3 Model B+ boards, running a 32bits ARM kernel and Imagination CI20 boards running a mipsel kernel - all built with buildroot which provides images for these boards out-of-the-box. However, we don't work on these - most of us have higher performance x86_64 machines and whenever a failure occurs upstream it's generally time consuming to reproduce.

I have therefore set out to create an environment where we can easily reproduce cross-architectural failures on JSC. I had worked on similar issues for Racket, when building the cross-architectural Racket chroot environment on GitLab CI.

Starting with chroot

Let's start the discussion by talking about chroot. In any linux system you'll find a chroot binary.

$ which chroot
/usr/sbin/chroot
$ chroot --help
Usage: chroot [OPTION] NEWROOT [COMMAND [ARG]...]
  or:  chroot OPTION
Run COMMAND with root directory set to NEWROOT.

  --groups=G_LIST        specify supplementary groups as g1,g2,..,gN
  --userspec=USER:GROUP  specify user and group (ID or name) to use
  --skip-chdir           do not change working directory to '/'
      --help     display this help and exit
      --version  output version information and exit

If no command is given, run '"$SHELL" -i' (default: '/bin/sh -i').

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Full documentation at: <https://www.gnu.org/software/coreutils/chroot>
or available locally via: info '(coreutils) chroot invocation'

However, at a slightly lower level, chroot is a system call.

$ man -s2 chroot
CHROOT(2)

NAME
       chroot - change root directory

SYNOPSIS
       #include <unistd.h>

       int chroot(const char *path);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
...

As mentioned in the man page, it changes the root of the file system for a process to the path passed as argument. This new environment created for the process is known as a chroot jail, to which we will refer simply as a jail.

The jail allows us to trap a process inside a filesystem, i.e. the process cannot access anything outside the filesystem it is in. So, if it tried to access the root of the filesystem, inside the jail it will only see the root of the new file system which is the path we passed on to chroot.

As an example throughout the post I will be using the factorial function. Whenever I refer to factorial.c, I refer to a file that you might have to create as needed and consists of the following C source code.

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {

  uint64_t result = 1;
  uint32_t arg;

  if (argc != 2)
    return 1;

  arg = strtoul(argv[1], NULL, 0);
  while(arg) result *= arg--;

  printf ("Result: %" PRIu64 "\n", result);
  return 0;
}

Let's compile it and run an example.

$ gcc -Wall -Wextra -o factorial factorial.c
$ ./factorial 20
Result: 2432902008176640000

Let's create a jail for this binary using chroot and run it.

$ mkdir jail
$ cp factorial jail/
$ chroot jail/ /factorial
chroot: cannot change root directory to 'jail/': Operation not permitted

First lesson here: only root can chroot. For security reasons, a normal user cannot chroot. Being able to do so, would allow the user privilege escalation (further details on breaking out of the chroot jail and privilege escalation).

So we sudo:

$ sudo chroot jail/ /factorial
chroot: failed to run command ‘/factorial’: No such file or directory

Now this really starts to annoy you and you start wondering if the path to factorial is correct. It is correct, the path is relative to the jail root. Once the root becomes jail/, the factorial binary is in the root of the filesystem so /factorial is correct. The problem is subtle but will teach you an important lesson: dependencies. This is, after all, a dynamically linked executable 💡.

$ ldd factorial
        linux-vdso.so.1 (0x00007ffe1f7f9000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f65bf62e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f65bf844000)

Those files exist in your file system, but not in the jail, so you get an error (although a pretty terrible error message at that). Let's try a static executable instead.

$ gcc -Wall -Wextra -static -o factorial factorial.c
$ ldd factorial
        not a dynamic executable
$ sudo chroot jail/ /factorial 20
Result: 2432902008176640000

This is exactly what we wanted from the beginning - to show that the binary is now jailed in this root and cannot escape without explicitly trying to, something no benign binary is likely to do (see my comment above on escaping jail).

In chroot jails, the filesystem root changes but it's still running on the same kernel. This allows us to create a new userspace inside this jail, separate from the host's, and possibly based on a different linux distribution. For reproducibility, you can potentially tarball this jail and send it to someone else, who could themselves chroot into it and reproduce a specific problem.

QEMU and binfmt

I will now introduce two other essential components to achieve our goal of cross architecture reproducibility: QEMU and binfmt.

Up until now our jail has contained binaries compiled for our host architecture (in my case x86_64), however this need not be the case. QEMU is an open-source hardware emulator and virtualizer. It implements two execution modes: system mode, and user mode;

In system mode, QEMU works like VirtualBox it emulates a whole system - from the applications, interrupts, and kernel, all the way to the hardware devices. The one I am interested in is user mode. In user mode, QEMU emulates a single binary leaving the rest of the system untouched.

To demo how it works, lets use a cross-toolchain to compile a program to be run on a different architecture. Given I am on a x86_64, I will use an armhf toolchain. You can either download one provided from your distro or compile one with crosstool-ng.

For the sake of reproducibility, lets compile a pre-configured one with crosstool-ng. Download it and install it in $PATH - I used version 1.24.0. Then build the cross toolchain for armv7-rpi2-linux-gnueabihf.

$ mkdir build
$ cd build
build/ $ ct-ng armv7-rpi2-linux-gnueabihf
build/ $ ct-ng build

Let's once again compile the factorial.c example, but this time with our new toolchain. This toolchain will be in $HOME/x-tools/armv7-rpi2-linux-gnueabihf by default.

$ export PATH=$HOME/x-tools/armv7-rpi2-linux-gnueabihf/bin:$PATH
$ armv7-rpi2-linux-gnueabihf-gcc -static -o factorial factorial.c
$ ./factorial
zsh: exec format error: ./factorial
$ file factorial
factorial: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), \
dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux \
4.19.21, with debug_info, not stripped

We were expecting this error, right? After all, we cannot just execute an arm binary in a x86_64 system. However, we can if we use QEMU in user mode which is what I want to show:

$ qemu-arm ./factorial 20 
Result: 2432902008176640000

Now we have it working - however, there's a last tool we need to discuss and that's binfmt. binfmt_misc is a linux kernel capability that allows arbitrary executable file formats to be recognized and passed on to an interpreter. This means that we can transparently recognize arm (or any other architecture) executables and request them to be passed to QEMU transparently when we try to execute them.

Before proceeding, if you wish to follow the examples, verify that your kernel has binfmt enabled.

$ zcat /proc/config.gz| grep BINFMT
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_BINFMT_SCRIPT=y
CONFIG_BINFMT_MISC=y

If you don't have /proc/config.gz try instead grep BINFMT /boot/config-$(uname -r).

Let's go back to our factorial on arm example, but this time install a binfmt record to call qemu-arm on the executable whenever we try to execute it directly. A binfmt record looks like :name:type:offset:magic:mask:interpreter:flags and needs to be installed by echoing the correct string to /proc/sys/fs/binfmt_misc/register. For details on this consult the kernel documentation. For armv7 we can register our interpreter and run our binary transparently like this:

$ sudo bash -c 'echo ":qemu-arm:M:0:\\x7f\\x45\\x4c\\x46\\x01\\x01\\x01\\x00\\x00\\x00\\x00\\x00\\x00\
\\x00\\x00\\x00\\x02\\x00\\x28\\x00:\\xff\\xff\\xff\\xff\\xff\\xff\\xff\\x00\\xff\\xff\\xff\\xff\\xff\
\\xff\\xff\\xff\\xfe\\xff\\xff\\xff:/home/pmatos/installs/bin/qemu-arm:OCF" > \
/proc/sys/fs/binfmt_misc/register'
$ ./factorial 20
Result: 2432902008176640000

The write to /proc/sys/fs/binfmt_misc/register may fail if you already have a record for qemu-arm setup. To remove the record run echo -1 > /proc/sys/fs/binfmt_misc/qemu-arm. Note that the file, as shown above is an ARM binary and yet we transparently run it through QEMU thanks to the binfmt magic we have initially set up.

I should note, that while interesting to understand how this works behind the scenes, several people have created images to do the binfmt registration for you. One of those projects is docker/binfmt which you can try to run using the latest tag: docker run --rm --privileged docker/binfmt:66f9012c56a8316f9244ffd7622d7c21c1f6f28d (another example of a similar project is multiarch/qemu-user-static). If, for some reason, this does not work you should know enough by now to proceed with the registration manually.

Creating container base images

We have gone through creating chroot jails and transparently executing cross-architecture binaries with QEMU. A base image for a container is based pretty much on what we have just learned. To create a container base image we will create a jail with a foreign root filesystem and we will use QEMU to execute binaries inside the jail. Once all is working, we import it into docker.

To help us create a base system, we will use debootstrap in two stages. We split this into two stages so we can setup QEMU in between.

$ mkdir rootfs
$ debootstrap --foreign --no-check-gpg --arch=arm buster ./rootfs http://httpredir.debian.org/debian/
$ cp -v /usr/bin/qemu-arm-static ./rootfs/usr/bin/
$ chroot ./rootfs ./debootstrap/debootstrap --second-stage --verbose
$ mount -t devpts devpts ./rootfs/dev/pts
$ mount -t proc proc ./rootfs/proc
$ mount -t sysfs sysfs ./rootfs/sys

At this point our system is a debian base system with root at ./rootfs. Note that the qemu you need to copy into ./rootfs/usr/bin needs to be static in order to avoid dynamic loading issues inside the jail when it is invoked.

Another important aspect to consider is that depending on your binfmt setup, you need to put the binary in the proper place inside the jail. For the above to work, your binfmt interpreter registration has to point to an interpreter at /usr/bin/qemu-arm-static, which is the absolute path as seen from inside the jail. Now we can install all dependencies in the system at will. All commands will be using qemu transparently to execute if binfmt setup worked correctly. Remember that in the case of chroot jails (and also of containers), the kernel in use is the kernel of your host. Therefore that is the only kernel that needs to be set up with binfmt. You set it up outside your chroot jail and when the kernel needs to execute a foreign executable, it will look at the current binfmt setup for the interpreter. However, the kernel cannot access a filesystem outside the jail, so that interpreter needs to exist inside the it.

$ chroot ./rootfs apt-get update
$ chroot ./rootfs apt-get -y upgrade
$ chroot ./rootfs apt-get install -y g++ cmake libicu-dev git ruby-highline ruby-json python
chroot ./rootfs apt-get -y autoremove
chroot ./rootfs apt-get clean
chroot ./rootfs find /var/lib/apt/lists -type f -delete

These are the steps to install the dependencies to build and test JSC. Let's unmount the filesystems in order to create the docker base image.

umount ./rootfs/dev/pts
umount ./rootfs/proc
umount ./rootfs/sys

Let's tar our jail and import it into a docker image.

$ tar --numeric-owner -cvf buster-arm.tar -C ./rootfs .
$ docker import buster-arm.tar jsc-base:arm-raw

The image jsc-base:arm-raw now contains our raw system. To be able to add metadata or build on top of this image before releasing we can build a docker image that extends the raw image version.

$ cat <<EOF > jsc32-base.Dockerfile
FROM jsc32-base:arm-raw

LABEL description="Minimal Debian image to reproduce JSC dev"
LABEL maintainer="Paulo Matos <pmatos@igalia.com>"
 
CMD ["/bin/bash"]
EOF
$ docker build -t pmatos/jsc32-base:arm -f jsc32-base.Dockerfile
$ docker push pmatos/jsc32-base:arm

Once we have pushed the image into the repository, we can run it from anywhere as long as binfmt is setup properly.

$ docker run pmatos/jsc32-base:arm file /bin/bash
/bin/bash: ELF 32-bit LSB pie executable, ARM, EABI5 version 1 (SYSV), dynamically linked, \
interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, \
BuildID[sha1]=40e50d160a6c70d1a4e961200202cf853b4a2145, stripped

Or to get an interactive shell inside the container, use -it: docker run -it pmatos/jsc32-base:arm /bin/bash.

Going rootless with podman

While docker is a product of Docker Inc with various components and pricing tiers, podman is a free and open source daemonless engine for developing, managing, and running containers on Linux. One of the main internal differences between docker and podman is that podman uses cgroups v2, which docker doesn't yet support. Since Fedora 31 ships with cgroups v2, for many users podman is the only container engine available. Fortunately, podman is largely docker compatible meaning that you can alias docker=podman without any issues.

So you can do something like this.

$ podman run pmatos/jsc32-base:arm file /bin/bash
/bin/bash: ELF 32-bit LSB pie executable, ARM, EABI5 version 1 (SYSV), dynamically linked, \
interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, \
BuildID[sha1]=40e50d160a6c70d1a4e961200202cf853b4a2145, stripped

As long your binfmt is properly setup, it will run as well as it did with docker.

Application to JSC

The initial code developed for reproducible JSC32 builds can be found in the WebKit-misc repository and was initially pushed under commit 94bf25e.

The main script is in containers/reprojsc.sh and follows the plan laid out in this blog post.

An image is created using containers/jsc32-base/build-image.sh and pushed into docker hub pmatos/jsc32-base. This only needs to be done when the image is changed.
The containers/reprojsc.sh script is run each time one wants to trigger a build and/or test of JSC. This sets up QEMU for the desired architecture, starts the image and issues the necessary commands for the actions laid out in the command line.

To run it yourself, checkout WebKit-misc and WebKit, and run reprojsc.sh.

$ git clone --depth=1 git://git.webkit.org/WebKit.git
$ export WEBKIT_PATH=$PWD/WebKit
$ git clone --depth=1 git@github.com:pmatos/WebKit-misc.git
$ cd WebKit-misc/containers
$ ./reprojsc.sh -a arm -b -t -i $WEBKIT_PATH

This will build, test and send you into an interactive session so you can debug some possibly failed tests. Inside the container, you'll be in an ARM Cortex-A7 (32bits) environment. The other available architecture at the moment is MIPS which can be chosen with -a mips.

If you have any issues or requests please open an issue in GitHub.

What's missing?

There are many things that deserved discussion but I won't elaborate further. One of them is the difference between the container execution engine and the image builder. When using docker it's easy to think that they are the same but it is not the case. For example, docker has a new image builder, called buildx whichhas in-built support for the creation of multi-architecture images. On the other hand, podman is a container execution engine but an image creator is, for example, buildah, which doesn't have multi-architecture image support yet. Don't worry if you are confused - things are moving fast in this area and it's easy to lose track of all the tools out there.

With the ability to transparently emulate other architectures QEMU user mode will see more usage patterns. Bugs in QEMU can look like application bugs and debugging is not straightforward, so bear in this in mind when using this technology.

Acknowledgments

Thanks to my fellow Igalians Angelos Oikonomopoulos and Philip Chimento for providing suggestions and corrections for this blog post.