Podman Solutions: User-Based Privilege Separation

Motivation

There is an ancient practice in the Unix world where each service gets its own “user.” The practice is so old that when it was a new idea, these “system users” got intermixed with the real human kind resulting in different system user IDs on each box, depending on the order the users were created. Because this then caused problems with NFS, OSes began reserving some number of the low-numbered IDs¹ for themselves, starting the real human users above that limit, with fixed assignments below it.

Podman obsoletes all of this.

Do not combine the two.

Why? Read on.

User Namespaces

“Linux container” is a wrapper term for a bunch of disconnected technologies which tools like Podman combine into a useful whole. While it aids initial understanding to think of this assortment of underlying features as if there had been a single concerted effort to add Containerization™ to the Linux kernel, the actual fact is that the pieces were each added separately over a span of many years, in some cases for purposes quite separate from what we now think of as the modern Linux containers infrastructure.

I bring this up because it can be important to understand the elements on their own merits. In this article’s specific case, Linux’s namespaces feature functionally obsoletes the old “system users” practice, specifically via user namespaces — userns for short — which isolate privilege based on user ID, same as occurs with single-purpose system users.

Default Behavior

Consider this:

$ id=$(podman run --rm -d alpine sleep 60)
$ podman top $id user huser
USER        HUSER
root        501

The first command merely starts a dummy container for us to examine, which will disappear a minute after we started it.

The second then tells Podman to report the user IDs involved, which shows this rootless container running under my host-side user ID even as it appears to be running as root inside.

Because this is a fake root user with less privilege than our host-side user, we’ve already gained most of the protection afforded by the ancient “system users” concept. The assortment of technologies brought to bear by Podman under the label “containerization” ensures that this sleep 60 container cannot…

…access the host-side user’s home directory

To allow that, we would have had to pass something like --volume $HOME:/home/host --workdir /home/host, as tools like Distrobox go out of their way to do, on purpose.

…send signals to host-side processes

Unless you tell it otherwise, Podman puts each container into a separate pidns, which you can see with:

$ podman run --rm -it alpine ps -eaf
PID   USER     TIME  COMMAND
    1 root      0:00 ps -eaf

Although we gave ps the “show me everything” flags, the only process we see is the one for the ps instance running inside that pidns. The kernel is not fooled by our fake “root” user into exceeding that limitation’s bounds.

Also note that this lone process was assigned PID 1 within this namespace, whereas the real PID 1 is the /usr/lib/systemd/systemd instance owned by the CoreOS-based podman machine this ephemeral container briefly ran under.²

…communicate with background processes on the host

This one isn’t hard-and-fast. Rootless Podman’s default configuration blocks some of the common IPC methods:

old-school System V IPC is blocked by running each container in a separate ipcns by default
Unix domain sockets appear in the filesystem, so the prior point applies: sockets not mapped through with --volume are invisible to the container
localhost sockets are inaccessible by virtue of the default --network=pasta on rootless containers; you must give --network=host to override that

But it does not block it all! There are two other major IPC options a Linux background process might employ:

listening on 0.0.0.0/::0 opens access to containers via the host’s public IP, which it may discover via the host.containers.internal entry that Podman puts in /etc/hosts³
abstract sockets bypass the filesystem namespace but not the network namespace, so they may be visible or not, depending on how you set up your container⁴

If you are looking at the above list and thinking “Aha, Podman isn’t so great after all!” please do realize that the ancient “system users” concept doesn’t block these IPC channels, either.

Automatic Unique User Namespace

Everything above applies to the straight-line default --userns=host case.⁵ When our goal is to gain isolation akin to the ancient system users concept, the flag’s other possible values are useful.

Most directly on-point for the purposes of this article is --userns=auto. This is a Podman-specific extension⁶ which makes use of preexisting features in Linux, subordinate UIDs and GIDs. Essentially, this manufactures a per-container user on the fly, one which has no connection to the host-side user.

Boom! 💥 System users are fully obsolete now.

There is one common problem that results from using this option. The standard advice for the number of subuids to reserve for each user is roughly 64k, but up at the top of this article we pointed out that regular user IDs typically start at 1000 on Linux. Combined, these facts mean each container using this feature chews through ~1k of your subuid alotment because of the way the mechanism works. The end result is that if your system has sixty-some containers running with --userns=auto, you can run out of subuids.

One workaround for this is to start the container as root, which causes Podman to look for a “containers” entry in /etc/subuid, which the docs recommend that you give 2³¹ subuids, as close to “infinite” as matters here.⁷ Because this scheme gives each container a separate UID range, they are effectively rootless even though they were started as root.

Life in a World Without System Users

If you happen to be a macOS or Windows user, Podman sets up a background “machine” for you, a hidden VM running their customized version of Fedora CoreOS.⁸ On such a host, try this:

$ podman machine ssh
core@localhost:~$ wc -l /etc/passwd
       3 /etc/passwd
core@localhost:~$ exit
$ grep -v '^#' /etc/passwd | wc -l
     130 /etc/passwd

CoreOS has only 3 users defined: root, unbound, and core, and the only reason there are even three is that unbound is a classic “system user,” isolating privilege within the podman machine by the old ways, doubtless to avoid needing to set up multiple containers with access gated using more modern mechanisms.

My macOS host is designed on more…let us be charitable and say “classic” lines. Despite being a single-user box, it has 130 users defined! All but a few are system users, owing to the fact that macOS has a development history traceable back to BSD Unix in the early 1980s. Thankfully, it is missing classic system users like sendmail, and most of the system users it does define are named with a leading underscore to distinguish them, yet one cannot help but draw a valuable distinction here.

Podman CoreOS doesn’t need these throwbacks. It has namespaces and all the rest of the elements that make up Linux containerization.

Rootless by Default

One huge reason for the longstanding popularity of the system users concept is that classic Unix (and then Linux) servers started all daemons as root as part of the boot process, even if the service had no need for root privileges. When there was good cause to start as root — as with Apache binding to port 80 — a well-designed daemon would drop root privilege as soon as it could.

And what would it drop to? The system user you configured it to use, of course.

While all of that can still be done in the modern Podman world, the pressures that made it the primary path no longer exist.

First systemd came along with the concept of user services, allowing service startup to be delayed until the system reached multi-user stage, or even until after the user logged in. Second, because such services run under a regular user account, the damage they can do is inherently limited. Finally, Podman came along and added all of what we’re discussing in this article.

Podman’s Quadlet feature lets us combine both capabilities: start a service in the background on boot as a normal user, but under an isolated userns. This not only provides every bit of the security the ancient system user concept was meant to provide, it provides more. Containerized processes…

…start with SELinux labels applied that normal user processes do not;
…have a default seccomp applied which denies access to syscalls deemed unnecessary; and
…have most of the capabilities stripped from the faux “root” user before the image’s entrypoint even launches.

Secure by Default, Porous by Configuration

Linux namespaces are not an all-or-nothing proposition.

We have already seen one major aspect of this: there are multiple namespaces, allowing one to erect certain barriers while leaving others down. Podman makes good use of this itself in its eponymous “pod” feature. By default, containers in a pod share a network namespace while having different pidns and userns, allowing them to communicate via TCP and UDP but not interfere with each other otherwise.

Another aspect is that each namespace is configurable through Podman, giving you a measure of control over the “dimensions” of each barrier. This is not the place to get into details; suffice it to say that your choice is generally not between having the barrier or not. Search the docs for “namespace” to get an idea of the level of control Podman gives you over this aspect of its internal operation.

License

^{^} Originally, the ceiling was set as low as 99, but the popularity of this concept resulted in enough demand for these reserved user IDs that this limit was eventually raised in mainstream Linux distros first to 499, then 999. macOS stopped at 500, resulting in the 501 you can see for my host-side user ID in command output elsewhere in this article.
^{^} Said container certainly cannot see PID 1 on the true host in my case, macOS’s /sbin/launchd.
^{^} …which might not exist, as with the podman machine case.
^{^} This is a particular worry with containers since old versions of containerd used an abstract socket, as does DBus to this day. This can allow powerful effects which are off-topic for this article, so let me simply say that you should avoid use of --network=host if a primary goal of your use of containerization is improved security.
^{^} Full details of the default are more complicated.
^{^} Docker has a vaguely similar feature, being the --userns-remap flag on the background container engine. The primary negative consequence of this design is that it affects all containers on that system. Podman's daemonless nature allows every container to arrange UID remapping separately, per each container's needs.
^{^} It allows for roughly 2 million containers running simultaneously. You are likely to run out of system resources before you can hit that limit.
^{^} One may say podman machine init on a Linux host as well, if one would like to follow along without moving over to a macOS or Windows box.