Podman Solutions

User-Based Privilege Separation
Login

User-Based Privilege Separation

Motivation

There is an ancient practice in the Unix world where each service gets its own “user.” The practice is so old that when it was a new idea, these “system users” got intermixed with the real human kind resulting in different system user IDs on each box, depending on the order the users were created. Because this then caused problems with NFS, OSes began reserving some number of the low-numbered IDs1 for themselves, starting the real human users above that limit, with fixed assignments below it.

Podman obsoletes all of this.

Do not combine the two.

Why? Read on.

User Namespaces

“Linux container” is a wrapper term for a bunch of disconnected technologies which tools like Podman combine into a useful whole. While it aids initial understanding to think of this assortment of underlying features as if there had been a single concerted effort to add Containerization™ to the Linux kernel, the actual fact is that the pieces were each added separately over a span of many years, in some cases for purposes quite separate from what we now think of as the modern Linux containers infrastructure.

I bring this up because it can be important to understand the elements on their own merits. In this article’s specific case, Linux’s namespaces feature functionally obsoletes the old “system users” practice, specifically via user namespaces — userns for short — which isolate privilege based on user ID, same as occurs with single-purpose system users.

Default Behavior

Consider this:

$ id=$(podman run --rm -d alpine sleep 60)
$ podman top $id user huser
USER        HUSER
root        501

The first command merely starts a dummy container for us to examine, which will disappear a minute after we started it.

The second then tells Podman to report the user IDs involved, which shows this rootless container running under my host-side user ID even as it appears to be running as root inside.

Because this is a fake root user with less privilege than our host-side user, we’ve already gained most of the protection afforded by the ancient “system users” concept. The assortment of technologies brought to bear by Podman under the label “containerization” ensures that this sleep 60 container cannot…

…access the host-side user’s home directory

To allow that, we would have had to pass something like --volume $HOME:/home/host --workdir /home/host, as tools like Distrobox go out of their way to do, on purpose.

…send signals to host-side processes

Unless you tell it otherwise, Podman puts each container into a separate pidns, which you can see with:

$ podman run --rm -it alpine ps -eaf
PID   USER     TIME  COMMAND
    1 root      0:00 ps -eaf

Although we gave ps the “show me everything” flags, the only process we see is the one for the ps instance running inside that pidns. The kernel is not fooled by our fake “root” user into exceeding that limitation’s bounds.

Also note that this lone process was assigned PID 1 within this namespace, whereas the real PID 1 is the /usr/lib/systemd/systemd instance owned by the CoreOS-based podman machine this ephemeral container briefly ran under.2

…communicate with background processes on the host

This one isn’t hard-and-fast. Rootless Podman’s default configuration blocks some of the common IPC methods:

But it does not block it all! There are two other major IPC options a Linux background process might employ:

If you are looking at the above list and thinking “Aha, Podman isn’t so great after all!” please do realize that the ancient “system users” concept doesn’t block these IPC channels, either.

Automatic Unique User Namespace

Everything above applies to the straight-line default --userns=host case.5 When our goal is to gain isolation akin to the ancient system users concept, the flag’s other possible values are useful.

Most directly on-point for the purposes of this article is --userns=auto. This is a Podman-specific extension6 which makes use of preexisting features in Linux, subordinate UIDs and GIDs. Essentially, this manufactures a per-container user on the fly, one which has no connection to the host-side user.

Boom! 💥 System users are fully obsolete now.

There are a few common problems that result from using this option:

Running out of IDs

The standard advice for the number of subuids to reserve for each user is roughly 64k, but up at the top of this article we pointed out that regular user IDs typically start at 1000 on Linux. Combined, these facts mean each container using this feature chews through ~1k of your subuid allotment because of the way the mechanism works. The end result is that if your system has sixty-some containers running with --userns=auto, you can run out of subuids.

It can even happen with one container. UIDs were once 16-bit values on Linux, and some systems used small negative UIDs for special cases. UID -2 is the nobody user, which becomes 65534 when cast from a two’s complement signed integer to an unsigned 16-bit integer, which means you can chew through the standard 64k allotment in a single gulp when you map nobody into a container.

This problem goes away when you are able to start the container as root because it causes Podman to look for a “containers” entry in /etc/subuid, which the docs recommend that you give 231 subuids, as close to “infinite” as matters here:7

containers:2147483647:2147483648

Because this scheme gives each container a separate UID range, they are effectively rootless even though they were started as root.

Better, each container gets an independent userns this way, unlike running multiple containers as rootless under a single user account. This prevents independent containers from being able to attack each other via any resource path protected under the userns umbrella.

Ever-Changing IDs

The second problem people commonly run into with --userns=auto is that the IDs each container gets are contingent on the start-up order, which isn’t strictly predictable even for rootful services started at boot by Quadlets. Even if you have arranged matters to ensure that your startup services come up in the same order 100% of the time, that has a chance of breaking if you ever have to restart a service after boot, because that opens a chance that the restarted service will get a different block of IDs.

Because the internal IDs remain the same under all conditions, we have to ask why it matters what the external IDs these map to are? It doesn’t always. If your container is providing a simple compute service (a so-called “function”) or is acting as a gateway to a back-end service that has its own identity management — e.g. a client-server DBMS — the external user IDs are unlikely to matter. The most common case where it does matter is when your container has a persistent --volume mapped in, because the UID/GID stored for the files will vary depending on which sub[ug]id blocks the container started with.

There are a few simple solutions to this.

If you have only one ID involved — typically set as the USER value in the Containerfile — you can set the :U flag on the --volume option to relabel the files with the per-instance UID/GIDs at startup time. For a small container with few files mapped in via that volume, you won’t even notice the overhead.

If you have many thousands of files on that volume or multiple user IDs in use inside the container, a better solution might be to give the idmap flag on the volume mount:

$ mkdir -p tmp
$ sudo podman run --userns=auto:size=1024 --user daemon --rm -it -v ~/tmp:/tmp:z,idmap \
  alpine touch /tmp/x ; \
  ls -l tmp/x ; \
  rm -f tmp/x
-rw-r--r--. 1 daemon daemon 0 Aug 18 01:40 tmp/x

Without the idmap flag, that last line would show two large integers instead of the daemon user and group name, being the starting subuid value configured for “containers” in /etc/subuid + $(id -u daemon).

The primary downside of this solution is that it only works for containers started as root. If it were otherwise, this mechanism would give normal users the ability to create files with arbitrary ownership, a huge security hole, which is why this limitation is enforced by the Linux kernel; the limitation is not specific to Podman.

This same property can allow a leak of internal root permissions to the outside. If the container can create a setuid root executable on a mapped-in volume and then trick a user on the host side into executing it, it will run as the host-side root user. An unqualified idmap should therefore be used with due caution. You may wish to give uids and gids options to map through only “safe” ID values.

The size=1024 bit may need to be adjusted in your particular use case. Podman is supposed to determine a good value for this, but when it cannot, it takes over the entire ID space, preventing any other containers from running, with this symptom:

Error: creating container storage: not enough unused IDs in user namespace

The “nobody” problem covered above can crop up here as well. The expedient solution is size=65535, which is more palatable when you’ve got 2.1 billion IDs to play with. See also Podman troubleshooting item #43.

Life in a World Without System Users

If you happen to be a macOS or Windows user, Podman sets up a background “machine” for you, a hidden VM running their customized version of Fedora CoreOS.8 On such a host, try this:

$ podman machine ssh
core@localhost:~$ wc -l /etc/passwd
       3 /etc/passwd
core@localhost:~$ exit
$ grep -v '^#' /etc/passwd | wc -l
     130 /etc/passwd

CoreOS has only 3 users defined: root, unbound, and core, and the only reason there are even three is that unbound is a classic “system user,” isolating privilege within the podman machine by the old ways, doubtless to avoid needing to set up multiple containers with access gated using more modern mechanisms.

My macOS host is designed on more…let us be charitable and say “classic” lines. Despite being a single-user box, it has 130 users defined! All but a few are system users, owing to the fact that macOS has a development history traceable back to BSD Unix in the early 1980s. Thankfully, it is missing classic system users like sendmail, and most of the system users it does define are named with a leading underscore to distinguish them, yet one cannot help but draw a valuable distinction here.

Podman CoreOS doesn’t need these throwbacks. It has namespaces and all the rest of the elements that make up Linux containerization.

Rootless by Default

One huge reason for the longstanding popularity of the system users concept is that classic Unix (and then Linux) servers started all daemons as root as part of the boot process, even if the service had no need for root privileges. When there was good cause to start as root — as with Apache binding to port 80 — a well-designed daemon would drop root privilege as soon as it could.

And what would it drop to? The system user you configured it to use, of course.

While all of that can still be done in the modern Podman world, the pressures that made it the primary path no longer exist.

First systemd came along with the concept of user services, allowing service startup to be delayed until the system reached multi-user stage, or even until after the user logged in. Second, because such services run under a regular user account, the damage they can do is inherently limited. Finally, Podman came along and added all of what we’re discussing in this article.

Podman’s Quadlet feature lets us combine both capabilities: start a service in the background on boot as a normal user, but under an isolated userns. This not only provides every bit of the security the ancient system user concept was meant to provide, it provides more. Containerized processes…

Secure by Default, Porous by Configuration

Linux namespaces are not an all-or-nothing proposition.

We have already seen one major aspect of this: there are multiple namespaces, allowing one to erect certain barriers while leaving others down. Podman makes good use of this itself in its eponymous “pod” feature. By default, containers in a pod share a network namespace while having different pidns and userns, allowing them to communicate via TCP and UDP but not interfere with each other otherwise.

Another aspect is that each namespace is configurable through Podman, giving you a measure of control over the “dimensions” of each barrier. This is not the place to get into details; suffice it to say that your choice is generally not between having the barrier or not. Search the docs for “namespace” to get an idea of the level of control Podman gives you over this aspect of its internal operation.

License

This work is © 2025 by Warren Young and is licensed under CC BY-NC-SA 4.0


  1. ^ Originally, the ceiling was set as low as 99, but the popularity of this concept resulted in enough demand for these reserved user IDs that this limit was eventually raised in mainstream Linux distros first to 499, then 999. macOS stopped at 500, resulting in the 501 you can see for my host-side user ID in command output elsewhere in this article.
  2. ^ Said container certainly cannot see PID 1 on the true host in my case, macOS’s /sbin/launchd.
  3. ^ …which might not exist, as with the podman machine case.
  4. ^ This is a particular worry with containers since old versions of containerd used an abstract socket, as does DBus to this day. This can allow powerful effects which are off-topic for this article, so let me simply say that you should avoid use of --network=host if a primary goal of your use of containerization is improved security.
  5. ^ Full details of the default are more complicated.
  6. ^ Docker has a vaguely similar feature, being the --userns-remap flag on the background container engine. The primary negative consequence of this design is that it affects all containers on that system. Podman’s daemonless nature allows every container to arrange UID remapping separately, per each container’s needs.
  7. ^ It allows for roughly 2 million containers running simultaneously. You are likely to run out of system resources before you can hit that limit.
  8. ^ One may say podman machine init on a Linux host as well, if one would like to follow along without moving over to a macOS or Windows box.