full diff: https://github.com/opencontainers/runc/compare/v1.0.0-rc92...v1.0.0-rc93
release notes: https://github.com/opencontainers/runc/releases/tag/v1.0.0-rc93
Release notes for runc v1.0.0-rc93
-------------------------------------------------
This is the last feature-rich RC release and we are in a feature-freeze until
1.0. 1.0.0~rc94 will be released in a few weeks with minimal bug fixes only,
and 1.0.0 will be released soon afterwards.
- runc's cgroupv2 support is no longer considered experimental. It is now
believed to be fully ready for production deployments. In addition, runc's
cgroup code has been improved:
- The systemd cgroup driver has been improved to be more resilient and
handle more systemd properties correctly.
- We now make use of openat2(2) when possible to improve the security of
cgroup operations (in future runc will be wholesale ported to libpathrs to
get this protection in all codepaths).
- runc's mountinfo parsing code has been reworked significantly, making
container startup times significantly faster and less wasteful in general.
- runc now has special handling for seccomp profiles to avoid making new
syscalls unusable for glibc. This is done by installing a custom prefix to
all seccomp filters which returns -ENOSYS for syscalls that are newer than
any syscall in the profile (meaning they have a larger syscall number).
This should not cause any regressions (because previously users would simply
get -EPERM rather than -ENOSYS, and the rule applied above is the most
conservative rule possible) but please report any regressions you find as a
result of this change -- in particular, programs which have special fallback
code that is only run in the case of -EPERM.
- runc now supports the following new runtime-spec features:
- The umask of a container can now be specified.
- The new Linux 5.9 capabilities (CAP_PERFMON, CAP_BPF, and
CAP_CHECKPOINT_RESTORE) are now supported.
- The "unified" cgroup configuration option, which allows users to explicitly
specify the limits based on the cgroup file names rather than abstracting
them through OCI configuration. This is currently limited in scope to
cgroupv2.
- Various rootless containers improvements:
- runc will no longer cause conflicts if a user specifies a custom device
which conflicts with a user-configured device -- the user device takes
precedence.
- runc no longer panics if /sys/fs/cgroup is missing in rootless mode.
- runc --root is now always treated as local to the current working directory.
- The --no-pivot-root hardening was improved to handle nested mounts properly
(please note that we still strongly recommend that users do not use
--no-pivot-root -- it is still an insecure option).
- A large number of code cleanliness and other various cleanups, including
fairly large changes to our tests and CI to make them all run more
efficiently.
For packagers the following changes have been made which will have impact on
your packaging of runc:
- The "selinux" and "apparmor" buildtags have been removed, and now all runc
builds will have SELinux and AppArmor support enabled. Note that "seccomp"
is still optional (though we very highly recommend you enable it).
- make install DESTDIR= now functions correctly.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 28e5a3c5a4)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit edb62a3ace fixed a bug in MkdirAllAndChown()
that caused the specified permissions to not be applied correctly. As a result
of that bug, the configured umask would be applied.
When extracting archives, Unpack() used 0777 permissions when creating missing
parent directories for files that were extracted.
Before edb62a3ace, this resulted in actual
permissions of those directories to be 0755 on most configurations (using a
default 022 umask).
Creating these directories should not depend on the host's umask configuration.
This patch changes the permissions to 0755 to match the previous behavior,
and to reflect the original intent of using 0755 as default.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 25ada76437)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
v0.13.1
- Refactor `ParsePortSpec` to handle IPv6 addresses, and improve validation
v0.13.0
- `rootlesskit --pidns`: fix propagating exit status
- Support cgroup2 evacuation, e.g., `systemd-run -p Delegate=yes --user -t rootlesskit --cgroupns --pidns --evacuate-cgroup2=evac --net=slirp4netns bash`
v0.12.0
- Port forwarding API now supports setting `ChildIP`
- The `vendor` directory is no longer included in this repo. Run `go mod vendor` if you need
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit e32ae1973a)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Fix docker0 subnet missmatch when running from docker in docker (dind)
Signed-off-by: Alexis Ries <ries.alexis@gmail.com>
(cherry picked from commit 96e103feb1)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Check if the `docker build` completed successfully before continuing.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit fa480403c7)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This currently doesn't make a difference, because load.FrozenImagesLinux()
currently loads all frozen images, not just the specified one, but in case
that is fixed/implemented at some point.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 26965fbfa0)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Thanks to Stefan Scherer for setting up the Jenkins nodes.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit c23b99f4db)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Use the image build from Dockerfile.simple to build docker binary failed
with not find <brtfs/ioctl.h>, we need to install libbtrfs-dev to fix this.
```
Building: bundles/dynbinary-daemon/dockerd-dev
GOOS="" GOARCH="" GOARM=""
.gopath/src/github.com/docker/docker/daemon/graphdriver/btrfs/btrfs.go:8:10: fatal error: btrfs/ioctl.h: No such file or directory
#include <btrfs/ioctl.h>
```
Signed-off-by: Lei Jitang <leijitang@outlook.com>
(cherry picked from commit dd7ee8ea3e)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
run docker-py integration tests of the latest release;
full diff: https://github.com/docker/docker-py/compare/4.3.0...4.4.1
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 14fb165085)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit f2f5106c92 added this test to verify loading
of images that were built with user-namespaces enabled.
However, because this test spins up a new daemon, not the daemon that's set up by
the test-suite's `TestMain()` (which loads the frozen images).
As a result, the `debian:bullseye` image was pulled from Docker Hub when running
the test;
Calling POST /v1.41/images/load?quiet=1
Applying tar in /go/src/github.com/docker/docker/bundles/test-integration/TestBuildUserNamespaceValidateCapabilitiesAreV2/d4d366b15997b/root/165536.165536/overlay2/3f7f9375197667acaf7bc810b34689c21f8fed9c52c6765c032497092ca023d6/diff" storage-driver=overlay
Applied tar sha256:845f0e5159140e9dbcad00c0326c2a506fbe375aa1c229c43f082867d283149c to 3f7f9375197667acaf7bc810b34689c21f8fed9c52c6765c032497092ca023d6, size: 5922359
Calling POST /v1.41/build?buildargs=null&cachefrom=null&cgroupparent=&cpuperiod=0&cpuquota=0&cpusetcpus=&cpusetmems=&cpushares=0&dockerfile=&labels=null&memory=0&memswap=0&networkmode=&rm=0&shmsize=0&t=capabilities%3A1.0&target=&ulimits=null&version=
Trying to pull debian from https://registry-1.docker.io v2
Fetching manifest from remote" digest="sha256:f169dbadc9021fc0b08e371d50a772809286a167f62a8b6ae86e4745878d283d" error="<nil>" remote="docker.io/library/debian:bullseye
Pulling ref from V2 registry: debian:bullseye
...
This patch updates `TestBuildUserNamespaceValidateCapabilitiesAreV2` to load the
frozen image. `StartWithBusybox` is also changed to `Start`, because the test
is not using the busybox image, so there's no need to load it.
In a followup, we should probably add some utilities to make this easier to set up
(and to allow passing the list frozen images that we want to load, without having
to "hard-code" the image name to load).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 46dfc31342)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Using `d.Kill()` with rootless mode causes the restarted daemon to not
be able to start containerd (it times out).
Originally this was SIGKILLing the daemon because we were hoping to not
have to manipulate on disk state, but since we need to anyway we can
shut it down normally.
I also tested this to ensure the test fails correctly without the fix
that the test was added to check for.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit e6591a9c7a)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 00225e220f)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The following warnings in `docker info` are now discarded,
because there is no action user can actually take.
On cgroup v1:
- "WARNING: No blkio weight support"
- "WARNING: No blkio weight_device support"
On cgroup v2:
- "WARNING: No kernel memory TCP limit support"
- "WARNING: No oom kill disable support"
`docker run` still prints warnings when the missing feature is being attempted to use.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 8086443a44)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Fix#41803
Also attempt to mknod devices.
Mknodding devices are likely to fail, but still worth trying when
running with a seccomp user notification.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit d5d5cccb7e)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Now `systemctl --user stop docker` completes just with in 1 or 2 seconds.
Fix issue 41944 ("Docker rootless does not exit properly if containers are running")
See systemd.kill(5) https://www.freedesktop.org/software/systemd/man/systemd.kill.html
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 05566adf71)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Various dirs in /var/lib/docker contain data that needs to be mounted
into a container. For this reason, these dirs are set to be owned by the
remapped root user, otherwise there can be permissions issues.
However, this uneccessarily exposes these dirs to an unprivileged user
on the host.
Instead, set the ownership of these dirs to the real root (or rather the
UID/GID of dockerd) with 0701 permissions, which allows the remapped
root to enter the directories but not read/write to them.
The remapped root needs to enter these dirs so the container's rootfs
can be configured... e.g. to mount /etc/resolve.conf.
This prevents an unprivileged user from having read/write access to
these dirs on the host.
The flip side of this is now any user can enter these directories.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The remapped root does not need access to this dir.
Having this owned by the remapped root opens the host up to an
uprivileged user on the host being able to escalate privileges.
While it would not be normal for the remapped UID to be used outside of
the container context, it could happen.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Generally if we ever need to change perms of a dir, between versions,
this ensures the permissions actually change when we think it should
change without having to handle special cases if it already existed.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Adds a test case for the case where dockerd gets stuck on startup due to
hanging `daemon.shutdownContainer`
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
full diff: https://github.com/opencontainers/selinux/compare/v1.6.0...v1.7.0
- Implement get_default_context_with_level() from libselinux
- Wrap some syscalls (lgetattr, lsetattr, fstatfs, statfs) to retry on EINTR.
- Improve code quality by turning fixing many problems found by linters
- Use bufio.Scanner for parsing labels and policy confilabelg
- Cache the value for SELinux policy directory
- test on ppc64le and go 1.15
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signal systemd when we start shutting down to complement the "READY" notify
that was originally implemented in 97088ebef7
From [sd_notify(3)](https://www.freedesktop.org/software/systemd/man/sd_notify.html#STOPPING=1)
> STOPPING=1
> Tells the service manager that the service is beginning its shutdown. This is useful
> to allow the service manager to track the service's internal state, and present it to
> the user.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This is a fix for https://github.com/docker/for-linux/issues/1012.
The code was not considering that C strings are NULL-terminated so
we need to leave one extra byte.
Without this fix, the testcase in https://github.com/docker/for-linux/issues/1012
fails with
```
Step 61/1001 : RUN echo 60 > 60
---> Running in dde85ac3b1e3
Removing intermediate container dde85ac3b1e3
---> 80a12a18a241
Step 62/1001 : RUN echo 61 > 61
error creating overlay mount to /23456789112345678921234/overlay2/d368abcc97d6c6ebcf23fa71225e2011d095295d5d8c9b31d6810bea748bdf07-init/merged: no such file or directory
```
with the output of `dmesg -T` as:
```
[Sat Dec 19 02:35:40 2020] overlayfs: failed to resolve '/23456789112345678921234/overlay2/89e435a1b24583c463abb73e8abfad8bf8a88312ef8253455390c5fa0a765517-init/wor': -2
```
with this fix, you get the expected:
```
Step 126/1001 : RUN echo 125 > 125
---> Running in 2f2e56da89e0
max depth exceeded
```
Signed-off-by: Oscar Bonilla <6f6231@gmail.com>
Previous startup sequence used to call "containerStop" on containers that were persisted with a running state but are not alive when restarting (can happen on non-clean shutdown).
This call was made before fixing-up the RunningState of the container, and tricked the daemon to trying to kill a non-existing process and ultimately hang.
The fix is very simple - just add a condition on calling containerStop.
Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>
Capabilities are serialised in VFS_CAP_REVISION_3 when an image is
built in a user-namespaced daemon, instead of VFS_CAP_REVISION_2.
This adds a test for this, though it's currently wired to fail if
the capabilities are serialised in VFS_CAP_REVISION_2 instead in this
situation, since this is unexpected.
Signed-off-by: Eric Mountain <eric.mountain@datadoghq.com>