We have upgraded runc to rc93 and added CI for cgroup 2.
So we can move cgroup v2 out of experimental.
Fix issue 41916
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 1d2a660093)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Fix issue 41762
Cherry-pick "drivers: btrfs: Allow unprivileged user to delete subvolumes" from containers/storage
831e32b6bd
> In btrfs, subvolume can be deleted by IOC_SNAP_DESTROY ioctl but there
> is one catch: unprivileged IOC_SNAP_DESTROY call is restricted by default.
>
> This is because IOC_SNAP_DESTROY only performs permission checks on
> the top directory(subvolume) and unprivileged user might delete dirs/files
> which cannot be deleted otherwise. This restriction can be relaxed if
> user_subvol_rm_allowed mount option is used.
>
> Although the above ioctl had been the only way to delete a subvolume,
> btrfs now allows deletion of subvolume just like regular directory
> (i.e. rmdir sycall) since kernel 4.18.
>
> So if we fail to cleanup subvolume in subvolDelete(), just fallback to
> system.EnsureRmoveall() to try to cleanup subvolumes again.
> (Note: quota needs privilege, so if quota is enabled we do not fallback)
>
> This fix will allow non-privileged container works with btrfs backend.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 62b5194f62)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
overlay2 no longer sets `archive.OverlayWhiteoutFormat` when
running in UserNS, so we can remove the complicated logic in the
archive package.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 6322dfc217)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When running in userns, returns error (i.e. "use naive, not native")
immediately.
No substantial change to the logic.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 67aa418df2)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Previously, `d.naiveDiff.ApplyDiff` was not used even when
`useNaiveDiff()==true`
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit dd97134232)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The following was failing previously, because `getUnprivilegedMountFlags()` was not called:
```console
$ sudo mount -t tmpfs -o noexec none /tmp/foo
$ $ docker --context=rootless run -it --rm -v /tmp/foo:/mnt:ro alpine
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:520: container init caused: rootfs_linux.go:60: mounting "/tmp/foo" to rootfs at "/home/suda/.local/share/docker/overlay2/b8e7ea02f6ef51247f7f10c7fb26edbfb308d2af8a2c77915260408ed3b0a8ec/merged/mnt" caused: operation not permitted: unknown.
```
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 248f98ef5e)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Tonis mentioned that we can run into issues if there is more error
handling added here. This adds a custom reader implementation which is
like io.MultiReader except it does not cache EOF's.
What got us into trouble in the first place is `io.MultiReader` will
always return EOF once it has received an EOF, however the error
handling that we are going for is to recover from an EOF because the
underlying file is a file which can have more data added to it after
EOF.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 5a664dc87d)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When the multireader hits EOF, we will always get EOF from it, so we
cannot store the multrireader fro later error handling, only for the
decoder.
Thanks @tobiasstadler for pointing this error out.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 4be98a38e7)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit a8008f7313)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The "userxattr" option is needed for mounting overlayfs inside a user namespace with kernel >= 5.11.
The "userxattr" option is NOT needed for the initial user namespace (aka "the host").
Also, Ubuntu (since circa 2015) and Debian (since 10) with kernel < 5.11 can mount the overlayfs in a user namespace without the "userxattr" option.
The corresponding kernel commit: 2d2f2d7322ff43e0fe92bf8cccdc0b09449bf2e1
> **ovl: user xattr**
>
> Optionally allow using "user.overlay." namespace instead of "trusted.overlay."
> ...
> Disable redirect_dir and metacopy options, because these would allow privilege escalation through direct manipulation of the
> "user.overlay.redirect" or "user.overlay.metacopy" xattrs.
Fix issue 42055
Related to containerd/containerd PR 5076
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 11ef8d3ba9)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
- Using "/go/" redirects for some topics, which allows us to
redirect to new locations if topics are moved around in the
documentation.
- Updated some old URLs to their new location.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 328de0b8d9)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Wrap platforms.Only and fallback to our ignore mismatches due to empty
CPU variants. This just cleans things up and makes the logic re-usable
in other places.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 50f39e7247)
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
In some cases, in fact many in the wild, an image may have the incorrect
platform on the image config.
This can lead to failures to run an image, particularly when a user
specifies a `--platform`.
Typically what we see in the wild is a manifest list with an an entry
for, as an example, linux/arm64 pointing to an image config that has
linux/amd64 on it.
This change falls back to looking up the manifest list for an image to
see if the manifest list shows the image as the correct one for that
platform.
In order to accomplish this we need to traverse the leases associated
with an image. Each image, if pulled with Docker 20.10, will have the
manifest list stored in the containerd content store with the resource
assigned to a lease keyed on the image ID.
So we look up the lease for the image, then look up the assocated
resources to find the manifest list, then check the manifest list for a
platform match, then ensure that manifest referes to our image config.
This is only used as a fallback when a user specified they want a
particular platform and the image config that we have does not match
that platform.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 4be5453215)
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
When pulling an image by platform, it is possible for the image's
configured platform to not match what was in the manifest list.
The image itself is buggy because either the manifest list is incorrect
or the image config is incorrect. In any case, this is preventing people
from upgrading because many times users do not have control over these
buggy images.
This was not a problem in 19.03 because we did not compare on platform
before. It just assumed if we had the image it was the one we wanted
regardless of platform, which has its own problems.
Example Dockerfile that has this problem:
```Dockerfile
FROM --platform=linux/arm64 k8s.gcr.io/build-image/debian-iptables:buster-v1.3.0
RUN echo hello
```
This fails the first time you try to build after it finishes pulling but
before performing the `RUN` command.
On the second attempt it works because the image is already there and
does not hit the code that errors out on platform mismatch (Actually it
ignores errors if an image is returned at all).
Must be run with the classic builder (DOCKER_BUILDKIT=0).
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 399695305c)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This fixes a panic when an admin specifies a custom default runtime,
when a plugin is started the shim config is nil.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 2903863a1d)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Loggers that implement BufSize() (e.g. awslogs) uses the method to
tell Copier about the maximum log line length. However loggerWithCache
and RingBuffer hide the method by wrapping loggers.
As a result, Copier uses its default 16KB limit which breaks log
lines > 16kB even the destinations can handle that.
This change implements BufSize() on loggerWithCache and RingBuffer to
make sure these logger wrappes don't hide the method on the underlying
loggers.
Fixes#41794.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
(cherry picked from commit bb11365e96)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 00225e220f)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The following warnings in `docker info` are now discarded,
because there is no action user can actually take.
On cgroup v1:
- "WARNING: No blkio weight support"
- "WARNING: No blkio weight_device support"
On cgroup v2:
- "WARNING: No kernel memory TCP limit support"
- "WARNING: No oom kill disable support"
`docker run` still prints warnings when the missing feature is being attempted to use.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 8086443a44)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Various dirs in /var/lib/docker contain data that needs to be mounted
into a container. For this reason, these dirs are set to be owned by the
remapped root user, otherwise there can be permissions issues.
However, this uneccessarily exposes these dirs to an unprivileged user
on the host.
Instead, set the ownership of these dirs to the real root (or rather the
UID/GID of dockerd) with 0701 permissions, which allows the remapped
root to enter the directories but not read/write to them.
The remapped root needs to enter these dirs so the container's rootfs
can be configured... e.g. to mount /etc/resolve.conf.
This prevents an unprivileged user from having read/write access to
these dirs on the host.
The flip side of this is now any user can enter these directories.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The remapped root does not need access to this dir.
Having this owned by the remapped root opens the host up to an
uprivileged user on the host being able to escalate privileges.
While it would not be normal for the remapped UID to be used outside of
the container context, it could happen.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Adds a test case for the case where dockerd gets stuck on startup due to
hanging `daemon.shutdownContainer`
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This is a fix for https://github.com/docker/for-linux/issues/1012.
The code was not considering that C strings are NULL-terminated so
we need to leave one extra byte.
Without this fix, the testcase in https://github.com/docker/for-linux/issues/1012
fails with
```
Step 61/1001 : RUN echo 60 > 60
---> Running in dde85ac3b1e3
Removing intermediate container dde85ac3b1e3
---> 80a12a18a241
Step 62/1001 : RUN echo 61 > 61
error creating overlay mount to /23456789112345678921234/overlay2/d368abcc97d6c6ebcf23fa71225e2011d095295d5d8c9b31d6810bea748bdf07-init/merged: no such file or directory
```
with the output of `dmesg -T` as:
```
[Sat Dec 19 02:35:40 2020] overlayfs: failed to resolve '/23456789112345678921234/overlay2/89e435a1b24583c463abb73e8abfad8bf8a88312ef8253455390c5fa0a765517-init/wor': -2
```
with this fix, you get the expected:
```
Step 126/1001 : RUN echo 125 > 125
---> Running in 2f2e56da89e0
max depth exceeded
```
Signed-off-by: Oscar Bonilla <6f6231@gmail.com>
Previous startup sequence used to call "containerStop" on containers that were persisted with a running state but are not alive when restarting (can happen on non-clean shutdown).
This call was made before fixing-up the RunningState of the container, and tricked the daemon to trying to kill a non-existing process and ultimately hang.
The fix is very simple - just add a condition on calling containerStop.
Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>
These tests fail when run by a non-root user
=== RUN TestTmpfsDevShmNoDupMount
oci_linux_test.go:29: assertion failed: error is not nil: mkdir /var/lib/docker: permission denied
--- FAIL: TestTmpfsDevShmNoDupMount (0.00s)
=== RUN TestIpcPrivateVsReadonly
oci_linux_test.go:29: assertion failed: error is not nil: mkdir /var/lib/docker: permission denied
--- FAIL: TestIpcPrivateVsReadonly (0.00s)
=== RUN TestSysctlOverride
oci_linux_test.go:29: assertion failed: error is not nil: mkdir /var/lib/docker: permission denied
--- FAIL: TestSysctlOverride (0.00s)
=== RUN TestSysctlOverrideHost
oci_linux_test.go:29: assertion failed: error is not nil: mkdir /var/lib/docker: permission denied
--- FAIL: TestSysctlOverrideHost (0.00s)
Signed-off-by: Arnaud Rebillout <elboulangero@gmail.com>
While working on some other code, noticed a bug in the jobs code. We're
adding job version after we're checking if there are port configs.
Before, if there were no port configs, the job version would be missing,
because we would return before trying to convert.
This moves the jobs version conversion above that code, so we don't
accidentally return before it.
Signed-off-by: Drew Erny <derny@mirantis.com>
Also move c.Lock() below containerd delete task, as it doesn't seem that
there is any necessity to hold the container lock while containerd is
killing the task.
This fixes a potential edge-case where containerd delete task hangs, and
thereafter all operations on the container would hang forever, as this
function is holding onto the container lock.
Signed-off-by: Cam <gh@sparr.email>