Kernel 5.11 introduced support for rootless overlayfs, but incompatible with SELinux.
On the other hand, fuse-overlayfs is compatible.
Close issue 42333
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
These tests would panic;
- in WithRLimits(), because HostConfig was not set;
470ae8422f/daemon/oci_linux.go (L46-L47)
- in daemon.mergeUlimits(), because daemon.configStore was not set;
470ae8422f/daemon/oci_linux.go (L1069)
This panic was not discovered because the current version of runc/libcontainer that we vendor
would not always return false for `apparmor.IsEnabled()` when running docker-in-docker or if
`apparmor_parser` is not found. Starting with v1.0.0-rc93 of libcontainer, this is no longer
the case (changed in bfb4ea1b1b)
This patch;
- changes the tests to initialize Daemon.configStore and Container.HostConfig
- Combines TestExecSetPlatformOpt and TestExecSetPlatformOptPrivileged into a new test
(TestExecSetPlatformOptAppArmor)
- Runs the test both if AppArmor is enabled and if not (in which case it tests
that the container's AppArmor profile is left empty).
- Adds a FIXME comment for a possible bug in execSetPlatformOpts, which currently
prefers custom profiles over "privileged".
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Whether or not the command path is in the error message is a an
implementation detail.
For example, on Windows the only reason this ever matched was because it
dumped the entire container config into the error message, but this had
nothing to do with the actual error.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
this refactors the Stop command to fix a few issues and behaviors that
dont seem completely correct:
1. first it fixes a situation where stop could hang forever (#41579)
2. fixes a behavior where if sending the
stop signal failed, then the code directly sends a -9 signal. If that
fails, it returns without waiting for the process to exit or going
through the full docker kill codepath.
3. fixes a behavior where if sending the stop signal failed, then the
code sends a -9 signal. If that succeeds, then we still go through the
same stop waiting process, and may even go through the docker kill path
again, even though we've already sent a -9.
4. fixes a behavior where the code would wait the full 30 seconds after
sending a stop signal, even if we already know the stop signal failed.
fixes#41579
Signed-off-by: Cam <gh@sparr.email>
Before this change, cleanup of the btrfs driver (occuring on each daemon
shutdown) resulted in disabling quotas. It was done with an assumption
that quotas can be enabled or disabled on a subvolume level, which is
not true - enabling or disabling quota is always done on a filesystem
level.
That was leading to disabling quota on btrfs filesystems on each daemon
shutdown.
This change fixes that behavior and removes misleading `subvol` prefix
from functions and methods which set up quota (on a filesystem level).
Fixes: #34593
Fixes: 401c8d1767 ("Add disk quota support for btrfs")
Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
The runc/libcontainer apparmor package on master no longer checks if apparmor_parser
is enabled, or if we are running docker-in-docker.
While those checks are not relevant to runc (as it doesn't load the profile), these
checks _are_ relevant to us (and containerd). So switching to use the containerd
apparmor package, which does include the needed checks.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Some tests were using domain names that were intended to be "fake", but are
actually registered domain names (such as domain.com, registry.com, mytest.com).
Even though we were not actually making connections to these domains, it's
better to use domains that are designated for testing/examples in RFC2606:
https://tools.ietf.org/html/rfc2606
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The following was failing previously, because `getUnprivilegedMountFlags()` was not called:
```console
$ sudo mount -t tmpfs -o noexec none /tmp/foo
$ $ docker --context=rootless run -it --rm -v /tmp/foo:/mnt:ro alpine
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:520: container init caused: rootfs_linux.go:60: mounting "/tmp/foo" to rootfs at "/home/suda/.local/share/docker/overlay2/b8e7ea02f6ef51247f7f10c7fb26edbfb308d2af8a2c77915260408ed3b0a8ec/merged/mnt" caused: operation not permitted: unknown.
```
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
overlay2 no longer sets `archive.OverlayWhiteoutFormat` when
running in UserNS, so we can remove the complicated logic in the
archive package.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When running in userns, returns error (i.e. "use naive, not native")
immediately.
No substantial change to the logic.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Fix issue 41762
Cherry-pick "drivers: btrfs: Allow unprivileged user to delete subvolumes" from containers/storage
831e32b6bd
> In btrfs, subvolume can be deleted by IOC_SNAP_DESTROY ioctl but there
> is one catch: unprivileged IOC_SNAP_DESTROY call is restricted by default.
>
> This is because IOC_SNAP_DESTROY only performs permission checks on
> the top directory(subvolume) and unprivileged user might delete dirs/files
> which cannot be deleted otherwise. This restriction can be relaxed if
> user_subvol_rm_allowed mount option is used.
>
> Although the above ioctl had been the only way to delete a subvolume,
> btrfs now allows deletion of subvolume just like regular directory
> (i.e. rmdir sycall) since kernel 4.18.
>
> So if we fail to cleanup subvolume in subvolDelete(), just fallback to
> system.EnsureRmoveall() to try to cleanup subvolumes again.
> (Note: quota needs privilege, so if quota is enabled we do not fallback)
>
> This fix will allow non-privileged container works with btrfs backend.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
daemon.getExecConfig() already returns typed errors; by wrapping those errors
we may loose the actual reason for failures. Changing the error-type was
originally added in 2d43d93410, but I think
it was not intentional to ignore already-typed errors. It was later refactored
in a793564b25, which added helper functions
to create these errors, but kept the same behavior.
Also adds error-handling to prevent a panic in situations where (although
unlikely) `daemon.containers.Get()` would not return a container.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This makes daemon.getExecConfig return a errdefs.Conflict() error if the
container is not running.
This was originally the case, but a refactor of this code changed the typed
error (`derr.ErrorCodeContainerNotRunning`) to a non-typed error;
a793564b25
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Tonis mentioned that we can run into issues if there is more error
handling added here. This adds a custom reader implementation which is
like io.MultiReader except it does not cache EOF's.
What got us into trouble in the first place is `io.MultiReader` will
always return EOF once it has received an EOF, however the error
handling that we are going for is to recover from an EOF because the
underlying file is a file which can have more data added to it after
EOF.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
When the multireader hits EOF, we will always get EOF from it, so we
cannot store the multrireader fro later error handling, only for the
decoder.
Thanks @tobiasstadler for pointing this error out.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The "userxattr" option is needed for mounting overlayfs inside a user namespace with kernel >= 5.11.
The "userxattr" option is NOT needed for the initial user namespace (aka "the host").
Also, Ubuntu (since circa 2015) and Debian (since 10) with kernel < 5.11 can mount the overlayfs in a user namespace without the "userxattr" option.
The corresponding kernel commit: 2d2f2d7322ff43e0fe92bf8cccdc0b09449bf2e1
> **ovl: user xattr**
>
> Optionally allow using "user.overlay." namespace instead of "trusted.overlay."
> ...
> Disable redirect_dir and metacopy options, because these would allow privilege escalation through direct manipulation of the
> "user.overlay.redirect" or "user.overlay.metacopy" xattrs.
Fix issue 42055
Related to containerd/containerd PR 5076
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Added an option `awslogs-create-stream` to allow skipping log stream
creation for awslogs log driver. The default value is still true to
keep the behavior be consistent with before.
Signed-off-by: Xia Wu <xwumzn@amazon.com>
- Using "/go/" redirects for some topics, which allows us to
redirect to new locations if topics are moved around in the
documentation.
- Updated some old URLs to their new location.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This code was added in 391441c28b, to fix
upgrades from docker 1.11 to 1.12 with existing containers.
Given that any container after 1.12 should have the correct configuration
already, it should be safe to assume this upgrade logic is no longer needed.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Wrap platforms.Only and fallback to our ignore mismatches due to empty
CPU variants. This just cleans things up and makes the logic re-usable
in other places.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
In some cases, in fact many in the wild, an image may have the incorrect
platform on the image config.
This can lead to failures to run an image, particularly when a user
specifies a `--platform`.
Typically what we see in the wild is a manifest list with an an entry
for, as an example, linux/arm64 pointing to an image config that has
linux/amd64 on it.
This change falls back to looking up the manifest list for an image to
see if the manifest list shows the image as the correct one for that
platform.
In order to accomplish this we need to traverse the leases associated
with an image. Each image, if pulled with Docker 20.10, will have the
manifest list stored in the containerd content store with the resource
assigned to a lease keyed on the image ID.
So we look up the lease for the image, then look up the assocated
resources to find the manifest list, then check the manifest list for a
platform match, then ensure that manifest referes to our image config.
This is only used as a fallback when a user specified they want a
particular platform and the image config that we have does not match
that platform.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
We have upgraded runc to rc93 and added CI for cgroup 2.
So we can move cgroup v2 out of experimental.
Fix issue 41916
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>