These HostConfig properties were not validated until the OCI spec for the container
was created, which meant that `container run` and `docker create` would accept
invalid values, and the invalid value would not be detected until `start` was
called, returning a 500 "internal server error", as well as errors from containerd
("cleanup: failed to delete container from containerd: no such container") in the
daemon logs.
As a result, a faulty container was created, and the container state remained
in the `created` state.
This patch:
- Updates `oci.WithNamespaces()` to return the correct `errdefs.InvalidParameter`
- Updates `verifyPlatformContainerSettings()` to validate these settings, so that
an error is returned when _creating_ the container.
Before this patch:
docker run -dit --ipc=shared --name foo busybox
2a00d74e9fbb7960c4718def8f6c74fa8ee754030eeb93ee26a516e27d4d029f
docker: Error response from daemon: Invalid IPC mode: shared.
docker ps -a --filter name=foo
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2a00d74e9fbb busybox "sh" About a minute ago Created foo
After this patch:
docker run -dit --ipc=shared --name foo busybox
docker: Error response from daemon: invalid IPC mode: shared.
docker ps -a --filter name=foo
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
An integration test was added to verify the new validation, which can be run with:
make BIND_DIR=. TEST_FILTER=TestCreateInvalidHostConfig DOCKER_GRAPHDRIVER=vfs test-integration
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
On Linux the daemon was not respecting the HostConfig.ConsoleSize
property and relied on cli initializing the tty size after the container
was created. This caused a delay between container creation and
the tty actually being resized.
This is also a small change to the api description, because
HostConfig.ConsoleSize is no longer Windows-only.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
strings.ReplaceAll(s, old, new) is a wrapper function for
strings.Replace(s, old, new, -1). But strings.ReplaceAll is more
readable and removes the hardcoded -1.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Finish the refactor which was partially completed with commit
34536c498d, passing around IdentityMapping structs instead of pairs of
[]IDMap slices.
Existing code which uses []IDMap relies on zero-valued fields to be
valid, empty mappings. So in order to successfully finish the
refactoring without introducing bugs, their replacement therefore also
needs to have a useful zero value which represents an empty mapping.
Change IdentityMapping to be a pass-by-value type so that there are no
nil pointers to worry about.
The functionality provided by the deprecated NewIDMappingsFromMaps
function is required by unit tests to to construct arbitrary
IdentityMapping values. And the daemon will always need to access the
mappings to pass them to the Linux kernel. Accommodate these use cases
by exporting the struct fields instead. BuildKit currently depends on
the UIDs and GIDs methods so we cannot get rid of them yet.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Perform the validation when the daemon starts instead of performing these
validations for each individual container, so that we can fail early.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
I had to check what the actual size was, so added it to the const's documentation.
While at it, also made use of it in a test, so that we're testing against the expected
value, and changed one alias to be consistent with other places where we alias this
import.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Trying to reduce the use of libcontainer/devices, as it's considered
to be an "internal" package by runc.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit dae652e2e5 added support for non-privileged
containers to use ICMP_PROTO (used for `ping`). This option cannot be set for
containers that have user-namespaces enabled.
However, the detection looks to be incorrect; HostConfig.UsernsMode was added
in 6993e891d1 / ee2183881b,
and the property only has meaning if the daemon is running with user namespaces
enabled. In other situations, the property has no meaning.
As a result of the above, the sysctl would only be set for containers running
with UsernsMode=host on a daemon running with user-namespaces enabled.
This patch adds a check if the daemon has user-namespaces enabled (RemappedRoot
having a non-empty value), or if the daemon is running inside a user namespace
(e.g. rootless mode) to fix the detection.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
The "quiet" argument was only used in a single place (at daemon startup), and
every other use had to pass "false" to prevent this function from logging
warnings.
Now that SysInfo contains the warnings that occurred when collecting the
system information, we can make leave it up to the caller to use those
warnings (and log them if wanted).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This changes mounts.NewParser() to create a parser for the current operatingsystem,
instead of one specific to a (possibly non-matching, in case of LCOW) OS.
With the OS-specific handling being removed, the "OS" parameter is also removed
from `daemon.verifyContainerSettings()`, and various other container-related
functions.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- daemon.WithRootless(): make sure ROOTLESSKIT_PARENT_EUID is valid int
- daemon.RawSysInfo(): minor simplification, and rename variable that
clashed with imported package.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The runc/libcontainer apparmor package on master no longer checks if apparmor_parser
is enabled, or if we are running docker-in-docker.
While those checks are not relevant to runc (as it doesn't load the profile), these
checks _are_ relevant to us (and containerd). So switching to use the containerd
apparmor package, which does include the needed checks.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The following was failing previously, because `getUnprivilegedMountFlags()` was not called:
```console
$ sudo mount -t tmpfs -o noexec none /tmp/foo
$ $ docker --context=rootless run -it --rm -v /tmp/foo:/mnt:ro alpine
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:520: container init caused: rootfs_linux.go:60: mounting "/tmp/foo" to rootfs at "/home/suda/.local/share/docker/overlay2/b8e7ea02f6ef51247f7f10c7fb26edbfb308d2af8a2c77915260408ed3b0a8ec/merged/mnt" caused: operation not permitted: unknown.
```
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
libcontainer does not guarantee a stable API, and is not intended
for external consumers.
this patch replaces some uses of libcontainer/cgroups with
containerd/cgroups.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
A nil interface in Go is not the same as a nil pointer that satisfies
the interface. libcontainer/user has special handling for missing
/etc/{passwd,group} files but this is all based on nil interface checks,
which were broken by Docker's usage of the API.
When combined with some recent changes in runc that made read errors
actually be returned to the caller, this results in spurrious -EINVAL
errors when we should detect the situation as "there is no passwd file".
Signed-off-by: Aleksa Sarai <asarai@suse.de>
After dicussing with maintainers, it was decided putting the burden of
providing the full cap list on the client is not a good design.
Instead we decided to follow along with the container API and use cap
add/drop.
This brings in the changes already merged into swarmkit.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Commit 56f77d5ade added code that is doing some very ugly things.
In partucular, calling cgroups.FindCgroupMountpointAndRoot() and
daemon.SysInfoRaw() inside a recursively-called initCgroupsPath()
not not a good thing to do.
This commit tries to partially untangle this by moving some expensive
checks and calls earlier, in a minimally invasive way (meaning I
tried hard to not break any logic, however weird it is).
This also removes double call to MkdirAll (not important, but it sticks
out) and renames the function to better reflect what it's doing.
Finally, this wraps some of the errors returned, and fixes the init
function to not ignore the error from itself.
This could be reworked more radically, but at least this this commit
we are calling expensive functions once, and only if necessary.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The implementation in libcontainer/system is quite complicated,
and we only use it to detect if user-namespaces are enabled.
In addition, the implementation in containerd uses a sync.Once,
so that detection (and reading/parsing `/proc/self/uid_map`) is
only performed once.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Currently default capability CAP_NET_RAW allows users to open ICMP echo
sockets, and CAP_NET_BIND_SERVICE allows binding to ports under 1024.
Both of these are safe operations, and Linux now provides ways that
these can be set, per container, to be allowed without any capabilties
for non root users. Enable these by default. Users can revert to the
previous behaviour by overriding the sysctl values explicitly.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Switch to moby/sys/mount and mountinfo. Keep the pkg/mount for potential
outside users.
This commit was generated by the following bash script:
```
set -e -u -o pipefail
for file in $(git grep -l 'docker/docker/pkg/mount"' | grep -v ^pkg/mount); do
sed -i -e 's#/docker/docker/pkg/mount"#/moby/sys/mount"#' \
-e 's#mount\.\(GetMounts\|Mounted\|Info\|[A-Za-z]*Filter\)#mountinfo.\1#g' \
$file
goimports -w $file
done
```
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Support cgroup as in Rootless Podman.
Requires cgroup v2 host with crun.
Tested with Ubuntu 19.10 (kernel 5.3, systemd 242), crun v0.12.1.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When a container is started in privileged mode, the device mappings
provided by `--device` flag was ignored. Now the device mappings will
be considered even in privileged mode.
Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
Format the source according to latest goimports.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This adds both a daemon-wide flag and a container creation property:
- Set the `CgroupnsMode: "host|private"` HostConfig property at
container creation time to control what cgroup namespace the container
is created in
- Set the `--default-cgroupns-mode=host|private` daemon flag to control
what cgroup namespace containers are created in by default
- Set the default if the daemon flag is unset to "host", for backward
compatibility
- Default to CgroupnsMode: "host" for client versions < 1.40
Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
This is enabled for all containers that are not run with --privileged,
if the kernel supports it.
Fixes#38332
Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
This patch hard-codes support for NVIDIA GPUs.
In a future patch it should move out into its own Device Plugin.
Signed-off-by: Tibor Vass <tibor@docker.com>