The Linux kernel never sets the Inheritable capability flag to anything
other than empty. Moby should have the same behavior, and leave it to
userspace code within the container to set a non-empty value if desired.
Reported-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: Samuel Karp <skarp@amazon.com>
(cherry picked from commit 0d9a37d0c2)
Signed-off-by: Samuel Karp <skarp@amazon.com>
This fixes a panic when an invalid "device cgroup rule" is passed, resulting
in an "index out of range".
This bug was introduced in the original implementation in 1756af6faf,
but was not reproducible when using the CLI, because the same commit also added
client-side validation on the flag before making an API request. The following
example, uses an invalid rule (`c *:* rwm` - two spaces before the permissions);
```console
$ docker run --rm --network=host --device-cgroup-rule='c *:* rwm' busybox
invalid argument "c *:* rwm" for "--device-cgroup-rule" flag: invalid device cgroup format 'c *:* rwm'
```
Doing the same, but using the API results in a daemon panic when starting the container;
Create a container with an invalid device cgroup rule:
```console
curl -v \
--unix-socket /var/run/docker.sock \
"http://localhost/v1.41/containers/create?name=foobar" \
-H "Content-Type: application/json" \
-d '{"Image":"busybox:latest", "HostConfig":{"DeviceCgroupRules": ["c *:* rwm"]}}'
```
Start the container:
```console
curl -v \
--unix-socket /var/run/docker.sock \
-X POST \
"http://localhost/v1.41/containers/foobar/start"
```
Observe the daemon logs:
```
2021-01-22 12:53:03.313806 I | http: panic serving @: runtime error: index out of range [0] with length 0
goroutine 571 [running]:
net/http.(*conn).serve.func1(0xc000cb2d20)
/usr/local/go/src/net/http/server.go:1795 +0x13b
panic(0x2f32380, 0xc000aebfc0)
/usr/local/go/src/runtime/panic.go:679 +0x1b6
github.com/docker/docker/oci.AppendDevicePermissionsFromCgroupRules(0xc000175c00, 0x8, 0x8, 0xc0000bd380, 0x1, 0x4, 0x0, 0x0, 0xc0000e69c0, 0x0, ...)
/go/src/github.com/docker/docker/oci/oci.go:34 +0x64f
```
This patch:
- fixes the panic, allowing the daemon to return an error on container start
- adds a unit-test to validate various permutations
- adds a "todo" to verify the regular expression (and handling) of the "a" (all) value
We should also consider performing this validation when _creating_ the container,
so that an error is produced early.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 5cc1753f2c)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This prevents docker from setting CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE
capabilities on privileged (or CAP_ALL) containers on Kernel 5.8 and up.
While these kernels support these capabilities, the current release of
runc ships with an older version of /gocapability/capability, and does
not know about them, causing an error to be produced.
We can remove this restriction once 6dfbe9b807
is included in a runc release and once we stop supporting containerd 1.3.x
(which ships with runc v1.0.0-rc92).
Thanks to Anca Iordache for reporting.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
After dicussing with maintainers, it was decided putting the burden of
providing the full cap list on the client is not a good design.
Instead we decided to follow along with the container API and use cap
add/drop.
This brings in the changes already merged into swarmkit.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
```
oci/devices_linux.go:64:72: SA4009: argument e is overwritten before first use (staticcheck)
```
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Format the source according to latest goimports.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- Add support for exact list of capabilities, support only OCI model
- Support OCI model on CapAdd and CapDrop but remain backward compatibility
- Create variable locally instead of declaring it at the top
- Use const for magic "ALL" value
- Rename `cap` variable as it overlaps with `cap()` built-in
- Normalize and validate capabilities before use
- Move validation for conflicting options to validateHostConfig()
- TweakCapabilities: simplify logic to calculate capabilities
Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@sw-pschmied originally post this in #38285
While looking through the Moby source code was found /proc/asound to be
shared with containers as read-only (as defined in
https://github.com/moby/moby/blob/master/oci/defaults.go#L128).
This can lead to two information leaks.
---
**Leak of media playback status of the host**
Steps to reproduce the issue:
- Listen to music/Play a YouTube video/Do anything else that involves
sound output
- Execute docker run --rm ubuntu:latest bash -c "sleep 7; cat
/proc/asound/card*/pcm*p/sub*/status | grep state | cut -d ' ' -f2 |
grep RUNNING || echo 'not running'"
- See that the containerized process is able to check whether someone
on the host is playing music as it prints RUNNING
- Stop the music output
- Execute the command again (The sleep is delaying the output because
information regarding playback status isn't propagated instantly)
- See that it outputs not running
**Describe the results you received:**
A containerized process is able to gather information on the playback
status of an audio device governed by the host. Therefore a process of a
container is able to check whether and what kind of user activity is
present on the host system. Also, this may indicate whether a container
runs on a desktop system or a server as media playback rarely happens on
server systems.
The description above is in regard to media playback - when examining
`/proc/asound/card*/pcm*c/sub*/status` (`pcm*c` instead of `pcm*p`) this
can also leak information regarding capturing sound, as in recording
audio or making calls on the host system.
Signed-off-by: Jonathan A. Schweder <jonathanschweder@gmail.com>
The deafult OCI linux spec in oci/defaults{_linux}.go in Docker/Moby
from 1.11 to current upstream master does not block /proc/acpi pathnames
allowing attackers to modify host's hardware like enabling/disabling
bluetooth or turning up/down keyboard brightness. SELinux prevents all
of this if enabled.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
This is writeable, and can be used to remove devices. Containers do
not need to know about scsi devices.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
This also update:
- runc to 3f2f8b84a77f73d38244dd690525642a72156c64
- runtime-specs to v1.0.0
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
Since the commit d88fe447df ("Add support for sharing /dev/shm/ and
/dev/mqueue between containers") container's /dev/shm is mounted on the
host first, then bind-mounted inside the container. This is done that
way in order to be able to share this container's IPC namespace
(and the /dev/shm mount point) with another container.
Unfortunately, this functionality breaks container checkpoint/restore
(even if IPC is not shared). Since /dev/shm is an external mount, its
contents is not saved by `criu checkpoint`, and so upon restore any
application that tries to access data under /dev/shm is severily
disappointed (which usually results in a fatal crash).
This commit solves the issue by introducing new IPC modes for containers
(in addition to 'host' and 'container:ID'). The new modes are:
- 'shareable': enables sharing this container's IPC with others
(this used to be the implicit default);
- 'private': disables sharing this container's IPC.
In 'private' mode, container's /dev/shm is truly mounted inside the
container, without any bind-mounting from the host, which solves the
issue.
While at it, let's also implement 'none' mode. The motivation, as
eloquently put by Justin Cormack, is:
> I wondered a while back about having a none shm mode, as currently it is
> not possible to have a totally unwriteable container as there is always
> a /dev/shm writeable mount. It is a bit of a niche case (and clearly
> should never be allowed to be daemon default) but it would be trivial to
> add now so maybe we should...
...so here's yet yet another mode:
- 'none': no /dev/shm mount inside the container (though it still
has its own private IPC namespace).
Now, to ultimately solve the abovementioned checkpoint/restore issue, we'd
need to make 'private' the default mode, but unfortunately it breaks the
backward compatibility. So, let's make the default container IPC mode
per-daemon configurable (with the built-in default set to 'shareable'
for now). The default can be changed either via a daemon CLI option
(--default-shm-mode) or a daemon.json configuration file parameter
of the same name.
Note one can only set either 'shareable' or 'private' IPC modes as a
daemon default (i.e. in this context 'host', 'container', or 'none'
do not make much sense).
Some other changes this patch introduces are:
1. A mount for /dev/shm is added to default OCI Linux spec.
2. IpcMode.Valid() is simplified to remove duplicated code that parsed
'container:ID' form. Note the old version used to check that ID does
not contain a semicolon -- this is no longer the case (tests are
modified accordingly). The motivation is we should either do a
proper check for container ID validity, or don't check it at all
(since it is checked in other places anyway). I chose the latter.
3. IpcMode.Container() is modified to not return container ID if the
mode value does not start with "container:", unifying the check to
be the same as in IpcMode.IsContainer().
3. IPC mode unit tests (runconfig/hostconfig_test.go) are modified
to add checks for newly added values.
[v2: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-51345997]
[v3: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-53902833]
[v4: addressed the case of upgrading from older daemon, in this case
container.HostConfig.IpcMode is unset and this is valid]
[v5: document old and new IpcMode values in api/swagger.yaml]
[v6: add the 'none' mode, changelog entry to docs/api/version-history.md]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There really is no reason why anyone should create content in /dev
other then device nodes. Limiting it size to the 64 k size limit.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
On typical x86_64 machines, /sys/firmware can contain SMBIOS and ACPI tables.
There is no need to expose the directory to containers.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
This adds a small C binary for fighting zombies. It is mounted under
`/dev/init` and is prepended to the args specified by the user. You
enable it via a daemon flag, `dockerd --init`, as it is disable by
default for backwards compat.
You can also override the daemon option or specify this on a per
container basis with `docker run --init=true|false`.
You can test this by running a process like this as the pid 1 in a
container and see the extra zombie that appears in the container as it
is running.
```c
int main(int argc, char ** argv) {
pid_t pid = fork();
if (pid == 0) {
pid = fork();
if (pid == 0) {
exit(0);
}
sleep(3);
exit(0);
}
printf("got pid %d and exited\n", pid);
sleep(20);
}
```
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This device is not required by the OCI spec.
The rationale for this was linked to docker/docker#2393
So a non functional /dev/fuse was created, and actual fuse use still is
required to add the device explicitly. However even old versions of the JVM
on Ubuntu 12.04 no longer require the fuse package, and this is all not
needed.
See also https://github.com/opencontainers/runc/pull/983 although this
change alone stops the fuse device being created.
Tested and does not change actual ability to use fuse.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
/proc/timer_list seems to leak information about the host. Here is
an example from a busybox container running on docker+kubernetes.
# cat /proc/timer_list | grep -i -e kube
<ffff8800b8cc3db0>, hrtimer_wakeup, S:01, futex_wait_queue_me, kubelet/2497
<ffff880129ac3db0>, hrtimer_wakeup, S:01, futex_wait_queue_me, kube-proxy/3478
<ffff8800b1b77db0>, hrtimer_wakeup, S:01, futex_wait_queue_me, kube-proxy/3470
<ffff8800bb6abdb0>, hrtimer_wakeup, S:01, futex_wait_queue_me, kubelet/2499
Signed-Off-By: Davanum Srinivas <davanum@gmail.com>
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
This vendors in new spec/runc that supports
setting readonly and masked paths in the
configuration. Using this allows us to make an
exception for `—-privileged`.
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Starks <jostarks@microsoft.com>
Signed-off-by: Darren Stahl <darst@microsoft.com>
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>