Currently moby drops ep sets before the entrypoint is executed.
This does mean that with combination of no-new-privileges the
file capabilities stops working with non-root containers.
This is undesired as the usability of such containers is harmed
comparing to running root containers.
This commit therefore sets the effective/permitted set in order
to allow use of file capabilities or libcap(3)/prctl(2) respectively
with combination of no-new-privileges and without respectively.
For no-new-privileges the container will be able to obtain capabilities
that are requested.
Signed-off-by: Luboslav Pivarc <lpivarc@redhat.com>
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
(cherry picked from commit 3aef732e61)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Audit the OCI spec options used for Linux containers to ensure they are
less order-dependent. Ensure they don't assume that any pointer fields
are non-nil and that they don't unintentionally clobber mutations to the
spec applied by other options.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 8a094fe609)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This patch:
- Deprecates pkg/system.DefaultPathEnv
- Moves the implementation inside oci
- Adds TODOs to align the default in the Builder with the one used elsewhere
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
runconfig/config_test.go:23:46: empty-lines: extra empty line at the start of a block (revive)
runconfig/config_test.go:75:55: empty-lines: extra empty line at the start of a block (revive)
oci/devices_linux.go:57:34: empty-lines: extra empty line at the start of a block (revive)
oci/devices_linux.go:60:69: empty-lines: extra empty line at the start of a block (revive)
image/fs_test.go:53:38: empty-lines: extra empty line at the end of a block (revive)
image/tarexport/save.go:88:29: empty-lines: extra empty line at the end of a block (revive)
layer/layer_unix_test.go:21:34: empty-lines: extra empty line at the end of a block (revive)
distribution/xfer/download.go:302:9: empty-lines: extra empty line at the end of a block (revive)
distribution/manifest_test.go:154:99: empty-lines: extra empty line at the end of a block (revive)
distribution/manifest_test.go:329:52: empty-lines: extra empty line at the end of a block (revive)
distribution/manifest_test.go:354:59: empty-lines: extra empty line at the end of a block (revive)
registry/config_test.go:323:42: empty-lines: extra empty line at the end of a block (revive)
registry/config_test.go:350:33: empty-lines: extra empty line at the end of a block (revive)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Older versions of Go don't format comments, so committing this as
a separate commit, so that we can already make these changes before
we upgrade to Go 1.19.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The Linux kernel never sets the Inheritable capability flag to anything
other than empty. Moby should have the same behavior, and leave it to
userspace code within the container to set a non-empty value if desired.
Reported-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: Samuel Karp <skarp@amazon.com>
In situations where docker runs in an environment where capabilities are limited,
sucn as docker-in-docker in a container created by older versions of docker, or
in a container where some capabilities have been disabled, starting a privileged
container may fail, because even though the _kernel_ supports a capability, the
capability is not available.
This patch attempts to address this problem by limiting the list of "known" capa-
bilities on the set of effective capabilties for the current process. This code
is based on the code in containerd's "caps" package.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
The `CapabilityMapping` and `Capabilities` types appeared to be only
used locally, and added unneeded complexity.
This patch removes those types, and simplifies the logic to use a
map that maps names to `capability.Cap`s
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
A capability can either be invalid, or not supported by the kernel
on which we're running. This patch changes the error message produced
to reflect if the capability is invalid/unknown, or a known capability,
but not supported by the kernel version.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Now that runc v1.0.0-rc93 is used, we can revert this temporary workaround
This reverts commit a38b96b8cd.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The runtime spec expects the FileMode field to only hold file permissions,
however `unix.Stat_t.Mode` contains both file type and mode.
This patch strips file type so that only file mode is included in the Device.
Thanks to Iceber Gu, who noticed the same issue in containerd and runc.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This fixes a panic when an invalid "device cgroup rule" is passed, resulting
in an "index out of range".
This bug was introduced in the original implementation in 1756af6faf,
but was not reproducible when using the CLI, because the same commit also added
client-side validation on the flag before making an API request. The following
example, uses an invalid rule (`c *:* rwm` - two spaces before the permissions);
```console
$ docker run --rm --network=host --device-cgroup-rule='c *:* rwm' busybox
invalid argument "c *:* rwm" for "--device-cgroup-rule" flag: invalid device cgroup format 'c *:* rwm'
```
Doing the same, but using the API results in a daemon panic when starting the container;
Create a container with an invalid device cgroup rule:
```console
curl -v \
--unix-socket /var/run/docker.sock \
"http://localhost/v1.41/containers/create?name=foobar" \
-H "Content-Type: application/json" \
-d '{"Image":"busybox:latest", "HostConfig":{"DeviceCgroupRules": ["c *:* rwm"]}}'
```
Start the container:
```console
curl -v \
--unix-socket /var/run/docker.sock \
-X POST \
"http://localhost/v1.41/containers/foobar/start"
```
Observe the daemon logs:
```
2021-01-22 12:53:03.313806 I | http: panic serving @: runtime error: index out of range [0] with length 0
goroutine 571 [running]:
net/http.(*conn).serve.func1(0xc000cb2d20)
/usr/local/go/src/net/http/server.go:1795 +0x13b
panic(0x2f32380, 0xc000aebfc0)
/usr/local/go/src/runtime/panic.go:679 +0x1b6
github.com/docker/docker/oci.AppendDevicePermissionsFromCgroupRules(0xc000175c00, 0x8, 0x8, 0xc0000bd380, 0x1, 0x4, 0x0, 0x0, 0xc0000e69c0, 0x0, ...)
/go/src/github.com/docker/docker/oci/oci.go:34 +0x64f
```
This patch:
- fixes the panic, allowing the daemon to return an error on container start
- adds a unit-test to validate various permutations
- adds a "todo" to verify the regular expression (and handling) of the "a" (all) value
We should also consider performing this validation when _creating_ the container,
so that an error is produced early.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This prevents docker from setting CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE
capabilities on privileged (or CAP_ALL) containers on Kernel 5.8 and up.
While these kernels support these capabilities, the current release of
runc ships with an older version of /gocapability/capability, and does
not know about them, causing an error to be produced.
We can remove this restriction once 6dfbe9b807
is included in a runc release and once we stop supporting containerd 1.3.x
(which ships with runc v1.0.0-rc92).
Thanks to Anca Iordache for reporting.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
After dicussing with maintainers, it was decided putting the burden of
providing the full cap list on the client is not a good design.
Instead we decided to follow along with the container API and use cap
add/drop.
This brings in the changes already merged into swarmkit.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
```
oci/devices_linux.go:64:72: SA4009: argument e is overwritten before first use (staticcheck)
```
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Format the source according to latest goimports.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- Add support for exact list of capabilities, support only OCI model
- Support OCI model on CapAdd and CapDrop but remain backward compatibility
- Create variable locally instead of declaring it at the top
- Use const for magic "ALL" value
- Rename `cap` variable as it overlaps with `cap()` built-in
- Normalize and validate capabilities before use
- Move validation for conflicting options to validateHostConfig()
- TweakCapabilities: simplify logic to calculate capabilities
Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@sw-pschmied originally post this in #38285
While looking through the Moby source code was found /proc/asound to be
shared with containers as read-only (as defined in
https://github.com/moby/moby/blob/master/oci/defaults.go#L128).
This can lead to two information leaks.
---
**Leak of media playback status of the host**
Steps to reproduce the issue:
- Listen to music/Play a YouTube video/Do anything else that involves
sound output
- Execute docker run --rm ubuntu:latest bash -c "sleep 7; cat
/proc/asound/card*/pcm*p/sub*/status | grep state | cut -d ' ' -f2 |
grep RUNNING || echo 'not running'"
- See that the containerized process is able to check whether someone
on the host is playing music as it prints RUNNING
- Stop the music output
- Execute the command again (The sleep is delaying the output because
information regarding playback status isn't propagated instantly)
- See that it outputs not running
**Describe the results you received:**
A containerized process is able to gather information on the playback
status of an audio device governed by the host. Therefore a process of a
container is able to check whether and what kind of user activity is
present on the host system. Also, this may indicate whether a container
runs on a desktop system or a server as media playback rarely happens on
server systems.
The description above is in regard to media playback - when examining
`/proc/asound/card*/pcm*c/sub*/status` (`pcm*c` instead of `pcm*p`) this
can also leak information regarding capturing sound, as in recording
audio or making calls on the host system.
Signed-off-by: Jonathan A. Schweder <jonathanschweder@gmail.com>
The deafult OCI linux spec in oci/defaults{_linux}.go in Docker/Moby
from 1.11 to current upstream master does not block /proc/acpi pathnames
allowing attackers to modify host's hardware like enabling/disabling
bluetooth or turning up/down keyboard brightness. SELinux prevents all
of this if enabled.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
This is writeable, and can be used to remove devices. Containers do
not need to know about scsi devices.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>