Commit graph

66 commits

Author SHA1 Message Date
Sebastiaan van Stijn
52c1a2fae8
gofmt GoDoc comments with go1.19
Older versions of Go don't format comments, so committing this as
a separate commit, so that we can already make these changes before
we upgrade to Go 1.19.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-07-08 19:56:23 +02:00
Samuel Karp
0d9a37d0c2
oci: inheritable capability set should be empty
The Linux kernel never sets the Inheritable capability flag to anything
other than empty.  Moby should have the same behavior, and leave it to
userspace code within the container to set a non-empty value if desired.

Reported-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: Samuel Karp <skarp@amazon.com>
2022-02-08 14:33:44 -08:00
Sebastiaan van Stijn
485cf38d48
oci/caps: limit available capabilities to current environment
In situations where docker runs in an environment where capabilities are limited,
sucn as docker-in-docker in a container created by older versions of docker, or
in a container where some capabilities have been disabled, starting a privileged
container may fail, because even though the _kernel_ supports a capability, the
capability is not available.

This patch attempts to address this problem by limiting the list of "known" capa-
bilities on the set of effective capabilties for the current process. This code
is based on the code in containerd's "caps" package.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-10-15 16:12:26 +02:00
Eng Zer Jun
c55a4ac779
refactor: move from io/ioutil to io and os package
The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2021-08-27 14:56:57 +08:00
Sebastiaan van Stijn
686be57d0a
Update to Go 1.17.0, and gofmt with Go 1.17
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-24 23:33:27 +02:00
Sebastiaan van Stijn
58c4c120a8
oci/caps: simplify, and remove types that were not needed
The `CapabilityMapping` and `Capabilities` types appeared to be only
used locally, and added unneeded complexity.

This patch removes those types, and simplifies the logic to use a
map that maps names to `capability.Cap`s

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-04 11:25:55 +02:00
Sebastiaan van Stijn
fc3f98848a
oci/caps: improve error message for unsupported capabilities
A capability can either be invalid, or not supported by the kernel
on which we're running. This patch changes the error message produced
to reflect if the capability is invalid/unknown, or a known capability,
but not supported by the kernel version.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-04 11:25:53 +02:00
Sebastiaan van Stijn
72b1fb59fe
oci/caps: use map for capabilities to simplify lookup
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-04 11:25:51 +02:00
Sebastiaan van Stijn
d786a52364
oci/caps: generate list of all capabilities on "init"
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-04 11:25:48 +02:00
Sebastiaan van Stijn
0ec6f7ea23
oci/caps: minor optimization in init
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-04 11:25:44 +02:00
Sebastiaan van Stijn
b00b21b93c
oci/caps: rename some vars that conflicted with imports / built-ins
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-04 11:24:40 +02:00
Sebastiaan van Stijn
94334153b5
oci/caps: remove hack for RHEL6 kernels
We no longer support these kernels, so we can remove the workaround

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-04 11:23:56 +02:00
Sebastiaan van Stijn
c1c973e81b
Revert "Temporarily disable CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE"
Now that runc v1.0.0-rc93 is used, we can revert this temporary workaround

This reverts commit a38b96b8cd.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-03 16:12:31 +02:00
Sebastiaan van Stijn
0c84c322ae
daemon, oci: remove LCOW bits
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-07-27 13:35:59 +02:00
Sebastiaan van Stijn
bb17074119
reformat "nolint" comments
Unlike regular comments, nolint comments should not have a leading space.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-10 13:03:42 +02:00
Sebastiaan van Stijn
d414c0c1e8
replace uses of deprecated libcontainer/configs.Device
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-02 17:55:51 +02:00
Sebastiaan van Stijn
a40197328e
oci/caps: remove unused GetCapability() and ValidateCapabilities()
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-05-06 15:59:26 +02:00
Sebastiaan van Stijn
1cd1925acd
oci.Device() fix FileMode to match runtime spec
The runtime spec expects the FileMode field to only hold file permissions,
however `unix.Stat_t.Mode` contains both file type and mode.

This patch strips file type so that only file mode is included in the Device.

Thanks to Iceber Gu, who noticed the same issue in containerd and runc.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-02-18 10:48:24 +01:00
Sebastiaan van Stijn
5cc1753f2c
Fix daemon panic when starting container with invalid device cgroup rule
This fixes a panic when an invalid "device cgroup rule" is passed, resulting
in an "index out of range".

This bug was introduced in the original implementation in 1756af6faf,
but was not reproducible when using the CLI, because the same commit also added
client-side validation on the flag before making an API request. The following
example, uses an invalid rule (`c *:*  rwm` - two spaces before the permissions);

```console
$ docker run --rm --network=host --device-cgroup-rule='c *:*  rwm' busybox
invalid argument "c *:*  rwm" for "--device-cgroup-rule" flag: invalid device cgroup format 'c *:*  rwm'
```

Doing the same, but using the API results in a daemon panic when starting the container;

Create a container with an invalid device cgroup rule:

```console
curl -v \
  --unix-socket /var/run/docker.sock \
  "http://localhost/v1.41/containers/create?name=foobar" \
  -H "Content-Type: application/json" \
  -d '{"Image":"busybox:latest", "HostConfig":{"DeviceCgroupRules": ["c *:*  rwm"]}}'
```

Start the container:

```console
curl -v \
  --unix-socket /var/run/docker.sock \
  -X POST \
  "http://localhost/v1.41/containers/foobar/start"
```

Observe the daemon logs:

```
2021-01-22 12:53:03.313806 I | http: panic serving @: runtime error: index out of range [0] with length 0
goroutine 571 [running]:
net/http.(*conn).serve.func1(0xc000cb2d20)
	/usr/local/go/src/net/http/server.go:1795 +0x13b
panic(0x2f32380, 0xc000aebfc0)
	/usr/local/go/src/runtime/panic.go:679 +0x1b6
github.com/docker/docker/oci.AppendDevicePermissionsFromCgroupRules(0xc000175c00, 0x8, 0x8, 0xc0000bd380, 0x1, 0x4, 0x0, 0x0, 0xc0000e69c0, 0x0, ...)
	/go/src/github.com/docker/docker/oci/oci.go:34 +0x64f
```

This patch:

- fixes the panic, allowing the daemon to return an error on container start
- adds a unit-test to validate various permutations
- adds a "todo" to verify the regular expression (and handling) of the "a" (all) value

We should also consider performing this validation when _creating_ the container,
so that an error is produced early.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-01-22 16:02:19 +01:00
Arnaud Rebillout
f9b2989e97 Fix permissions on oci fixtures files
These two json files were executable, they are now 0644.

Signed-off-by: Arnaud Rebillout <elboulangero@gmail.com>
2020-11-27 10:29:47 +07:00
Sebastiaan van Stijn
a38b96b8cd
Temporarily disable CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE
This prevents docker from setting CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE
capabilities on privileged (or CAP_ALL) containers on Kernel 5.8 and up.

While these kernels support these capabilities, the current release of
runc ships with an older version of /gocapability/capability, and does
not know about them, causing an error to be produced.

We can remove this restriction once 6dfbe9b807
is included in a runc release and once we stop supporting containerd 1.3.x
(which ships with runc v1.0.0-rc92).

Thanks to Anca Iordache for reporting.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-10-16 17:52:27 +02:00
Sebastiaan van Stijn
c9c7756301
oci: add tests for loading seccomp profiles
Verify that we're able to test seccomp profiles with our
default Spec.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-29 20:15:43 +02:00
Jintao Zhang
9ad35b7e69 vendor runc 67169a9d43456ff0d5ae12b967acb8e366e2f181
v1.0.0-rc91-48-g67169a9d

Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
2020-07-30 16:16:11 +00:00
Sebastiaan van Stijn
bd0c2b3581
oci/deviceCgroup(): remove redundant variable
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-07-29 14:50:56 +02:00
Brian Goff
24f173a003 Replace service "Capabilities" w/ add/drop API
After dicussing with maintainers, it was decided putting the burden of
providing the full cap list on the client is not a good design.
Instead we decided to follow along with the container API and use cap
add/drop.

This brings in the changes already merged into swarmkit.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2020-07-27 10:09:42 -07:00
Olli Janatuinen
1308a3a99f Move DefaultCapabilities() to caps package
Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
2019-11-14 21:13:16 +02:00
Sebastiaan van Stijn
4a3ee04351
oci: fix SA4009: argument e is overwritten before first use (staticcheck)
```
oci/devices_linux.go:64:72: SA4009: argument e is overwritten before first use (staticcheck)
```

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-09-18 12:57:50 +02:00
Sebastiaan van Stijn
07ff4f1de8
goimports: fix imports
Format the source according to latest goimports.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-09-18 12:56:54 +02:00
Olli Janatuinen
80d7bfd54d Capabilities refactor
- Add support for exact list of capabilities, support only OCI model
- Support OCI model on CapAdd and CapDrop but remain backward compatibility
- Create variable locally instead of declaring it at the top
- Use const for magic "ALL" value
- Rename `cap` variable as it overlaps with `cap()` built-in
- Normalize and validate capabilities before use
- Move validation for conflicting options to validateHostConfig()
- TweakCapabilities: simplify logic to calculate capabilities

Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-01-22 21:50:41 +02:00
Michael Crosby
b940cc5cff Move caps and device spec utils to oci pkg
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-12-11 10:20:25 -05:00
Jonathan A. Schweder
64e52ff3db Masked /proc/asound
@sw-pschmied originally post this in #38285

While looking through the Moby source code was found /proc/asound to be
shared with containers as read-only (as defined in
https://github.com/moby/moby/blob/master/oci/defaults.go#L128).

This can lead to two information leaks.

---

**Leak of media playback status of the host**

Steps to reproduce the issue:

 - Listen to music/Play a YouTube video/Do anything else that involves
sound output
 - Execute docker run --rm ubuntu:latest bash -c "sleep 7; cat
/proc/asound/card*/pcm*p/sub*/status | grep state | cut -d ' ' -f2 |
grep RUNNING || echo 'not running'"
 - See that the containerized process is able to check whether someone
on the host is playing music as it prints RUNNING
 - Stop the music output
 - Execute the command again (The sleep is delaying the output because
information regarding playback status isn't propagated instantly)
 - See that it outputs not running

**Describe the results you received:**

A containerized process is able to gather information on the playback
status of an audio device governed by the host. Therefore a process of a
container is able to check whether and what kind of user activity is
present on the host system. Also, this may indicate whether a container
runs on a desktop system or a server as media playback rarely happens on
server systems.

The description above is in regard to media playback - when examining
`/proc/asound/card*/pcm*c/sub*/status` (`pcm*c` instead of `pcm*p`) this
can also leak information regarding capturing sound, as in recording
audio or making calls on the host system.

Signed-off-by: Jonathan A. Schweder <jonathanschweder@gmail.com>
2018-11-30 10:03:10 -02:00
Antonio Murdaca
569b9702a5
Add /proc/acpi to masked paths
The deafult OCI linux spec in oci/defaults{_linux}.go in Docker/Moby
from 1.11 to current upstream master does not block /proc/acpi pathnames
allowing attackers to modify host's hardware like enabling/disabling
bluetooth or turning up/down keyboard brightness. SELinux prevents all
of this if enabled.

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2018-07-05 17:39:52 +02:00
Sebastiaan van Stijn
f23c00d870
Various code-cleanup
remove unnescessary import aliases, brackets, and so on.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-05-23 17:50:54 +02:00
Justin Cormack
de23cb9398 Add /proc/keys to masked paths
This leaks information about keyrings on the host. Keyrings are
not namespaced.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2018-02-21 16:23:34 +00:00
Daniel Nephin
4f0d95fa6e Add canonical import comment
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2018-02-05 16:51:57 -05:00
John Howard
b023a46a07 Don't special case /sys/firmware in masked paths
Signed-off-by: John Howard <jhoward@microsoft.com>
2017-11-08 12:10:42 -08:00
Justin Cormack
a21ecdf3c8 Add /proc/scsi to masked paths
This is writeable, and can be used to remove devices. Containers do
not need to know about scsi devices.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-11-03 15:12:22 +00:00
John Howard
71651e0b80 Fixes LCOW after containerd 1.0 introduced regressions
Signed-off-by: John Howard <jhoward@microsoft.com>
2017-10-27 09:55:43 -07:00
Michael Crosby
5a9b5f10cf Remove solaris files
For obvious reasons that it is not really supported now.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-10-24 15:39:34 -04:00
Kenfe-Mickael Laventure
ddae20c032
Update libcontainerd to use containerd 1.0
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-10-20 07:11:37 -07:00
John Howard
285bc99731 Merge pull request #34356 from mlaventure/update-containerd
Update containerd to 06b9cb35161009dcb7123345749fef02f7cea8e0
2017-08-24 14:25:44 -07:00
Darren Stahl
7c29103ad9
Update Windows and LCOW to use v1.0.0 runtime-spec
Signed-off-by: Darren Stahl <darst@microsoft.com>
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-21 15:19:31 -07:00
Kenfe-Mickael Laventure
45d85c9913
Update containerd to 06b9cb35161009dcb7123345749fef02f7cea8e0
This also update:
 - runc to 3f2f8b84a77f73d38244dd690525642a72156c64
 - runtime-specs to v1.0.0

Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-21 12:04:07 -07:00
Christophe Vidal
dffa5d6df2 Dropped hyphen in bind mount where appropriate
Signed-off-by: Christophe Vidal <kriss@krizalys.com>
2017-08-19 21:25:07 +07:00
Kir Kolyshkin
7120976d74 Implement none, private, and shareable ipc modes
Since the commit d88fe447df ("Add support for sharing /dev/shm/ and
/dev/mqueue between containers") container's /dev/shm is mounted on the
host first, then bind-mounted inside the container. This is done that
way in order to be able to share this container's IPC namespace
(and the /dev/shm mount point) with another container.

Unfortunately, this functionality breaks container checkpoint/restore
(even if IPC is not shared). Since /dev/shm is an external mount, its
contents is not saved by `criu checkpoint`, and so upon restore any
application that tries to access data under /dev/shm is severily
disappointed (which usually results in a fatal crash).

This commit solves the issue by introducing new IPC modes for containers
(in addition to 'host' and 'container:ID'). The new modes are:

 - 'shareable':	enables sharing this container's IPC with others
		(this used to be the implicit default);

 - 'private':	disables sharing this container's IPC.

In 'private' mode, container's /dev/shm is truly mounted inside the
container, without any bind-mounting from the host, which solves the
issue.

While at it, let's also implement 'none' mode. The motivation, as
eloquently put by Justin Cormack, is:

> I wondered a while back about having a none shm mode, as currently it is
> not possible to have a totally unwriteable container as there is always
> a /dev/shm writeable mount. It is a bit of a niche case (and clearly
> should never be allowed to be daemon default) but it would be trivial to
> add now so maybe we should...

...so here's yet yet another mode:

 - 'none':	no /dev/shm mount inside the container (though it still
		has its own private IPC namespace).

Now, to ultimately solve the abovementioned checkpoint/restore issue, we'd
need to make 'private' the default mode, but unfortunately it breaks the
backward compatibility. So, let's make the default container IPC mode
per-daemon configurable (with the built-in default set to 'shareable'
for now). The default can be changed either via a daemon CLI option
(--default-shm-mode) or a daemon.json configuration file parameter
of the same name.

Note one can only set either 'shareable' or 'private' IPC modes as a
daemon default (i.e. in this context 'host', 'container', or 'none'
do not make much sense).

Some other changes this patch introduces are:

1. A mount for /dev/shm is added to default OCI Linux spec.

2. IpcMode.Valid() is simplified to remove duplicated code that parsed
   'container:ID' form. Note the old version used to check that ID does
   not contain a semicolon -- this is no longer the case (tests are
   modified accordingly). The motivation is we should either do a
   proper check for container ID validity, or don't check it at all
   (since it is checked in other places anyway). I chose the latter.

3. IpcMode.Container() is modified to not return container ID if the
   mode value does not start with "container:", unifying the check to
   be the same as in IpcMode.IsContainer().

3. IPC mode unit tests (runconfig/hostconfig_test.go) are modified
   to add checks for newly added values.

[v2: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-51345997]
[v3: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-53902833]
[v4: addressed the case of upgrading from older daemon, in this case
     container.HostConfig.IpcMode is unset and this is valid]
[v5: document old and new IpcMode values in api/swagger.yaml]
[v6: add the 'none' mode, changelog entry to docs/api/version-history.md]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2017-08-14 10:50:39 +03:00
Daniel J Walsh
bfdb0f3cb8 /dev should be constrained in size
There really is no reason why anyone should create content in /dev
other then device nodes.  Limiting it size to the 64 k size limit.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2017-07-20 08:59:56 -04:00
John Howard
f154588226 LCOW: OCI Spec and Environment for container start
Signed-off-by: John Howard <jhoward@microsoft.com>
2017-06-20 19:50:11 -07:00
Michael Crosby
005506d36c Update moby to runc and oci 1.0 runtime final rc
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-05-05 13:45:45 -07:00
Ma Shimiao
730e0994c8 oci/namespace: remove unnecessary variable idx
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2016-12-22 09:08:43 +08:00
Tibor Vass
6547609870 plugins: misc fixes
Rename variable to reflect manifest -> config renaming
Populate Description fields when computing privileges.
Refactor/reuse code from daemon/oci_linux.go

Signed-off-by: Tibor Vass <tibor@docker.com>
2016-11-22 14:32:07 -08:00