Commit graph

23 commits

Author SHA1 Message Date
Tianon Gravi
567c01f6d1 seccomp: add support for "clone3" syscall in default policy
This is a backport of 9f6b562dd1, adapted to avoid the refactoring that happened in d92739713c.

Original commit message is as follows:

> If no seccomp policy is requested, then the built-in default policy in
> dockerd applies. This has no rule for "clone3" defined, nor any default
> errno defined. So when runc receives the config it attempts to determine
> a default errno, using logic defined in its commit:
>
>   opencontainers/runc@7a8d716
>
> As explained in the above commit message, runc uses a heuristic to
> decide which errno to return by default:
>
> [quote]
>   The solution applied here is to prepend a "stub" filter which returns
>   -ENOSYS if the requested syscall has a larger syscall number than any
>   syscall mentioned in the filter. The reason for this specific rule is
>   that syscall numbers are (roughly) allocated sequentially and thus newer
>   syscalls will (usually) have a larger syscall number -- thus causing our
>   filters to produce -ENOSYS if the filter was written before the syscall
>   existed.
> [/quote]
>
> Unfortunately clone3 appears to one of the edge cases that does not
> result in use of ENOSYS, instead ending up with the historical EPERM
> errno.
>
> Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
> clone3 by default. If it sees ENOSYS then it will automatically
> fallback to using clone. Any other errno is treated as a fatal
> error. Thus when docker seccomp policy triggers EPERM from clone3,
> no fallback occurs and programs are thus unable to spawn threads.
>
> The clone3 syscall is much more complicated than clone, most notably its
> flags are not exposed as a directly argument any more. Instead they are
> hidden inside a struct. This means that seccomp filters are unable to
> apply policy based on values seen in flags. Thus we can't directly
> replicate the current "clone" filtering for "clone3". We can at least
> ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
> at which point we can filter on flags.

Signed-off-by: Tianon Gravi <admwiggin@gmail.com>
Co-authored-by: Daniel P. Berrangé <berrange@redhat.com>
2021-09-13 08:56:21 -07:00
Sebastiaan van Stijn
4539e7f0eb
seccomp: implement marshal/unmarshall for MinVersion
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-10-07 17:48:25 +02:00
Sebastiaan van Stijn
97535c6c2b
seccomp: remove dependency on pkg/parsers/kernel
This removes the dependency on the `pkg/parsers/kernel` package, because secomp
only needs to consider Linux (and no parsing is needed for Windows or Darwin kernel
versions).

This patch implements the minimum requirements for this implementation:

- only `kernel` and `major` versions are considered
- `minor` version, `flavor`, and `-rcXX` suffixes are ignored

So, for example:

- `3.4.54.longterm-1` => `kernel: 3`, `major: 4`
- `3.8.0-19-generic` => `kernel: 3`, `major: 8`
- `3.10.0-862.2.3.el7.x86_64` => `kernel: 3`, `major: 10`

Some systems also omit the `minor` and/or have odd-formatted versions. In context
of generating seccomp profiles, both versions below are considered equal;

- `3.12.25-gentoo` => `kernel: 3`, `major: 12`
- `3.12-1-amd64` => `kernel: 3`, `major: 12`

Note that `-rcX` suffixes are also not considered, and thus (e.g.) kernel `5.9-rc1`,
`5.9-rc6` and `5.9` are all considered equal.

The motivation for ignoring "minor" versions and "flavors" is that;

- The upstream kernel only does "kernel.major" releases
- While release-candidates exists for kernel (e.g. 5.9-rc5), we don't expect users
  to write profiles that target a specific release-candidate, and therefore consider
  (e.g.) kernel `5.9-rc1`, `5.9-rc6` and `5.9` to be equal.
- Generally, a seccomp-profile should either be portable, or written for a specific
  infrastructure (in which case the writer of the profile would know if the kernel-flavors
  used does/does not support certain things.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-10-02 18:15:37 +02:00
Sebastiaan van Stijn
0d75b63987
seccomp: replace types with runtime-spec types
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-18 19:33:58 +02:00
Sebastiaan van Stijn
0efee50b95
seccomp: move seccomp types from api into seccomp profile
These types were not used in the API, so could not come up with
a reason why they were in that package.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-18 18:14:16 +02:00
Brian Goff
ccbb00c815 Remove dependency in dockerd on libseccomp
This was just using libseccomp to get the right arch, but we can use
GOARCH to get this.
The nativeToSeccomp map needed to be adjusted a bit for mipsle vs mipsel
since that's go how refers to it. Also added some other arches to it.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2020-09-11 22:48:42 +00:00
Kir Kolyshkin
0d496e3d71 profiles/seccomp: improve profile conversion
When translating seccomp profile to opencontainers format, a single
group with multiple syscalls is converted to individual syscall rules.
I am not sure why it is done that way, but suspect it might have
performance implications as the number of rules grows.

Change this to pass a groups of syscalls as a group.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-06-18 17:58:51 -07:00
Tonis Tiigi
e76380b67b seccomp: review update
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2019-02-05 12:02:41 -08:00
Tonis Tiigi
1124543ca8 seccomp: allow ptrace for 4.8+ kernels
4.8+ kernels have fixed the ptrace security issues
so we can allow ptrace(2) on the default seccomp
profile if we do the kernel version check.

93e35efb8d

Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2018-11-04 13:06:43 -08:00
Justin Cormack
15ff09395c
If container will run as non root user, drop permitted, effective caps early
As soon as the initial executable in the container is executed as a non root user,
permitted and effective capabilities are dropped. Drop them earlier than this, so
that they are dropped before executing the file. The main effect of this is that
if `CAP_DAC_OVERRIDE` is set (the default) the user will not be able to execute
files they do not have permission to execute, which previously they could.

The old behaviour was somewhat surprising and the new one is definitely correct,
but it is not in any meaningful way exploitable, and I do not think it is
necessary to backport this fix. It is unlikely to have any negative effects as
almost all executables have world execute permission anyway.

Use the bounding set not the effective set as the canonical set of capabilities, as
effective will now vary.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2018-03-19 14:45:27 -07:00
Daniel Nephin
4f0d95fa6e Add canonical import comment
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2018-02-05 16:51:57 -05:00
Chao Wang
5c154cfac8 Copy Inslice() to those parts that use it
Signed-off-by: Chao Wang <wangchao.fnst@cn.fujitsu.com>
2017-11-10 13:42:38 +08:00
Michael Crosby
005506d36c Update moby to runc and oci 1.0 runtime final rc
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-05-05 13:45:45 -07:00
Antonio Murdaca
197f3ee687
profiles/seccomp: fix comment
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2016-11-25 11:40:54 +01:00
Michael Crosby
91e197d614 Add engine-api types to docker
This moves the types for the `engine-api` repo to the existing types
package.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-09-07 11:05:58 -07:00
Antonio Murdaca
5ff21add06
New seccomp format
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2016-09-01 11:53:07 +02:00
Michael Crosby
041e5a21dc Replace old oci specs import with runtime-specs
Fixes #25804

The upstream repo changed the import paths.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-08-17 09:38:34 -07:00
Justin Cormack
a01c4dc8f8 Align default seccomp profile with selected capabilities
Currently the default seccomp profile is fixed. This changes it
so that it varies depending on the Linux capabilities selected with
the --cap-add and --cap-drop options. Without this, if a user adds
privileges, eg to allow ptrace with --cap-add sys_ptrace then still
cannot actually use ptrace as it is still blocked by seccomp, so
they will probably disable seccomp or use --privileged. With this
change the syscalls that are needed for the capability are also
allowed by the seccomp profile based on the selected capabilities.

While this patch makes it easier to do things with for example
cap_sys_admin enabled, as it will now allow creating new namespaces
and use of mount, it still allows less than --cap-add cap_sys_admin
--security-opt seccomp:unconfined would have previously. It is not
recommended that users run containers with cap_sys_admin as this does
give full access to the host machine.

It also cleans up some architecture specific system calls to be
only selected when needed.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2016-05-11 09:30:23 +01:00
Tonis Tiigi
99b16b3523 Reuse profiles/seccomp package
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2016-03-19 14:15:39 -07:00
allencloud
34b82a69b9 fix some typos.
Signed-off-by: allencloud <allen.sun@daocloud.io>
2016-03-10 10:09:27 +08:00
Jessica Frazelle
ad600239bc
generate seccomp profile convert type
Signed-off-by: Jessica Frazelle <acidburn@docker.com>
2016-02-19 13:32:54 -08:00
Jessica Frazelle
d57816de02
add default seccomp profile as json
profile is created by go generate

Signed-off-by: Jessica Frazelle <acidburn@docker.com>
2016-02-08 08:19:21 -08:00
Jessica Frazelle
bed0bb7d01
move default seccomp profile into package
Signed-off-by: Jessica Frazelle <acidburn@docker.com>
2016-01-21 16:55:29 -08:00
Renamed from daemon/execdriver/native/seccomp.go (Browse further)