Commit graph

118 commits

Author SHA1 Message Date
Sören Tempel
85eaf23bf4 seccomp: add support for "swapcontext" syscall in default policy
This system call is only available on the 32- and 64-bit PowerPC, it is
used by modern programming language implementations (such as gcc-go) to
implement coroutine features through userspace context switches.

Other container environment, such as Systemd nspawn already whitelist
this system call in their seccomp profile [1] [2]. As such, it would be
nice to also whitelist it in moby.

This issue was encountered on Alpine Linux GitLab CI system, which uses
moby, when attempting to execute gcc-go compiled software on ppc64le.

[1]: https://github.com/systemd/systemd/pull/9487
[2]: https://github.com/systemd/systemd/issues/9485

Signed-off-by: Sören Tempel <soeren+git@soeren-tempel.net>
2021-12-18 14:06:07 +01:00
Eng Zer Jun
c55a4ac779
refactor: move from io/ioutil to io and os package
The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2021-08-27 14:56:57 +08:00
Sebastiaan van Stijn
686be57d0a
Update to Go 1.17.0, and gofmt with Go 1.17
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-24 23:33:27 +02:00
Sebastiaan van Stijn
2480bebf59
Merge pull request #42649 from kinvolk/rata/seccomp-default-errno
seccomp: Use explicit DefaultErrnoRet
2021-08-03 15:13:42 +02:00
Rodrigo Campos
fb794166d9 seccomp: Use explicit DefaultErrnoRet
Since commit "seccomp: Sync fields with runtime-spec fields"
(5d244675bd) we support to specify the
DefaultErrnoRet to be used.

Before that commit it was not specified and EPERM was used by default.
This commit keeps the same behaviour but just makes it explicit that the
default is EPERM.

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-07-30 19:13:21 +02:00
Daniel P. Berrangé
9f6b562dd1 seccomp: add support for "clone3" syscall in default policy
If no seccomp policy is requested, then the built-in default policy in
dockerd applies. This has no rule for "clone3" defined, nor any default
errno defined. So when runc receives the config it attempts to determine
a default errno, using logic defined in its commit:

  7a8d7162f9

As explained in the above commit message, runc uses a heuristic to
decide which errno to return by default:

[quote]
  The solution applied here is to prepend a "stub" filter which returns
  -ENOSYS if the requested syscall has a larger syscall number than any
  syscall mentioned in the filter. The reason for this specific rule is
  that syscall numbers are (roughly) allocated sequentially and thus newer
  syscalls will (usually) have a larger syscall number -- thus causing our
  filters to produce -ENOSYS if the filter was written before the syscall
  existed.
[/quote]

Unfortunately clone3 appears to one of the edge cases that does not
result in use of ENOSYS, instead ending up with the historical EPERM
errno.

Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
clone3 by default. If it sees ENOSYS then it will automatically
fallback to using clone. Any other errno is treated as a fatal
error. Thus when docker seccomp policy triggers EPERM from clone3,
no fallback occurs and programs are thus unable to spawn threads.

The clone3 syscall is much more complicated than clone, most notably its
flags are not exposed as a directly argument any more. Instead they are
hidden inside a struct. This means that seccomp filters are unable to
apply policy based on values seen in flags. Thus we can't directly
replicate the current "clone" filtering for "clone3". We can at least
ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
at which point we can filter on flags.

Fixes: https://github.com/moby/moby/issues/42680
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2021-07-27 10:56:07 +01:00
Sebastiaan van Stijn
0ef7e727d2
seccomp: Seccomp: embed oci-spec LinuxSeccomp, add support for seccomp flags
This patch, similar to d92739713c, embeds the
`LinuxSeccomp` type of the runtime-spec, so that we can support all options
provided by the spec, and decorates it with our own fields.

With this, profiles can make use of the recently added "Flags" field, to
specify flags that must be passed to seccomp(2) when installing the filter.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-07-17 15:57:54 +02:00
Sebastiaan van Stijn
bfd4b64600
seccomp: setupSeccomp(): update errors and remove redundant check
Make the error message slightly more informative, and remove the redundant
`len(config.ArchMap) != 0` check, as iterating over an empty, or 'nil' slice
is a no-op already. This allows to use a slightly more idiomatic "if ok := xx; ok"
condition.

Also move validation to the start of the loop (early return), and explicitly create
a new slice for "names" if the legacy "Name" field is used.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-07-17 15:57:41 +02:00
Sebastiaan van Stijn
c815b86f40
seccomp: add additional unit-tests
Add test to verify profile validation, and to verify that the legacy
format actually loads the profile as expected (instead of only verifying
it doesn't produce an error).

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-07-16 18:01:25 +02:00
Sebastiaan van Stijn
c1ced23544
seccomp: use oci-spec consts in tests
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-07-16 18:01:23 +02:00
Sebastiaan van Stijn
b309e96b11
seccomp: improve GoDoc for Seccomp fields
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-07-16 18:01:12 +02:00
Rodrigo Campos
5d244675bd seccomp: Sync fields with runtime-spec fields
The runtime spec we are using has support for these 3 fields[1], but
moby doesn't have them in its seccomp struct. This patch just adds and
copies them when they are in the profile.

DefaultErrnoRet is implemented in the runc version moby is using (it is
implemented since runc-rc95[2]) but if we create a container without
this moby patch, we don't see an error nor the expected behavior. This
is not clear for the user (the profile they specify is valid, the syntax
is ok, but the wrong behavior is seen).

This is because the DefaultErrnoRet field is not copied to the config
passed ultimately to runc (i.e. is like the field was not specified).
With this patch, we see the expected behavior.

The other two fileds are in the runtime-spec but not yet in runc (a PR
is open and targets 1.1.0 milestone). However, I took the liberty to
copy them now too for two reasons:

1. If we don't add them now and end up using a runc version that
supports them, then the error that the user will see is not clear at
all:

	docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: listenerPath is not set: unknown.

And it is not obvious to debug for the user, as the field _is_ set in
the profile they specify (just not copied by moby to the profile moby
specifies ultimately to runc).

2. When using a runc without seccomp notify support (like today), the
error we see is the same with and without this moby patch (when using a
seccomp profile with the new fields):

	docker: Error response from daemon: OCI runtime create failed: string SCMP_ACT_NOTIFY is not a valid action for seccomp: unknown.

Then, it seems like a clear win to add them now: we don't have to do it
later (that implies not clear errors to the user if we forget, like we
did with DefaultErrnoRet) and the user sees the exact same error when
using a runc version that doesn't support these fields.

[1]: Note we are vendoring version 1c3f411f041711bbeecf35ff7e93461ea6789220 and this version has these 3 fields 1c3f411f04/config-linux.md (seccomp)
[2]: https://github.com/opencontainers/runc/pull/2954/
[3]: https://github.com/opencontainers/runc/pull/2682

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-07-08 17:11:53 +02:00
Sebastiaan van Stijn
c7cd1b9436
profiles/seccomp.Syscall: use pointers and omitempty
These fields are optional, and this makes the JSON representation
slightly less verbose.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-17 21:25:09 +02:00
Sebastiaan van Stijn
d92739713c
seccomp.Syscall: embed runtime-spec Syscall type
This makes the type better reflect the difference with the "runtime" profile;
our local type is used to generate a runtime-spec seccomp profile and extends
the runtime-spec type with additional fields; adding a "Name" field for backward
compatibility with older JSON representations, additional "Comment" metadata,
and conditional rules ("Includes", "Excludes") used during generation to adjust
the profile based on the container (capabilities) and host's (architecture, kernel)
configuration.

This change introduces one change in the type; the "runtime-spec" type uses a
`[]LinuxSeccompArg` for the `Args` field, whereas the local type used pointers;
`[]*LinuxSeccompArg`.

In addition, the runtime-spec Syscall type brings a new `ErrnoRet` field, allowing
the profile to specify the errno code returned for the syscall, which allows
changing the default EPERM for specific syscalls.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-17 21:25:06 +02:00
clubby789
d39b075302 Enable process_vm_readv and process_vm_writev for kernel > 4.8
These syscalls were disabled in #18971
due to them requiring CAP_PTRACE. CAP_PTRACE was blocked by default due
to a ptrace related exploit. This has been patched in the Linux kernel
(version 4.8) and thus `ptrace` has been re-enabled. However, these
associated syscalls seem to have been left behind. This commit brings
them in line with `ptrace`, and re-enables it for kernel > 4.8.

Signed-off-by: clubby789 <jamie@hill-daniel.co.uk>
2021-03-04 17:12:01 +00:00
Aleksa Sarai
54eff4354b
profiles: seccomp: update to Linux 5.11 syscall list
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:

 * close_range(2), epoll_pwait2(2) are just extensions of existing "safe
   for everyone" syscalls.

 * The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
   all equivalent to aspects of mount(2) and thus go into the
   CAP_SYS_ADMIN category.

 * process_madvise(2) is similar to the other process_*(2) syscalls and
   thus goes in the CAP_SYS_PTRACE category.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2021-01-27 13:25:49 +11:00
Mark Vainomaa
f7bcb02f67
seccomp: Add pidfd_getfd syscall
Signed-off-by: Mark Vainomaa <mikroskeem@mikroskeem.eu>
2020-11-12 15:31:07 +02:00
Mark Vainomaa
5e3ffe6464
seccomp: Add pidfd_open and pidfd_send_signal
Signed-off-by: Mark Vainomaa <mikroskeem@mikroskeem.eu>
2020-11-11 15:20:34 +02:00
Sebastiaan van Stijn
4539e7f0eb
seccomp: implement marshal/unmarshall for MinVersion
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-10-07 17:48:25 +02:00
Sebastiaan van Stijn
a692823413
seccomp: add test for unmarshal default profile
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-10-02 18:15:46 +02:00
Sebastiaan van Stijn
97535c6c2b
seccomp: remove dependency on pkg/parsers/kernel
This removes the dependency on the `pkg/parsers/kernel` package, because secomp
only needs to consider Linux (and no parsing is needed for Windows or Darwin kernel
versions).

This patch implements the minimum requirements for this implementation:

- only `kernel` and `major` versions are considered
- `minor` version, `flavor`, and `-rcXX` suffixes are ignored

So, for example:

- `3.4.54.longterm-1` => `kernel: 3`, `major: 4`
- `3.8.0-19-generic` => `kernel: 3`, `major: 8`
- `3.10.0-862.2.3.el7.x86_64` => `kernel: 3`, `major: 10`

Some systems also omit the `minor` and/or have odd-formatted versions. In context
of generating seccomp profiles, both versions below are considered equal;

- `3.12.25-gentoo` => `kernel: 3`, `major: 12`
- `3.12-1-amd64` => `kernel: 3`, `major: 12`

Note that `-rcX` suffixes are also not considered, and thus (e.g.) kernel `5.9-rc1`,
`5.9-rc6` and `5.9` are all considered equal.

The motivation for ignoring "minor" versions and "flavors" is that;

- The upstream kernel only does "kernel.major" releases
- While release-candidates exists for kernel (e.g. 5.9-rc5), we don't expect users
  to write profiles that target a specific release-candidate, and therefore consider
  (e.g.) kernel `5.9-rc1`, `5.9-rc6` and `5.9` to be equal.
- Generally, a seccomp-profile should either be portable, or written for a specific
  infrastructure (in which case the writer of the profile would know if the kernel-flavors
  used does/does not support certain things.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-10-02 18:15:37 +02:00
Sebastiaan van Stijn
56e7bc4b78
seccomp: remove dependency on oci package
rewrite the tests to use a minimal runtime-spec Spec instead

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-29 19:39:15 +02:00
Sebastiaan van Stijn
b8aec34680
seccomp: add test for loading old JSON format
Commit 5ff21add06 changed the (JSON) format that's
used for seccomp profiles, but keeping the code backward compatible to allow both
the old or new format.

This patch adds a new test, which loads the old format. It takes the default seccomp
profile before the format was changed.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-28 09:50:03 +02:00
Sebastiaan van Stijn
0d75b63987
seccomp: replace types with runtime-spec types
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-18 19:33:58 +02:00
Sebastiaan van Stijn
0efee50b95
seccomp: move seccomp types from api into seccomp profile
These types were not used in the API, so could not come up with
a reason why they were in that package.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-18 18:14:16 +02:00
Sebastiaan van Stijn
1155b6bc7a
Merge pull request #41395 from cpuguy83/no_libseccomp
Remove dependency in dockerd on libseccomp
2020-09-15 17:37:04 +02:00
Brian Goff
ccbb00c815 Remove dependency in dockerd on libseccomp
This was just using libseccomp to get the right arch, but we can use
GOARCH to get this.
The nativeToSeccomp map needed to be adjusted a bit for mipsle vs mipsel
since that's go how refers to it. Also added some other arches to it.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2020-09-11 22:48:42 +00:00
Justin Cormack
7ca355652f
Merge pull request #41337 from cyphar/apparmor-update-profile
apparmor: permit signals from unconfined programs
2020-09-11 12:05:40 +01:00
Jintao Zhang
a18139111d Add faccessat2 to default seccomp profile.
Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
2020-08-17 21:13:03 +08:00
Jintao Zhang
b8988c8475 Add openat2 to default seccomp profile.
follow up to https://github.com/moby/moby/pull/41344#discussion_r469919978

Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
2020-08-16 15:58:57 +08:00
Aleksa Sarai
725eced4e0
apparmor: permit signals from unconfined programs
Otherwise if you try to kill a container process from the host directly,
you get EACCES. Also add a comment to make sure that the profile code
(which has been replicated by several projects) doesn't get out of sync.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-08-11 18:18:58 +10:00
Sebastiaan van Stijn
3895dd585f
Replace uses of blacklist/whitelist
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-07-14 10:41:34 +02:00
Florian Schmaus
d0d99b04cf seccomp: allow 'rseq' syscall in default seccomp profile
Restartable Sequences (rseq) are a kernel-based mechanism for fast
update operations on per-core data in user-space. Some libraries, like
the newest version of Google's TCMalloc, depend on it [1].

This also makes dockers default seccomp profile on par with systemd's,
which enabled 'rseq' in early 2019 [2].

1: https://google.github.io/tcmalloc/design.html
2: 6fee3be0b4

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
2020-06-26 16:06:26 +02:00
Justin Cormack
1aafcbb47a
Merge pull request #40995 from KentaTada/remove-unused-syscall
seccomp: remove the unused query_module(2)
2020-05-28 11:25:59 +01:00
Akihiro Suda
b2917efb1a
Merge pull request #40731 from sqreen/fix/seccomp-profile
seccomp: allow syscall membarrier
2020-05-20 00:31:32 +09:00
Kenta Tada
1192c7aee4 seccomp: remove the unused query_module(2)
query_module(2) is only in kernels before Linux 2.6.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-05-19 10:30:54 +09:00
Stanislav Levin
5d3a9e4319
seccomp: Whitelist clock_adjtime
This only allows making the syscall. CAP_SYS_TIME is still required
for time adjustment (enforced by the kernel):

```
kernel/time/posix-timers.c:

1112 SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock,
1113                 struct __kernel_timex __user *, utx)
...
1121         err = do_clock_adjtime(which_clock, &ktx);

1100 int do_clock_adjtime(const clockid_t which_clock, struct __kernel_timex * ktx)
1101 {
...
1109         return kc->clock_adj(which_clock, ktx);

1299 static const struct k_clock clock_realtime = {
...
1304         .clock_adj              = posix_clock_realtime_adj,

188 static int posix_clock_realtime_adj(const clockid_t which_clock,
189                                     struct __kernel_timex *t)
190 {
191         return do_adjtimex(t);

kernel/time/timekeeping.c:

2312 int do_adjtimex(struct __kernel_timex *txc)
2313 {
...
2321         /* Validate the data before disabling interrupts */
2322         ret = timekeeping_validate_timex(txc);

2246 static int timekeeping_validate_timex(const struct __kernel_timex *txc)
2247 {
2248         if (txc->modes & ADJ_ADJTIME) {
...
2252                 if (!(txc->modes & ADJ_OFFSET_READONLY) &&
2253                     !capable(CAP_SYS_TIME))
2254                         return -EPERM;
2255         } else {
2256                 /* In order to modify anything, you gotta be super-user! */
2257                 if (txc->modes && !capable(CAP_SYS_TIME))
2258                         return -EPERM;

```

Fixes: https://github.com/moby/moby/issues/40919
Signed-off-by: Stanislav Levin <slev@altlinux.org>
2020-05-08 12:33:25 +03:00
Julio Guerra
1026f873a4
seccomp: allow syscall membarrier
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.

Signed-off-by: Julio Guerra <julio@sqreen.com>
2020-04-07 16:24:17 +02:00
Sebastiaan van Stijn
89fabf0f24
seccomp: add 64-bit time_t syscalls
Relates to https://patchwork.kernel.org/patch/10756415/

Added to whitelist:

- `clock_getres_time64` (equivalent of `clock_getres`, which was whitelisted)
- `clock_gettime64` (equivalent of `clock_gettime`, which was whitelisted)
- `clock_nanosleep_time64` (equivalent of `clock_nanosleep`, which was whitelisted)
- `futex_time64` (equivalent of `futex`, which was whitelisted)
- `io_pgetevents_time64` (equivalent of `io_pgetevents`, which was whitelisted)
- `mq_timedreceive_time64` (equivalent of `mq_timedreceive`, which was whitelisted)
- `mq_timedsend_time64 ` (equivalent of `mq_timedsend`, which was whitelisted)
- `ppoll_time64` (equivalent of `ppoll`, which was whitelisted)
- `pselect6_time64` (equivalent of `pselect6`, which was whitelisted)
- `recvmmsg_time64` (equivalent of `recvmmsg`, which was whitelisted)
- `rt_sigtimedwait_time64` (equivalent of `rt_sigtimedwait`, which was whitelisted)
- `sched_rr_get_interval_time64` (equivalent of `sched_rr_get_interval`, which was whitelisted)
- `semtimedop_time64` (equivalent of `semtimedop`, which was whitelisted)
- `timer_gettime64` (equivalent of `timer_gettime`, which was whitelisted)
- `timer_settime64` (equivalent of `timer_settime`, which was whitelisted)
- `timerfd_gettime64` (equivalent of `timerfd_gettime`, which was whitelisted)
- `timerfd_settime64` (equivalent of `timerfd_settime`, which was whitelisted)
- `utimensat_time64` (equivalent of `utimensat`, which was whitelisted)

Not added to whitelist:

- `clock_adjtime64` (equivalent of `clock_adjtime`, which was not whitelisted)
- `clock_settime64` (equivalent of `clock_settime`, which was not whitelisted)

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-03-25 13:49:49 +01:00
Arnaud Rebillout
667c87ef4f profiles: Fix file permissions on json files
json files should not be executable I think.

Signed-off-by: Arnaud Rebillout <arnaud.rebillout@collabora.com>
2019-09-16 11:15:37 +07:00
youcai
f4d41f1dfa seccomp: whitelist io-uring related system calls
Signed-off-by: youcai <omegacoleman@gmail.com>
2019-09-07 07:35:23 +00:00
Michael Crosby
e4605cc2a5 Add sigprocmask to default seccomp profile
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-08-29 13:52:45 -04:00
Kir Kolyshkin
0d496e3d71 profiles/seccomp: improve profile conversion
When translating seccomp profile to opencontainers format, a single
group with multiple syscalls is converted to individual syscall rules.
I am not sure why it is done that way, but suspect it might have
performance implications as the number of rules grows.

Change this to pass a groups of syscalls as a group.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-06-18 17:58:51 -07:00
Sebastiaan van Stijn
9e763de6ad
Merge pull request #39121 from goldwynr/master
apparmor: allow readby and tracedby
2019-06-11 18:25:47 +02:00
Sebastiaan van Stijn
a1ec8551ab
Fix seccomp profile for clone syscall
All clone flags for namespace should be denied.

Based-on-patch-by: Kenta Tada <Kenta.Tada@sony.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-06-04 15:28:12 +02:00
Goldwyn Rodrigues
b36455258f apparmor: allow readby and tracedby
Fixes audit errors such as:

type=AVC msg=audit(1550236803.810:143):
apparmor="DENIED" operation="ptrace" profile="docker-default"
pid=3181 comm="ps" requested_mask="readby" denied_mask="readby"
peer="docker-default"

audit(1550236375.918:3): apparmor="DENIED" operation="ptrace"
profile="docker-default" pid=2267 comm="ps"
requested_mask="tracedby" denied_mask="tracedby"
peer="docker-default"

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2019-04-22 09:11:50 -05:00
Avi Kivity
665741510a seccomp: whitelist io_pgetevents()
io_pgetevents() is a new Linux system call. It is similar to io_getevents()
that is already whitelisted, and adds no special abilities over that system call.

Allow that system call to enable applications that use it.

Fixes #38894.

Signed-off-by: Avi Kivity <avi@scylladb.com>
2019-03-18 20:46:16 +02:00
Tonis Tiigi
e76380b67b seccomp: review update
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2019-02-05 12:02:41 -08:00
Justin Cormack
1603af9689
Merge pull request #38137 from tonistiigi/seccomp-ptrace
seccomp: allow ptrace(2) for 4.8+ kernels
2019-02-05 13:47:43 +00:00
Vincent Demeester
f11b87bfca
Merge pull request #37831 from cyphar/apparmor-external-templates
apparmor: allow receiving of signals from 'docker kill'
2018-11-19 09:12:15 +01:00