Commit graph

93 commits

Author SHA1 Message Date
Sebastiaan van Stijn
1155b6bc7a
Merge pull request #41395 from cpuguy83/no_libseccomp
Remove dependency in dockerd on libseccomp
2020-09-15 17:37:04 +02:00
Brian Goff
ccbb00c815 Remove dependency in dockerd on libseccomp
This was just using libseccomp to get the right arch, but we can use
GOARCH to get this.
The nativeToSeccomp map needed to be adjusted a bit for mipsle vs mipsel
since that's go how refers to it. Also added some other arches to it.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2020-09-11 22:48:42 +00:00
Justin Cormack
7ca355652f
Merge pull request #41337 from cyphar/apparmor-update-profile
apparmor: permit signals from unconfined programs
2020-09-11 12:05:40 +01:00
Jintao Zhang
a18139111d Add faccessat2 to default seccomp profile.
Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
2020-08-17 21:13:03 +08:00
Jintao Zhang
b8988c8475 Add openat2 to default seccomp profile.
follow up to https://github.com/moby/moby/pull/41344#discussion_r469919978

Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
2020-08-16 15:58:57 +08:00
Aleksa Sarai
725eced4e0
apparmor: permit signals from unconfined programs
Otherwise if you try to kill a container process from the host directly,
you get EACCES. Also add a comment to make sure that the profile code
(which has been replicated by several projects) doesn't get out of sync.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-08-11 18:18:58 +10:00
Sebastiaan van Stijn
3895dd585f
Replace uses of blacklist/whitelist
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-07-14 10:41:34 +02:00
Florian Schmaus
d0d99b04cf seccomp: allow 'rseq' syscall in default seccomp profile
Restartable Sequences (rseq) are a kernel-based mechanism for fast
update operations on per-core data in user-space. Some libraries, like
the newest version of Google's TCMalloc, depend on it [1].

This also makes dockers default seccomp profile on par with systemd's,
which enabled 'rseq' in early 2019 [2].

1: https://google.github.io/tcmalloc/design.html
2: 6fee3be0b4

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
2020-06-26 16:06:26 +02:00
Justin Cormack
1aafcbb47a
Merge pull request #40995 from KentaTada/remove-unused-syscall
seccomp: remove the unused query_module(2)
2020-05-28 11:25:59 +01:00
Akihiro Suda
b2917efb1a
Merge pull request #40731 from sqreen/fix/seccomp-profile
seccomp: allow syscall membarrier
2020-05-20 00:31:32 +09:00
Kenta Tada
1192c7aee4 seccomp: remove the unused query_module(2)
query_module(2) is only in kernels before Linux 2.6.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-05-19 10:30:54 +09:00
Stanislav Levin
5d3a9e4319
seccomp: Whitelist clock_adjtime
This only allows making the syscall. CAP_SYS_TIME is still required
for time adjustment (enforced by the kernel):

```
kernel/time/posix-timers.c:

1112 SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock,
1113                 struct __kernel_timex __user *, utx)
...
1121         err = do_clock_adjtime(which_clock, &ktx);

1100 int do_clock_adjtime(const clockid_t which_clock, struct __kernel_timex * ktx)
1101 {
...
1109         return kc->clock_adj(which_clock, ktx);

1299 static const struct k_clock clock_realtime = {
...
1304         .clock_adj              = posix_clock_realtime_adj,

188 static int posix_clock_realtime_adj(const clockid_t which_clock,
189                                     struct __kernel_timex *t)
190 {
191         return do_adjtimex(t);

kernel/time/timekeeping.c:

2312 int do_adjtimex(struct __kernel_timex *txc)
2313 {
...
2321         /* Validate the data before disabling interrupts */
2322         ret = timekeeping_validate_timex(txc);

2246 static int timekeeping_validate_timex(const struct __kernel_timex *txc)
2247 {
2248         if (txc->modes & ADJ_ADJTIME) {
...
2252                 if (!(txc->modes & ADJ_OFFSET_READONLY) &&
2253                     !capable(CAP_SYS_TIME))
2254                         return -EPERM;
2255         } else {
2256                 /* In order to modify anything, you gotta be super-user! */
2257                 if (txc->modes && !capable(CAP_SYS_TIME))
2258                         return -EPERM;

```

Fixes: https://github.com/moby/moby/issues/40919
Signed-off-by: Stanislav Levin <slev@altlinux.org>
2020-05-08 12:33:25 +03:00
Julio Guerra
1026f873a4
seccomp: allow syscall membarrier
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.

Signed-off-by: Julio Guerra <julio@sqreen.com>
2020-04-07 16:24:17 +02:00
Sebastiaan van Stijn
89fabf0f24
seccomp: add 64-bit time_t syscalls
Relates to https://patchwork.kernel.org/patch/10756415/

Added to whitelist:

- `clock_getres_time64` (equivalent of `clock_getres`, which was whitelisted)
- `clock_gettime64` (equivalent of `clock_gettime`, which was whitelisted)
- `clock_nanosleep_time64` (equivalent of `clock_nanosleep`, which was whitelisted)
- `futex_time64` (equivalent of `futex`, which was whitelisted)
- `io_pgetevents_time64` (equivalent of `io_pgetevents`, which was whitelisted)
- `mq_timedreceive_time64` (equivalent of `mq_timedreceive`, which was whitelisted)
- `mq_timedsend_time64 ` (equivalent of `mq_timedsend`, which was whitelisted)
- `ppoll_time64` (equivalent of `ppoll`, which was whitelisted)
- `pselect6_time64` (equivalent of `pselect6`, which was whitelisted)
- `recvmmsg_time64` (equivalent of `recvmmsg`, which was whitelisted)
- `rt_sigtimedwait_time64` (equivalent of `rt_sigtimedwait`, which was whitelisted)
- `sched_rr_get_interval_time64` (equivalent of `sched_rr_get_interval`, which was whitelisted)
- `semtimedop_time64` (equivalent of `semtimedop`, which was whitelisted)
- `timer_gettime64` (equivalent of `timer_gettime`, which was whitelisted)
- `timer_settime64` (equivalent of `timer_settime`, which was whitelisted)
- `timerfd_gettime64` (equivalent of `timerfd_gettime`, which was whitelisted)
- `timerfd_settime64` (equivalent of `timerfd_settime`, which was whitelisted)
- `utimensat_time64` (equivalent of `utimensat`, which was whitelisted)

Not added to whitelist:

- `clock_adjtime64` (equivalent of `clock_adjtime`, which was not whitelisted)
- `clock_settime64` (equivalent of `clock_settime`, which was not whitelisted)

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-03-25 13:49:49 +01:00
Arnaud Rebillout
667c87ef4f profiles: Fix file permissions on json files
json files should not be executable I think.

Signed-off-by: Arnaud Rebillout <arnaud.rebillout@collabora.com>
2019-09-16 11:15:37 +07:00
youcai
f4d41f1dfa seccomp: whitelist io-uring related system calls
Signed-off-by: youcai <omegacoleman@gmail.com>
2019-09-07 07:35:23 +00:00
Michael Crosby
e4605cc2a5 Add sigprocmask to default seccomp profile
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-08-29 13:52:45 -04:00
Kir Kolyshkin
0d496e3d71 profiles/seccomp: improve profile conversion
When translating seccomp profile to opencontainers format, a single
group with multiple syscalls is converted to individual syscall rules.
I am not sure why it is done that way, but suspect it might have
performance implications as the number of rules grows.

Change this to pass a groups of syscalls as a group.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-06-18 17:58:51 -07:00
Sebastiaan van Stijn
9e763de6ad
Merge pull request #39121 from goldwynr/master
apparmor: allow readby and tracedby
2019-06-11 18:25:47 +02:00
Sebastiaan van Stijn
a1ec8551ab
Fix seccomp profile for clone syscall
All clone flags for namespace should be denied.

Based-on-patch-by: Kenta Tada <Kenta.Tada@sony.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-06-04 15:28:12 +02:00
Goldwyn Rodrigues
b36455258f apparmor: allow readby and tracedby
Fixes audit errors such as:

type=AVC msg=audit(1550236803.810:143):
apparmor="DENIED" operation="ptrace" profile="docker-default"
pid=3181 comm="ps" requested_mask="readby" denied_mask="readby"
peer="docker-default"

audit(1550236375.918:3): apparmor="DENIED" operation="ptrace"
profile="docker-default" pid=2267 comm="ps"
requested_mask="tracedby" denied_mask="tracedby"
peer="docker-default"

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2019-04-22 09:11:50 -05:00
Avi Kivity
665741510a seccomp: whitelist io_pgetevents()
io_pgetevents() is a new Linux system call. It is similar to io_getevents()
that is already whitelisted, and adds no special abilities over that system call.

Allow that system call to enable applications that use it.

Fixes #38894.

Signed-off-by: Avi Kivity <avi@scylladb.com>
2019-03-18 20:46:16 +02:00
Tonis Tiigi
e76380b67b seccomp: review update
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2019-02-05 12:02:41 -08:00
Justin Cormack
1603af9689
Merge pull request #38137 from tonistiigi/seccomp-ptrace
seccomp: allow ptrace(2) for 4.8+ kernels
2019-02-05 13:47:43 +00:00
Vincent Demeester
f11b87bfca
Merge pull request #37831 from cyphar/apparmor-external-templates
apparmor: allow receiving of signals from 'docker kill'
2018-11-19 09:12:15 +01:00
Tonis Tiigi
1124543ca8 seccomp: allow ptrace for 4.8+ kernels
4.8+ kernels have fixed the ptrace security issues
so we can allow ptrace(2) on the default seccomp
profile if we do the kernel version check.

93e35efb8d

Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2018-11-04 13:06:43 -08:00
Justin Cormack
ccd22ffcc8
Move the syslog syscall to be gated by CAP_SYS_ADMIN or CAP_SYSLOG
This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.

Fix #37897

See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2018-09-27 14:27:05 -07:00
Aleksa Sarai
4822fb1e24
apparmor: allow receiving of signals from 'docker kill'
In newer kernels, AppArmor will reject attempts to send signals to a
container because the signal originated from outside of that AppArmor
profile. Correct this by allowing all unconfined signals to be received.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-09-13 02:06:56 +10:00
Nicolas V Castet
47dfff68e4 Whitelist syscalls linked to CAP_SYS_NICE in default seccomp profile
* Update profile to match docker documentation at
  https://docs.docker.com/engine/security/seccomp/

Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
2018-06-20 07:32:08 -05:00
Justin Cormack
15ff09395c
If container will run as non root user, drop permitted, effective caps early
As soon as the initial executable in the container is executed as a non root user,
permitted and effective capabilities are dropped. Drop them earlier than this, so
that they are dropped before executing the file. The main effect of this is that
if `CAP_DAC_OVERRIDE` is set (the default) the user will not be able to execute
files they do not have permission to execute, which previously they could.

The old behaviour was somewhat surprising and the new one is definitely correct,
but it is not in any meaningful way exploitable, and I do not think it is
necessary to backport this fix. It is unlikely to have any negative effects as
almost all executables have world execute permission anyway.

Use the bounding set not the effective set as the canonical set of capabilities, as
effective will now vary.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2018-03-19 14:45:27 -07:00
NobodyOnSE
b2a907c8ca Whitelist statx syscall for libseccomp-2.3.3 onward
Older seccomp versions will ignore this.

Signed-off-by: NobodyOnSE <ich@sektor.selfip.com>
2018-03-06 08:42:12 +01:00
Daniel Nephin
4f0d95fa6e Add canonical import comment
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2018-02-05 16:51:57 -05:00
Chao Wang
5c154cfac8 Copy Inslice() to those parts that use it
Signed-off-by: Chao Wang <wangchao.fnst@cn.fujitsu.com>
2017-11-10 13:42:38 +08:00
Tycho Andersen
b4a6ccbc5f drop useless apparmor denies
These files don't exist under proc so this rule does nothing.

They are protected against by docker's default cgroup devices since they're
both character devices and not explicitly allowed.

Signed-off-by: Tycho Andersen <tycho@docker.com>
2017-10-06 09:11:59 -06:00
Simon Vikstrom
d7bf5e3b4d Remove double defined alarm
Signed-off-by: Simon Vikstrom <pullreq@devsn.se>
2017-08-19 09:55:03 +02:00
Yong Tang
bbb401de87 Merge pull request #34445 from pmoust/f-seccomp-quotacl
seccomp: whitelist quotactl with CAP_SYS_ADMIN
2017-08-09 11:53:13 -07:00
Panagiotis Moustafellos
cf6e1c5dfd
seccomp: whitelist quotactl with CAP_SYS_ADMIN
The quotactl syscall is being whitelisted in default seccomp profile,
gated by CAP_SYS_ADMIN.

Signed-off-by: Panagiotis Moustafellos <pmoust@elastic.co>
2017-08-09 18:52:15 +03:00
Vincent Demeester
9ef3b53597
Move pkg/templates away
- Remove unused function and variables from the package
- Remove usage of it from `profiles/apparmor` where it wasn't required
- Move the package to `daemon/logger/templates` where it's only used

Signed-off-by: Vincent Demeester <vincent@sbr.pm>
2017-08-08 18:16:41 +02:00
Florin Patan
52d4716843 Remove unused import
This commit removes an unused import.

Signed-off-by: Florin Patan <florinpatan@gmail.com>
2017-07-29 22:21:53 +01:00
Christopher Jones
069fdc8a08
[project] change syscall to /x/sys/unix|windows
Changes most references of syscall to golang.org/x/sys/
Ones aren't changes include, Errno, Signal and SysProcAttr
as they haven't been implemented in /x/sys/.

Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>

[s390x] switch utsname from unsigned to signed

per 33267e036f
char in s390x in the /x/sys/unix package is now signed, so
change the buildtags

Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
2017-07-11 08:00:32 -04:00
Miklos Szegedi
2db05316d0 Whitelist adjtimex get operation. Adjustment operations are gated by CAP_SYS_TIME
Signed-off-by: Miklos Szegedi <miklos.szegedi@cloudera.com>
2017-06-02 18:48:16 +00:00
Justin Cormack
dcf2632945 Revert "Block obsolete socket families in the default seccomp profile"
This reverts commit 7e3a596a63.

Unfortunately, it was pointed out in https://github.com/moby/moby/pull/29076#commitcomment-21831387
that the `socketcall` syscall takes a pointer to a struct so it is not possible to
use seccomp profiles to filter it. This means these cannot be blocked as you can
use `socketcall` to call them regardless, as we currently allow 32 bit syscalls.

Users who wish to block these should use a seccomp profile that blocks all
32 bit syscalls and then just block the non socketcall versions.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-05-09 14:26:00 +01:00
Michael Crosby
005506d36c Update moby to runc and oci 1.0 runtime final rc
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-05-05 13:45:45 -07:00
Ian Campbell
cd456433ea seccomp: Allow personality with UNAME26 bit set.
From personality(2):

    Have uname(2) report a 2.6.40+ version number rather than a 3.x version
    number.  Added as a stopgap measure to support broken applications that
    could not handle the  kernel  version-numbering  switch  from 2.6.x to 3.x.

This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".

Fixes: #32839

Signed-off-by: Ian Campbell <ian.campbell@docker.com>
2017-05-02 15:05:01 +01:00
Antonio Murdaca
3ab4961032
profiles: seccomp: allow clock_settime when CAP_SYS_TIME is added
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2017-03-20 11:05:23 +01:00
Justin Cormack
9067ef0e32 Seccomp Update
- Update libseccomp-golang to 0.9.0 release
- Update libseccomp to 2.3.2 release
- add preadv2 and pwritev2 syscalls to whitelist

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-03-07 22:19:46 +00:00
Aleksa Sarai
a3155743ad
profiles: seccomp: fix !seccomp build
Previously building with seccomp disabled would cause build failures
because of a mismatch in the type signatures of DefaultProfile().

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-02 21:13:17 +11:00
Gabriel Linder
52d8f582c3 Allow sync_file_range2 on supported architectures.
Signed-off-by: Gabriel Linder <linder.gabriel@gmail.com>
2017-02-14 21:29:33 +01:00
Justin Cormack
d6adcd6a82 Add two arm specific syscalls to seccomp profile
These are arm variants with different argument ordering because of
register alignment requirements.

fix #30516

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-01-29 14:59:45 +00:00
Justin Cormack
7e3a596a63 Block obsolete socket families in the default seccomp profile
Linux supports many obsolete address families, which are usually available in
common distro kernels, but they are less likely to be properly audited and
may have security issues

This blocks all socket families in the socket (and socketcall where applicable) syscall
except
- AF_UNIX - Unix domain sockets
- AF_INET - IPv4
- AF_INET6 - IPv6
- AF_NETLINK - Netlink sockets for communicating with the ekrnel
- AF_PACKET - raw sockets, which are only allowed with CAP_NET_RAW

All other socket families are blocked, including Appletalk (native, not
over IP), IPX (remember that!), VSOCK and HVSOCK, which should not generally
be used in containers, etc.

Note that users can of course provide a profile per container or in the daemon
config if they have unusual use cases that require these.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-01-17 17:50:44 +00:00