Commit graph

62 commits

Author SHA1 Message Date
Bjorn Neergaard
b173b9e739
seccomp: add name_to_handle_at to allowlist
Based on the analysis on [the previous PR][1].

  [1]: https://github.com/moby/moby/pull/45766#pullrequestreview-1493908145

Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
(cherry picked from commit b335e3d305)
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
2023-06-28 05:47:10 -06:00
Vitor Anjos
7f2729ff2c
remove name_to_handle_at(2) from filtered syscalls
Signed-off-by: Vitor Anjos <bartier@users.noreply.github.com>
(cherry picked from commit fdc9b7cceb)
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
2023-06-27 13:20:44 -06:00
Sebastiaan van Stijn
6875e7f1be
seccomp: block socket calls to AF_VSOCK in default profile
This syncs the seccomp-profile with the latest changes in containerd's
profile, applying the same changes as 17a9324035

Some background from the associated ticket:

> We want to use vsock for guest-host communication on KubeVirt
> (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods.
>
> However since anyone can just connect from any pod to any VM with the
> default seccomp settings, we cannot limit connection attempts to our
> privileged node-agent.
>
> ### Describe the solution you'd like
> We want to deny the `socket` syscall for the `AF_VSOCK` family by default.
>
> I see in [1] and [2] that AF_VSOCK was actually already blocked for some
> time, but that got reverted since some architectures support the `socketcall`
> syscall which can't be restricted properly. However we are mostly interested
> in `arm64` and `amd64` where limiting `socket` would probably be enough.
>
> ### Additional context
> I know that in theory we could use our own seccomp profiles, but we would want
> to provide security for as many users as possible which use KubeVirt, and there
> it would be very helpful if this protection could be added by being part of the
> DefaultRuntime profile to easily ensure that it is active for all pods [3].
>
> Impact on existing workloads: It is unlikely that this will disturb any existing
> workload, becuase VSOCK is almost exclusively used for host-guest commmunication.
> However if someone would still use it: Privileged pods would still be able to
> use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too.
> Further it was already blocked for quite some time and the blockade got lifted
> due to reasons not related to AF_VSOCK.
>
> The PR in KubeVirt which adds VSOCK support for additional context: [4]
>
> [1]: https://github.com/moby/moby/pull/29076#commitcomment-21831387
> [2]: dcf2632945
> [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads
> [4]: https://github.com/kubevirt/kubevirt/pull/8546

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 57b229012a)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-12-01 14:09:46 +01:00
Sebastiaan van Stijn
8912c1fade
seccomp: allow "bpf", "perf_event_open", gated by CAP_BPF, CAP_PERFMON
Update the profile to make use of CAP_BPF and CAP_PERFMON capabilities. Prior to
kernel 5.8, bpf and perf_event_open required CAP_SYS_ADMIN. This change enables
finer control of the privilege setting, thus allowing us to run certain system
tracing tools with minimal privileges.

Based on the original patch from Henry Wang in the containerd repository.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 7b7d1132e8)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-08-18 18:36:49 +02:00
zhubojun
edcc51cbee
profiles: seccomp: add syscalls related to PKU in default policy
Add pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) in seccomp default profile.
pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) can only configure
the calling process's own memory, so they are existing "safe for everyone" syscalls.

close issue: #43481

Signed-off-by: zhubojun <bojun.zhu@foxmail.com>
(cherry picked from commit e258d66f17)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-07-15 09:19:57 +02:00
Bastien Pascard
420142a886 profiles: seccomp: allow clock_settime64 when CAP_SYS_TIME is added
Signed-off-by: Bastien Pascard <bpascard@hotmail.com>
2022-07-06 23:45:13 +02:00
Djordje Lukic
7de9f4f82d Allow different syscalls from kernels 5.12 -> 5.16
Kernel 5.12:

    mount_setattr: needs CAP_SYS_ADMIN

Kernel 5.14:

    quotactl_fd: needs CAP_SYS_ADMIN
    memfd_secret: always allowed

Kernel 5.15:

    process_mrelease: always allowed

Kernel 5.16:

    futex_waitv: always allowed

Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
2022-05-13 12:35:08 +02:00
Justin Cormack
f1dd6bf84e
Merge pull request #43553 from AkihiroSuda/riscv64
seccomp: support riscv64
2022-05-13 10:41:53 +01:00
Akihiro Suda
4c2f18f6cc
seccomp: support riscv64
Corresponds to containerd PR 6882

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-05-02 17:41:43 +09:00
Tudor Brindus
af819bf623 seccomp: add support for Landlock syscalls in default policy
This commit allows the Landlock[0] system calls in the default seccomp
policy.

Landlock was introduced in kernel 5.13, to fill the gap that inspecting
filepaths passed as arguments to filesystem system calls is not really
possible with pure `seccomp` (unless involving `ptrace`).

Allowing Landlock by default fits in with allowing `seccomp` for
containerized applications to voluntarily restrict their access rights
to files within the container.

[0]: https://www.kernel.org/doc/html/latest/userspace-api/landlock.html

Signed-off-by: Tudor Brindus <me@tbrindus.ca>
2022-01-31 08:44:04 -05:00
Sören Tempel
85eaf23bf4 seccomp: add support for "swapcontext" syscall in default policy
This system call is only available on the 32- and 64-bit PowerPC, it is
used by modern programming language implementations (such as gcc-go) to
implement coroutine features through userspace context switches.

Other container environment, such as Systemd nspawn already whitelist
this system call in their seccomp profile [1] [2]. As such, it would be
nice to also whitelist it in moby.

This issue was encountered on Alpine Linux GitLab CI system, which uses
moby, when attempting to execute gcc-go compiled software on ppc64le.

[1]: https://github.com/systemd/systemd/pull/9487
[2]: https://github.com/systemd/systemd/issues/9485

Signed-off-by: Sören Tempel <soeren+git@soeren-tempel.net>
2021-12-18 14:06:07 +01:00
Sebastiaan van Stijn
2480bebf59
Merge pull request #42649 from kinvolk/rata/seccomp-default-errno
seccomp: Use explicit DefaultErrnoRet
2021-08-03 15:13:42 +02:00
Rodrigo Campos
fb794166d9 seccomp: Use explicit DefaultErrnoRet
Since commit "seccomp: Sync fields with runtime-spec fields"
(5d244675bd) we support to specify the
DefaultErrnoRet to be used.

Before that commit it was not specified and EPERM was used by default.
This commit keeps the same behaviour but just makes it explicit that the
default is EPERM.

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-07-30 19:13:21 +02:00
Daniel P. Berrangé
9f6b562dd1 seccomp: add support for "clone3" syscall in default policy
If no seccomp policy is requested, then the built-in default policy in
dockerd applies. This has no rule for "clone3" defined, nor any default
errno defined. So when runc receives the config it attempts to determine
a default errno, using logic defined in its commit:

  7a8d7162f9

As explained in the above commit message, runc uses a heuristic to
decide which errno to return by default:

[quote]
  The solution applied here is to prepend a "stub" filter which returns
  -ENOSYS if the requested syscall has a larger syscall number than any
  syscall mentioned in the filter. The reason for this specific rule is
  that syscall numbers are (roughly) allocated sequentially and thus newer
  syscalls will (usually) have a larger syscall number -- thus causing our
  filters to produce -ENOSYS if the filter was written before the syscall
  existed.
[/quote]

Unfortunately clone3 appears to one of the edge cases that does not
result in use of ENOSYS, instead ending up with the historical EPERM
errno.

Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
clone3 by default. If it sees ENOSYS then it will automatically
fallback to using clone. Any other errno is treated as a fatal
error. Thus when docker seccomp policy triggers EPERM from clone3,
no fallback occurs and programs are thus unable to spawn threads.

The clone3 syscall is much more complicated than clone, most notably its
flags are not exposed as a directly argument any more. Instead they are
hidden inside a struct. This means that seccomp filters are unable to
apply policy based on values seen in flags. Thus we can't directly
replicate the current "clone" filtering for "clone3". We can at least
ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
at which point we can filter on flags.

Fixes: https://github.com/moby/moby/issues/42680
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2021-07-27 10:56:07 +01:00
Sebastiaan van Stijn
c7cd1b9436
profiles/seccomp.Syscall: use pointers and omitempty
These fields are optional, and this makes the JSON representation
slightly less verbose.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-17 21:25:09 +02:00
Sebastiaan van Stijn
d92739713c
seccomp.Syscall: embed runtime-spec Syscall type
This makes the type better reflect the difference with the "runtime" profile;
our local type is used to generate a runtime-spec seccomp profile and extends
the runtime-spec type with additional fields; adding a "Name" field for backward
compatibility with older JSON representations, additional "Comment" metadata,
and conditional rules ("Includes", "Excludes") used during generation to adjust
the profile based on the container (capabilities) and host's (architecture, kernel)
configuration.

This change introduces one change in the type; the "runtime-spec" type uses a
`[]LinuxSeccompArg` for the `Args` field, whereas the local type used pointers;
`[]*LinuxSeccompArg`.

In addition, the runtime-spec Syscall type brings a new `ErrnoRet` field, allowing
the profile to specify the errno code returned for the syscall, which allows
changing the default EPERM for specific syscalls.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-17 21:25:06 +02:00
clubby789
d39b075302 Enable process_vm_readv and process_vm_writev for kernel > 4.8
These syscalls were disabled in #18971
due to them requiring CAP_PTRACE. CAP_PTRACE was blocked by default due
to a ptrace related exploit. This has been patched in the Linux kernel
(version 4.8) and thus `ptrace` has been re-enabled. However, these
associated syscalls seem to have been left behind. This commit brings
them in line with `ptrace`, and re-enables it for kernel > 4.8.

Signed-off-by: clubby789 <jamie@hill-daniel.co.uk>
2021-03-04 17:12:01 +00:00
Aleksa Sarai
54eff4354b
profiles: seccomp: update to Linux 5.11 syscall list
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:

 * close_range(2), epoll_pwait2(2) are just extensions of existing "safe
   for everyone" syscalls.

 * The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
   all equivalent to aspects of mount(2) and thus go into the
   CAP_SYS_ADMIN category.

 * process_madvise(2) is similar to the other process_*(2) syscalls and
   thus goes in the CAP_SYS_PTRACE category.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2021-01-27 13:25:49 +11:00
Mark Vainomaa
f7bcb02f67
seccomp: Add pidfd_getfd syscall
Signed-off-by: Mark Vainomaa <mikroskeem@mikroskeem.eu>
2020-11-12 15:31:07 +02:00
Mark Vainomaa
5e3ffe6464
seccomp: Add pidfd_open and pidfd_send_signal
Signed-off-by: Mark Vainomaa <mikroskeem@mikroskeem.eu>
2020-11-11 15:20:34 +02:00
Sebastiaan van Stijn
0d75b63987
seccomp: replace types with runtime-spec types
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-18 19:33:58 +02:00
Jintao Zhang
a18139111d Add faccessat2 to default seccomp profile.
Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
2020-08-17 21:13:03 +08:00
Jintao Zhang
b8988c8475 Add openat2 to default seccomp profile.
follow up to https://github.com/moby/moby/pull/41344#discussion_r469919978

Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
2020-08-16 15:58:57 +08:00
Florian Schmaus
d0d99b04cf seccomp: allow 'rseq' syscall in default seccomp profile
Restartable Sequences (rseq) are a kernel-based mechanism for fast
update operations on per-core data in user-space. Some libraries, like
the newest version of Google's TCMalloc, depend on it [1].

This also makes dockers default seccomp profile on par with systemd's,
which enabled 'rseq' in early 2019 [2].

1: https://google.github.io/tcmalloc/design.html
2: 6fee3be0b4

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
2020-06-26 16:06:26 +02:00
Justin Cormack
1aafcbb47a
Merge pull request #40995 from KentaTada/remove-unused-syscall
seccomp: remove the unused query_module(2)
2020-05-28 11:25:59 +01:00
Akihiro Suda
b2917efb1a
Merge pull request #40731 from sqreen/fix/seccomp-profile
seccomp: allow syscall membarrier
2020-05-20 00:31:32 +09:00
Kenta Tada
1192c7aee4 seccomp: remove the unused query_module(2)
query_module(2) is only in kernels before Linux 2.6.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-05-19 10:30:54 +09:00
Stanislav Levin
5d3a9e4319
seccomp: Whitelist clock_adjtime
This only allows making the syscall. CAP_SYS_TIME is still required
for time adjustment (enforced by the kernel):

```
kernel/time/posix-timers.c:

1112 SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock,
1113                 struct __kernel_timex __user *, utx)
...
1121         err = do_clock_adjtime(which_clock, &ktx);

1100 int do_clock_adjtime(const clockid_t which_clock, struct __kernel_timex * ktx)
1101 {
...
1109         return kc->clock_adj(which_clock, ktx);

1299 static const struct k_clock clock_realtime = {
...
1304         .clock_adj              = posix_clock_realtime_adj,

188 static int posix_clock_realtime_adj(const clockid_t which_clock,
189                                     struct __kernel_timex *t)
190 {
191         return do_adjtimex(t);

kernel/time/timekeeping.c:

2312 int do_adjtimex(struct __kernel_timex *txc)
2313 {
...
2321         /* Validate the data before disabling interrupts */
2322         ret = timekeeping_validate_timex(txc);

2246 static int timekeeping_validate_timex(const struct __kernel_timex *txc)
2247 {
2248         if (txc->modes & ADJ_ADJTIME) {
...
2252                 if (!(txc->modes & ADJ_OFFSET_READONLY) &&
2253                     !capable(CAP_SYS_TIME))
2254                         return -EPERM;
2255         } else {
2256                 /* In order to modify anything, you gotta be super-user! */
2257                 if (txc->modes && !capable(CAP_SYS_TIME))
2258                         return -EPERM;

```

Fixes: https://github.com/moby/moby/issues/40919
Signed-off-by: Stanislav Levin <slev@altlinux.org>
2020-05-08 12:33:25 +03:00
Julio Guerra
1026f873a4
seccomp: allow syscall membarrier
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.

Signed-off-by: Julio Guerra <julio@sqreen.com>
2020-04-07 16:24:17 +02:00
Sebastiaan van Stijn
89fabf0f24
seccomp: add 64-bit time_t syscalls
Relates to https://patchwork.kernel.org/patch/10756415/

Added to whitelist:

- `clock_getres_time64` (equivalent of `clock_getres`, which was whitelisted)
- `clock_gettime64` (equivalent of `clock_gettime`, which was whitelisted)
- `clock_nanosleep_time64` (equivalent of `clock_nanosleep`, which was whitelisted)
- `futex_time64` (equivalent of `futex`, which was whitelisted)
- `io_pgetevents_time64` (equivalent of `io_pgetevents`, which was whitelisted)
- `mq_timedreceive_time64` (equivalent of `mq_timedreceive`, which was whitelisted)
- `mq_timedsend_time64 ` (equivalent of `mq_timedsend`, which was whitelisted)
- `ppoll_time64` (equivalent of `ppoll`, which was whitelisted)
- `pselect6_time64` (equivalent of `pselect6`, which was whitelisted)
- `recvmmsg_time64` (equivalent of `recvmmsg`, which was whitelisted)
- `rt_sigtimedwait_time64` (equivalent of `rt_sigtimedwait`, which was whitelisted)
- `sched_rr_get_interval_time64` (equivalent of `sched_rr_get_interval`, which was whitelisted)
- `semtimedop_time64` (equivalent of `semtimedop`, which was whitelisted)
- `timer_gettime64` (equivalent of `timer_gettime`, which was whitelisted)
- `timer_settime64` (equivalent of `timer_settime`, which was whitelisted)
- `timerfd_gettime64` (equivalent of `timerfd_gettime`, which was whitelisted)
- `timerfd_settime64` (equivalent of `timerfd_settime`, which was whitelisted)
- `utimensat_time64` (equivalent of `utimensat`, which was whitelisted)

Not added to whitelist:

- `clock_adjtime64` (equivalent of `clock_adjtime`, which was not whitelisted)
- `clock_settime64` (equivalent of `clock_settime`, which was not whitelisted)

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-03-25 13:49:49 +01:00
Arnaud Rebillout
667c87ef4f profiles: Fix file permissions on json files
json files should not be executable I think.

Signed-off-by: Arnaud Rebillout <arnaud.rebillout@collabora.com>
2019-09-16 11:15:37 +07:00
youcai
f4d41f1dfa seccomp: whitelist io-uring related system calls
Signed-off-by: youcai <omegacoleman@gmail.com>
2019-09-07 07:35:23 +00:00
Michael Crosby
e4605cc2a5 Add sigprocmask to default seccomp profile
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-08-29 13:52:45 -04:00
Sebastiaan van Stijn
a1ec8551ab
Fix seccomp profile for clone syscall
All clone flags for namespace should be denied.

Based-on-patch-by: Kenta Tada <Kenta.Tada@sony.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-06-04 15:28:12 +02:00
Avi Kivity
665741510a seccomp: whitelist io_pgetevents()
io_pgetevents() is a new Linux system call. It is similar to io_getevents()
that is already whitelisted, and adds no special abilities over that system call.

Allow that system call to enable applications that use it.

Fixes #38894.

Signed-off-by: Avi Kivity <avi@scylladb.com>
2019-03-18 20:46:16 +02:00
Tonis Tiigi
e76380b67b seccomp: review update
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2019-02-05 12:02:41 -08:00
Tonis Tiigi
1124543ca8 seccomp: allow ptrace for 4.8+ kernels
4.8+ kernels have fixed the ptrace security issues
so we can allow ptrace(2) on the default seccomp
profile if we do the kernel version check.

93e35efb8d

Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2018-11-04 13:06:43 -08:00
Justin Cormack
ccd22ffcc8
Move the syslog syscall to be gated by CAP_SYS_ADMIN or CAP_SYSLOG
This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.

Fix #37897

See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2018-09-27 14:27:05 -07:00
Nicolas V Castet
47dfff68e4 Whitelist syscalls linked to CAP_SYS_NICE in default seccomp profile
* Update profile to match docker documentation at
  https://docs.docker.com/engine/security/seccomp/

Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
2018-06-20 07:32:08 -05:00
NobodyOnSE
b2a907c8ca Whitelist statx syscall for libseccomp-2.3.3 onward
Older seccomp versions will ignore this.

Signed-off-by: NobodyOnSE <ich@sektor.selfip.com>
2018-03-06 08:42:12 +01:00
Simon Vikstrom
d7bf5e3b4d Remove double defined alarm
Signed-off-by: Simon Vikstrom <pullreq@devsn.se>
2017-08-19 09:55:03 +02:00
Panagiotis Moustafellos
cf6e1c5dfd
seccomp: whitelist quotactl with CAP_SYS_ADMIN
The quotactl syscall is being whitelisted in default seccomp profile,
gated by CAP_SYS_ADMIN.

Signed-off-by: Panagiotis Moustafellos <pmoust@elastic.co>
2017-08-09 18:52:15 +03:00
Miklos Szegedi
2db05316d0 Whitelist adjtimex get operation. Adjustment operations are gated by CAP_SYS_TIME
Signed-off-by: Miklos Szegedi <miklos.szegedi@cloudera.com>
2017-06-02 18:48:16 +00:00
Justin Cormack
dcf2632945 Revert "Block obsolete socket families in the default seccomp profile"
This reverts commit 7e3a596a63.

Unfortunately, it was pointed out in https://github.com/moby/moby/pull/29076#commitcomment-21831387
that the `socketcall` syscall takes a pointer to a struct so it is not possible to
use seccomp profiles to filter it. This means these cannot be blocked as you can
use `socketcall` to call them regardless, as we currently allow 32 bit syscalls.

Users who wish to block these should use a seccomp profile that blocks all
32 bit syscalls and then just block the non socketcall versions.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-05-09 14:26:00 +01:00
Ian Campbell
cd456433ea seccomp: Allow personality with UNAME26 bit set.
From personality(2):

    Have uname(2) report a 2.6.40+ version number rather than a 3.x version
    number.  Added as a stopgap measure to support broken applications that
    could not handle the  kernel  version-numbering  switch  from 2.6.x to 3.x.

This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".

Fixes: #32839

Signed-off-by: Ian Campbell <ian.campbell@docker.com>
2017-05-02 15:05:01 +01:00
Antonio Murdaca
3ab4961032
profiles: seccomp: allow clock_settime when CAP_SYS_TIME is added
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2017-03-20 11:05:23 +01:00
Justin Cormack
9067ef0e32 Seccomp Update
- Update libseccomp-golang to 0.9.0 release
- Update libseccomp to 2.3.2 release
- add preadv2 and pwritev2 syscalls to whitelist

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-03-07 22:19:46 +00:00
Gabriel Linder
52d8f582c3 Allow sync_file_range2 on supported architectures.
Signed-off-by: Gabriel Linder <linder.gabriel@gmail.com>
2017-02-14 21:29:33 +01:00
Justin Cormack
d6adcd6a82 Add two arm specific syscalls to seccomp profile
These are arm variants with different argument ordering because of
register alignment requirements.

fix #30516

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-01-29 14:59:45 +00:00
Justin Cormack
7e3a596a63 Block obsolete socket families in the default seccomp profile
Linux supports many obsolete address families, which are usually available in
common distro kernels, but they are less likely to be properly audited and
may have security issues

This blocks all socket families in the socket (and socketcall where applicable) syscall
except
- AF_UNIX - Unix domain sockets
- AF_INET - IPv4
- AF_INET6 - IPv6
- AF_NETLINK - Netlink sockets for communicating with the ekrnel
- AF_PACKET - raw sockets, which are only allowed with CAP_NET_RAW

All other socket families are blocked, including Appletalk (native, not
over IP), IPX (remember that!), VSOCK and HVSOCK, which should not generally
be used in containers, etc.

Note that users can of course provide a profile per container or in the daemon
config if they have unusual use cases that require these.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-01-17 17:50:44 +00:00