This reverts commit 57b229012a.
This change, while favorable from a security standpoint, caused a
regression for users of the 20.10 branch of Moby. As such, we are
reverting it to ensure stability and compatibility for the affected
users.
However, users of AF_VSOCK in containers should recognize that this
(special) address family is not currently namespaced in any version of
the Linux kernel, and may result in unexpected behavior, like VMs
communicating directly with host hypervisors.
Future branches, including the 23.0 branch, will continue to filter
AF_VSOCK. Users who need to allow containers to communicate over the
unnamespaced AF_VSOCK will need to turn off seccomp confinement or set a
custom seccomp profile.
It is our hope that future mechanisms will make this more
ergonomic/maintainable for end users, and that future kernels will
support namespacing of AF_VSOCK.
Signed-off-by: Bjorn Neergaard <bneergaard@mirantis.com>
This syncs the seccomp-profile with the latest changes in containerd's
profile, applying the same changes as 17a9324035
Some background from the associated ticket:
> We want to use vsock for guest-host communication on KubeVirt
> (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods.
>
> However since anyone can just connect from any pod to any VM with the
> default seccomp settings, we cannot limit connection attempts to our
> privileged node-agent.
>
> ### Describe the solution you'd like
> We want to deny the `socket` syscall for the `AF_VSOCK` family by default.
>
> I see in [1] and [2] that AF_VSOCK was actually already blocked for some
> time, but that got reverted since some architectures support the `socketcall`
> syscall which can't be restricted properly. However we are mostly interested
> in `arm64` and `amd64` where limiting `socket` would probably be enough.
>
> ### Additional context
> I know that in theory we could use our own seccomp profiles, but we would want
> to provide security for as many users as possible which use KubeVirt, and there
> it would be very helpful if this protection could be added by being part of the
> DefaultRuntime profile to easily ensure that it is active for all pods [3].
>
> Impact on existing workloads: It is unlikely that this will disturb any existing
> workload, becuase VSOCK is almost exclusively used for host-guest commmunication.
> However if someone would still use it: Privileged pods would still be able to
> use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too.
> Further it was already blocked for quite some time and the blockade got lifted
> due to reasons not related to AF_VSOCK.
>
> The PR in KubeVirt which adds VSOCK support for additional context: [4]
>
> [1]: https://github.com/moby/moby/pull/29076#commitcomment-21831387
> [2]: dcf2632945
> [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads
> [4]: https://github.com/kubevirt/kubevirt/pull/8546
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 57b229012a)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This commit allows the Landlock[0] system calls in the default seccomp
policy.
Landlock was introduced in kernel 5.13, to fill the gap that inspecting
filepaths passed as arguments to filesystem system calls is not really
possible with pure `seccomp` (unless involving `ptrace`).
Allowing Landlock by default fits in with allowing `seccomp` for
containerized applications to voluntarily restrict their access rights
to files within the container.
[0]: https://www.kernel.org/doc/html/latest/userspace-api/landlock.html
Signed-off-by: Tudor Brindus <me@tbrindus.ca>
(cherry picked from commit af819bf623)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This is a backport of 9f6b562dd1, adapted to avoid the refactoring that happened in d92739713c.
Original commit message is as follows:
> If no seccomp policy is requested, then the built-in default policy in
> dockerd applies. This has no rule for "clone3" defined, nor any default
> errno defined. So when runc receives the config it attempts to determine
> a default errno, using logic defined in its commit:
>
> opencontainers/runc@7a8d716
>
> As explained in the above commit message, runc uses a heuristic to
> decide which errno to return by default:
>
> [quote]
> The solution applied here is to prepend a "stub" filter which returns
> -ENOSYS if the requested syscall has a larger syscall number than any
> syscall mentioned in the filter. The reason for this specific rule is
> that syscall numbers are (roughly) allocated sequentially and thus newer
> syscalls will (usually) have a larger syscall number -- thus causing our
> filters to produce -ENOSYS if the filter was written before the syscall
> existed.
> [/quote]
>
> Unfortunately clone3 appears to one of the edge cases that does not
> result in use of ENOSYS, instead ending up with the historical EPERM
> errno.
>
> Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
> clone3 by default. If it sees ENOSYS then it will automatically
> fallback to using clone. Any other errno is treated as a fatal
> error. Thus when docker seccomp policy triggers EPERM from clone3,
> no fallback occurs and programs are thus unable to spawn threads.
>
> The clone3 syscall is much more complicated than clone, most notably its
> flags are not exposed as a directly argument any more. Instead they are
> hidden inside a struct. This means that seccomp filters are unable to
> apply policy based on values seen in flags. Thus we can't directly
> replicate the current "clone" filtering for "clone3". We can at least
> ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
> at which point we can filter on flags.
Signed-off-by: Tianon Gravi <admwiggin@gmail.com>
Co-authored-by: Daniel P. Berrangé <berrange@redhat.com>
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:
* close_range(2), epoll_pwait2(2) are just extensions of existing "safe
for everyone" syscalls.
* The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
all equivalent to aspects of mount(2) and thus go into the
CAP_SYS_ADMIN category.
* process_madvise(2) is similar to the other process_*(2) syscalls and
thus goes in the CAP_SYS_PTRACE category.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
(cherry picked from commit 54eff4354b)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Restartable Sequences (rseq) are a kernel-based mechanism for fast
update operations on per-core data in user-space. Some libraries, like
the newest version of Google's TCMalloc, depend on it [1].
This also makes dockers default seccomp profile on par with systemd's,
which enabled 'rseq' in early 2019 [2].
1: https://google.github.io/tcmalloc/design.html
2: 6fee3be0b4
Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.
Signed-off-by: Julio Guerra <julio@sqreen.com>
Relates to https://patchwork.kernel.org/patch/10756415/
Added to whitelist:
- `clock_getres_time64` (equivalent of `clock_getres`, which was whitelisted)
- `clock_gettime64` (equivalent of `clock_gettime`, which was whitelisted)
- `clock_nanosleep_time64` (equivalent of `clock_nanosleep`, which was whitelisted)
- `futex_time64` (equivalent of `futex`, which was whitelisted)
- `io_pgetevents_time64` (equivalent of `io_pgetevents`, which was whitelisted)
- `mq_timedreceive_time64` (equivalent of `mq_timedreceive`, which was whitelisted)
- `mq_timedsend_time64 ` (equivalent of `mq_timedsend`, which was whitelisted)
- `ppoll_time64` (equivalent of `ppoll`, which was whitelisted)
- `pselect6_time64` (equivalent of `pselect6`, which was whitelisted)
- `recvmmsg_time64` (equivalent of `recvmmsg`, which was whitelisted)
- `rt_sigtimedwait_time64` (equivalent of `rt_sigtimedwait`, which was whitelisted)
- `sched_rr_get_interval_time64` (equivalent of `sched_rr_get_interval`, which was whitelisted)
- `semtimedop_time64` (equivalent of `semtimedop`, which was whitelisted)
- `timer_gettime64` (equivalent of `timer_gettime`, which was whitelisted)
- `timer_settime64` (equivalent of `timer_settime`, which was whitelisted)
- `timerfd_gettime64` (equivalent of `timerfd_gettime`, which was whitelisted)
- `timerfd_settime64` (equivalent of `timerfd_settime`, which was whitelisted)
- `utimensat_time64` (equivalent of `utimensat`, which was whitelisted)
Not added to whitelist:
- `clock_adjtime64` (equivalent of `clock_adjtime`, which was not whitelisted)
- `clock_settime64` (equivalent of `clock_settime`, which was not whitelisted)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
All clone flags for namespace should be denied.
Based-on-patch-by: Kenta Tada <Kenta.Tada@sony.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
io_pgetevents() is a new Linux system call. It is similar to io_getevents()
that is already whitelisted, and adds no special abilities over that system call.
Allow that system call to enable applications that use it.
Fixes#38894.
Signed-off-by: Avi Kivity <avi@scylladb.com>
4.8+ kernels have fixed the ptrace security issues
so we can allow ptrace(2) on the default seccomp
profile if we do the kernel version check.
93e35efb8d
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.
Fix#37897
See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
The quotactl syscall is being whitelisted in default seccomp profile,
gated by CAP_SYS_ADMIN.
Signed-off-by: Panagiotis Moustafellos <pmoust@elastic.co>
This reverts commit 7e3a596a63.
Unfortunately, it was pointed out in https://github.com/moby/moby/pull/29076#commitcomment-21831387
that the `socketcall` syscall takes a pointer to a struct so it is not possible to
use seccomp profiles to filter it. This means these cannot be blocked as you can
use `socketcall` to call them regardless, as we currently allow 32 bit syscalls.
Users who wish to block these should use a seccomp profile that blocks all
32 bit syscalls and then just block the non socketcall versions.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
From personality(2):
Have uname(2) report a 2.6.40+ version number rather than a 3.x version
number. Added as a stopgap measure to support broken applications that
could not handle the kernel version-numbering switch from 2.6.x to 3.x.
This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".
Fixes: #32839
Signed-off-by: Ian Campbell <ian.campbell@docker.com>
These are arm variants with different argument ordering because of
register alignment requirements.
fix#30516
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Linux supports many obsolete address families, which are usually available in
common distro kernels, but they are less likely to be properly audited and
may have security issues
This blocks all socket families in the socket (and socketcall where applicable) syscall
except
- AF_UNIX - Unix domain sockets
- AF_INET - IPv4
- AF_INET6 - IPv6
- AF_NETLINK - Netlink sockets for communicating with the ekrnel
- AF_PACKET - raw sockets, which are only allowed with CAP_NET_RAW
All other socket families are blocked, including Appletalk (native, not
over IP), IPX (remember that!), VSOCK and HVSOCK, which should not generally
be used in containers, etc.
Note that users can of course provide a profile per container or in the daemon
config if they have unusual use cases that require these.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Do not gate with CAP_IPC_LOCK as unprivileged use is now
allowed in Linux. This returns it to how it was in 1.11.
Fixes#23587
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
In #22554 I aligned seccomp and capabilities, however the case of
the chown calls and CAP_CHOWN was less clearcut, as these are
simple calls that the capabilities will block if they are not
allowed. They are needed when no new privileges is not set in
order to allow docker to call chown before the container is
started, so there was a workaround but this did not include
all the chown syscalls, and Arm was failing on some seccomp
tests because it was using a different syscall from just the
fchown that was allowed in this case. It is simpler to just
allow all the chown calls in the default seccomp profile and
let the capabilities subsystem block them.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
In order to do this, allow the socketcall syscall in the default
seccomp profile. This is a multiplexing syscall for the socket
operations, which is becoming obsolete gradually, but it is used
in some architectures. libseccomp has special handling for it for
x86 where it is common, so we did not need it in the profile,
but does not have any handling for ppc64le. It turns out that the
Debian images we use for tests do use the socketcall, while the
newer images such as Ubuntu 16.04 do not. Enabling this does no
harm as we allow all the socket operations anyway, and we allow
the similar ipc call for similar reasons already.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Currently the default seccomp profile is fixed. This changes it
so that it varies depending on the Linux capabilities selected with
the --cap-add and --cap-drop options. Without this, if a user adds
privileges, eg to allow ptrace with --cap-add sys_ptrace then still
cannot actually use ptrace as it is still blocked by seccomp, so
they will probably disable seccomp or use --privileged. With this
change the syscalls that are needed for the capability are also
allowed by the seccomp profile based on the selected capabilities.
While this patch makes it easier to do things with for example
cap_sys_admin enabled, as it will now allow creating new namespaces
and use of mount, it still allows less than --cap-add cap_sys_admin
--security-opt seccomp:unconfined would have previously. It is not
recommended that users run containers with cap_sys_admin as this does
give full access to the host machine.
It also cleans up some architecture specific system calls to be
only selected when needed.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
These syscalls are already blocked by the default capabilities:
mlock mlock2 mlockall require CAP_IPC_LOCK
vhangup requires CAP_SYS_TTY_CONFIG
There is therefore no reason to allow them in the default profile
as they cannot be used anyway.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
This adds the following new syscalls that are supported in libseccomp 2.3.0,
including calls added up to kernel 4.5-rc4:
mlock2 - same as mlock but with a flag
copy_file_range - copy file contents, like splice but with reflink support.
The following are not added, and mentioned in docs:
userfaultfd - userspace page fault handling, mainly designed for process migration
The following are not added, only apply to less common architectures:
switch_endian
membarrier
breakpoint
set_tls
I plan to review the other architectures, some of which can now have seccomp
enabled in the build as they are now supported.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Fixes#20818
This syscall was blocked as there was some concern that it could be
used to bypass filtering of other syscall arguments. However none of the
potential syscalls where this could be an issue (poll, nanosleep,
clock_nanosleep, futex) are blocked in the default profile anyway.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
On 32 bit x86 this is a multiplexing syscall for the system V
ipc syscalls such as shmget, and so needs to be allowed for
shared memory access for 32 bit binaries.
Fixes#20733
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
We generally want to filter the personality(2) syscall, as it
allows disabling ASLR, and turning on some poorly supported
emulations that have been the target of CVEs. However the use
cases for reading the current value, setting the default
PER_LINUX personality, and setting PER_LINUX32 for 32 bit
emulation are fine.
See issue #20634
Signed-off-by: Justin Cormack <justin.cormack@docker.com>