Add this syscall to match the profile in containerd
containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: 9f6c532f59
futex: Add sys_futex_wake()
To complement sys_futex_waitv() add sys_futex_wake(). This syscall
implements what was previously known as FUTEX_WAKE_BITSET except it
uses 'unsigned long' for the bitmask and takes FUTEX2 flags.
The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add this syscall to match the profile in containerd
containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: cb8c4312af
futex: Add sys_futex_wait()
To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This
syscall implements what was previously known as FUTEX_WAIT_BITSET
except it uses 'unsigned long' for the value and bitmask arguments,
takes timespec and clockid_t arguments for the absolute timeout and
uses FUTEX2 flags.
The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add this syscall to match the profile in containerd
containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: 0f4b5f9722
futex: Add sys_futex_requeue()
Finish off the 'simple' futex2 syscall group by adding
sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are
too numerous to fit into a regular syscall. As such, use struct
futex_waitv to pass the 'source' and 'destination' futexes to the
syscall.
This syscall implements what was previously known as FUTEX_CMP_REQUEUE
and uses {val, uaddr, flags} for source and {uaddr, flags} for
destination.
This design explicitly allows requeueing between different types of
futex by having a different flags word per uaddr.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add this syscall to match the profile in containerd
containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: c35559f94e
x86/shstk: Introduce map_shadow_stack syscall
When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.
Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stacks is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add this syscall to match the profile in containerd
containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: 09da082b07
fs: Add fchmodat2()
On the userspace side fchmodat(3) is implemented as a wrapper
function which implements the POSIX-specified interface. This
interface differs from the underlying kernel system call, which does not
have a flags argument. Most implementations require procfs [1][2].
There doesn't appear to be a good userspace workaround for this issue
but the implementation in the kernel is pretty straight-forward.
The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
unlike existing fchmodat.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add this syscall to match the profile in containerd
containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: cf264e1329
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This syscall is gated by CAP_SYS_NICE, matching the profile in containerd.
containerd: a6e52c74fa
libseccomp: d83cb7ac25
kernel: c6018b4b25
mm/mempolicy: add set_mempolicy_home_node syscall
This syscall can be used to set a home node for the MPOL_BIND and
MPOL_PREFERRED_MANY memory policy. Users should use this syscall after
setting up a memory policy for the specified range as shown below.
mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
new_nodes->size + 1, 0);
sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
home_node, 0);
The syscall allows specifying a home node/preferred node from which
kernel will fulfill memory allocation requests first.
...
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This package provided utilities to obtain the apparmor_parser version, as well
as loading a profile.
Commit e3e715666f (included in v24.0.0 through
bfffb0974e) deprecated GetVersion, as it was no
longer used, which made LoadProfile the only utility remaining in this package.
LoadProfile appears to have no external consumers, and the only use in our code
is "profiles/apparmor".
This patch moves the remaining code (LoadProfile) to profiles/apparmor as a
non-exported function, and deletes the package.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This syncs the seccomp profile with changes made to containerd's default
profile in [1].
The original containerd issue and PR mention:
> Security experts generally believe io_uring to be unsafe. In fact
> Google ChromeOS and Android have turned it off, plus all Google
> production servers turn it off. Based on the blog published by Google
> below it seems like a bunch of vulnerabilities related to io_uring can
> be exploited to breakout of the container.
>
> [2]
>
> Other security reaserchers also hold this opinion: see [3] for a
> blackhat presentation on io_uring exploits.
For the record, these syscalls were added to the allowlist in [4].
[1]: a48ddf4a20
[2]: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html
[3]: https://i.blackhat.com/BH-US-23/Presentations/US-23-Lin-bad_io_uring.pdf
[4]: https://github.com/moby/moby/pull/39415
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
While this is not strictly necessary as the default OCI config masks this
path, it is possible that the user disabled path masking, passed their
own list, or is using a forked (or future) daemon version that has a
modified default config/allows changing the default config.
Add some defense-in-depth by also masking out this problematic hardware
device with the AppArmor LSM.
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
commit ab35df454d removed most of the pre-go1.17
build-tags, but for some reason, "go fix" doesn't remove these, so removing
the remaining ones manually
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
commit 7008a51449 removed version-conditional
rules from the template, so we no longer need the apparmor_parser Version.
This patch removes the call to `aaparser.GetVersion()`
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These conditions were added in 8cf89245f5
to account for old versions of debian/ubuntu (apparmor_parser < 2.8.96)
that lacked some options;
> This allows us to use the apparmor profile we have in contrib/apparmor/
> and solves the problems where certain functions are not apparent on older
> versions of apparmor_parser on debian/ubuntu.
Those patches were from 2015/2016, and all currently supported distro
versions should now have more current versions than that. Looking at the
oldest supported versions;
Ubuntu 18.04 "Bionic":
apparmor_parser --version
AppArmor parser version 2.12
Copyright (C) 1999-2008 Novell Inc.
Copyright 2009-2012 Canonical Ltd.
Debian 10 "Buster"
apparmor_parser --version
AppArmor parser version 2.13.2
Copyright (C) 1999-2008 Novell Inc.
Copyright 2009-2018 Canonical Ltd.
This patch removes the conditionals.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This syncs the seccomp-profile with the latest changes in containerd's
profile, applying the same changes as 17a9324035
Some background from the associated ticket:
> We want to use vsock for guest-host communication on KubeVirt
> (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods.
>
> However since anyone can just connect from any pod to any VM with the
> default seccomp settings, we cannot limit connection attempts to our
> privileged node-agent.
>
> ### Describe the solution you'd like
> We want to deny the `socket` syscall for the `AF_VSOCK` family by default.
>
> I see in [1] and [2] that AF_VSOCK was actually already blocked for some
> time, but that got reverted since some architectures support the `socketcall`
> syscall which can't be restricted properly. However we are mostly interested
> in `arm64` and `amd64` where limiting `socket` would probably be enough.
>
> ### Additional context
> I know that in theory we could use our own seccomp profiles, but we would want
> to provide security for as many users as possible which use KubeVirt, and there
> it would be very helpful if this protection could be added by being part of the
> DefaultRuntime profile to easily ensure that it is active for all pods [3].
>
> Impact on existing workloads: It is unlikely that this will disturb any existing
> workload, becuase VSOCK is almost exclusively used for host-guest commmunication.
> However if someone would still use it: Privileged pods would still be able to
> use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too.
> Further it was already blocked for quite some time and the blockade got lifted
> due to reasons not related to AF_VSOCK.
>
> The PR in KubeVirt which adds VSOCK support for additional context: [4]
>
> [1]: https://github.com/moby/moby/pull/29076#commitcomment-21831387
> [2]: dcf2632945
> [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads
> [4]: https://github.com/kubevirt/kubevirt/pull/8546
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Update the profile to make use of CAP_BPF and CAP_PERFMON capabilities. Prior to
kernel 5.8, bpf and perf_event_open required CAP_SYS_ADMIN. This change enables
finer control of the privilege setting, thus allowing us to run certain system
tracing tools with minimal privileges.
Based on the original patch from Henry Wang in the containerd repository.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) in seccomp default profile.
pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) can only configure
the calling process's own memory, so they are existing "safe for everyone" syscalls.
close issue: #43481
Signed-off-by: zhubojun <bojun.zhu@foxmail.com>
The current docker-default AppArmor profile intends to block write
access to everything in `/proc`, except for `/proc/<pid>` and
`/proc/sys/kernel/shm*`.
Currently the rules block access to everything in `/proc/sys`, and do
not successfully allow access to `/proc/sys/kernel/shm*`. Specifically,
a path like /proc/sys/kernel/shmmax matches this part of the pattern:
deny @{PROC}/{[^1-9][^0-9][^0-9][^0-9]* }/** w,
/proc / s y s / kernel /shmmax
This patch updates the rule so that it works as intended.
Closes#39791
Signed-off-by: Phil Sphicas <phil.sphicas@att.com>
This also fixes the GetOperatingSystem function in
pkg/parsers/operatingsystem which mistakenly truncated utsname.Machine
to the index of \0 in utsname.Sysname.
Fixes: 7aeb3efcb4
Cc: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Similar to the (now removed) `apparmor` build tag, this build-time toggle existed for users who needed to build without the `libseccomp` library. That's no longer necessary, and given the importance of seccomp to the overall default security profile of Docker containers, it makes sense that any binary built for Linux should support (and use by default) seccomp if the underlying host does.
Signed-off-by: Tianon Gravi <admwiggin@gmail.com>
This commit allows the Landlock[0] system calls in the default seccomp
policy.
Landlock was introduced in kernel 5.13, to fill the gap that inspecting
filepaths passed as arguments to filesystem system calls is not really
possible with pure `seccomp` (unless involving `ptrace`).
Allowing Landlock by default fits in with allowing `seccomp` for
containerized applications to voluntarily restrict their access rights
to files within the container.
[0]: https://www.kernel.org/doc/html/latest/userspace-api/landlock.html
Signed-off-by: Tudor Brindus <me@tbrindus.ca>
This system call is only available on the 32- and 64-bit PowerPC, it is
used by modern programming language implementations (such as gcc-go) to
implement coroutine features through userspace context switches.
Other container environment, such as Systemd nspawn already whitelist
this system call in their seccomp profile [1] [2]. As such, it would be
nice to also whitelist it in moby.
This issue was encountered on Alpine Linux GitLab CI system, which uses
moby, when attempting to execute gcc-go compiled software on ppc64le.
[1]: https://github.com/systemd/systemd/pull/9487
[2]: https://github.com/systemd/systemd/issues/9485
Signed-off-by: Sören Tempel <soeren+git@soeren-tempel.net>
The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Since commit "seccomp: Sync fields with runtime-spec fields"
(5d244675bd) we support to specify the
DefaultErrnoRet to be used.
Before that commit it was not specified and EPERM was used by default.
This commit keeps the same behaviour but just makes it explicit that the
default is EPERM.
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
If no seccomp policy is requested, then the built-in default policy in
dockerd applies. This has no rule for "clone3" defined, nor any default
errno defined. So when runc receives the config it attempts to determine
a default errno, using logic defined in its commit:
7a8d7162f9
As explained in the above commit message, runc uses a heuristic to
decide which errno to return by default:
[quote]
The solution applied here is to prepend a "stub" filter which returns
-ENOSYS if the requested syscall has a larger syscall number than any
syscall mentioned in the filter. The reason for this specific rule is
that syscall numbers are (roughly) allocated sequentially and thus newer
syscalls will (usually) have a larger syscall number -- thus causing our
filters to produce -ENOSYS if the filter was written before the syscall
existed.
[/quote]
Unfortunately clone3 appears to one of the edge cases that does not
result in use of ENOSYS, instead ending up with the historical EPERM
errno.
Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
clone3 by default. If it sees ENOSYS then it will automatically
fallback to using clone. Any other errno is treated as a fatal
error. Thus when docker seccomp policy triggers EPERM from clone3,
no fallback occurs and programs are thus unable to spawn threads.
The clone3 syscall is much more complicated than clone, most notably its
flags are not exposed as a directly argument any more. Instead they are
hidden inside a struct. This means that seccomp filters are unable to
apply policy based on values seen in flags. Thus we can't directly
replicate the current "clone" filtering for "clone3". We can at least
ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
at which point we can filter on flags.
Fixes: https://github.com/moby/moby/issues/42680
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
This patch, similar to d92739713c, embeds the
`LinuxSeccomp` type of the runtime-spec, so that we can support all options
provided by the spec, and decorates it with our own fields.
With this, profiles can make use of the recently added "Flags" field, to
specify flags that must be passed to seccomp(2) when installing the filter.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Make the error message slightly more informative, and remove the redundant
`len(config.ArchMap) != 0` check, as iterating over an empty, or 'nil' slice
is a no-op already. This allows to use a slightly more idiomatic "if ok := xx; ok"
condition.
Also move validation to the start of the loop (early return), and explicitly create
a new slice for "names" if the legacy "Name" field is used.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add test to verify profile validation, and to verify that the legacy
format actually loads the profile as expected (instead of only verifying
it doesn't produce an error).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The runtime spec we are using has support for these 3 fields[1], but
moby doesn't have them in its seccomp struct. This patch just adds and
copies them when they are in the profile.
DefaultErrnoRet is implemented in the runc version moby is using (it is
implemented since runc-rc95[2]) but if we create a container without
this moby patch, we don't see an error nor the expected behavior. This
is not clear for the user (the profile they specify is valid, the syntax
is ok, but the wrong behavior is seen).
This is because the DefaultErrnoRet field is not copied to the config
passed ultimately to runc (i.e. is like the field was not specified).
With this patch, we see the expected behavior.
The other two fileds are in the runtime-spec but not yet in runc (a PR
is open and targets 1.1.0 milestone). However, I took the liberty to
copy them now too for two reasons:
1. If we don't add them now and end up using a runc version that
supports them, then the error that the user will see is not clear at
all:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: listenerPath is not set: unknown.
And it is not obvious to debug for the user, as the field _is_ set in
the profile they specify (just not copied by moby to the profile moby
specifies ultimately to runc).
2. When using a runc without seccomp notify support (like today), the
error we see is the same with and without this moby patch (when using a
seccomp profile with the new fields):
docker: Error response from daemon: OCI runtime create failed: string SCMP_ACT_NOTIFY is not a valid action for seccomp: unknown.
Then, it seems like a clear win to add them now: we don't have to do it
later (that implies not clear errors to the user if we forget, like we
did with DefaultErrnoRet) and the user sees the exact same error when
using a runc version that doesn't support these fields.
[1]: Note we are vendoring version 1c3f411f041711bbeecf35ff7e93461ea6789220 and this version has these 3 fields 1c3f411f04/config-linux.md (seccomp)
[2]: https://github.com/opencontainers/runc/pull/2954/
[3]: https://github.com/opencontainers/runc/pull/2682
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
This makes the type better reflect the difference with the "runtime" profile;
our local type is used to generate a runtime-spec seccomp profile and extends
the runtime-spec type with additional fields; adding a "Name" field for backward
compatibility with older JSON representations, additional "Comment" metadata,
and conditional rules ("Includes", "Excludes") used during generation to adjust
the profile based on the container (capabilities) and host's (architecture, kernel)
configuration.
This change introduces one change in the type; the "runtime-spec" type uses a
`[]LinuxSeccompArg` for the `Args` field, whereas the local type used pointers;
`[]*LinuxSeccompArg`.
In addition, the runtime-spec Syscall type brings a new `ErrnoRet` field, allowing
the profile to specify the errno code returned for the syscall, which allows
changing the default EPERM for specific syscalls.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These syscalls were disabled in #18971
due to them requiring CAP_PTRACE. CAP_PTRACE was blocked by default due
to a ptrace related exploit. This has been patched in the Linux kernel
(version 4.8) and thus `ptrace` has been re-enabled. However, these
associated syscalls seem to have been left behind. This commit brings
them in line with `ptrace`, and re-enables it for kernel > 4.8.
Signed-off-by: clubby789 <jamie@hill-daniel.co.uk>
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:
* close_range(2), epoll_pwait2(2) are just extensions of existing "safe
for everyone" syscalls.
* The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
all equivalent to aspects of mount(2) and thus go into the
CAP_SYS_ADMIN category.
* process_madvise(2) is similar to the other process_*(2) syscalls and
thus goes in the CAP_SYS_PTRACE category.
Signed-off-by: Aleksa Sarai <asarai@suse.de>