Movified from 686be57d0a, and re-ran
gofmt again to address for files not present in 20.10 and vice-versa.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 686be57d0a)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This is a backport of 9f6b562dd1, adapted to avoid the refactoring that happened in d92739713c.
Original commit message is as follows:
> If no seccomp policy is requested, then the built-in default policy in
> dockerd applies. This has no rule for "clone3" defined, nor any default
> errno defined. So when runc receives the config it attempts to determine
> a default errno, using logic defined in its commit:
>
> opencontainers/runc@7a8d716
>
> As explained in the above commit message, runc uses a heuristic to
> decide which errno to return by default:
>
> [quote]
> The solution applied here is to prepend a "stub" filter which returns
> -ENOSYS if the requested syscall has a larger syscall number than any
> syscall mentioned in the filter. The reason for this specific rule is
> that syscall numbers are (roughly) allocated sequentially and thus newer
> syscalls will (usually) have a larger syscall number -- thus causing our
> filters to produce -ENOSYS if the filter was written before the syscall
> existed.
> [/quote]
>
> Unfortunately clone3 appears to one of the edge cases that does not
> result in use of ENOSYS, instead ending up with the historical EPERM
> errno.
>
> Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
> clone3 by default. If it sees ENOSYS then it will automatically
> fallback to using clone. Any other errno is treated as a fatal
> error. Thus when docker seccomp policy triggers EPERM from clone3,
> no fallback occurs and programs are thus unable to spawn threads.
>
> The clone3 syscall is much more complicated than clone, most notably its
> flags are not exposed as a directly argument any more. Instead they are
> hidden inside a struct. This means that seccomp filters are unable to
> apply policy based on values seen in flags. Thus we can't directly
> replicate the current "clone" filtering for "clone3". We can at least
> ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
> at which point we can filter on flags.
Signed-off-by: Tianon Gravi <admwiggin@gmail.com>
Co-authored-by: Daniel P. Berrangé <berrange@redhat.com>
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:
* close_range(2), epoll_pwait2(2) are just extensions of existing "safe
for everyone" syscalls.
* The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
all equivalent to aspects of mount(2) and thus go into the
CAP_SYS_ADMIN category.
* process_madvise(2) is similar to the other process_*(2) syscalls and
thus goes in the CAP_SYS_PTRACE category.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
(cherry picked from commit 54eff4354b)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This removes the dependency on the `pkg/parsers/kernel` package, because secomp
only needs to consider Linux (and no parsing is needed for Windows or Darwin kernel
versions).
This patch implements the minimum requirements for this implementation:
- only `kernel` and `major` versions are considered
- `minor` version, `flavor`, and `-rcXX` suffixes are ignored
So, for example:
- `3.4.54.longterm-1` => `kernel: 3`, `major: 4`
- `3.8.0-19-generic` => `kernel: 3`, `major: 8`
- `3.10.0-862.2.3.el7.x86_64` => `kernel: 3`, `major: 10`
Some systems also omit the `minor` and/or have odd-formatted versions. In context
of generating seccomp profiles, both versions below are considered equal;
- `3.12.25-gentoo` => `kernel: 3`, `major: 12`
- `3.12-1-amd64` => `kernel: 3`, `major: 12`
Note that `-rcX` suffixes are also not considered, and thus (e.g.) kernel `5.9-rc1`,
`5.9-rc6` and `5.9` are all considered equal.
The motivation for ignoring "minor" versions and "flavors" is that;
- The upstream kernel only does "kernel.major" releases
- While release-candidates exists for kernel (e.g. 5.9-rc5), we don't expect users
to write profiles that target a specific release-candidate, and therefore consider
(e.g.) kernel `5.9-rc1`, `5.9-rc6` and `5.9` to be equal.
- Generally, a seccomp-profile should either be portable, or written for a specific
infrastructure (in which case the writer of the profile would know if the kernel-flavors
used does/does not support certain things.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit 5ff21add06 changed the (JSON) format that's
used for seccomp profiles, but keeping the code backward compatible to allow both
the old or new format.
This patch adds a new test, which loads the old format. It takes the default seccomp
profile before the format was changed.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These types were not used in the API, so could not come up with
a reason why they were in that package.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This was just using libseccomp to get the right arch, but we can use
GOARCH to get this.
The nativeToSeccomp map needed to be adjusted a bit for mipsle vs mipsel
since that's go how refers to it. Also added some other arches to it.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Restartable Sequences (rseq) are a kernel-based mechanism for fast
update operations on per-core data in user-space. Some libraries, like
the newest version of Google's TCMalloc, depend on it [1].
This also makes dockers default seccomp profile on par with systemd's,
which enabled 'rseq' in early 2019 [2].
1: https://google.github.io/tcmalloc/design.html
2: 6fee3be0b4
Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.
Signed-off-by: Julio Guerra <julio@sqreen.com>
Relates to https://patchwork.kernel.org/patch/10756415/
Added to whitelist:
- `clock_getres_time64` (equivalent of `clock_getres`, which was whitelisted)
- `clock_gettime64` (equivalent of `clock_gettime`, which was whitelisted)
- `clock_nanosleep_time64` (equivalent of `clock_nanosleep`, which was whitelisted)
- `futex_time64` (equivalent of `futex`, which was whitelisted)
- `io_pgetevents_time64` (equivalent of `io_pgetevents`, which was whitelisted)
- `mq_timedreceive_time64` (equivalent of `mq_timedreceive`, which was whitelisted)
- `mq_timedsend_time64 ` (equivalent of `mq_timedsend`, which was whitelisted)
- `ppoll_time64` (equivalent of `ppoll`, which was whitelisted)
- `pselect6_time64` (equivalent of `pselect6`, which was whitelisted)
- `recvmmsg_time64` (equivalent of `recvmmsg`, which was whitelisted)
- `rt_sigtimedwait_time64` (equivalent of `rt_sigtimedwait`, which was whitelisted)
- `sched_rr_get_interval_time64` (equivalent of `sched_rr_get_interval`, which was whitelisted)
- `semtimedop_time64` (equivalent of `semtimedop`, which was whitelisted)
- `timer_gettime64` (equivalent of `timer_gettime`, which was whitelisted)
- `timer_settime64` (equivalent of `timer_settime`, which was whitelisted)
- `timerfd_gettime64` (equivalent of `timerfd_gettime`, which was whitelisted)
- `timerfd_settime64` (equivalent of `timerfd_settime`, which was whitelisted)
- `utimensat_time64` (equivalent of `utimensat`, which was whitelisted)
Not added to whitelist:
- `clock_adjtime64` (equivalent of `clock_adjtime`, which was not whitelisted)
- `clock_settime64` (equivalent of `clock_settime`, which was not whitelisted)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When translating seccomp profile to opencontainers format, a single
group with multiple syscalls is converted to individual syscall rules.
I am not sure why it is done that way, but suspect it might have
performance implications as the number of rules grows.
Change this to pass a groups of syscalls as a group.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
All clone flags for namespace should be denied.
Based-on-patch-by: Kenta Tada <Kenta.Tada@sony.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
io_pgetevents() is a new Linux system call. It is similar to io_getevents()
that is already whitelisted, and adds no special abilities over that system call.
Allow that system call to enable applications that use it.
Fixes#38894.
Signed-off-by: Avi Kivity <avi@scylladb.com>
4.8+ kernels have fixed the ptrace security issues
so we can allow ptrace(2) on the default seccomp
profile if we do the kernel version check.
93e35efb8d
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.
Fix#37897
See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
As soon as the initial executable in the container is executed as a non root user,
permitted and effective capabilities are dropped. Drop them earlier than this, so
that they are dropped before executing the file. The main effect of this is that
if `CAP_DAC_OVERRIDE` is set (the default) the user will not be able to execute
files they do not have permission to execute, which previously they could.
The old behaviour was somewhat surprising and the new one is definitely correct,
but it is not in any meaningful way exploitable, and I do not think it is
necessary to backport this fix. It is unlikely to have any negative effects as
almost all executables have world execute permission anyway.
Use the bounding set not the effective set as the canonical set of capabilities, as
effective will now vary.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
The quotactl syscall is being whitelisted in default seccomp profile,
gated by CAP_SYS_ADMIN.
Signed-off-by: Panagiotis Moustafellos <pmoust@elastic.co>
Changes most references of syscall to golang.org/x/sys/
Ones aren't changes include, Errno, Signal and SysProcAttr
as they haven't been implemented in /x/sys/.
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
[s390x] switch utsname from unsigned to signed
per 33267e036f
char in s390x in the /x/sys/unix package is now signed, so
change the buildtags
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
This reverts commit 7e3a596a63.
Unfortunately, it was pointed out in https://github.com/moby/moby/pull/29076#commitcomment-21831387
that the `socketcall` syscall takes a pointer to a struct so it is not possible to
use seccomp profiles to filter it. This means these cannot be blocked as you can
use `socketcall` to call them regardless, as we currently allow 32 bit syscalls.
Users who wish to block these should use a seccomp profile that blocks all
32 bit syscalls and then just block the non socketcall versions.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
From personality(2):
Have uname(2) report a 2.6.40+ version number rather than a 3.x version
number. Added as a stopgap measure to support broken applications that
could not handle the kernel version-numbering switch from 2.6.x to 3.x.
This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".
Fixes: #32839
Signed-off-by: Ian Campbell <ian.campbell@docker.com>
Previously building with seccomp disabled would cause build failures
because of a mismatch in the type signatures of DefaultProfile().
Signed-off-by: Aleksa Sarai <asarai@suse.de>
These are arm variants with different argument ordering because of
register alignment requirements.
fix#30516
Signed-off-by: Justin Cormack <justin.cormack@docker.com>