io_pgetevents() is a new Linux system call. It is similar to io_getevents()
that is already whitelisted, and adds no special abilities over that system call.
Allow that system call to enable applications that use it.
Fixes#38894.
Signed-off-by: Avi Kivity <avi@scylladb.com>
4.8+ kernels have fixed the ptrace security issues
so we can allow ptrace(2) on the default seccomp
profile if we do the kernel version check.
93e35efb8d
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.
Fix#37897
See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
In newer kernels, AppArmor will reject attempts to send signals to a
container because the signal originated from outside of that AppArmor
profile. Correct this by allowing all unconfined signals to be received.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
As soon as the initial executable in the container is executed as a non root user,
permitted and effective capabilities are dropped. Drop them earlier than this, so
that they are dropped before executing the file. The main effect of this is that
if `CAP_DAC_OVERRIDE` is set (the default) the user will not be able to execute
files they do not have permission to execute, which previously they could.
The old behaviour was somewhat surprising and the new one is definitely correct,
but it is not in any meaningful way exploitable, and I do not think it is
necessary to backport this fix. It is unlikely to have any negative effects as
almost all executables have world execute permission anyway.
Use the bounding set not the effective set as the canonical set of capabilities, as
effective will now vary.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
These files don't exist under proc so this rule does nothing.
They are protected against by docker's default cgroup devices since they're
both character devices and not explicitly allowed.
Signed-off-by: Tycho Andersen <tycho@docker.com>
The quotactl syscall is being whitelisted in default seccomp profile,
gated by CAP_SYS_ADMIN.
Signed-off-by: Panagiotis Moustafellos <pmoust@elastic.co>
- Remove unused function and variables from the package
- Remove usage of it from `profiles/apparmor` where it wasn't required
- Move the package to `daemon/logger/templates` where it's only used
Signed-off-by: Vincent Demeester <vincent@sbr.pm>
Changes most references of syscall to golang.org/x/sys/
Ones aren't changes include, Errno, Signal and SysProcAttr
as they haven't been implemented in /x/sys/.
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
[s390x] switch utsname from unsigned to signed
per 33267e036f
char in s390x in the /x/sys/unix package is now signed, so
change the buildtags
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
This reverts commit 7e3a596a63.
Unfortunately, it was pointed out in https://github.com/moby/moby/pull/29076#commitcomment-21831387
that the `socketcall` syscall takes a pointer to a struct so it is not possible to
use seccomp profiles to filter it. This means these cannot be blocked as you can
use `socketcall` to call them regardless, as we currently allow 32 bit syscalls.
Users who wish to block these should use a seccomp profile that blocks all
32 bit syscalls and then just block the non socketcall versions.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
From personality(2):
Have uname(2) report a 2.6.40+ version number rather than a 3.x version
number. Added as a stopgap measure to support broken applications that
could not handle the kernel version-numbering switch from 2.6.x to 3.x.
This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".
Fixes: #32839
Signed-off-by: Ian Campbell <ian.campbell@docker.com>
Previously building with seccomp disabled would cause build failures
because of a mismatch in the type signatures of DefaultProfile().
Signed-off-by: Aleksa Sarai <asarai@suse.de>
These are arm variants with different argument ordering because of
register alignment requirements.
fix#30516
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Linux supports many obsolete address families, which are usually available in
common distro kernels, but they are less likely to be properly audited and
may have security issues
This blocks all socket families in the socket (and socketcall where applicable) syscall
except
- AF_UNIX - Unix domain sockets
- AF_INET - IPv4
- AF_INET6 - IPv6
- AF_NETLINK - Netlink sockets for communicating with the ekrnel
- AF_PACKET - raw sockets, which are only allowed with CAP_NET_RAW
All other socket families are blocked, including Appletalk (native, not
over IP), IPX (remember that!), VSOCK and HVSOCK, which should not generally
be used in containers, etc.
Note that users can of course provide a profile per container or in the daemon
config if they have unusual use cases that require these.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Fixes#26823
Fixes an issue where apparmor was not loaded into the kernel, because
apparmor_parser was being called incorrectly.
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
Writing the profile to /etc/apparmor.d, while also manually loading it
into the kernel results in quite a bit of confusion. In addition, it
means that people using apparmor but have /etc mounted read-only cannot
use apparmor at all on a Docker host.
Fix this by writing the profile to a temporary directory and deleting it
after it's been inserted.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Only open_by_handle_at requires CAP_DAC_READ_SEARCH.
This allows systemd to run with only `--cap-add SYS_ADMIN`
rather than having to also add `--cap-add DAC_READ_SEARCH`
as well which it does not really need.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Do not gate with CAP_IPC_LOCK as unprivileged use is now
allowed in Linux. This returns it to how it was in 1.11.
Fixes#23587
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
To implement seccomp for s390x the following changes are required:
1) seccomp_default: Add s390 compat mode
On s390x (64 bit) we can run s390 (32 bit) programs in 32 bit
compat mode. Therefore add this information to arches().
2) seccomp_default: Use correct flags parameter for sys_clone on s390x
On s390x the second parameter for the clone system call is the flags
parameter. On all other architectures it is the first one.
See kernel code kernel/fork.c:
#elif defined(CONFIG_CLONE_BACKWARDS2)
SYSCALL_DEFINE5(clone, unsigned long, newsp, unsigned long, clone_flags,
int __user *, parent_tidptr,
So fix the docker default seccomp rule and check for the second
parameter on s390/s390x.
3) seccomp_default: Add s390 specific syscalls
For s390 we currently have three additional system calls that should
be added to the seccomp whitelist:
- Other architectures can read/write unprivileged from/to PCI MMIO memory.
On s390 the instructions are privileged and therefore we need system
calls for that purpose:
* s390_pci_mmio_write()
* s390_pci_mmio_read()
- Runtime instrumentation:
* s390_runtime_instr()
4) test_integration: Do not run seccomp default profile test on s390x
The generated profile that we check in is for amd64 and i386
architectures and does not work correctly on s390x.
See also: 75385dc216 ("Do not run the seccomp tests that use
default.json on non x86 architectures")
5) Dockerfile.s390x: Add "seccomp" to DOCKER_BUILDTAGS
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
In #22554 I aligned seccomp and capabilities, however the case of
the chown calls and CAP_CHOWN was less clearcut, as these are
simple calls that the capabilities will block if they are not
allowed. They are needed when no new privileges is not set in
order to allow docker to call chown before the container is
started, so there was a workaround but this did not include
all the chown syscalls, and Arm was failing on some seccomp
tests because it was using a different syscall from just the
fchown that was allowed in this case. It is simpler to just
allow all the chown calls in the default seccomp profile and
let the capabilities subsystem block them.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>