Commit graph

135 commits

Author SHA1 Message Date
Eng Zer Jun
a916414b0b refactor: move from io/ioutil to io and os package
The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
(cherry picked from commit c55a4ac779)
Signed-off-by: Cory Snider <csnider@mirantis.com>
2023-02-24 16:11:55 -05:00
Nicolas De Loof
e44d7f735e
AdditionalGids must include effective group ID
otherwise this one won't be considered for permission checks

Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
(cherry picked from commit 25345f2c04)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-09-06 12:25:37 +02:00
Sebastiaan van Stijn
660b9962e4
daemon.WithCommonOptions() fix detection of user-namespaces
Commit dae652e2e5 added support for non-privileged
containers to use ICMP_PROTO (used for `ping`). This option cannot be set for
containers that have user-namespaces enabled.

However, the detection looks to be incorrect; HostConfig.UsernsMode was added
in 6993e891d1 / ee2183881b,
and the property only has meaning if the daemon is running with user namespaces
enabled. In other situations, the property has no meaning.
As a result of the above, the sysctl would only be set for containers running
with UsernsMode=host on a daemon running with user-namespaces enabled.

This patch adds a check if the daemon has user-namespaces enabled (RemappedRoot
having a non-empty value), or if the daemon is running inside a user namespace
(e.g. rootless mode) to fix the detection.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit a826ca3aef)

---
The cherry-pick was almost clean but `userns.RunningInUserNS()` -> `sys.RunningInUserNS()`.

Fix docker/buildx issue 561
---

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-12-15 18:20:07 +09:00
Akihiro Suda
e974cb638c
rootless: bind mount: fix "operation not permitted"
The following was failing previously, because `getUnprivilegedMountFlags()` was not called:
```console
$ sudo mount -t tmpfs -o noexec none /tmp/foo
$ $ docker --context=rootless run -it --rm -v /tmp/foo:/mnt:ro alpine
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:520: container init caused: rootfs_linux.go:60: mounting "/tmp/foo" to rootfs at "/home/suda/.local/share/docker/overlay2/b8e7ea02f6ef51247f7f10c7fb26edbfb308d2af8a2c77915260408ed3b0a8ec/merged/mnt" caused: operation not permitted: unknown.
```

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 248f98ef5e)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-04-01 18:45:23 +09:00
Sebastiaan van Stijn
6458f750e1
use containerd/cgroups to detect cgroups v2
libcontainer does not guarantee a stable API, and is not intended
for external consumers.

this patch replaces some uses of libcontainer/cgroups with
containerd/cgroups.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-11-09 15:00:32 +01:00
Sebastiaan van Stijn
65a33d02f6
Simplify getUser() to use libcontainer built-in functionality
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-09 13:25:59 +02:00
Aleksa Sarai
3108ae6226
oci: correctly use user.GetExecUser interface
A nil interface in Go is not the same as a nil pointer that satisfies
the interface. libcontainer/user has special handling for missing
/etc/{passwd,group} files but this is all based on nil interface checks,
which were broken by Docker's usage of the API.

When combined with some recent changes in runc that made read errors
actually be returned to the caller, this results in spurrious -EINVAL
errors when we should detect the situation as "there is no passwd file".

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-07-29 14:04:47 +02:00
Brian Goff
24f173a003 Replace service "Capabilities" w/ add/drop API
After dicussing with maintainers, it was decided putting the burden of
providing the full cap list on the client is not a good design.
Instead we decided to follow along with the container API and use cap
add/drop.

This brings in the changes already merged into swarmkit.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2020-07-27 10:09:42 -07:00
Kir Kolyshkin
e3cff19dd1 Untangle CPU RT controller init
Commit 56f77d5ade added code that is doing some very ugly things.
In partucular, calling cgroups.FindCgroupMountpointAndRoot() and
daemon.SysInfoRaw() inside a recursively-called initCgroupsPath()
not not a good thing to do.

This commit tries to partially untangle this by moving some expensive
checks and calls earlier, in a minimally invasive way (meaning I
tried hard to not break any logic, however weird it is).

This also removes double call to MkdirAll (not important, but it sticks
out) and renames the function to better reflect what it's doing.

Finally, this wraps some of the errors returned, and fixes the init
function to not ignore the error from itself.

This could be reworked more radically, but at least this this commit
we are calling expensive functions once, and only if necessary.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-26 16:19:52 -07:00
Sebastiaan van Stijn
4534a7afc3
daemon: use containerd/sys to detect UserNamespaces
The implementation in libcontainer/system is quite complicated,
and we only use it to detect if user-namespaces are enabled.

In addition, the implementation in containerd uses a sync.Once,
so that detection (and reading/parsing `/proc/self/uid_map`) is
only performed once.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-06-15 13:06:08 +02:00
Justin Cormack
dae652e2e5
Add default sysctls to allow ping sockets and privileged ports with no capabilities
Currently default capability CAP_NET_RAW allows users to open ICMP echo
sockets, and CAP_NET_BIND_SERVICE allows binding to ports under 1024.
Both of these are safe operations, and Linux now provides ways that
these can be set, per container, to be allowed without any capabilties
for non root users. Enable these by default. Users can revert to the
previous behaviour by overriding the sysctl values explicitly.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2020-06-04 18:11:08 +01:00
Akihiro Suda
33ee7941d4 support --privileged --cgroupns=private on cgroup v1
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-21 23:11:32 +09:00
Sebastiaan van Stijn
5d040cbd16
daemon: fix capitalization of some functions
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-04-14 17:22:19 +02:00
Sebastiaan van Stijn
eeef12f469
daemon: address some minor linting issues and nits
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-04-14 17:22:17 +02:00
Kir Kolyshkin
39048cf656 Really switch to moby/sys/mount*
Switch to moby/sys/mount and mountinfo. Keep the pkg/mount for potential
outside users.

This commit was generated by the following bash script:

```
set -e -u -o pipefail

for file in $(git grep -l 'docker/docker/pkg/mount"' | grep -v ^pkg/mount); do
	sed -i -e 's#/docker/docker/pkg/mount"#/moby/sys/mount"#' \
		-e 's#mount\.\(GetMounts\|Mounted\|Info\|[A-Za-z]*Filter\)#mountinfo.\1#g' \
		$file
	goimports -w $file
done
```

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-20 09:46:25 -07:00
Akihiro Suda
ca4b51868a rootless: support --exec-opt native.cgroupdriver=systemd
Support cgroup as in Rootless Podman.

Requires cgroup v2 host with crun.
Tested with Ubuntu 19.10 (kernel 5.3, systemd 242), crun v0.12.1.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-02-14 15:32:31 +09:00
Sebastiaan van Stijn
e6c1820ef5
Merge pull request #40174 from AkihiroSuda/cgroup2
support cgroup2
2020-01-09 20:09:11 +01:00
Akhil Mohan
86ebbe16de
remove host directory check
Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
2020-01-02 14:28:51 +05:30
Akihiro Suda
19baeaca26 cgroup2: enable cgroup namespace by default
For cgroup v1, we were unable to change the default because of
compatibility issue.

For cgroup v2, we should change the default right now because switching
to cgroup v2 is already breaking change.

See also containers/libpod#4363 containers/libpod#4374

Privileged containers also use cgroupns=private by default.
https://github.com/containers/libpod/pull/4374#issuecomment-549776387

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-01 02:58:40 +09:00
Akhil Mohan
35b9e6989f
Make --device flag work in privileged mode
When a container is started in privileged mode, the device mappings
provided by `--device` flag was ignored. Now the device mappings will
be considered even in privileged mode.

Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
2019-12-06 18:43:56 +05:30
wenlxie
03b3ec1dd5
make --device works at privileged mode
Signed-off-by: wenlxie <wenlxie@ebay.com>
2019-12-06 18:17:03 +05:30
Olli Janatuinen
1308a3a99f Move DefaultCapabilities() to caps package
Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
2019-11-14 21:13:16 +02:00
Justin Cormack
dde030a6b1
Merge pull request #40083 from thaJeztah/daemon_consts
daemon: use constants for AppArmor and Seccomp
2019-10-17 11:12:37 -07:00
Grant Millar
df7b8f458a daemon: Use short libnetwork ID in exec-root & update libnetwork
Signed-off-by: Grant Millar <rid@cylo.io>
2019-10-15 11:40:24 +01:00
Sebastiaan van Stijn
a33cf495f2
daemon: use constants for AppArmor profiles
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-10-13 19:16:12 +02:00
Sebastiaan van Stijn
07ff4f1de8
goimports: fix imports
Format the source according to latest goimports.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-09-18 12:56:54 +02:00
Rob Gulewich
072400fc4b Make cgroup namespaces configurable
This adds both a daemon-wide flag and a container creation property:
- Set the `CgroupnsMode: "host|private"` HostConfig property at
  container creation time to control what cgroup namespace the container
  is created in
- Set the `--default-cgroupns-mode=host|private` daemon flag to control
  what cgroup namespace containers are created in by default
- Set the default if the daemon flag is unset to "host", for backward
  compatibility
- Default to CgroupnsMode: "host" for client versions < 1.40

Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
2019-05-07 10:22:16 -07:00
Rob Gulewich
256eb04d69 Start containers in their own cgroup namespaces
This is enabled for all containers that are not run with --privileged,
if the kernel supports it.

Fixes #38332

Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
2019-05-07 10:22:16 -07:00
Michael Crosby
c478553640 Export all spec generation opts
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-04-10 15:38:36 -04:00
Michael Crosby
cb902f4430 Refactor few spec generation ops
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-04-09 16:51:40 -04:00
John Howard
a3eda72f71
Merge pull request #38541 from Microsoft/jjh/containerd
Windows: Experimental: ContainerD runtime
2019-03-19 21:09:19 -07:00
Tibor Vass
8f936ae8cf Add DeviceRequests to HostConfig to support NVIDIA GPUs
This patch hard-codes support for NVIDIA GPUs.
In a future patch it should move out into its own Device Plugin.

Signed-off-by: Tibor Vass <tibor@docker.com>
2019-03-18 17:19:45 +00:00
John Howard
d4ceb61f2b LCOW:Reworking spec builder
Signed-off-by: John Howard <jhoward@microsoft.com>
2019-03-12 18:41:55 -07:00
Sebastiaan van Stijn
dd94555787
Merge pull request #32519 from darkowlzz/32443-docker-update-pids-limit
Add pids-limit support in docker update
2019-02-23 15:20:59 +01:00
Sunny Gogoi
74eb258ffb Add pids-limit support in docker update
- Adds updating PidsLimit in UpdateContainer().
- Adds setting PidsLimit in toContainerResources().

Signed-off-by: Sunny Gogoi <indiasuny000@gmail.com>
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2019-02-21 14:17:38 -08:00
Akihiro Suda
ec87479b7e allow running dockerd in an unprivileged user namespace (rootless mode)
Please refer to `docs/rootless.md`.

TLDR:
 * Make sure `/etc/subuid` and `/etc/subgid` contain the entry for you
 * `dockerd-rootless.sh --experimental`
 * `docker -H unix://$XDG_RUNTIME_DIR/docker.sock run ...`

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2019-02-04 00:24:27 +09:00
Olli Janatuinen
80d7bfd54d Capabilities refactor
- Add support for exact list of capabilities, support only OCI model
- Support OCI model on CapAdd and CapDrop but remain backward compatibility
- Create variable locally instead of declaring it at the top
- Use const for magic "ALL" value
- Rename `cap` variable as it overlaps with `cap()` built-in
- Normalize and validate capabilities before use
- Move validation for conflicting options to validateHostConfig()
- TweakCapabilities: simplify logic to calculate capabilities

Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-01-22 21:50:41 +02:00
Michael Crosby
b940cc5cff Move caps and device spec utils to oci pkg
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-12-11 10:20:25 -05:00
Aleksa Sarai
7417f50575
oci: include the domainname in "kernel.domainname"
The OCI doesn't have a specific field for an NIS domainname[1] (mainly
because FreeBSD and Solaris appear to have a similar concept but it is
configured entirely differently).

However, on Linux, the NIS domainname can be configured through both the
setdomainname(2) syscall but also through the "kernel.domainname"
sysctl. Since the OCI has a way of injecting sysctls this means we don't
need to have any OCI changes to support NIS domainnames (and we can
always switch if the OCI picks up such support in the future).

It should be noted that because we have to generate this each spec
creation we also have to make sure that it's not clobbered by the
HostConfig. I'm pretty sure making this change generic (so that
HostConfig will not clobber any pre-set sysctls) will not cause other
issues to crop up.

[1]: https://github.com/opencontainers/runtime-spec/issues/592

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-11-30 17:31:38 +11:00
Akihiro Suda
596cdffb9f mount: add BindOptions.NonRecursive (API v1.40)
This allows non-recursive bind-mount, i.e. mount(2) with "bind" rather than "rbind".

Swarm-mode will be supported in a separate PR because of mutual vendoring.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-11-06 17:51:58 +09:00
Sebastiaan van Stijn
deac65c929
Merge pull request #37850 from AkihiroSuda/propagate-exec-root-to-libnetwork
daemon: propagate exec-root to libnetwork-setkey
2018-09-28 15:20:37 +02:00
Brian Goff
12d5eb8e22
Merge pull request #37703 from kolyshkin/rm-dead-code
daemon/setMounts(): remove dead code
2018-09-25 16:07:15 -07:00
Akihiro Suda
40385208cb daemon: propagate exec-root to libnetwork-setkey
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-15 13:49:30 +09:00
Kir Kolyshkin
ac8c3debdb daemon/setMounts(): remove dead code
Since PR 11353 (commit 7804cd36ee "Filter out default mounts that
are override by user") there can be no duplicated mounts in the list,
so the check is redundant.

This should speed up container start by a nanosecond or two.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-08-27 15:40:10 -07:00
Kir Kolyshkin
bcacbf523b Fix docker --init with /dev bind mount
In case a user wants to have a child reaper inside a container
(i.e. run "docker --init") AND a bind-mounted /dev, the following
error occurs:

> docker run -d -v /dev:/dev --init busybox top
> 088c96808c683077f04c4cc2711fddefe1f5970afc085d59e0baae779745a7cf
> docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "exec: "/dev/init": stat /dev/init: no such file or directory": unknown.

This happens because if a user-suppled /dev is provided, all the
built-in /dev/xxx mounts are filtered out.

To solve, let's move in-container init to /sbin, as the chance that
/sbin will be bind-mounted to a container is smaller than that for /dev.
While at it, let's give it more unique name (docker-init).

NOTE it still won't work for the case of bind-mounted /sbin.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-08-27 15:38:46 -07:00
Salahuddin Khan
763d839261 Add ADD/COPY --chown flag support to Windows
This implements chown support on Windows. Built-in accounts as well
as accounts included in the SAM database of the container are supported.

NOTE: IDPair is now named Identity and IDMappings is now named
IdentityMapping.

The following are valid examples:
ADD --chown=Guest . <some directory>
COPY --chown=Administrator . <some directory>
COPY --chown=Guests . <some directory>
COPY --chown=ContainerUser . <some directory>

On Windows an owner is only granted the permission to read the security
descriptor and read/write the discretionary access control list. This
fix also grants read/write and execute permissions to the owner.

Signed-off-by: Salahuddin Khan <salah@docker.com>
2018-08-13 21:59:11 -07:00
Kazuhiro Sera
1e49fdcafc Fix the several typos detected by github.com/client9/misspell
Signed-off-by: Kazuhiro Sera <seratch@gmail.com>
2018-08-09 00:45:00 +09:00
John Starks
e9268d9642 lcow: Allow the client to add device cgroup rules
Signed-off-by: John Starks <jostarks@microsoft.com>
2018-06-15 16:14:17 -07:00
John Starks
349aeeab7c lcow: Allow the client to add or remove capabilities
Signed-off-by: John Starks <jostarks@microsoft.com>
2018-06-15 16:03:33 -07:00
Jess Frazelle
3694c1e34e
api: add configurable MaskedPaths and ReadOnlyPaths to the API
This adds MaskedPaths and ReadOnlyPaths options to HostConfig for containers so
that a user can override the default values.

When the value sent through the API is nil the default is used.
Otherwise the default is overridden.

Adds integration tests for MaskedPaths and ReadonlyPaths.

Signed-off-by: Jess Frazelle <acidburn@microsoft.com>
2018-06-05 12:33:14 -04:00