libcontainer does not guarantee a stable API, and is not intended
for external consumers.
this patch replaces some uses of libcontainer/cgroups with
containerd/cgroups.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
A nil interface in Go is not the same as a nil pointer that satisfies
the interface. libcontainer/user has special handling for missing
/etc/{passwd,group} files but this is all based on nil interface checks,
which were broken by Docker's usage of the API.
When combined with some recent changes in runc that made read errors
actually be returned to the caller, this results in spurrious -EINVAL
errors when we should detect the situation as "there is no passwd file".
Signed-off-by: Aleksa Sarai <asarai@suse.de>
After dicussing with maintainers, it was decided putting the burden of
providing the full cap list on the client is not a good design.
Instead we decided to follow along with the container API and use cap
add/drop.
This brings in the changes already merged into swarmkit.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Commit 56f77d5ade added code that is doing some very ugly things.
In partucular, calling cgroups.FindCgroupMountpointAndRoot() and
daemon.SysInfoRaw() inside a recursively-called initCgroupsPath()
not not a good thing to do.
This commit tries to partially untangle this by moving some expensive
checks and calls earlier, in a minimally invasive way (meaning I
tried hard to not break any logic, however weird it is).
This also removes double call to MkdirAll (not important, but it sticks
out) and renames the function to better reflect what it's doing.
Finally, this wraps some of the errors returned, and fixes the init
function to not ignore the error from itself.
This could be reworked more radically, but at least this this commit
we are calling expensive functions once, and only if necessary.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The implementation in libcontainer/system is quite complicated,
and we only use it to detect if user-namespaces are enabled.
In addition, the implementation in containerd uses a sync.Once,
so that detection (and reading/parsing `/proc/self/uid_map`) is
only performed once.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Currently default capability CAP_NET_RAW allows users to open ICMP echo
sockets, and CAP_NET_BIND_SERVICE allows binding to ports under 1024.
Both of these are safe operations, and Linux now provides ways that
these can be set, per container, to be allowed without any capabilties
for non root users. Enable these by default. Users can revert to the
previous behaviour by overriding the sysctl values explicitly.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Switch to moby/sys/mount and mountinfo. Keep the pkg/mount for potential
outside users.
This commit was generated by the following bash script:
```
set -e -u -o pipefail
for file in $(git grep -l 'docker/docker/pkg/mount"' | grep -v ^pkg/mount); do
sed -i -e 's#/docker/docker/pkg/mount"#/moby/sys/mount"#' \
-e 's#mount\.\(GetMounts\|Mounted\|Info\|[A-Za-z]*Filter\)#mountinfo.\1#g' \
$file
goimports -w $file
done
```
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Support cgroup as in Rootless Podman.
Requires cgroup v2 host with crun.
Tested with Ubuntu 19.10 (kernel 5.3, systemd 242), crun v0.12.1.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When a container is started in privileged mode, the device mappings
provided by `--device` flag was ignored. Now the device mappings will
be considered even in privileged mode.
Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
Format the source according to latest goimports.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This adds both a daemon-wide flag and a container creation property:
- Set the `CgroupnsMode: "host|private"` HostConfig property at
container creation time to control what cgroup namespace the container
is created in
- Set the `--default-cgroupns-mode=host|private` daemon flag to control
what cgroup namespace containers are created in by default
- Set the default if the daemon flag is unset to "host", for backward
compatibility
- Default to CgroupnsMode: "host" for client versions < 1.40
Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
This is enabled for all containers that are not run with --privileged,
if the kernel supports it.
Fixes#38332
Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
This patch hard-codes support for NVIDIA GPUs.
In a future patch it should move out into its own Device Plugin.
Signed-off-by: Tibor Vass <tibor@docker.com>
Please refer to `docs/rootless.md`.
TLDR:
* Make sure `/etc/subuid` and `/etc/subgid` contain the entry for you
* `dockerd-rootless.sh --experimental`
* `docker -H unix://$XDG_RUNTIME_DIR/docker.sock run ...`
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
- Add support for exact list of capabilities, support only OCI model
- Support OCI model on CapAdd and CapDrop but remain backward compatibility
- Create variable locally instead of declaring it at the top
- Use const for magic "ALL" value
- Rename `cap` variable as it overlaps with `cap()` built-in
- Normalize and validate capabilities before use
- Move validation for conflicting options to validateHostConfig()
- TweakCapabilities: simplify logic to calculate capabilities
Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The OCI doesn't have a specific field for an NIS domainname[1] (mainly
because FreeBSD and Solaris appear to have a similar concept but it is
configured entirely differently).
However, on Linux, the NIS domainname can be configured through both the
setdomainname(2) syscall but also through the "kernel.domainname"
sysctl. Since the OCI has a way of injecting sysctls this means we don't
need to have any OCI changes to support NIS domainnames (and we can
always switch if the OCI picks up such support in the future).
It should be noted that because we have to generate this each spec
creation we also have to make sure that it's not clobbered by the
HostConfig. I'm pretty sure making this change generic (so that
HostConfig will not clobber any pre-set sysctls) will not cause other
issues to crop up.
[1]: https://github.com/opencontainers/runtime-spec/issues/592
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This allows non-recursive bind-mount, i.e. mount(2) with "bind" rather than "rbind".
Swarm-mode will be supported in a separate PR because of mutual vendoring.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
Since PR 11353 (commit 7804cd36ee "Filter out default mounts that
are override by user") there can be no duplicated mounts in the list,
so the check is redundant.
This should speed up container start by a nanosecond or two.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case a user wants to have a child reaper inside a container
(i.e. run "docker --init") AND a bind-mounted /dev, the following
error occurs:
> docker run -d -v /dev:/dev --init busybox top
> 088c96808c683077f04c4cc2711fddefe1f5970afc085d59e0baae779745a7cf
> docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "exec: "/dev/init": stat /dev/init: no such file or directory": unknown.
This happens because if a user-suppled /dev is provided, all the
built-in /dev/xxx mounts are filtered out.
To solve, let's move in-container init to /sbin, as the chance that
/sbin will be bind-mounted to a container is smaller than that for /dev.
While at it, let's give it more unique name (docker-init).
NOTE it still won't work for the case of bind-mounted /sbin.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This implements chown support on Windows. Built-in accounts as well
as accounts included in the SAM database of the container are supported.
NOTE: IDPair is now named Identity and IDMappings is now named
IdentityMapping.
The following are valid examples:
ADD --chown=Guest . <some directory>
COPY --chown=Administrator . <some directory>
COPY --chown=Guests . <some directory>
COPY --chown=ContainerUser . <some directory>
On Windows an owner is only granted the permission to read the security
descriptor and read/write the discretionary access control list. This
fix also grants read/write and execute permissions to the owner.
Signed-off-by: Salahuddin Khan <salah@docker.com>
This adds MaskedPaths and ReadOnlyPaths options to HostConfig for containers so
that a user can override the default values.
When the value sent through the API is nil the default is used.
Otherwise the default is overridden.
Adds integration tests for MaskedPaths and ReadonlyPaths.
Signed-off-by: Jess Frazelle <acidburn@microsoft.com>
A recent optimization in getSourceMount() made it return an error
in case when the found mount point is "/". This prevented bind-mounted
volumes from working in such cases.
A (rather trivial but adeqate) unit test case is added.
Fixes: 871c957242 ("getSourceMount(): simplify")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It does not make sense to copy a slice element by element, then discard
the source one. Let's do copy in place instead which is way more
efficient.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>