Commit graph

333 commits

Author SHA1 Message Date
Brian Goff
67de83e70b Use real root with 0701 perms
Various dirs in /var/lib/docker contain data that needs to be mounted
into a container. For this reason, these dirs are set to be owned by the
remapped root user, otherwise there can be permissions issues.
However, this uneccessarily exposes these dirs to an unprivileged user
on the host.

Instead, set the ownership of these dirs to the real root (or rather the
UID/GID of dockerd) with 0701 permissions, which allows the remapped
root to enter the directories but not read/write to them.
The remapped root needs to enter these dirs so the container's rootfs
can be configured... e.g. to mount /etc/resolve.conf.

This prevents an unprivileged user from having read/write access to
these dirs on the host.
The flip side of this is now any user can enter these directories.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit e908cc3901)

Cherry-pick conflict with eb14d936bf:
Kept old `container` variable name.
Signed-off-by: Tibor Vass <tibor@docker.com>
2021-01-28 21:42:41 +00:00
Akihiro Suda
47a6d9b54f
Merge pull request #40565 from thaJeztah/19.03_backport_fix_bip_subnet_config
[19.03 backport] Set the bip network value as the subnet
2020-04-17 16:59:34 +09:00
Arko Dasgupta
911ecc3376
Set the bip network value as the subnet
Dont assign the --bip value directly to the subnet
for the default bridge. Instead use the network value
from the ParseCIDR output

Addresses: https://github.com/moby/moby/issues/40392

Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
(cherry picked from commit f800d5f786)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-02-22 16:04:43 +01:00
Yong Tang
dcae74c44a
Fix docker crash when creating namespaces with UID in /etc/subuid and /etc/subgid
This fix tries to address the issue raised in 39353 where
docker crash when creating namespaces with UID in /etc/subuid and /etc/subgid.

The issue was that, mapping to `/etc/sub[u,g]id` in docker does not
allow numeric ID.

This fix fixes the issue by probing other combinations (uid:groupname, username:gid, uid:gid)
when normal username:groupname fails.

This fix fixes 39353.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
(cherry picked from commit f09dc2f4fc)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-02-22 15:46:55 +01:00
Sebastiaan van Stijn
6949793bb1
Merge pull request #429 from thaJeztah/19.03_backport_windows_1903_fixes
[19.03 backport] bump hcsshim to fix docker build failing on Windows 1903
2020-01-23 20:48:16 +01:00
Dominic
16f503c048
cast Dev and Rdev of Stat_t to uint64 for mips
Signed-off-by: Dominic <yindongchao@inspur.com>
Signed-off-by: Dominic Yin <yindongchao@inspur.com>
(cherry picked from commit 5f0231bca1)
Signed-off-by: Dominic Yin <yindongchao@inspur.com>
2020-01-13 09:25:13 +08:00
Sebastiaan van Stijn
4d190af804
Rename "v1" to "statsV1"
follow-up to 27552ceb15, where this
was left as a review comment, but the PR was already merged.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 9a7e96b5b7)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-12-03 16:04:08 +01:00
Sebastiaan van Stijn
9ab162a73a
bump containerd/cgroups 5fbad35c2a7e855762d3c60f2e474ffcad0d470a
full diff: c4b9ac5c76...5fbad35c2a

- containerd/cgroups#82 Add go module support
- containerd/cgroups#96 Move metrics proto package to stats/v1
- containerd/cgroups#97 Allow overriding the default /proc folder in blkioController
- containerd/cgroups#98 Allows ignoring memory modules
- containerd/cgroups#99 Add Go 1.13 to Travis
- containerd/cgroups#100 stats/v1: export per-cgroup stats

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 27552ceb15)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-12-03 16:03:22 +01:00
Brian Goff
34418110ec
Add (hidden) flags to set containerd namespaces
This allows our tests, which all share a containerd instance, to be a
bit more isolated by setting the containerd namespaces to the generated
daemon ID's rather than the default namespaces.

This came about because I found in some cases we had test daemons
failing to start (really very slow to start) because it was (seemingly)
processing events from other tests.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 24ad2f486d)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-09-25 17:26:26 +02:00
Andrew Hsu
2aa5322638
Merge pull request #352 from thaJeztah/19.03_backport_detect_invalid_linked_container
[19.03 backport] Return "invalid parameter" when linking to non-existing container
2019-09-19 17:45:09 -07:00
Brian Goff
c67edc5d61
Ensure parent dir exists for mount cleanup file
While investigating a test failure, I found this in the logs:

```
time="2019-07-04T15:06:32.622506760Z" level=warning msg="Error while setting daemon root propagation, this is not generally critical but may cause some functionality to not work or fallback to less desirable behavior" dir=/go/src/github.com/docker/docker/bundles/test-integration/d1285b8250308/root error="error writing file to signal mount cleanup on shutdown: open /tmp/dxr/d1285b8250308/unmount-on-shutdown: no such file or directory"
```

This path is generated from the daemon's exec-root, which appears to not
exist yet. This change just makes sure it exists before we try to write
a file.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 7725b88edc)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-09-16 15:22:31 +02:00
Sebastiaan van Stijn
1e0234ddc6
Return "invalid parameter" when linking to non-existing container
Trying to link to a non-existing container is not valid, and should return an
"invalid parameter" (400) error. Returning a "not found" error in this situation
would make the client report the container's image could not be found.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 422067ba7b)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-09-10 23:57:45 +02:00
Akihiro Suda
57b59f876e
info: report cgroup driver as "none" when running rootless
Previously `docker info` had reported "cgroupfs" as the cgroup driver
but the driver wasn't actually used at all.

This PR reports "none" as the cgroup driver so as to avoid confusion.
e.g. kubeadm/kubelet will detect cgroupless-ness by checking this docker
info field. https://github.com/rootless-containers/usernetes/pull/97

Note that user still cannot specify `native.cgroupdriver=none` manually.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 153466ba0a)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-06-03 22:48:36 +02:00
frankyang
750e0ace06
bugfix: fetch the right device number which great than 255
Signed-off-by: frankyang <yyb196@gmail.com>
(cherry picked from commit b9f31912de)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-05-21 12:06:26 +02:00
Sebastiaan van Stijn
ffa1728d4b
Normalize values for pids-limit
- Don't set `PidsLimit` when creating a container and
  no limit was set (or the limit was set to "unlimited")
- Don't set `PidsLimit` if the host does not have pids-limit
  support (previously "unlimited" was set).
- Do not generate a warning if the host does not have pids-limit
  support, but pids-limit was set to unlimited (having no
  limit set, or the limit set to "unlimited" is equivalent,
  so no warning is nescessary in that case).
- When updating a container, convert `0`, and `-1` to
  "unlimited" (`0`).

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-03-13 00:27:05 +01:00
Sebastiaan van Stijn
dd94555787
Merge pull request #32519 from darkowlzz/32443-docker-update-pids-limit
Add pids-limit support in docker update
2019-02-23 15:20:59 +01:00
Sunny Gogoi
74eb258ffb Add pids-limit support in docker update
- Adds updating PidsLimit in UpdateContainer().
- Adds setting PidsLimit in toContainerResources().

Signed-off-by: Sunny Gogoi <indiasuny000@gmail.com>
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2019-02-21 14:17:38 -08:00
Akihiro Suda
ec87479b7e allow running dockerd in an unprivileged user namespace (rootless mode)
Please refer to `docs/rootless.md`.

TLDR:
 * Make sure `/etc/subuid` and `/etc/subgid` contain the entry for you
 * `dockerd-rootless.sh --experimental`
 * `docker -H unix://$XDG_RUNTIME_DIR/docker.sock run ...`

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2019-02-04 00:24:27 +09:00
Akihiro Suda
2cb26cfe9c
Merge pull request #38301 from cyphar/waitgroup-limits
daemon: switch to semaphore-gated WaitGroup for startup tasks
2018-12-22 00:07:55 +09:00
Aleksa Sarai
5a52917e4d
daemon: switch to semaphore-gated WaitGroup for startup tasks
Many startup tasks have to run for each container, and thus using a
WaitGroup (which doesn't have a limit to the number of parallel tasks)
can result in Docker exceeding the NOFILE limit quite trivially. A more
optimal solution is to have a parallelism limit by using a semaphore.

In addition, several startup tasks were not parallelised previously
which resulted in very long startup times. According to my testing, 20K
dead containers resulted in ~6 minute startup times (during which time
Docker is completely unusable).

This patch fixes both issues, and the parallelStartupTimes factor chosen
(128 * NumCPU) is based on my own significant testing of the 20K
container case. This patch (on my machines) reduces the startup time
from 6 minutes to less than a minute (ideally this could be further
reduced by removing the need to scan all dead containers on startup --
but that's beyond the scope of this patchset).

In order to avoid the NOFILE limit problem, we also detect this
on-startup and if NOFILE < 2*128*NumCPU we will reduce the parallelism
factor to avoid hitting NOFILE limits (but also emit a warning since
this is almost certainly a mis-configuration).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-12-21 21:51:02 +11:00
Sebastiaan van Stijn
f6002117a4
Extract container-config and container-hostconfig validation
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-19 13:09:12 +01:00
Sebastiaan van Stijn
b6e373c525
Rename verifyContainerResources to verifyPlatformContainerResources
This validation function is platform-specific; rename it to be
more explicit.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-19 10:24:09 +01:00
Sebastiaan van Stijn
e278678705
Remove unused argument from verifyPlatformContainerSettings
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-19 09:23:09 +01:00
Sebastiaan van Stijn
10c97b9357
Unify logging container validation warnings
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-19 09:15:21 +01:00
Sebastiaan van Stijn
2e23ef5350
Move port-publishing check to linux platform-check
Windows does not have host-mode networking, so on Windows, this
check was a no-op

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-18 22:46:05 +01:00
Sebastiaan van Stijn
57f1305e74
Move "OOM Kill disable" warning to the daemon
Disabling the oom-killer for a container without setting a memory limit
is dangerous, as it can result in the container consuming unlimited memory,
without the kernel being able to kill it. A check for this situation is curently
done in the CLI, but other consumers of the API won't receive this warning.

This patch adds a check for this situation to the daemon, so that all consumers
of the API will receive this warning.

This patch will have one side-effect; docker cli's that also perform this check
client-side will print the warning twice; this can be addressed by disabling
the cli-side check for newer API versions, but will generate a bit of extra
noise when using an older CLI.

With this patch applied (and a cli that does not take the new warning into account);

```
docker create --oom-kill-disable busybox
WARNING: OOM killer is disabled for the container, but no memory limit is set, this can result in the system running out of resources.
669933b9b237fa27da699483b5cf15355a9027050825146587a0e5be0d848adf

docker run --rm --oom-kill-disable busybox
WARNING: Disabling the OOM killer on containers without setting a '-m/--memory' limit may be dangerous.
WARNING: OOM killer is disabled for the container, but no memory limit is set, this can result in the system running out of resources.
```

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-18 22:30:56 +01:00
Andrew Hsu
78045a5419
use empty string as cgroup path to grab first find
Signed-off-by: Andrew Hsu <andrewhsu@docker.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-07 18:44:00 +01:00
Yong Tang
f023816608 Add memory.kernelTCP support for linux
This fix tries to address the issue raised in 37038 where
there were no memory.kernelTCP support for linux.

This fix add MemoryKernelTCP to HostConfig, and pass
the config to runtime-spec.

Additional test case has been added.

This fix fixes 37038.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
2018-11-26 21:03:08 +00:00
Justin Cormack
f8e876d761
Fix denial of service with large numbers in cpuset-cpus and cpuset-mems
Using a value such as `--cpuset-mems=1-9223372036854775807` would cause
`dockerd` to run out of memory allocating a map of the values in the
validation code. Set limits to the normal limit of the number of CPUs,
and improve the error handling.

Reported by Huawei PSIRT.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-10-05 15:09:02 +02:00
Tibor Vass
34eede0296 Remove 'docker-' prefix for containerd and runc binaries
This allows to run the daemon in environments that have upstream containerd installed.

Signed-off-by: Tibor Vass <tibor@docker.com>
2018-09-24 21:49:03 +00:00
Salahuddin Khan
763d839261 Add ADD/COPY --chown flag support to Windows
This implements chown support on Windows. Built-in accounts as well
as accounts included in the SAM database of the container are supported.

NOTE: IDPair is now named Identity and IDMappings is now named
IdentityMapping.

The following are valid examples:
ADD --chown=Guest . <some directory>
COPY --chown=Administrator . <some directory>
COPY --chown=Guests . <some directory>
COPY --chown=ContainerUser . <some directory>

On Windows an owner is only granted the permission to read the security
descriptor and read/write the discretionary access control list. This
fix also grants read/write and execute permissions to the owner.

Signed-off-by: Salahuddin Khan <salah@docker.com>
2018-08-13 21:59:11 -07:00
Sebastiaan van Stijn
3737194b9f
daemon/*.go: fix some Wrap[f]/Warn[f] errors
In particular, these two:
> daemon/daemon_unix.go:1129: Wrapf format %v reads arg #1, but call has 0 args
> daemon/kill.go:111: Warn call has possible formatting directive %s

and a few more.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-07-11 15:51:51 +02:00
Sebastiaan van Stijn
f23c00d870
Various code-cleanup
remove unnescessary import aliases, brackets, and so on.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-05-23 17:50:54 +02:00
Alessandro Boch
173b3c364e Allow user to control the default address pools
- Via daemon flag --default-address-pools base=<CIDR>,size=<int>

Signed-off-by: Elango Siva  <elango@docker.com>
2018-04-30 11:14:08 -04:00
Sebastiaan van Stijn
cf9c48bb3e
Merge pull request #36879 from cpuguy83/extra_unmount_check
Extra check before unmounting on shutdown
2018-04-20 17:08:11 -07:00
Brian Goff
6a70fd222b Move mount parsing to separate package.
This moves the platform specific stuff in a separate package and keeps
the `volume` package and the defined interfaces light to import.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-04-19 06:35:54 -04:00
Brian Goff
c403f0036b Extra check before unmounting on shutdown
This makes sure that if the daemon root was already a self-binded mount
(thus meaning the daemonc only performed a remount) that the daemon does
not try to unmount.

Example:

```
$ sudo mount --bind /var/lib/docker /var/lib/docker
$ sudo dockerd &
```

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-04-18 20:43:42 -04:00
Daniel Nephin
4ceea53b5e Remove duplicate rootFSToAPIType
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2018-02-14 11:59:18 -05:00
Daniel Nephin
c502bcff33 Remove unnecessary getLayerInit
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2018-02-14 11:59:10 -05:00
Tianon Gravi
3a633a712c
Merge pull request #36194 from dnephin/add-canonical-import
Add canonical import path
2018-02-07 13:06:45 -08:00
Daniel Nephin
4f0d95fa6e Add canonical import comment
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2018-02-05 16:51:57 -05:00
Dennis Chen
44b074d199 Daemon: passdown the --oom-kill-disable option to containerd
Current implementaion of docke daemon doesn't pass down the
`--oom-kill-disable` option specified by the end user to the containerd
when spawning a new docker instance with help from `runc` component, which
results in the `--oom-kill-disable` doesn't work no matter the flag is `true`
or `false`.

This PR will fix this issue reported by #36090

Signed-off-by: Dennis Chen <dennis.chen@arm.com>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2018-02-05 03:25:59 +00:00
Brian Goff
a510192b86 Set daemon root to use shared propagation
This change sets an explicit mount propagation for the daemon root.
This is useful for people who need to bind mount the docker daemon root
into a container.

Since bind mounting the daemon root should only ever happen with at
least `rlsave` propagation (to prevent the container from holding
references to mounts making it impossible for the daemon to clean up its
resources), we should make sure the user is actually able to this.

Most modern systems have shared root (`/`) propagation by default
already, however there are some cases where this may not be so
(e.g. potentially docker-in-docker scenarios, but also other cases).
So this just gives the daemon a little more control here and provides
a more uniform experience across different systems.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-01-23 14:17:08 -08:00
Sebastiaan van Stijn
6121a8429b
Move reload-related functions to reload.go
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-01-21 00:55:49 +01:00
John Howard
ce8e529e18 LCOW: Re-coalesce stores
Signed-off-by: John Howard <jhoward@microsoft.com>

The re-coalesces the daemon stores which were split as part of the
original LCOW implementation.

This is part of the work discussed in https://github.com/moby/moby/issues/34617,
in particular see the document linked to in that issue.
2018-01-18 08:29:19 -08:00
Sebastiaan van Stijn
b4a6313969
Golint: remove redundant ifs
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-01-15 00:42:25 +01:00
Sebastiaan van Stijn
16fe5a1289
Remove unused experimental code
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2017-12-18 17:07:48 +01:00
Kir Kolyshkin
516010e92d Simplify/fix MkdirAll usage
This subtle bug keeps lurking in because error checking for `Mkdir()`
and `MkdirAll()` is slightly different wrt to `EEXIST`/`IsExist`:

 - for `Mkdir()`, `IsExist` error should (usually) be ignored
   (unless you want to make sure directory was not there before)
   as it means "the destination directory was already there"

 - for `MkdirAll()`, `IsExist` error should NEVER be ignored.

Mostly, this commit just removes ignoring the IsExist error, as it
should not be ignored.

Also, there are a couple of cases then IsExist is handled as
"directory already exist" which is wrong. As a result, some code
that never worked as intended is now removed.

NOTE that `idtools.MkdirAndChown()` behaves like `os.MkdirAll()`
rather than `os.Mkdir()` -- so its description is amended accordingly,
and its usage is handled as such (i.e. IsExist error is not ignored).

For more details, a quote from my runc commit 6f82d4b (July 2015):

    TL;DR: check for IsExist(err) after a failed MkdirAll() is both
    redundant and wrong -- so two reasons to remove it.

    Quoting MkdirAll documentation:

    > MkdirAll creates a directory named path, along with any necessary
    > parents, and returns nil, or else returns an error. If path
    > is already a directory, MkdirAll does nothing and returns nil.

    This means two things:

    1. If a directory to be created already exists, no error is
    returned.

    2. If the error returned is IsExist (EEXIST), it means there exists
    a non-directory with the same name as MkdirAll need to use for
    directory. Example: we want to MkdirAll("a/b"), but file "a"
    (or "a/b") already exists, so MkdirAll fails.

    The above is a theory, based on quoted documentation and my UNIX
    knowledge.

    3. In practice, though, current MkdirAll implementation [1] returns
    ENOTDIR in most of cases described in #2, with the exception when
    there is a race between MkdirAll and someone else creating the
    last component of MkdirAll argument as a file. In this very case
    MkdirAll() will indeed return EEXIST.

    Because of #1, IsExist check after MkdirAll is not needed.

    Because of #2 and #3, ignoring IsExist error is just plain wrong,
    as directory we require is not created. It's cleaner to report
    the error now.

    Note this error is all over the tree, I guess due to copy-paste,
    or trying to follow the same usage pattern as for Mkdir(),
    or some not quite correct examples on the Internet.

    [1] https://github.com/golang/go/blob/f9ed2f75/src/os/path.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2017-11-27 17:32:12 -08:00
Kenfe-Mickael Laventure
ddae20c032
Update libcontainerd to use containerd 1.0
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-10-20 07:11:37 -07:00
Brian Goff
c6a2044497 Don't abort when setting may_detach_mounts
83c2152de5 sets the kernel param for
fs.may_detach_mounts, but this is not neccessary for the daemon to
operate. Instead of erroring out (and thus aborting startup) just log
the error.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2017-10-11 14:54:24 -04:00