Commit graph

315 commits

Author SHA1 Message Date
Akihiro Suda
2cb26cfe9c
Merge pull request #38301 from cyphar/waitgroup-limits
daemon: switch to semaphore-gated WaitGroup for startup tasks
2018-12-22 00:07:55 +09:00
Aleksa Sarai
5a52917e4d
daemon: switch to semaphore-gated WaitGroup for startup tasks
Many startup tasks have to run for each container, and thus using a
WaitGroup (which doesn't have a limit to the number of parallel tasks)
can result in Docker exceeding the NOFILE limit quite trivially. A more
optimal solution is to have a parallelism limit by using a semaphore.

In addition, several startup tasks were not parallelised previously
which resulted in very long startup times. According to my testing, 20K
dead containers resulted in ~6 minute startup times (during which time
Docker is completely unusable).

This patch fixes both issues, and the parallelStartupTimes factor chosen
(128 * NumCPU) is based on my own significant testing of the 20K
container case. This patch (on my machines) reduces the startup time
from 6 minutes to less than a minute (ideally this could be further
reduced by removing the need to scan all dead containers on startup --
but that's beyond the scope of this patchset).

In order to avoid the NOFILE limit problem, we also detect this
on-startup and if NOFILE < 2*128*NumCPU we will reduce the parallelism
factor to avoid hitting NOFILE limits (but also emit a warning since
this is almost certainly a mis-configuration).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-12-21 21:51:02 +11:00
Sebastiaan van Stijn
f6002117a4
Extract container-config and container-hostconfig validation
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-19 13:09:12 +01:00
Sebastiaan van Stijn
b6e373c525
Rename verifyContainerResources to verifyPlatformContainerResources
This validation function is platform-specific; rename it to be
more explicit.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-19 10:24:09 +01:00
Sebastiaan van Stijn
e278678705
Remove unused argument from verifyPlatformContainerSettings
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-19 09:23:09 +01:00
Sebastiaan van Stijn
10c97b9357
Unify logging container validation warnings
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-19 09:15:21 +01:00
Sebastiaan van Stijn
2e23ef5350
Move port-publishing check to linux platform-check
Windows does not have host-mode networking, so on Windows, this
check was a no-op

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-18 22:46:05 +01:00
Sebastiaan van Stijn
57f1305e74
Move "OOM Kill disable" warning to the daemon
Disabling the oom-killer for a container without setting a memory limit
is dangerous, as it can result in the container consuming unlimited memory,
without the kernel being able to kill it. A check for this situation is curently
done in the CLI, but other consumers of the API won't receive this warning.

This patch adds a check for this situation to the daemon, so that all consumers
of the API will receive this warning.

This patch will have one side-effect; docker cli's that also perform this check
client-side will print the warning twice; this can be addressed by disabling
the cli-side check for newer API versions, but will generate a bit of extra
noise when using an older CLI.

With this patch applied (and a cli that does not take the new warning into account);

```
docker create --oom-kill-disable busybox
WARNING: OOM killer is disabled for the container, but no memory limit is set, this can result in the system running out of resources.
669933b9b237fa27da699483b5cf15355a9027050825146587a0e5be0d848adf

docker run --rm --oom-kill-disable busybox
WARNING: Disabling the OOM killer on containers without setting a '-m/--memory' limit may be dangerous.
WARNING: OOM killer is disabled for the container, but no memory limit is set, this can result in the system running out of resources.
```

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-18 22:30:56 +01:00
Andrew Hsu
78045a5419
use empty string as cgroup path to grab first find
Signed-off-by: Andrew Hsu <andrewhsu@docker.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-12-07 18:44:00 +01:00
Yong Tang
f023816608 Add memory.kernelTCP support for linux
This fix tries to address the issue raised in 37038 where
there were no memory.kernelTCP support for linux.

This fix add MemoryKernelTCP to HostConfig, and pass
the config to runtime-spec.

Additional test case has been added.

This fix fixes 37038.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
2018-11-26 21:03:08 +00:00
Justin Cormack
f8e876d761
Fix denial of service with large numbers in cpuset-cpus and cpuset-mems
Using a value such as `--cpuset-mems=1-9223372036854775807` would cause
`dockerd` to run out of memory allocating a map of the values in the
validation code. Set limits to the normal limit of the number of CPUs,
and improve the error handling.

Reported by Huawei PSIRT.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-10-05 15:09:02 +02:00
Tibor Vass
34eede0296 Remove 'docker-' prefix for containerd and runc binaries
This allows to run the daemon in environments that have upstream containerd installed.

Signed-off-by: Tibor Vass <tibor@docker.com>
2018-09-24 21:49:03 +00:00
Salahuddin Khan
763d839261 Add ADD/COPY --chown flag support to Windows
This implements chown support on Windows. Built-in accounts as well
as accounts included in the SAM database of the container are supported.

NOTE: IDPair is now named Identity and IDMappings is now named
IdentityMapping.

The following are valid examples:
ADD --chown=Guest . <some directory>
COPY --chown=Administrator . <some directory>
COPY --chown=Guests . <some directory>
COPY --chown=ContainerUser . <some directory>

On Windows an owner is only granted the permission to read the security
descriptor and read/write the discretionary access control list. This
fix also grants read/write and execute permissions to the owner.

Signed-off-by: Salahuddin Khan <salah@docker.com>
2018-08-13 21:59:11 -07:00
Sebastiaan van Stijn
3737194b9f
daemon/*.go: fix some Wrap[f]/Warn[f] errors
In particular, these two:
> daemon/daemon_unix.go:1129: Wrapf format %v reads arg #1, but call has 0 args
> daemon/kill.go:111: Warn call has possible formatting directive %s

and a few more.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-07-11 15:51:51 +02:00
Sebastiaan van Stijn
f23c00d870
Various code-cleanup
remove unnescessary import aliases, brackets, and so on.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-05-23 17:50:54 +02:00
Alessandro Boch
173b3c364e Allow user to control the default address pools
- Via daemon flag --default-address-pools base=<CIDR>,size=<int>

Signed-off-by: Elango Siva  <elango@docker.com>
2018-04-30 11:14:08 -04:00
Sebastiaan van Stijn
cf9c48bb3e
Merge pull request #36879 from cpuguy83/extra_unmount_check
Extra check before unmounting on shutdown
2018-04-20 17:08:11 -07:00
Brian Goff
6a70fd222b Move mount parsing to separate package.
This moves the platform specific stuff in a separate package and keeps
the `volume` package and the defined interfaces light to import.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-04-19 06:35:54 -04:00
Brian Goff
c403f0036b Extra check before unmounting on shutdown
This makes sure that if the daemon root was already a self-binded mount
(thus meaning the daemonc only performed a remount) that the daemon does
not try to unmount.

Example:

```
$ sudo mount --bind /var/lib/docker /var/lib/docker
$ sudo dockerd &
```

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-04-18 20:43:42 -04:00
Daniel Nephin
4ceea53b5e Remove duplicate rootFSToAPIType
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2018-02-14 11:59:18 -05:00
Daniel Nephin
c502bcff33 Remove unnecessary getLayerInit
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2018-02-14 11:59:10 -05:00
Tianon Gravi
3a633a712c
Merge pull request #36194 from dnephin/add-canonical-import
Add canonical import path
2018-02-07 13:06:45 -08:00
Daniel Nephin
4f0d95fa6e Add canonical import comment
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2018-02-05 16:51:57 -05:00
Dennis Chen
44b074d199 Daemon: passdown the --oom-kill-disable option to containerd
Current implementaion of docke daemon doesn't pass down the
`--oom-kill-disable` option specified by the end user to the containerd
when spawning a new docker instance with help from `runc` component, which
results in the `--oom-kill-disable` doesn't work no matter the flag is `true`
or `false`.

This PR will fix this issue reported by #36090

Signed-off-by: Dennis Chen <dennis.chen@arm.com>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2018-02-05 03:25:59 +00:00
Brian Goff
a510192b86 Set daemon root to use shared propagation
This change sets an explicit mount propagation for the daemon root.
This is useful for people who need to bind mount the docker daemon root
into a container.

Since bind mounting the daemon root should only ever happen with at
least `rlsave` propagation (to prevent the container from holding
references to mounts making it impossible for the daemon to clean up its
resources), we should make sure the user is actually able to this.

Most modern systems have shared root (`/`) propagation by default
already, however there are some cases where this may not be so
(e.g. potentially docker-in-docker scenarios, but also other cases).
So this just gives the daemon a little more control here and provides
a more uniform experience across different systems.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-01-23 14:17:08 -08:00
Sebastiaan van Stijn
6121a8429b
Move reload-related functions to reload.go
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-01-21 00:55:49 +01:00
John Howard
ce8e529e18 LCOW: Re-coalesce stores
Signed-off-by: John Howard <jhoward@microsoft.com>

The re-coalesces the daemon stores which were split as part of the
original LCOW implementation.

This is part of the work discussed in https://github.com/moby/moby/issues/34617,
in particular see the document linked to in that issue.
2018-01-18 08:29:19 -08:00
Sebastiaan van Stijn
b4a6313969
Golint: remove redundant ifs
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-01-15 00:42:25 +01:00
Sebastiaan van Stijn
16fe5a1289
Remove unused experimental code
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2017-12-18 17:07:48 +01:00
Kir Kolyshkin
516010e92d Simplify/fix MkdirAll usage
This subtle bug keeps lurking in because error checking for `Mkdir()`
and `MkdirAll()` is slightly different wrt to `EEXIST`/`IsExist`:

 - for `Mkdir()`, `IsExist` error should (usually) be ignored
   (unless you want to make sure directory was not there before)
   as it means "the destination directory was already there"

 - for `MkdirAll()`, `IsExist` error should NEVER be ignored.

Mostly, this commit just removes ignoring the IsExist error, as it
should not be ignored.

Also, there are a couple of cases then IsExist is handled as
"directory already exist" which is wrong. As a result, some code
that never worked as intended is now removed.

NOTE that `idtools.MkdirAndChown()` behaves like `os.MkdirAll()`
rather than `os.Mkdir()` -- so its description is amended accordingly,
and its usage is handled as such (i.e. IsExist error is not ignored).

For more details, a quote from my runc commit 6f82d4b (July 2015):

    TL;DR: check for IsExist(err) after a failed MkdirAll() is both
    redundant and wrong -- so two reasons to remove it.

    Quoting MkdirAll documentation:

    > MkdirAll creates a directory named path, along with any necessary
    > parents, and returns nil, or else returns an error. If path
    > is already a directory, MkdirAll does nothing and returns nil.

    This means two things:

    1. If a directory to be created already exists, no error is
    returned.

    2. If the error returned is IsExist (EEXIST), it means there exists
    a non-directory with the same name as MkdirAll need to use for
    directory. Example: we want to MkdirAll("a/b"), but file "a"
    (or "a/b") already exists, so MkdirAll fails.

    The above is a theory, based on quoted documentation and my UNIX
    knowledge.

    3. In practice, though, current MkdirAll implementation [1] returns
    ENOTDIR in most of cases described in #2, with the exception when
    there is a race between MkdirAll and someone else creating the
    last component of MkdirAll argument as a file. In this very case
    MkdirAll() will indeed return EEXIST.

    Because of #1, IsExist check after MkdirAll is not needed.

    Because of #2 and #3, ignoring IsExist error is just plain wrong,
    as directory we require is not created. It's cleaner to report
    the error now.

    Note this error is all over the tree, I guess due to copy-paste,
    or trying to follow the same usage pattern as for Mkdir(),
    or some not quite correct examples on the Internet.

    [1] https://github.com/golang/go/blob/f9ed2f75/src/os/path.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2017-11-27 17:32:12 -08:00
Kenfe-Mickael Laventure
ddae20c032
Update libcontainerd to use containerd 1.0
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-10-20 07:11:37 -07:00
Brian Goff
c6a2044497 Don't abort when setting may_detach_mounts
83c2152de5 sets the kernel param for
fs.may_detach_mounts, but this is not neccessary for the daemon to
operate. Instead of erroring out (and thus aborting startup) just log
the error.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2017-10-11 14:54:24 -04:00
Brian Goff
83c2152de5 Automatically set may_detach_mounts=1 on startup
This is kernel config available in RHEL7.4 based kernels that enables
mountpoint removal where the mountpoint exists in other namespaces.
In particular this is important for making this pattern work:

```
umount -l /some/path
rm -r /some/path
```

Where `/some/path` exists in another mount namespace.
Setting this value will prevent `device or resource busy` errors when
attempting to the removal of `/some/path` in the example.

This setting is the default, and non-configurable, on upstream kernels
since 3.15.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2017-09-20 09:57:25 -04:00
Victor Vieux
a5f9783c93 Merge pull request #34252 from Microsoft/akagup/lcow-remotefs-sandbox
LCOW: Support for docker cp, ADD/COPY on build
2017-09-15 16:49:48 -07:00
Simon Ferquel
e89b6e8c2d Volume refactoring for LCOW
Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>
2017-09-14 12:33:31 -07:00
Akash Gupta
7a7357dae1 LCOW: Implemented support for docker cp + build
This enables docker cp and ADD/COPY docker build support for LCOW.
Originally, the graphdriver.Get() interface returned a local path
to the container root filesystem. This does not work for LCOW, so
the Get() method now returns an interface that LCOW implements to
support copying to and from the container.

Signed-off-by: Akash Gupta <akagup@microsoft.com>
2017-09-14 12:07:52 -07:00
Yong Tang
70214f95b2 Merge pull request #34352 from ChenMin46/fix_rename_shared_namespace
Use ID rather than Name to identify a container when sharing namespace
2017-08-25 09:39:58 -07:00
Chen Min
b6e5ea8e57 Use ID rather than Name to identify a container when sharing namespace
Fix: https://github.com/moby/moby/issues/34307

Signed-off-by: Chen Min <chenmin46@huawei.com>
2017-08-25 01:55:50 +08:00
Kenfe-Mickael Laventure
45d85c9913
Update containerd to 06b9cb35161009dcb7123345749fef02f7cea8e0
This also update:
 - runc to 3f2f8b84a77f73d38244dd690525642a72156c64
 - runtime-specs to v1.0.0

Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-21 12:04:07 -07:00
Daniel Nephin
9b47b7b151 Fix golint errors.
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2017-08-18 14:23:44 -04:00
Brian Goff
ebcb7d6b40 Remove string checking in API error handling
Use strongly typed errors to set HTTP status codes.
Error interfaces are defined in the api/errors package and errors
returned from controllers are checked against these interfaces.

Errors can be wraeped in a pkg/errors.Causer, as long as somewhere in the
line of causes one of the interfaces is implemented. The special error
interfaces take precedence over Causer, meaning if both Causer and one
of the new error interfaces are implemented, the Causer is not
traversed.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2017-08-15 16:01:11 -04:00
Kir Kolyshkin
7120976d74 Implement none, private, and shareable ipc modes
Since the commit d88fe447df ("Add support for sharing /dev/shm/ and
/dev/mqueue between containers") container's /dev/shm is mounted on the
host first, then bind-mounted inside the container. This is done that
way in order to be able to share this container's IPC namespace
(and the /dev/shm mount point) with another container.

Unfortunately, this functionality breaks container checkpoint/restore
(even if IPC is not shared). Since /dev/shm is an external mount, its
contents is not saved by `criu checkpoint`, and so upon restore any
application that tries to access data under /dev/shm is severily
disappointed (which usually results in a fatal crash).

This commit solves the issue by introducing new IPC modes for containers
(in addition to 'host' and 'container:ID'). The new modes are:

 - 'shareable':	enables sharing this container's IPC with others
		(this used to be the implicit default);

 - 'private':	disables sharing this container's IPC.

In 'private' mode, container's /dev/shm is truly mounted inside the
container, without any bind-mounting from the host, which solves the
issue.

While at it, let's also implement 'none' mode. The motivation, as
eloquently put by Justin Cormack, is:

> I wondered a while back about having a none shm mode, as currently it is
> not possible to have a totally unwriteable container as there is always
> a /dev/shm writeable mount. It is a bit of a niche case (and clearly
> should never be allowed to be daemon default) but it would be trivial to
> add now so maybe we should...

...so here's yet yet another mode:

 - 'none':	no /dev/shm mount inside the container (though it still
		has its own private IPC namespace).

Now, to ultimately solve the abovementioned checkpoint/restore issue, we'd
need to make 'private' the default mode, but unfortunately it breaks the
backward compatibility. So, let's make the default container IPC mode
per-daemon configurable (with the built-in default set to 'shareable'
for now). The default can be changed either via a daemon CLI option
(--default-shm-mode) or a daemon.json configuration file parameter
of the same name.

Note one can only set either 'shareable' or 'private' IPC modes as a
daemon default (i.e. in this context 'host', 'container', or 'none'
do not make much sense).

Some other changes this patch introduces are:

1. A mount for /dev/shm is added to default OCI Linux spec.

2. IpcMode.Valid() is simplified to remove duplicated code that parsed
   'container:ID' form. Note the old version used to check that ID does
   not contain a semicolon -- this is no longer the case (tests are
   modified accordingly). The motivation is we should either do a
   proper check for container ID validity, or don't check it at all
   (since it is checked in other places anyway). I chose the latter.

3. IpcMode.Container() is modified to not return container ID if the
   mode value does not start with "container:", unifying the check to
   be the same as in IpcMode.IsContainer().

3. IPC mode unit tests (runconfig/hostconfig_test.go) are modified
   to add checks for newly added values.

[v2: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-51345997]
[v3: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-53902833]
[v4: addressed the case of upgrading from older daemon, in this case
     container.HostConfig.IpcMode is unset and this is valid]
[v5: document old and new IpcMode values in api/swagger.yaml]
[v6: add the 'none' mode, changelog entry to docs/api/version-history.md]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2017-08-14 10:50:39 +03:00
Yong Tang
7ccd8bda77 Merge pull request #33722 from TomSweeneyRedHat/tsweeney/privmessage
Add clarification to --privileged error message
2017-08-09 16:08:10 -07:00
Derek McGowan
1009e6a40b
Update logrus to v1.0.1
Fixes case sensitivity issue

Signed-off-by: Derek McGowan <derek@mcgstyle.net>
2017-07-31 13:16:46 -07:00
Tobias Klauser
01f70b028e Switch Stat syscalls to x/sys/unix
Switch some more usage of the Stat function and the Stat_t type from the
syscall package to golang.org/x/sys. Those were missing in PR #33399.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-27 10:09:02 +02:00
Yuanhong Peng
4a6cbf9bcb Return an empty stats if "container not found"
If we get "container not found" error from containerd, it's possibly
because that this container has already been stopped. It will be ok to
ignore this error and just return an empty stats.

Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
2017-07-10 16:30:48 +08:00
Michael Crosby
9d87e6e0fb Do not set -1 for swappiness
Do not set a default value for swappiness as the default value should be
`nil`

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-07-03 11:23:15 -07:00
TomSweeneyRedHat
38e26f0d8d Add clarification to --privileged error message
Signed-off-by: TomSweeneyRedHat <tsweeney@redhat.com>
2017-06-25 14:02:20 -04:00
Fabio Kung
a43be3431e avoid re-reading json files when copying containers
Signed-off-by: Fabio Kung <fabio.kung@gmail.com>
2017-06-23 07:52:34 -07:00
John Howard
3aa4a00715 LCOW: Move daemon stores to per platform
Signed-off-by: John Howard <jhoward@microsoft.com>
2017-06-20 19:49:52 -07:00