Commit graph

7184 commits

Author SHA1 Message Date
Sebastiaan van Stijn
04ff4a2ba4
Merge pull request #39137 from arkodg/attach-to-existing-network-error
Handle the error case when a container reattaches to the same network
2019-06-12 19:58:04 +02:00
Sebastiaan van Stijn
29829874d1
Merge pull request #39270 from kolyshkin/moar-aufs-fixes
aufs: retry umount on ebusy, ignore ENOENT in graphdriver.Mounted
2019-06-11 20:43:50 +02:00
Sebastiaan van Stijn
e511b3be89
Merge pull request #39336 from justincormack/entropy-cannot-be-saved
Entropy cannot be saved
2019-06-11 18:40:19 +02:00
Brian Goff
2b15825d9c
Merge pull request #39327 from tonistiigi/improve-non-cgo
allow dockerd builds without cgo
2019-06-07 10:07:44 -07:00
Sebastiaan van Stijn
28678f2226
Merge pull request #38349 from wk8/wk8/os_version
Adding OS version info to nodes' `Info` struct and to the system info's API
2019-06-07 14:54:51 +02:00
Sebastiaan van Stijn
c85fe2d224
Merge pull request #38522 from cpuguy83/fix_timers
Make sure timers are stopped after use.
2019-06-07 13:16:46 +02:00
Justin Cormack
2df693e533
Entropy cannot be saved
Remove non cryptographic randomness.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2019-06-07 11:54:45 +01:00
Jean Rouge
d363a1881e Adding OS version info to the nodes' Info struct
This is needed so that we can add OS version constraints in Swarmkit, which
does require the engine to report its host's OS version (see
https://github.com/docker/swarmkit/issues/2770).

The OS version is parsed from the `os-release` file on Linux, and from the
`ReleaseId` string value of the `SOFTWARE\Microsoft\Windows NT\CurrentVersion`
registry key on Windows.

Added unit tests when possible, as well as Prometheus metrics.

Signed-off-by: Jean Rouge <rougej+github@gmail.com>
2019-06-06 22:40:10 +00:00
Kirill Kolyshkin
1d5748d975
Merge pull request #39173 from olljanat/25885-capabilities-swarm
Add support for capabilities options in services
2019-06-06 15:03:46 -07:00
Brian Goff
cf406eb359
Merge pull request #39307 from kolyshkin/aufs-reinstate-mntL
Revert "aufs: remove mntL"
2019-06-06 11:22:16 -07:00
Tonis Tiigi
cf104d85c3 stats: avoid cgo in collector
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2019-06-05 22:21:11 -07:00
Tonis Tiigi
230a55d337 copy: allow non-cgo build
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2019-06-05 22:21:11 -07:00
Tonis Tiigi
186cd7cf4a quota: add noncgo build tag
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2019-06-05 22:21:06 -07:00
John Howard
293c74ba79 Windows: Don't attempt detach VHD for R/O layers
Signed-off-by: John Howard <jhoward@microsoft.com>
2019-06-04 13:38:52 -07:00
Sebastiaan van Stijn
a6e1502575
Merge pull request #39295 from tiborvass/buildkit-systemd-resolvconf
build: buildkit now also uses systemd's resolv.conf
2019-06-04 20:28:36 +02:00
Tibor Vass
18a4498c2d
Merge pull request #39306 from dperny/increase-swarmkit-grpc
Increase max recv gRPC message size for nodes and secrets
2019-06-04 09:32:24 -07:00
Akihiro Suda
364f9bce16
Merge pull request #39292 from cpuguy83/root_dir_on_copy
Pass root to chroot to for chroot Tar/Untar (CVE-2018-15664)
2019-06-05 01:05:29 +09:00
Tibor Vass
8ff4ec98cf build: buildkit now also uses systemd's resolv.conf
Signed-off-by: Tibor Vass <tibor@docker.com>
2019-06-04 16:04:10 +00:00
Sebastiaan van Stijn
3d21b86e0a
Merge pull request #39299 from AkihiroSuda/ro-none-cgroupdriver
info: report cgroup driver as "none" when running rootless
2019-06-03 22:46:08 +02:00
Kir Kolyshkin
5020edca76 Revert "aufs: remove mntL"
Commit e2989c4d48 says:

> With the suffix added, the possibility to hit the race is extremely
> low, and we don't have to do any locking.

Probability theory just laughed in my face this weekend, as this has
actually happened once in 6050000 containers created, on a high-end
hardware with 1000 parallel "docker create" running (took a few days).

One way to work around this is increase the randomness by adding more
characters, which will further decrease the probability, but won't
eliminate it entirely. Another is to fix it upstream (done, see the
link below, but the fix might not be packported to Ubuntu).

Overall, as much as I like this solution, I think we need to
revert it :-\

See-also: https://github.com/sfjro/aufs5-standalone/commit/abf61326f49535

This reverts commit e2989c4d48.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-06-03 10:42:45 -07:00
Brian Goff
3029e765e2 Add chroot for tar packing operations
Previously only unpack operations were supported with chroot.
This adds chroot support for packing operations.
This prevents potential breakouts when copying data from a container.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2019-06-03 09:45:29 -07:00
Brian Goff
d089b63937 Pass root to chroot to for chroot Untar
This is useful for preventing CVE-2018-15664 where a malicious container
process can take advantage of a race on symlink resolution/sanitization.

Before this change chrootarchive would chroot to the destination
directory which is attacker controlled. With this patch we always chroot
to the container's root which is not attacker controlled.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2019-06-03 09:45:21 -07:00
Drew Erny
a0903e1fa3 Increase max recv gRPC message size for nodes and secrets
Increases the max recieved gRPC message size for Node and Secret list
operations. This has already been done for the other swarm types, but
was not done for these.

Signed-off-by: Drew Erny <drew.erny@docker.com>
2019-06-03 11:42:31 -05:00
Yong Tang
acdbaaa3ed
Merge pull request #39204 from olljanat/fix-hostname-dns-resolution
Add alias for hostname if hostname != container name
2019-06-02 09:48:37 -07:00
Akihiro Suda
153466ba0a info: report cgroup driver as "none" when running rootless
Previously `docker info` had reported "cgroupfs" as the cgroup driver
but the driver wasn't actually used at all.

This PR reports "none" as the cgroup driver so as to avoid confusion.
e.g. kubeadm/kubelet will detect cgroupless-ness by checking this docker
info field. https://github.com/rootless-containers/usernetes/pull/97

Note that user still cannot specify `native.cgroupdriver=none` manually.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-06-03 00:11:21 +09:00
Kir Kolyshkin
57f06409b1 aufs: retry unmount on EBUSY
For some reason, retrying to unmount in case of getting EBUSY error
was only performed in Remove(), but not Put().

I have done some testing on Ubuntu 16.04 and 18.04 with aufs,
performing massively parallel container creation using this script:

```
NUMCTS=5000
PARALLEL=100
IMAGE=busybox

docker pull $IMAGE >/dev/null
seq $NUMCTS | parallel -j$PARALLEL docker create $IMAGE true > /dev/null
docker ps -qa | shuf | tail -n $NUMCTS | parallel -j$PARALLEL docker rm -f '{}' > /dev/null
```

Sometimes (1 to 5 times per 10000 `docker create`), aufs.Put() fails on Unmount syscall
with EBUSY during container creation:

> Error response from daemon: device or resource busy

and in docker log, with debug turned on:

> level=debug msg="Failed to unmount ID-init aufs: device or resource busy"
> level=error msg="Handler for POST /v1.30/containers/create returned error: device or resource busy"

I did some debugging by running fuser -v -M -m $MOUNT_POINT but
that reveals nothing.

This commit:

 * implements retry on EBUSY in Unmount()
 * calls Unmount() from Remove()
 * increases the number of retries from 3 to 5

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-05-28 18:54:44 -07:00
Olli Janatuinen
f787b235de Add support capabilities list on services
Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
2019-05-28 19:52:36 +03:00
Sebastiaan van Stijn
c7a0eaf004
Merge pull request #39242 from arkodg/lb-stale-force-leave
Network not deleted after stack is removed
2019-05-28 00:31:55 +03:00
Kir Kolyshkin
72ceac6a74 graphdriver.Mounted(): ignore ENOENT
In case statfs() returns ENOENT, do not return an error, but rather
treat this as "not mounted".

Related to commit d42dbdd3d4.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-05-24 12:56:00 -07:00
Arko Dasgupta
70fa7b6a3f Network not deleted after stack is removed
Make sure adapter.removeNetworks executes during task Remove
adapter.removeNetworks was being skipped for cases when
isUnknownContainer(err) was true after adapter.remove was executed

This fix eliminates the nil return case forcing the function
to continue executing unless there is a true error

Fixes https://github.com/moby/moby/issues/39225

Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
2019-05-23 12:37:17 -07:00
Kir Kolyshkin
e2989c4d48 aufs: remove mntL
Commit 5cd62852fa added a lock around call to unix.Mount() to
avoid the race in aufs kernel code related to xino file creation
and removal. While this is going to be fixed in the kernel, we still
need to support the current aufs, so some kind of fix is required.

A think a better fix (rather than a lock) is to add a random suffix
to the file name (note it is and was a separate file per mount,
never mind the same file name -- the file is created/opened and
removed instantly, so each mount deals with its own file).

With the suffix added, the possibility to hit the race is extremely
low, and we don't have to do any locking.

Note we don't add any more characters, instead we're replacing
`xino` with four random characters in the 0-9a-z range.

See also: https://sourceforge.net/p/aufs/mailman/message/36674769/

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-05-23 12:08:40 -07:00
Olli Janatuinen
a3fcd4b82a Add alias for hostname if hostname != container
name which happens if user manually specify hostname

Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
2019-05-22 20:20:43 +03:00
Kir Kolyshkin
ae431b10a9 aufs: retry auplink flush
Running a bundled aufs benchmark sometimes results in this warning:

> WARN[0001] Couldn't run auplink before unmount /tmp/aufs-tests/aufs/mnt/XXXXX  error="exit status 22" storage-driver=aufs

If we take a look at what aulink utility produces on stderr, we'll see:

> auplink:proc_mnt.c:96: /tmp/aufs-tests/aufs/mnt/XXXXX: Invalid argument

and auplink exits with exit code of 22 (EINVAL).

Looking into auplink source code, what happens is it tries to find a
record in /proc/self/mounts corresponding to the mount point (by using
setmntent()/getmntent_r() glibc functions), and it fails.

Some manual testing, as well as runtime testing with lots of printf
added on mount/unmount, as well as calls to check the superblock fs
magic on mount point (as in graphdriver.Mounted(graphdriver.FsMagicAufs, target)
confirmed that this record is in fact there, but sometimes auplink
can't find it. I was also able to reproduce the same error (inability
to find a mount in /proc/self/mounts that should definitely be there)
using a small C program, mocking what `auplink` does:

```c
 #include <stdio.h>
 #include <err.h>
 #include <mntent.h>
 #include <string.h>
 #include <stdlib.h>

int main(int argc, char **argv)
{
	FILE *fp;
	struct mntent m, *p;
	char a[4096];
	char buf[4096 + 1024];
	int found =0, lines = 0;

	if (argc != 2) {
		fprintf(stderr, "Usage: %s <mountpoint>\n", argv[0]);
		exit(1);
	}

	fp = setmntent("/proc/self/mounts", "r");
	if (!fp) {
		err(1, "setmntent");
	}
	setvbuf(fp, a, _IOLBF, sizeof(a));
	while ((p = getmntent_r(fp, &m, buf, sizeof(buf)))) {
		lines++;
		if (!strcmp(p->mnt_dir, argv[1])) {
			found++;
		}
	}
	printf("found %d entries for %s (%d lines seen)\n", found, argv[1], lines);
	return !found;
}
```

I have also wrote a few other C proggies -- one that reads
/proc/self/mounts directly, one that reads /proc/self/mountinfo instead.
They are also prone to the same occasional error.

It is not perfectly clear why this happens, but so far my best theory
is when a lot of mounts/unmounts happen in parallel with reading
contents of /proc/self/mounts, sometimes the kernel fails to provide
continuity (i.e. it skips some part of file or mixes it up in some
other way). In other words, this is a kernel bug (which is probably
hard to fix unless some other interface to get a mount entry is added).

Now, there is no real fix, and a workaround I was able to come up
with is to retry when we got EINVAL. It usually works on the second
attempt, although I've once seen it took two attempts to go through.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-05-21 10:58:59 -07:00
Kir Kolyshkin
8fda12c607 aufs.Cleanup: optimize
Do not use filepath.Walk() as there's no requirement to recursively
go into every directory under mnt -- a (non-recursive) list of
directories in mnt is sufficient.

With filepath.Walk(), in case some container will fail to unmount,
it'll go through the whole container filesystem which is both
excessive and useless.

This is similar to commit f1a4592297 ("devmapper.shutdown:
optimize")

While at it, raise the priority of "unmount error" message from debug
to a warning. Note we don't have to explicitly add `m` as unmount error (from
pkg/mount) will have it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-05-21 10:58:59 -07:00
Kir Kolyshkin
d58c434bff aufs: optimize lots of layers case
In case there are a big number of layers, so that mount data won't fit
into a single memory page (4096 bytes on most platforms, which is good
enough for about 40 layers, depending on how long graphdriver root path
is), we supply additional layers with O_REMOUNT, as described in aufs
documentation.

Problem is, the current implementation does that one layer at a time
(i.e. there is one mount syscall per each additional layer).

Optimize the code to supply as many layers as we can fit in one page
(basically reusing the same code as for the original mount).

Note, per aufs docs, "[a]t remount-time, the options are interpreted
in the given order, e.g. left to right" so we should be good.

Tested on an image with ~100 layers.

Before (35 syscalls):
> [pid 22756] 1556919088.686955 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", "aufs", 0, "br:/mnt/volume_sfo2_09/docker-au"...) = 0 <0.000504>
> [pid 22756] 1556919088.687643 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", 0xc000c451b0, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.000105>
> [pid 22756] 1556919088.687851 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", 0xc000c451ba, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.000098>
> ..... (~30 lines skipped for clarity)
> [pid 22756] 1556919088.696182 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", 0xc000c45310, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.000266>

After (2 syscalls):
> [pid 24352] 1556919361.799889 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/8e7ba189e347a834e99eea4ed568f95b86cec809c227516afdc7c70286ff9a20", "aufs", 0, "br:/mnt/volume_sfo2_09/docker-au"...) = 0 <0.001717>
> [pid 24352] 1556919361.801761 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/8e7ba189e347a834e99eea4ed568f95b86cec809c227516afdc7c70286ff9a20", 0xc000dbecb0, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.001358>

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-05-21 10:58:59 -07:00
Kir Kolyshkin
5cd62852fa aufs: add lock around mount
Apparently there is some kind of race in aufs kernel module code,
which leads to the errors like:

[98221.158606] aufs au_xino_create2:186:dockerd[25801]: aufs.xino create err -17
[98221.162128] aufs au_xino_set:1229:dockerd[25801]: I/O Error, failed creating xino(-17).
[98362.239085] aufs au_xino_create2:186:dockerd[6348]: aufs.xino create err -17
[98362.243860] aufs au_xino_set:1229:dockerd[6348]: I/O Error, failed creating xino(-17).
[98373.775380] aufs au_xino_create:767:dockerd[27435]: open /dev/shm/aufs.xino(-17)
[98389.015640] aufs au_xino_create2:186:dockerd[26753]: aufs.xino create err -17
[98389.018776] aufs au_xino_set:1229:dockerd[26753]: I/O Error, failed creating xino(-17).
[98424.117584] aufs au_xino_create:767:dockerd[27105]: open /dev/shm/aufs.xino(-17)

So, we have to have a lock around mount syscall.

While at it, don't call the whole Unmount() on an error path, as
it leads to bogus error from auplink flush.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-05-21 10:58:59 -07:00
Kir Kolyshkin
5873768dbe aufs: aufsMount: better errors for unix.Mount()
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-05-21 10:58:59 -07:00
Kir Kolyshkin
4beee98026 aufs: use mount.Unmount
1. Use mount.Unmount() which ignores EINVAL ("not mounted") error,
and provides better error diagnostics (so we don't have to explicitly
add target to error messages).

2. Since we're ignoring "not mounted" error, we can call
multiple unmounts without any locking -- but since "auplink flush"
is still involved and can produce an error in logs, let's keep
the check for fs being mounted (it's just a statfs so should be fast).

2. While at it, improve the "can't unmount" error message in Put().

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-05-21 10:58:59 -07:00
Kir Kolyshkin
f93750b2c4 aufs: remove extra locking
Both mount and unmount calls are already protected by fine-grained
(per id) locks in Get()/Put() introduced in commit fc1cf1911b
("Add more locking to storage drivers"), so there's no point in
having a global lock in mount/unmount.

The only place from which unmount is called without any locking
is Cleanup() -- this is to be addressed in the next patch.

This reverts commit 824c24e680.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2019-05-21 10:58:59 -07:00
Brian Goff
9808f6036a
Merge pull request #39212 from yyb196/devno
bugfix: fetch the right device number which great than 255
2019-05-20 10:43:03 -07:00
Brian Goff
03a03c6c32
Merge pull request #39190 from ollypom/swarmnanocpu
Switch Swarm Mode services to NanoCpu
2019-05-20 10:14:07 -07:00
Sebastiaan van Stijn
19008faf03
Merge pull request #38992 from kolyshkin/mnt
pkg/mount: optimizations
2019-05-20 14:12:42 +02:00
frankyang
b9f31912de bugfix: fetch the right device number which great than 255
Signed-off-by: frankyang <yyb196@gmail.com>
2019-05-16 15:32:59 +08:00
John Howard
20b11792e8 LCOW: Fix FROM scratch
Signed-off-by: John Howard <jhoward@microsoft.com>
2019-05-14 15:55:59 -07:00
Yong Tang
3042254a87
Merge pull request #38377 from rgulewich/38332-cgroup-ns
Start containers in their own cgroup namespaces
2019-05-11 20:18:31 -07:00
Olly Pomeroy
8a60a1e14a Switch swarmmode services to NanoCpu
Today `$ docker service create --limit-cpu` configures a containers
`CpuPeriod` and `CpuQuota` variables, this commit switches this to
configure a containers `NanoCpu` variable instead.

Signed-off-by: Olly Pomeroy <olly@docker.com>
2019-05-08 14:04:24 +00:00
Rob Gulewich
072400fc4b Make cgroup namespaces configurable
This adds both a daemon-wide flag and a container creation property:
- Set the `CgroupnsMode: "host|private"` HostConfig property at
  container creation time to control what cgroup namespace the container
  is created in
- Set the `--default-cgroupns-mode=host|private` daemon flag to control
  what cgroup namespace containers are created in by default
- Set the default if the daemon flag is unset to "host", for backward
  compatibility
- Default to CgroupnsMode: "host" for client versions < 1.40

Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
2019-05-07 10:22:16 -07:00
Rob Gulewich
256eb04d69 Start containers in their own cgroup namespaces
This is enabled for all containers that are not run with --privileged,
if the kernel supports it.

Fixes #38332

Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
2019-05-07 10:22:16 -07:00
Arko Dasgupta
680d0ba4ab Remove a network during task SHUTDOWN instead of REMOVE to
make sure the LB sandbox is removed when a service is updated
with a --network-rm option

Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
2019-05-06 20:26:59 -07:00
Arko Dasgupta
31e8fcc678 Change Forbidden Error (403) to Conflict(409)
Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
2019-05-03 15:59:20 -07:00