ext4 support d_type by default, but filetype feature is a tunable so
there is still a chance to disable it for some reasons. In this case,
print additional message to explicitly tell how to support d_type.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
In cases where a logging plugin has crashed when the daemon tries to
copy the container stdio to the logging plugin it returns a broken pipe
error and any log entries that occurr while the plugin is down are lost.
Fix this by opening read+write in the daemon so logs are not lost while
the plugin is down.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
- Check errors.Cause(err) when comparing errors
- Fix bug where oldest log file is not actually removed. This in
particular causes issues when compression is enabled. On rotate it just
overwrites the data in the log file corrupting it.
- Use O_TRUNC to open new gzip files to ensure we don't corrupt log
files as happens without the above fix.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Closing the log driver was in a defer meanwhile logs are
collected asyncronously, so the log driver was being closed before reads
were actually finished.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The overlay storage driver currently does not support any option, but was silently
ignoring any option that was passed.
This patch verifies that no options are passed, and if they are passed will produce
an error.
Before this change:
dockerd --storage-driver=overlay --storage-opt dm.thinp_percent=95
INFO[2018-05-11T11:40:40.996597152Z] libcontainerd: started new docker-containerd process pid=256
....
INFO[2018-05-11T11:40:41.135392535Z] Daemon has completed initialization
INFO[2018-05-11T11:40:41.141035093Z] API listen on /var/run/docker.sock
After this change:
dockerd --storage-driver=overlay --storage-opt dm.thinp_percent=95
INFO[2018-05-11T11:39:21.632610319Z] libcontainerd: started new docker-containerd process pid=233
....
Error starting daemon: error initializing graphdriver: overlay: unknown option dm.thinp_percent
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These network operations really don't have anything to do with the
container but rather are setting up the networking.
Ideally these wouldn't get shoved into the daemon package, but doing
something else (e.g. extract a network service into a new package) but
there's a lot more work to do in that regard.
In reality, this probably simplifies some of that work as it moves all
the network operations to the same place.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
A recent optimization in getSourceMount() made it return an error
in case when the found mount point is "/". This prevented bind-mounted
volumes from working in such cases.
A (rather trivial but adeqate) unit test case is added.
Fixes: 871c957242 ("getSourceMount(): simplify")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The Partial property of the Logger message
was replaced by PLogMetaData, causing the build to fail.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
It does not make sense to copy a slice element by element, then discard
the source one. Let's do copy in place instead which is way more
efficient.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. As daemon.ContainerStop() documentation says,
> If a negative number of seconds is given, ContainerStop
> will wait for a graceful termination.
but since commit cfdf84d5d0 (PR #32237) this is no longer the case.
This happens because `context.WithTimeout(ctx, timeout)` is implemented
as `WithDeadline(ctx, time.Now().Add(timeout))`, resulting in a deadline
which is in the past.
To fix, don't use WithDeadline() if the timeout is negative.
2. Add a test case to validate the correct behavior and
as a means to prevent a similar regression in the future.
3. Fix/improve daemon.ContainerStop() and client.ContainerStop()
description for clarity and completeness.
4. Fix/improve DefaultStopTimeout description.
Fixes: cfdf84d5d0 ("Update Container Wait")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When then non-blocking mode is specified, awslogs will:
- No longer potentially block calls to logstream.Log(), instead will
return an error if the awslogs buffer is full. This has the effect of
dropping log messages sent to awslogs.Log() that are made while the
buffer is full.
- Wait to initialize the log stream until the first Log() call instead of in
New(). This has the effect of allowing the container to start in
the case where Cloudwatch Logs is unreachable.
Both of these changes require the --log-opt mode=non-blocking to be
explicitly set and do not modify the default behavior.
Signed-off-by: Cody Roseborough <crrosebo@amazon.com>
Since Go 1.7, context is a standard package. Since Go 1.9, everything
that is provided by "x/net/context" is a couple of type aliases to
types in "context".
Many vendored packages still use x/net/context, so vendor entry remains
for now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
govet complains (when using standard "context" package):
> the cancel function returned by context.WithTimeout should be called,
> not discarded, to avoid a context leak (vet)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The flow of getSourceMount was:
1 get all entries from /proc/self/mountinfo
2 do a linear search for the `source` directory
3 if found, return its data
4 get the parent directory of `source`, goto 2
The repeated linear search through the whole mountinfo (which can have
thousands of records) is inefficient. Instead, let's just
1 collect all the relevant records (only those mount points
that can be a parent of `source`)
2 find the record with the longest mountpath, return its data
This was tested manually with something like
```go
func TestGetSourceMount(t *testing.T) {
mnt, flags, err := getSourceMount("/sys/devices/msr/")
assert.NoError(t, err)
t.Logf("mnt: %v, flags: %v", mnt, flags)
}
```
...but it relies on having a specific mount points on the system
being used for testing.
[v2: add unit tests for ParentsFilter]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use mount.SingleEntryFilter as we're only interested in a single entry.
Test case data of TestShouldUnmountRoot is modified accordingly, as
from now on:
1. `info` can't be nil;
2. the mountpoint check is not performed (as SingleEntryFilter
guarantees it to be equal to daemon.root).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Functions `GetMounts()` and `parseMountTable()` return all the entries
as read and parsed from /proc/self/mountinfo. In many cases the caller
is only interested only one or a few entries, not all of them.
One good example is `Mounted()` function, which looks for a specific
entry only. Another example is `RecursiveUnmount()` which is only
interested in mount under a specific path.
This commit adds `filter` argument to `GetMounts()` to implement
two things:
1. filter out entries a caller is not interested in
2. stop processing if a caller is found what it wanted
`nil` can be passed to get a backward-compatible behavior, i.e. return
all the entries.
A few filters are implemented:
- `PrefixFilter`: filters out all entries not under `prefix`
- `SingleEntryFilter`: looks for a specific entry
Finally, `Mounted()` is modified to use `SingleEntryFilter()`, and
`RecursiveUnmount()` is using `PrefixFilter()`.
Unit tests are added to check filters are working.
[v2: ditch NoFilter, use nil]
[v3: ditch GetMountsFiltered()]
[v4: add unit test for filters]
[v5: switch to gotestyourself]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This moves the platform specific stuff in a separate package and keeps
the `volume` package and the defined interfaces light to import.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This makes sure that if the daemon root was already a self-binded mount
(thus meaning the daemonc only performed a remount) that the daemon does
not try to unmount.
Example:
```
$ sudo mount --bind /var/lib/docker /var/lib/docker
$ sudo dockerd &
```
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Instead of using a global store for volume drivers, scope the driver
store to the caller (e.g. the volume store). This makes testing much
simpler.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Since the volume store already provides this functionality, we should
just use it rather than duplicating it.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Add synchronization around adding logs to a plugin
and reading those logs. Without the follow configuration,
a race occurs between go routines to add the logs into
the plugin and read the logs out of the plugin. This
adds a function to synchronize the action to avoid the
race.
Removes use of file for buffering, instead buffering whole
messages so log count can be checked discretely.
Signed-off-by: Derek McGowan <derek@mcgstyle.net>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Docker daemon has a 16K buffer for log messages. If a message length
exceeds 16K, it should be split by the logger and merged at the
endpoint.
This change adds `PartialLogMetaData` struct for enhanced partial support
- LastPartial (bool) : indicates if this is the last of all partials.
- ID (string) : unique 32 bit ID. ID is same across all partials.
- Ordinal (int starts at 1) : indicates the position of msg in the series of partials.
Also, the timestamps across partials in the same.
Signed-off-by: Anusha Ragunathan <anusha.ragunathan@docker.com>
It does not make any sense to vary this based on whether the
rootfs is read only. We removed all the other mount dependencies
on read-only eg see #35344.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
We have seen a panic when re-joining a node to a swarm cluster. The
cause of the issue is unknown, so we just need to add a test for nil
objects and log when we get the condition. Hopefully this can prevent
the crash and we can recover the config at a later time.
Signed-off-by: Stephen J Day <stephen.day@docker.com>
commit 617c352e92 "Don't create devices if in a user namespace"
introduced check, which meant to skip mknod operation when run
in user namespace, but instread skipped FIFO and socket files
copy.
Signed-off-by: Maxim Ivanov <ivanov.maxim@gmail.com>
This call was added as part of commit a042e5a20 and at the time was
useful. sandbox.DisableService() basically calls
endpoint.deleteServiceInfoFromCluster() for every endpoint in the
sandbox. However, with the libnetwork change, endpoint.sbLeave()
invokes endpoint.deleteServiceInfoFromCluster(). The releaseNetwork()
call invokes sandbox.Delete() immediately after
sandbox.DisableService(). The sandbox.Delete() in turn ultimately
invokes endpoint.sbLeave() for every endpoint in the sandbox which thus
removes the endpoint's load balancing entry via
endpoint.deleteServiceInfoFromCluster(). So the call to
sandbox.DisableService() is now redundant.
It is noteworthy that, while redundant, the presence of the call would
not cause errors. It would just be sub-optimal. The DisableService()
call would cause libnetwork to down-weight the load balancing entries
while the call to sandbox.Delete() would cause it to remove the entries
immediately afterwards. Aside from the wasted computation, the extra
call would also propagate an extra state change in the networkDB gossip
messages. So, overall, it is much better to just avoid the extra
overhead.
Signed-off-by: Chris Telfer <ctelfer@docker.com>
Now all of the storage drivers use the field "storage-driver" in their log
messages, which is set to name of the respective driver.
Storage drivers changed:
- Aufs
- Btrfs
- Devicemapper
- Overlay
- Overlay 2
- Zfs
Signed-off-by: Alejandro GonzÃlez Hevia <alejandrgh11@gmail.com>
As soon as the initial executable in the container is executed as a non root user,
permitted and effective capabilities are dropped. Drop them earlier than this, so
that they are dropped before executing the file. The main effect of this is that
if `CAP_DAC_OVERRIDE` is set (the default) the user will not be able to execute
files they do not have permission to execute, which previously they could.
The old behaviour was somewhat surprising and the new one is definitely correct,
but it is not in any meaningful way exploitable, and I do not think it is
necessary to backport this fix. It is unlikely to have any negative effects as
almost all executables have world execute permission anyway.
Use the bounding set not the effective set as the canonical set of capabilities, as
effective will now vary.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Commit fd0e24b718 changed
the stats collection loop to use a `sleep()` instead
of `time.Tick()` in the for-loop.
This change caused a regression in situations where
no stats are being collected, or an error is hit
in the loop (in which case the loop would `continue`,
and the `sleep()` is not hit).
This patch puts the sleep at the start of the loop
to guarantee it's always hit.
This will delay the sampling, which is similar to the
behavior before fd0e24b718.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This PR adds support for compressibility of log file.
I added a new option conpression for the jsonfile log driver,
this option allows the user to specify compression algorithm to
compress the log files. By default, the log files will be
not compressed. At present, only support 'gzip'.
Signed-off-by: Yanqiang Miao <miao.yanqiang@zte.com.cn>
'docker logs' can read from compressed files
Signed-off-by: Yanqiang Miao <miao.yanqiang@zte.com.cn>
Add Metadata to the gzip header, optmize 'readlog'
Signed-off-by: Yanqiang Miao <miao.yanqiang@zte.com.cn>
Commit 7a7357dae1 ("LCOW: Implemented support for docker cp + build")
changed `container.BaseFS` from being a string (that could be empty but
can't lead to nil pointer dereference) to containerfs.ContainerFS,
which could be be `nil` and so nil dereference is at least theoretically
possible, which leads to panic (i.e. engine crashes).
Such a panic can be avoided by carefully analysing the source code in all
the places that dereference a variable, to make the variable can't be nil.
Practically, this analisys are impossible as code is constantly
evolving.
Still, we need to avoid panics and crashes. A good way to do so is to
explicitly check that a variable is non-nil, returning an error
otherwise. Even in case such a check looks absolutely redundant,
further changes to the code might make it useful, and having an
extra check is not a big price to pay to avoid a panic.
This commit adds such checks for all the places where it is not obvious
that container.BaseFS is not nil (which in this case means we do not
call daemon.Mount() a few lines earlier).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case ContainerExport() is called for an unmounted container, it leads
to a daemon panic as container.BaseFS, which is dereferenced here, is
nil.
To fix, do not rely on container.BaseFS; use the one returned from
rwlayer.Mount().
Fixes: 7a7357dae1 ("LCOW: Implemented support for docker cp + build")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In info, we only need the number of images, but `CountImages` was
getting the whole map of images and then grabbing the length from that.
This causes a lot of unnecessary CPU usage and memory allocations, which
increases with O(n) on the number of images.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Ingress networks will no longer automatically remove their
load-balancing endpoint (and sandbox) automatically when the network is
otherwise upopulated. This is to prevent automatic removal of the
ingress networks when all the containers leave them. Therefore
explicit removal of an ingress network also requires explicit removal
of its load-balancing endpoint.
Signed-off-by: Chris Telfer <ctelfer@docker.com>
It has been pointed out that if --read-only flag is given, /dev/shm
also becomes read-only in case of --ipc private.
This happens because in this case the mount comes from OCI spec
(since commit 7120976d74), and is a regression caused by that
commit.
The meaning of --read-only flag is to only have a "main" container
filesystem read-only, not the auxiliary stuff (that includes /dev/shm,
other mounts and volumes, --tmpfs, /proc, /dev and so on).
So, let's make sure /dev/shm that comes from OCI spec is not made
read-only.
Fixes: 7120976d74 ("Implement none, private, and shareable ipc modes")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The test case checks that in case of IpcMode: private and
ReadonlyRootfs: true (as in "docker run --ipc private --read-only")
the resulting /dev/shm mount is NOT made read-only.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
To avoid noise in sampling CPU usage metrics, we now sample the system
usage closer to the actual response from the underlying runtime. Because
the response from the runtime may be delayed, this makes the sampling
more resilient in loaded conditions. In addition to this, we also
replace the tick with a sleep to avoid situations where ticks can backup
under loaded conditions.
The trade off here is slightly more load reading the system CPU usage
for each container. There may be an optimization required for large
amounts of containers but the cost is on the order of 15 ms per 1000
containers. If this becomes a problem, we can time slot the sampling,
but the complexity may not be worth it unless we can test further.
Unfortunately, there aren't really any good tests for this condition.
Triggering this behavior is highly system dependent. As a matter of
course, we should qualify the fix with the users that are affected.
Signed-off-by: Stephen J Day <stephen.day@docker.com>
With the inclusion of PR 30897, creating service for host network
fails in 18.02. Modified IsPreDefinedNetwork check and return
NetworkNameError instead of errdefs.Forbidden to address this issue
Signed-off-by: selansen <elango.siva@docker.com>
While a `types.go` file is handly when there are a lot of record types,
it is completely obnoxious when used for concrete, utility types with a
struct, new function and method set in the same file. This change
removes the `types.go` file in favor of the simpler approach.
Signed-off-by: Stephen J Day <stephen.day@docker.com>
Move the "unmount and deactivate" code into a separate method, and
optimize it a bit:
1. Do not use filepath.Walk() as there's no requirement to recursively
go into every directory under home/mnt; a list of directories in mnt
is sufficient. With filepath.Walk(), in case some container will fail
to unmount, it'll go through the whole container filesystem which is
excessive and useless.
2. Do not use GetMounts() and check if a directory is mounted; just
unmount it and ignore "not mounted" error. Note the same error
is returned in case of wrong flags set, but as flags are hardcoded
we can safely ignore such case.
While at it, promote "can't unmount" log level from debug to warning.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Make sure it's clear the error is from unmount.
2. Simplify the code a bit to make it more readable.
[v2: use errors.Wrap]
[v3: use errors.Wrapf]
[v4: lowercase the error message]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Replace EnsureRemoveAll() with Rmdir(), as here we are removing
the container's mount point, which is already properly unmounted
and is therefore an empty directory.
2. Ignore the Rmdir() error (but log it unless it's ENOENT). This
is a mount point, currently unmounted (i.e. an empty directory),
and an older kernel can return EBUSY if e.g. the mount was
leaked to other mount namespaces.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Exec processes do not automatically inherit AppArmor
profiles from the container.
This patch sets the AppArmor profile for the exec
process.
Before this change:
apparmor_parser -q -r <<EOF
#include <tunables/global>
profile deny-write flags=(attach_disconnected) {
#include <abstractions/base>
file,
network,
deny /tmp/** w,
capability,
}
EOF
docker run -dit --security-opt "apparmor=deny-write" --name aa busybox
docker exec aa sh -c 'mkdir /tmp/test'
(no error)
With this change applied:
docker exec aa sh -c 'mkdir /tmp/test'
mkdir: can't create directory '/tmp/test': Permission denied
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: John Howard <jhoward@microsoft.com>
While debugging #32838, it was found (https://github.com/moby/moby/issues/32838#issuecomment-356005845) that the utility VM in some circumstances was crashing. Unfortunately, this was silently thrown away, and as far as the build step (also applies to docker run) was concerned, the exit code was zero and the error was thrown away. Windows containers operate differently to containers on Linux, and there can be legitimate system errors during container shutdown after the init process exits. This PR handles this and passes the error all the way back to the client, and correctly causes a build step running a container which hits a system error to fail, rather than blindly trying to keep going, assuming all is good, and get a subsequent failure on a commit.
With this change, assuming an error occurs, here's an example of a failure which previous was reported as a commit error:
```
The command 'powershell -Command $ErrorActionPreference = 'Stop'; $ProgressPreference = 'SilentlyContinue'; Install-WindowsFeature -Name Web-App-Dev ; Install-WindowsFeature -Name ADLDS; Install-WindowsFeature -Name Web-Mgmt-Compat; Install-WindowsFeature -Name Web-Mgmt-Service; Install-WindowsFeature -Name Web-Metabase; Install-WindowsFeature -Name Web-Lgcy-Scripting; Install-WindowsFeature -Name Web-WMI; Install-WindowsFeature -Name Web-WHC; Install-WindowsFeature -Name Web-Scripting-Tools; Install-WindowsFeature -Name Web-Net-Ext45; Install-WindowsFeature -Name Web-ASP; Install-WindowsFeature -Name Web-ISAPI-Ext; Install-WindowsFeature -Name Web-ISAPI-Filter; Install-WindowsFeature -Name Web-Default-Doc; Install-WindowsFeature -Name Web-Dir-Browsing; Install-WindowsFeature -Name Web-Http-Errors; Install-WindowsFeature -Name Web-Static-Content; Install-WindowsFeature -Name Web-Http-Redirect; Install-WindowsFeature -Name Web-DAV-Publishing; Install-WindowsFeature -Name Web-Health; Install-WindowsFeature -Name Web-Http-Logging; Install-WindowsFeature -Name Web-Custom-Logging; Install-WindowsFeature -Name Web-Log-Libraries; Install-WindowsFeature -Name Web-Request-Monitor; Install-WindowsFeature -Name Web-Http-Tracing; Install-WindowsFeature -Name Web-Stat-Compression; Install-WindowsFeature -Name Web-Dyn-Compression; Install-WindowsFeature -Name Web-Security; Install-WindowsFeature -Name Web-Windows-Auth; Install-WindowsFeature -Name Web-Basic-Auth; Install-WindowsFeature -Name Web-Url-Auth; Install-WindowsFeature -Name Web-WebSockets; Install-WindowsFeature -Name Web-AppInit; Install-WindowsFeature -Name NET-WCF-HTTP-Activation45; Install-WindowsFeature -Name NET-WCF-Pipe-Activation45; Install-WindowsFeature -Name NET-WCF-TCP-Activation45;' returned a non-zero code: 4294967295: container shutdown failed: container ba9c65054d42d4830fb25ef55e4ab3287550345aa1a2bb265df4e5bfcd79c78a encountered an error during WaitTimeout: failure in a Windows system call: The compute system exited unexpectedly. (0xc0370106)
```
Without this change, it would be incorrectly reported such as in this comment: https://github.com/moby/moby/issues/32838#issuecomment-309621097
```
Step 3/8 : ADD buildtools C:/buildtools
re-exec error: exit status 1: output: time="2017-06-20T11:37:38+10:00" level=error msg="hcsshim::ImportLayer failed in Win32: The system cannot find the path specified. (0x3) layerId=\\\\?\\C:\\ProgramData\\docker\\windowsfilter\\b41d28c95f98368b73fc192cb9205700e21
6691495c1f9ac79b9b04ec4923ea2 flavour=1 folder=C:\\Windows\\TEMP\\hcs232661915"
hcsshim::ImportLayer failed in Win32: The system cannot find the path specified. (0x3) layerId=\\?\C:\ProgramData\docker\windowsfilter\b41d28c95f98368b73fc192cb9205700e216691495c1f9ac79b9b04ec4923ea2 flavour=1 folder=C:\Windows\TEMP\hcs232661915
```
This fixes an issue where the container LogPath was empty when the
non-blocking logging mode was enabled. This change sets the LogPath on
the container as soon as the path is generated, instead of setting the
LogPath on a logger struct and then attempting to pull it off that
logger at a later point. That attempt to pull the LogPath off the logger
was error prone since it assumed that the logger would only ever be a
single type.
Prior to this change docker inspect returned an empty string for
LogPath. This caused issues with tools that rely on docker inspect
output to discover container logs, e.g. Kubernetes.
This commit also removes some LogPath methods that are now unnecessary
and are never invoked.
Signed-off-by: junzhe and mnussbaum <code@getbraintree.com>
When a container is created if "--network" is set to "host" all the
ports in the container are bound to the host.
Thus, adding "-p" or "--publish" to the command-line is meaningless.
Unlike "docker run" and "docker create", "docker service create" sends
an error message when network mode is host and port bindings are given
This patch however suggests to send a warning message to the client when
such a case occurs.
The warning message is added to "warnings" which are returned from
"verifyPlatformContainerSettings".
Signed-off-by: Boaz Shuster <ripcurld.github@gmail.com>
On unix, merge secrets/configs handling. This is important because
configs can contain secrets (via templating) and potentially a config
could just simply have secret information "by accident" from the user.
This just make sure that configs are as secure as secrets and de-dups a
lot of code.
Generally this makes everything simpler and configs more secure.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
When tailing a container log, if the log file is empty it will cause the
log stream to abort with an unexpected `EOF`.
Note that this only applies to the "current" log file as rotated files
cannot be empty.
This fix just skips adding the "current" file the log tail if it is
empty.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Since commit e9b9e4ace2 has landed, there is a chance that
container.RWLayer is nil (due to some half-removed container). Let's
check the pointer before use to avoid any potential nil pointer
dereferences, resulting in a daemon crash.
Note that even without the abovementioned commit, it's better to perform
an extra check (even it's totally redundant) rather than to have a
possibility of a daemon crash. In other words, better be safe than
sorry.
[v2: add a test case for daemon.getInspectData]
[v3: add a check for container.Dead and a special error for the case]
Fixes: e9b9e4ace2
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The imageRefs map was being popualted with containerID, and accessed
with an imageID which would never match.
Remove this broken code because: 1) it hasn't ever worked so isn't
necessary, and 2) because at best it would be racy
ImageDelete() should already handle preventing of removal of used
images.
Signed-off-by: Daniel Nephin <dnephin@docker.com>
Before this change, volume management was relying on the fact that
everything the plugin mounts is visible on the host within the plugin's
rootfs. In practice this caused some issues with mount leaks, so we
changed the behavior such that mounts are not visible on the plugin's
rootfs, but available outside of it, which breaks volume management.
To fix the issue, allow the plugin to scope the path correctly rather
than assuming that everything is visible in `p.Rootfs`.
In practice this is just scoping the `PropagatedMount` paths to the
correct host path.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Setting up the mounts on the host increases chances of mount leakage and
makes for more cleanup after the plugin has stopped.
With this change all mounts for the plugin are performed by the
container runtime and automatically cleaned up when the container exits.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Currently the metrics plugin uses a really hackish host mount with
propagated mounts to get the metrics socket into a plugin after the
plugin is alreay running.
This approach ends up leaking mounts which requires setting the plugin
manager root to private, which causes some other issues.
With this change, plugin subsystems can register a set of modifiers to
apply to the plugin's runtime spec before the plugin is ever started.
This will help to generalize some of the customization work that needs
to happen for various plugin subsystems (and future ones).
Specifically it lets the metrics plugin subsystem append a mount to the
runtime spec to mount the metrics socket in the plugin's mount namespace
rather than the host's and prevetns any leaking due to this mount.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The goal of this refactor is to make it easier to integrate buildkit
and containerd snapshotters.
Commit is used from two places (api and build), each calls it
with distinct arguments. Refactored to pull out the common commit
logic and provide different interfaces for each consumer.
Signed-off-by: Daniel Nephin <dnephin@docker.com>
This was added in #36047 just as a way to make sure the tree is fully
unmounted on shutdown.
For ZFS this could be a breaking change since there was no unmount before.
Someone could have setup the zfs tree themselves. It would be better, if
we really do want the cleanup to actually the unpacked layers checking
for mounts rather than a blind recursive unmount of the root.
BTRFS does not use mounts and does not need to unmount anyway.
These was only an unmount to begin with because for some reason the
btrfs tree was being moutned with `private` propagation.
For the other graphdrivers that still have a recursive unmount here...
these were already being unmounted and performing the recursive unmount
shouldn't break anything. If anyone had anything mounted at the
graphdriver location it would have been unmounted on shutdown anyway.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
By default, if a user requests a bind mount it uses private propagation.
When the source path is a path within the daemon root this, along with
some other propagation values that the user can use, causes issues when
the daemon tries to remove a mountpoint because a container will then
have a private reference to that mount which prevents removal.
Unmouting with MNT_DETATCH can help this scenario on newer kernels, but
ultimately this is just covering up the problem and doesn't actually
free up the underlying resources until all references are destroyed.
This change does essentially 2 things:
1. Change the default propagation when unspecified to `rslave` when the
source path is within the daemon root path or a parent of the daemon
root (because everything is using rbinds).
2. Creates a validation error on create when the user tries to specify
an unacceptable propagation mode for these paths...
basically the only two acceptable modes are `rslave` and `rshared`.
In cases where we have used the new default propagation but the
underlying filesystem is not setup to handle it (fs must hvae at least
rshared propagation) instead of erroring out like we normally would,
this falls back to the old default mode of `private`, which preserves
backwards compatibility.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Current implementaion of docke daemon doesn't pass down the
`--oom-kill-disable` option specified by the end user to the containerd
when spawning a new docker instance with help from `runc` component, which
results in the `--oom-kill-disable` doesn't work no matter the flag is `true`
or `false`.
This PR will fix this issue reported by #36090
Signed-off-by: Dennis Chen <dennis.chen@arm.com>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Attachable networks are networks created on the cluster which can then
be attached to by non-swarm containers. These networks are lazily
created on the node that wants to attach to that network.
When no container is currently attached to one of these networks on a
node, and then multiple containers which want that network are started
concurrently, this can cause a race condition in the network attachment
where essentially we try to attach the same network to the node twice.
To easily reproduce this issue you must use a multi-node cluster with a
worker node that has lots of CPUs (I used a 36 CPU node).
Repro steps:
1. On manager, `docker network create -d overlay --attachable test`
2. On worker, `docker create --restart=always --network test busybox
top`, many times... 200 is a good number (but not much more due to
subnet size restrictions)
3. Restart the daemon
When the daemon restarts, it will attempt to start all those containers
simultaneously. Note that you could try to do this yourself over the API,
but it's harder to trigger due to the added latency from going over
the API.
The error produced happens when the daemon tries to start the container
upon allocating the network resources:
```
attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded
```
What happens here is the worker makes a network attachment request to
the manager. This is an async call which in the happy case would cause a
task to be placed on the node, which the worker is waiting for to get
the network configuration.
In the case of this race, the error ocurrs on the manager like this:
```
task allocation failure" error="failed during network allocation for task n7bwwwbymj2o2h9asqkza8gom: failed to allocate network IP for task n7bwwwbymj2o2h9asqkza8gom network rj4szie2zfauqnpgh4eri1yue: could not find an available IP" module=node node.id=u3489c490fx1df8onlyfo1v6e
```
The task is not created and the worker times out waiting for the task.
---
The mitigation for this is to make sure that only one attachment reuest
is in flight for a given network at a time *when the network doesn't
already exist on the node*. If the network already exists on the node
there is no need for synchronization because the network is already
allocated and on the node so there is no need to request it from the
manager.
This basically comes down to a race with `Find(network) ||
Create(network)` without any sort of syncronization.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This fix tries to address the issue raised in 36139 where
ExitCode and PID does not show up in Task.Status.ContainerStatus
The issue was caused by `json:",omitempty"` in PID and ExitCode
which interprate 0 as null.
This is confusion as ExitCode 0 does have a meaning.
This fix removes `json:",omitempty"` in ExitCode and PID,
but changes ContainerStatus to pointer so that ContainerStatus
does not show up at all if no content. If ContainerStatus
does have a content, then ExitCode and PID will show up (even if
they are 0).
This fix fixes 36139.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
ReleaseRWLayer can and should only be called once (unless it returns
an error), but might be called twice in case of a failure from
`system.EnsureRemoveAll(container.Root)`. This results in the
following error:
> Error response from daemon: driver "XXX" failed to remove root filesystem for YYY: layer not retained
The obvious fix is to set container.RWLayer to nil as soon as
ReleaseRWLayer() succeeds.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This fix tries to address the issue raised in 36042
where secret and config are not configured with the
specified file mode.
This fix update the file mode so that it is not impacted
with umask.
Additional tests have been added.
This fix fixes 36042.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
You don't need to resolve the symlink for the exec as long as the
process is to keep running during execution.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This change sets an explicit mount propagation for the daemon root.
This is useful for people who need to bind mount the docker daemon root
into a container.
Since bind mounting the daemon root should only ever happen with at
least `rlsave` propagation (to prevent the container from holding
references to mounts making it impossible for the daemon to clean up its
resources), we should make sure the user is actually able to this.
Most modern systems have shared root (`/`) propagation by default
already, however there are some cases where this may not be so
(e.g. potentially docker-in-docker scenarios, but also other cases).
So this just gives the daemon a little more control here and provides
a more uniform experience across different systems.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This fix tries to address the issue raised in 36083 where
`network inspect` does not show Created time if the network is
created in swarm scope.
The issue was that Created was not converted from swarm api.
This fix addresses the issue.
An unit test has been added.
This fix fixes 36083.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This fix tries to address the issue raised in 33661 where
network alias does not work when connect to a network the second time.
This fix address the issue.
This fix fixes 33661.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This fix tries to address the issue raised in 35752
where container start will trigger a crash if EndpointSettings is nil.
This fix adds the validation to make sure EndpointSettings != nil
This fix fixes 35752.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
When succesfully reloading the daemon configuration, print a message
in the logs with the active configuration:
INFO[2018-01-15T15:36:20.901688317Z] Got signal to reload configuration, reloading from: /etc/docker/daemon.json
INFO[2018-01-14T02:23:48.782769942Z] Reloaded configuration: {"mtu":1500,"pidfile":"/var/run/docker.pid","data-root":"/var/lib/docker","exec-root":"/var/run/docker","group":"docker","deprecated-key-path":"/etc/docker/key.json","max-concurrent-downloads":3,"max-concurrent-uploads":5,"shutdown-timeout":15,"debug":true,"hosts":["unix:///var/run/docker.sock"],"log-level":"info","swarm-default-advertise-addr":"","metrics-addr":"","log-driver":"json-file","ip":"0.0.0.0","icc":true,"iptables":true,"ip-forward":true,"ip-masq":true,"userland-proxy":true,"disable-legacy-registry":true,"experimental":false,"network-control-plane-mtu":1500,"runtimes":{"runc":{"path":"docker-runc"}},"default-runtime":"runc","oom-score-adjust":-500,"default-shm-size":67108864,"default-ipc-mode":"shareable"}
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This fix carries PR 34248: Added tag log option to json-logger
This fix changes to use RawAttrs based on review feedback.
This fix fixes 19803, this fix closes 34248.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Fixes#19803
Updated the json-logger to utilize the common log option
'tag' that can define container/image information to include
as part of logging.
When the 'tag' log option is not included, there is no change
to the log content via the json-logger. When the 'tag' log option
is included, the tag will be parsed as a template and the result
will be stored within each log entry as the attribute 'tag'.
Update: Removing test added to integration_cli as those have been deprecated.
Update: Using proper test calls (require and assert) in jsonfilelog_test.go based on review.
Update: Added new unit test configs for logs with tag. Updated unit test error checking.
Update: Cleanup check in jsonlogbytes_test.go to match pending changes in PR #34946.
Update: Merging to correct conflicts from PR #34946.
Signed-off-by: bonczj <josh.bonczkowski@gmail.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
The re-coalesces the daemon stores which were split as part of the
original LCOW implementation.
This is part of the work discussed in https://github.com/moby/moby/issues/34617,
in particular see the document linked to in that issue.
The idea behind making the graphdrivers private is to prevent leaking
mounts into other namespaces.
Unfortunately this is not really what happens.
There is one case where this does work, and that is when the namespace
was created before the daemon's namespace.
However with systemd each system servie winds up with it's own mount
namespace. This causes a race betwen daemon startup and other system
services as to if the mount is actually private.
This also means there is a negative impact when other system services
are started while the daemon is running.
Basically there are too many things that the daemon does not have
control over (nor should it) to be able to protect against these kinds
of leakages. One thing is certain, setting the graphdriver roots to
private disconnects the mount ns heirarchy preventing propagation of
unmounts... new mounts are of course not propagated either, but the
behavior is racey (or just bad in the case of restarting services)... so
it's better to just be able to keep mount propagation in tact.
It also does not protect situations like `-v
/var/lib/docker:/var/lib/docker` where all mounts are recursively bound
into the container anyway.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This PR contains a fix for moby/moby#30321. There was a moby/moby#31142
PR intending to fix the issue by adding a delay between disabling the
service in the cluster and the shutdown of the tasks. However
disabling the service was not deleting the service info in the cluster.
Added a fix to delete service info from cluster and verified using siege
to ensure there is zero downtime on rolling update of a service.In order
to support it and ensure consitency of enabling and disable service knob
from the daemon, we need to ensure we disable service when we release
the network from the container. This helps in making the enable and
disable service less racy. The corresponding part of libnetwork fix is
part of docker/libnetwork#1824
Signed-off-by: abhi <abhi@docker.com>
A linter (vet) found the following bug in the code:
> daemon/metrics.go:124::error: range variable p captured by func literal (vet)
Here a variable p is used in an async fashion by goroutine, and most
probably by the time of use it is set to the last element of a range.
For example, the following code
```go
for _, c := range []string{"here ", "we ", "go"} {
go func() {
fmt.Print(c)
}()
}
```
will print `gogogo` rather than `here we go` as one would expect.
Fixes: 0e8e8f0f31 ("Add support for metrics plugins")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>