Some snapshotters (like overlayfs or zfs) can't mount the same
directories twice. For example if the same directroy is used as an upper
directory in two mounts the kernel will output this warning:
overlayfs: upperdir is in-use as upperdir/workdir of another mount, accessing files from both mounts will result in undefined behavior.
And indeed accessing the files from both mounts will result in an "No
such file or directory" error.
This change introduces reference counts for the mounts, if a directory
is already mounted the mount interface will only increment the mount
counter and return the mount target effectively making sure that the
filesystem doesn't end up in an undefined behavior.
Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
(cherry picked from commit 32d58144fd)
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
- Split these options to a separate struct, so that we can handle them in isolation.
- Change some tests to use subtests, and improve coverage
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Since cc19eba (backported to v23.0.4), the PreferredPool for docker0 is
set only when the user provides the bip config parameter or when the
default bridge already exist. That means, if a user provides the
fixed-cidr parameter on a fresh install or reboot their computer/server
without bip set, dockerd throw the following error when it starts:
> failed to start daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to parse pool request for
> address space "LocalDefault" pool "" subpool "100.64.0.0/26": Invalid
> Address SubPool
See #45356.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Kubernetes only permits RuntimeClass values which are valid lowercase
RFC 1123 labels, which disallows the period character. This prevents
cri-dockerd from being able to support configuring alternative shimv2
runtimes for a pod as shimv2 runtime names must contain at least one
period character. Add support for configuring named shimv2 runtimes in
daemon.json so that runtime names can be aliased to
Kubernetes-compatible names.
Allow options to be set on shimv2 runtimes in daemon.json.
The names of the new daemon runtime config fields have been selected to
correspond with the equivalent field names in cri-containerd's
configuration so that users can more easily follow documentation from
the runtime vendor written for cri-containerd and apply it to
daemon.json.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
c8d/daemon: Mount root and fill BaseFS
This fixes things that were broken due to nil BaseFS like `docker cp`
and running a container with workdir override.
This is more of a temporary hack than a real solution.
The correct fix would be to refactor the code to make BaseFS and LayerRW
an implementation detail of the old image store implementation and use
the temporary mounts for the c8d implementation instead.
That requires more work though.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
daemon/images: Don't unset BaseFS
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
The netutils.ElectInterfaceAddresses function is only used in one place
outside of tests: in the daemon, to configure the default bridge
network. The function is also messy to reason about as it references the
shared mutable state of ipamutils.PredefinedLocalScopeDefaultNetworks.
It uses the list of predefined default networks to always return an IPv4
address even if the named interface does not exist or does not have any
IPv4 addresses. This list happens to be the same as the one used to
initialize the address pool of the 'builtin' IPAM driver, though that is
far from obvious. (Start with "./libnetwork".initIPAMDrivers and trace
the dataflow of the addressPool value. Surprise! Global state is being
mutated using the value of other global mutable state.)
The daemon does not need the fallback behaviour of
ElectInterfaceAddresses. In fact, the daemon does not have to configure
an address pool for the network at all! libnetwork will acquire one of
the available address ranges from the network's IPAM driver when the
preferred-pool configuration is unset. It will do so using the same list
of address ranges and the exact same logic
(netutils.FindAvailableNetworks) as ElectInterfaceAddresses. So unless
the daemon needs to force the network to use a specific address range
because the bridge interface already exists, it can leave the details
up to libnetwork.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(*Daemon).registerLinks() calling the WriteHostConfig() method of its
container argument is a vestigial behaviour. In the distant past,
registerLinks() would persist the container links in an SQLite database
and drop the link config from the container's persisted HostConfig. This
changed in Docker v1.10 (#16032) which migrated away from SQLite and
began using the link config in the container's HostConfig as the
persistent source of truth. registerLinks() no longer mutates the
HostConfig at all so persisting the HostConfig to disk falls outside of
its scope of responsibilities.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The implementation of CanAccess() is very rudimentary, and should
not be used for anything other than a basic check (and maybe not
even for that). It's only used in a single location in the daemon,
so move it there, and un-export it to not encourage others to use
it out of context.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
daemon/network/filter_test.go:174:19: empty-lines: extra empty line at the end of a block (revive)
daemon/restart.go:17:116: empty-lines: extra empty line at the end of a block (revive)
daemon/daemon_linux_test.go:255:41: empty-lines: extra empty line at the end of a block (revive)
daemon/reload_test.go:340:58: empty-lines: extra empty line at the end of a block (revive)
daemon/oci_linux.go:495:101: empty-lines: extra empty line at the end of a block (revive)
daemon/seccomp_linux_test.go:17:36: empty-lines: extra empty line at the start of a block (revive)
daemon/container_operations.go:560:73: empty-lines: extra empty line at the end of a block (revive)
daemon/daemon_unix.go:558:76: empty-lines: extra empty line at the end of a block (revive)
daemon/daemon_unix.go:1092:64: empty-lines: extra empty line at the start of a block (revive)
daemon/container_operations.go:587:24: empty-lines: extra empty line at the end of a block (revive)
daemon/network.go:807:18: empty-lines: extra empty line at the end of a block (revive)
daemon/network.go:813:42: empty-lines: extra empty line at the end of a block (revive)
daemon/network.go:872:72: empty-lines: extra empty line at the end of a block (revive)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The containerd client is very chatty at the best of times. Because the
libcontained API is stateless and references containers and processes by
string ID for every method call, the implementation is essentially
forced to use the containerd client in a way which amplifies the number
of redundant RPCs invoked to perform any operation. The libcontainerd
remote implementation has to reload the containerd container, task
and/or process metadata for nearly every operation. This in turn
amplifies the number of context switches between dockerd and containerd
to perform any container operation or handle a containerd event,
increasing the load on the system which could otherwise be allocated to
workloads.
Overhaul the libcontainerd interface to reduce the impedance mismatch
with the containerd client so that the containerd client can be used
more efficiently. Split the API out into container, task and process
interfaces which the consumer is expected to retain so that
libcontainerd can retain state---especially the analogous containerd
client objects---without having to manage any state-store inside the
libcontainerd client.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Treat (storage/graph)Driver as snapshotter
Also moved some layerStore related initialization to the non-c8d case
because otherwise they get treated as a graphdriver plugins.
Co-authored-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Since runtimes can now just be containerd shims, we need to check if the
reference is possibly a containerd shim.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Contrary to popular belief, the OCI Runtime specification does not
specify the command-line API for runtimes. Looking at containerd's
architecture from the lens of the OCI Runtime spec, the _shim_ is the
OCI Runtime and runC is "just" an implementation detail of the
io.containerd.runc.v2 runtime. When one configures a non-default runtime
in Docker, what they're really doing is instructing Docker to create
containers using the io.containerd.runc.v2 runtime with a configuration
option telling the runtime that the runC binary is at some non-default
path. Consequently, only OCI runtimes which are compatible with the
io.containerd.runc.v2 shim, such as crun, can be used in this manner.
Other OCI runtimes, including kata-containers v2, come with their own
containerd shim and are not compatible with io.containerd.runc.v2.
As Docker has not historically provided a way to select a non-default
runtime which requires its own shim, runtimes such as kata-containers v2
could not be used with Docker.
Allow other containerd shims to be used with Docker; no daemon
configuration required. If the daemon is instructed to create a
container with a runtime name which does not match any of the configured
or stock runtimes, it passes the name along to containerd verbatim. A
user can start a container with the kata-containers runtime, for
example, simply by calling
docker run --runtime io.containerd.kata.v2
Runtime names which containerd would interpret as a path to an arbitrary
binary are disallowed. While handy for development and testing it is not
strictly necessary and would allow anyone with Engine API access to
trivially execute any binary on the host as root, so we have decided it
would be safest for our users if it was not allowed.
It is not yet possible to set an alternative containerd shim as the
default runtime; it can only be configured per-container.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Older versions of Go don't format comments, so committing this as
a separate commit, so that we can already make these changes before
we upgrade to Go 1.19.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit 483aa6294b introduced a regression, causing
spurious warnings to be shown when starting a daemon for the first time after
a fresh install:
docker info
...
WARNING: IPv4 forwarding is disabled
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
The information shown is incorrect, as checking the corresponding options on
the system, shows that these options are available:
cat /proc/sys/net/ipv4/ip_forward
1
cat /proc/sys/net/bridge/bridge-nf-call-iptables
1
cat /proc/sys/net/bridge/bridge-nf-call-ip6tables
1
The reason this is failing is because the daemon itself reconfigures those
options during networking initialization in `configureIPForwarding()`;
cf4595265e/libnetwork/drivers/bridge/setup_ip_forwarding.go (L14-L25)
Network initialization happens in the `daemon.restore()` function within `daemon.NewDaemon()`:
cf4595265e/daemon/daemon.go (L475-L478)
However, 483aa6294b moved detection of features
earlier in the `daemon.NewDaemon()` function, and collects the system information
(`d.RawSysInfo()`) before we enter `daemon.restore()`;
cf4595265e/daemon/daemon.go (L1008-L1011)
For optimization (collecting the system information comes at a cost), those
results are cached on the daemon, and will only be performed once (using a
`sync.Once`).
This patch:
- introduces a `getSysInfo()` utility, which collects system information without
caching the results
- uses `getSysInfo()` to collect the preliminary information needed at that
point in the daemon's lifecycle.
- moves printing warnings to the end of `daemon.NewDaemon()`, after all information
can be read correctly.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These HostConfig properties were not validated until the OCI spec for the container
was created, which meant that `container run` and `docker create` would accept
invalid values, and the invalid value would not be detected until `start` was
called, returning a 500 "internal server error", as well as errors from containerd
("cleanup: failed to delete container from containerd: no such container") in the
daemon logs.
As a result, a faulty container was created, and the container state remained
in the `created` state.
This patch:
- Updates `oci.WithNamespaces()` to return the correct `errdefs.InvalidParameter`
- Updates `verifyPlatformContainerSettings()` to validate these settings, so that
an error is returned when _creating_ the container.
Before this patch:
docker run -dit --ipc=shared --name foo busybox
2a00d74e9fbb7960c4718def8f6c74fa8ee754030eeb93ee26a516e27d4d029f
docker: Error response from daemon: Invalid IPC mode: shared.
docker ps -a --filter name=foo
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2a00d74e9fbb busybox "sh" About a minute ago Created foo
After this patch:
docker run -dit --ipc=shared --name foo busybox
docker: Error response from daemon: invalid IPC mode: shared.
docker ps -a --filter name=foo
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
An integration test was added to verify the new validation, which can be run with:
make BIND_DIR=. TEST_FILTER=TestCreateInvalidHostConfig DOCKER_GRAPHDRIVER=vfs test-integration
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This method returned the network controller, only to set it on the daemon.
While making this change, also;
- update some error messages to be in the correct format
- use errors.Wrap() where possible
- extract configuring networks into a separate function to make the flow
slightly easier to follow.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This is a method on the daemon, which itself holds the Config, so
there's no need to pass the same configuration as an argument.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- Omit `KernelMemory` and `KernelMemoryTCP` fields in `/info` response if they're
not supported, or when using API v1.42 or up.
- Re-enable detection of `KernelMemory` (as it's still needed for older API versions)
- Remove warning about kernel memory TCP in daemon logs (a warning is still returned
by the `/info` endpoint, but we can consider removing that).
- Prevent incorrect "Minimum kernel memory limit allowed" error if the value was
reset because it's not supported by the host.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- remove KernelMemory option from `v1.42` api docs
- remove KernelMemory warning on `/info`
- update changes for `v1.42`
- remove `KernelMemory` field from endpoints docs
Signed-off-by: aiordache <anca.iordache@docker.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Finish the refactor which was partially completed with commit
34536c498d, passing around IdentityMapping structs instead of pairs of
[]IDMap slices.
Existing code which uses []IDMap relies on zero-valued fields to be
valid, empty mappings. So in order to successfully finish the
refactoring without introducing bugs, their replacement therefore also
needs to have a useful zero value which represents an empty mapping.
Change IdentityMapping to be a pass-by-value type so that there are no
nil pointers to worry about.
The functionality provided by the deprecated NewIDMappingsFromMaps
function is required by unit tests to to construct arbitrary
IdentityMapping values. And the daemon will always need to access the
mappings to pass them to the Linux kernel. Accommodate these use cases
by exporting the struct fields instead. BuildKit currently depends on
the UIDs and GIDs methods so we cannot get rid of them yet.
Signed-off-by: Cory Snider <csnider@mirantis.com>
All regular, non-EOL Linux distros now come with more recent kernels
out of the box. There may still be users trying to run on kernel 3.10
or older (some embedded systems, e.g.), but those should be a rare
exception, which we don't have to take into account.
This patch removes the kernel version check on Linux, and the corresponding
DOCKER_NOWARN_KERNEL_VERSION environment that was there to skip this
check.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The `daemon.RawSysInfo()` function can be a heavy operation, as it collects
information about all cgroups on the host, networking, AppArmor, Seccomp, etc.
While looking at our code, I noticed that various parts in the code call this
function, potentially even _multiple times_ per container, for example, it is
called from:
- `verifyPlatformContainerSettings()`
- `oci.WithCgroups()` if the daemon has `cpu-rt-period` or `cpu-rt-runtime` configured
- in `ContainerDecoder.DecodeConfig()`, which is called on boith `container create` and `container commit`
Given that this information is not expected to change during the daemon's
lifecycle, and various information coming from this (such as seccomp and
apparmor status) was already cached, we may as well load it once, and cache
the results in the daemon instance.
This patch updates `daemon.RawSysInfo()` to use a `sync.Once()` so that
it's only executed once for the daemon's lifecycle.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Also includes review suggestions in daemon.initNetworkController():
- update godoc for setHostGatewayIP()
- change setHostGatewayIP() to get config, instead of daemon
- remove redundant nil check for controller
Signed-off-by: sanchayanghosh <sanchayanghosh@outlook.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Do not use 0701 perms.
0701 dir perms allows anyone to traverse the docker dir.
It happens to allow any user to execute, as an example, suid binaries
from image rootfs dirs because it allows traversal AND critically
container users need to be able to do execute things.
0701 on lower directories also happens to allow any user to modify
things in, for instance, the overlay upper dir which neccessarily
has 0755 permissions.
This changes to use 0710 which allows users in the group to traverse.
In userns mode the UID owner is (real) root and the GID is the remapped
root's GID.
This prevents anyone but the remapped root to traverse our directories
(which is required for userns with runc).
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit ef7237442147441a7cadcda0600be1186d81ac73)
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 93ac040bf0)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This adds support for 2 runtimes on Windows, one that uses the built-in
HCSv1 integration and another which uses containerd with the runhcs
shim.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This makes sure that the value set in the daemon can be used as-is,
without having to replicate the normalization logic elsewhere.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This allows containers to use the embedded default profile if a different
default is set (e.g. "unconfined") in the daemon configuration. Without this
option, users would have to copy the default profile to a file in order to
use the default.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit b237189e6c implemented an option to
set the default seccomp profile in the daemon configuration. When that PR
was reviewed, it was discussed to have the option accept the path to a custom
profile JSON file; https://github.com/moby/moby/pull/26276#issuecomment-253546966
However, in the implementation, the special "unconfined" value was not taken into
account. The "unconfined" value is meant to disable seccomp (more factually:
run with an empty profile).
While it's likely possible to achieve this by creating a file with an an empty
(`{}`) profile, and passing the path to that file, it's inconsistent with the
`--security-opt seccomp=unconfined` option on `docker run` and `docker create`,
which is both confusing, and makes it harder to use (especially on Docker Desktop,
where there's no direct access to the VM's filesystem).
This patch adds the missing check for the special "unconfined" value.
Co-authored-by: Tianon Gravi <admwiggin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>