Since PR 11353 (commit 7804cd36ee "Filter out default mounts that
are override by user") there can be no duplicated mounts in the list,
so the check is redundant.
This should speed up container start by a nanosecond or two.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case a user wants to have a child reaper inside a container
(i.e. run "docker --init") AND a bind-mounted /dev, the following
error occurs:
> docker run -d -v /dev:/dev --init busybox top
> 088c96808c683077f04c4cc2711fddefe1f5970afc085d59e0baae779745a7cf
> docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "exec: "/dev/init": stat /dev/init: no such file or directory": unknown.
This happens because if a user-suppled /dev is provided, all the
built-in /dev/xxx mounts are filtered out.
To solve, let's move in-container init to /sbin, as the chance that
/sbin will be bind-mounted to a container is smaller than that for /dev.
While at it, let's give it more unique name (docker-init).
NOTE it still won't work for the case of bind-mounted /sbin.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This implements chown support on Windows. Built-in accounts as well
as accounts included in the SAM database of the container are supported.
NOTE: IDPair is now named Identity and IDMappings is now named
IdentityMapping.
The following are valid examples:
ADD --chown=Guest . <some directory>
COPY --chown=Administrator . <some directory>
COPY --chown=Guests . <some directory>
COPY --chown=ContainerUser . <some directory>
On Windows an owner is only granted the permission to read the security
descriptor and read/write the discretionary access control list. This
fix also grants read/write and execute permissions to the owner.
Signed-off-by: Salahuddin Khan <salah@docker.com>
This adds MaskedPaths and ReadOnlyPaths options to HostConfig for containers so
that a user can override the default values.
When the value sent through the API is nil the default is used.
Otherwise the default is overridden.
Adds integration tests for MaskedPaths and ReadonlyPaths.
Signed-off-by: Jess Frazelle <acidburn@microsoft.com>
A recent optimization in getSourceMount() made it return an error
in case when the found mount point is "/". This prevented bind-mounted
volumes from working in such cases.
A (rather trivial but adeqate) unit test case is added.
Fixes: 871c957242 ("getSourceMount(): simplify")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It does not make sense to copy a slice element by element, then discard
the source one. Let's do copy in place instead which is way more
efficient.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The flow of getSourceMount was:
1 get all entries from /proc/self/mountinfo
2 do a linear search for the `source` directory
3 if found, return its data
4 get the parent directory of `source`, goto 2
The repeated linear search through the whole mountinfo (which can have
thousands of records) is inefficient. Instead, let's just
1 collect all the relevant records (only those mount points
that can be a parent of `source`)
2 find the record with the longest mountpath, return its data
This was tested manually with something like
```go
func TestGetSourceMount(t *testing.T) {
mnt, flags, err := getSourceMount("/sys/devices/msr/")
assert.NoError(t, err)
t.Logf("mnt: %v, flags: %v", mnt, flags)
}
```
...but it relies on having a specific mount points on the system
being used for testing.
[v2: add unit tests for ParentsFilter]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Functions `GetMounts()` and `parseMountTable()` return all the entries
as read and parsed from /proc/self/mountinfo. In many cases the caller
is only interested only one or a few entries, not all of them.
One good example is `Mounted()` function, which looks for a specific
entry only. Another example is `RecursiveUnmount()` which is only
interested in mount under a specific path.
This commit adds `filter` argument to `GetMounts()` to implement
two things:
1. filter out entries a caller is not interested in
2. stop processing if a caller is found what it wanted
`nil` can be passed to get a backward-compatible behavior, i.e. return
all the entries.
A few filters are implemented:
- `PrefixFilter`: filters out all entries not under `prefix`
- `SingleEntryFilter`: looks for a specific entry
Finally, `Mounted()` is modified to use `SingleEntryFilter()`, and
`RecursiveUnmount()` is using `PrefixFilter()`.
Unit tests are added to check filters are working.
[v2: ditch NoFilter, use nil]
[v3: ditch GetMountsFiltered()]
[v4: add unit test for filters]
[v5: switch to gotestyourself]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This moves the platform specific stuff in a separate package and keeps
the `volume` package and the defined interfaces light to import.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
It does not make any sense to vary this based on whether the
rootfs is read only. We removed all the other mount dependencies
on read-only eg see #35344.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
As soon as the initial executable in the container is executed as a non root user,
permitted and effective capabilities are dropped. Drop them earlier than this, so
that they are dropped before executing the file. The main effect of this is that
if `CAP_DAC_OVERRIDE` is set (the default) the user will not be able to execute
files they do not have permission to execute, which previously they could.
The old behaviour was somewhat surprising and the new one is definitely correct,
but it is not in any meaningful way exploitable, and I do not think it is
necessary to backport this fix. It is unlikely to have any negative effects as
almost all executables have world execute permission anyway.
Use the bounding set not the effective set as the canonical set of capabilities, as
effective will now vary.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Commit 7a7357dae1 ("LCOW: Implemented support for docker cp + build")
changed `container.BaseFS` from being a string (that could be empty but
can't lead to nil pointer dereference) to containerfs.ContainerFS,
which could be be `nil` and so nil dereference is at least theoretically
possible, which leads to panic (i.e. engine crashes).
Such a panic can be avoided by carefully analysing the source code in all
the places that dereference a variable, to make the variable can't be nil.
Practically, this analisys are impossible as code is constantly
evolving.
Still, we need to avoid panics and crashes. A good way to do so is to
explicitly check that a variable is non-nil, returning an error
otherwise. Even in case such a check looks absolutely redundant,
further changes to the code might make it useful, and having an
extra check is not a big price to pay to avoid a panic.
This commit adds such checks for all the places where it is not obvious
that container.BaseFS is not nil (which in this case means we do not
call daemon.Mount() a few lines earlier).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It has been pointed out that if --read-only flag is given, /dev/shm
also becomes read-only in case of --ipc private.
This happens because in this case the mount comes from OCI spec
(since commit 7120976d74), and is a regression caused by that
commit.
The meaning of --read-only flag is to only have a "main" container
filesystem read-only, not the auxiliary stuff (that includes /dev/shm,
other mounts and volumes, --tmpfs, /proc, /dev and so on).
So, let's make sure /dev/shm that comes from OCI spec is not made
read-only.
Fixes: 7120976d74 ("Implement none, private, and shareable ipc modes")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
On unix, merge secrets/configs handling. This is important because
configs can contain secrets (via templating) and potentially a config
could just simply have secret information "by accident" from the user.
This just make sure that configs are as secure as secrets and de-dups a
lot of code.
Generally this makes everything simpler and configs more secure.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
By default, if a user requests a bind mount it uses private propagation.
When the source path is a path within the daemon root this, along with
some other propagation values that the user can use, causes issues when
the daemon tries to remove a mountpoint because a container will then
have a private reference to that mount which prevents removal.
Unmouting with MNT_DETATCH can help this scenario on newer kernels, but
ultimately this is just covering up the problem and doesn't actually
free up the underlying resources until all references are destroyed.
This change does essentially 2 things:
1. Change the default propagation when unspecified to `rslave` when the
source path is within the daemon root path or a parent of the daemon
root (because everything is using rbinds).
2. Creates a validation error on create when the user tries to specify
an unacceptable propagation mode for these paths...
basically the only two acceptable modes are `rslave` and `rshared`.
In cases where we have used the new default propagation but the
underlying filesystem is not setup to handle it (fs must hvae at least
rshared propagation) instead of erroring out like we normally would,
this falls back to the old default mode of `private`, which preserves
backwards compatibility.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
You don't need to resolve the symlink for the exec as long as the
process is to keep running during execution.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
It's a common scenario for admins and/or monitoring applications to
mount in the daemon root dir into a container. When doing so all mounts
get coppied into the container, often with private references.
This can prevent removal of a container due to the various mounts that
must be configured before a container is started (for example, for
shared /dev/shm, or secrets) being leaked into another namespace,
usually with private references.
This is particularly problematic on older kernels (e.g. RHEL < 7.4)
where a mount may be active in another namespace and attempting to
remove a mountpoint which is active in another namespace fails.
This change moves all container resource mounts into a common directory
so that the directory can be made unbindable.
What this does is prevents sub-mounts of this new directory from leaking
into other namespaces when mounted with `rbind`... which is how all
binds are handled for containers.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Commit 7120976d74 ("Implement none, private, and shareable ipc
modes") introduces a bug: if a user-specified mount for /dev/shm
is provided, its size is overriden by value of ShmSize.
A reproducer is simple:
docker run --rm
--mount type=tmpfs,dst=/dev/shm,tmpfs-size=100K \
alpine df /dev/shm
This commit is an attempt to fix the bug, as well as optimize things
a but and make the code easier to read.
https://github.com/moby/moby/issues/35271
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
/dev is mounted on a tmpfs inside of a container. Processes inside of containers
some times need to create devices nodes, or to setup a socket that listens on /dev/log
Allowing these containers to run with the --readonly flag makes sense. Making a tmpfs
readonly does not add any security to the container, since there is plenty of places
where the container can write tmpfs content.
I have no idea why /dev was excluded.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
When runc is bind-mounting a particular path "with options", it has to
do so by first creating a bind-mount and the modifying the options of
said bind-mount via remount. However, in a user namespace, there are
restrictions on which flags you can change with a remount (due to
CL_UNPRIVILEGED being set in this instance). Docker historically has
ignored this, and as a result, internal Docker mounts (such as secrets)
haven't worked with --userns-remap. Fix this by preserving
CL_UNPRIVILEGED mount flags when Docker is spawning containers with user
namespaces enabled.
Ref: https://github.com/opencontainers/runc/pull/1603
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This enables docker cp and ADD/COPY docker build support for LCOW.
Originally, the graphdriver.Get() interface returned a local path
to the container root filesystem. This does not work for LCOW, so
the Get() method now returns an interface that LCOW implements to
support copying to and from the container.
Signed-off-by: Akash Gupta <akagup@microsoft.com>
This also update:
- runc to 3f2f8b84a77f73d38244dd690525642a72156c64
- runtime-specs to v1.0.0
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
Use strongly typed errors to set HTTP status codes.
Error interfaces are defined in the api/errors package and errors
returned from controllers are checked against these interfaces.
Errors can be wraeped in a pkg/errors.Causer, as long as somewhere in the
line of causes one of the interfaces is implemented. The special error
interfaces take precedence over Causer, meaning if both Causer and one
of the new error interfaces are implemented, the Causer is not
traversed.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Since the commit d88fe447df ("Add support for sharing /dev/shm/ and
/dev/mqueue between containers") container's /dev/shm is mounted on the
host first, then bind-mounted inside the container. This is done that
way in order to be able to share this container's IPC namespace
(and the /dev/shm mount point) with another container.
Unfortunately, this functionality breaks container checkpoint/restore
(even if IPC is not shared). Since /dev/shm is an external mount, its
contents is not saved by `criu checkpoint`, and so upon restore any
application that tries to access data under /dev/shm is severily
disappointed (which usually results in a fatal crash).
This commit solves the issue by introducing new IPC modes for containers
(in addition to 'host' and 'container:ID'). The new modes are:
- 'shareable': enables sharing this container's IPC with others
(this used to be the implicit default);
- 'private': disables sharing this container's IPC.
In 'private' mode, container's /dev/shm is truly mounted inside the
container, without any bind-mounting from the host, which solves the
issue.
While at it, let's also implement 'none' mode. The motivation, as
eloquently put by Justin Cormack, is:
> I wondered a while back about having a none shm mode, as currently it is
> not possible to have a totally unwriteable container as there is always
> a /dev/shm writeable mount. It is a bit of a niche case (and clearly
> should never be allowed to be daemon default) but it would be trivial to
> add now so maybe we should...
...so here's yet yet another mode:
- 'none': no /dev/shm mount inside the container (though it still
has its own private IPC namespace).
Now, to ultimately solve the abovementioned checkpoint/restore issue, we'd
need to make 'private' the default mode, but unfortunately it breaks the
backward compatibility. So, let's make the default container IPC mode
per-daemon configurable (with the built-in default set to 'shareable'
for now). The default can be changed either via a daemon CLI option
(--default-shm-mode) or a daemon.json configuration file parameter
of the same name.
Note one can only set either 'shareable' or 'private' IPC modes as a
daemon default (i.e. in this context 'host', 'container', or 'none'
do not make much sense).
Some other changes this patch introduces are:
1. A mount for /dev/shm is added to default OCI Linux spec.
2. IpcMode.Valid() is simplified to remove duplicated code that parsed
'container:ID' form. Note the old version used to check that ID does
not contain a semicolon -- this is no longer the case (tests are
modified accordingly). The motivation is we should either do a
proper check for container ID validity, or don't check it at all
(since it is checked in other places anyway). I chose the latter.
3. IpcMode.Container() is modified to not return container ID if the
mode value does not start with "container:", unifying the check to
be the same as in IpcMode.IsContainer().
3. IPC mode unit tests (runconfig/hostconfig_test.go) are modified
to add checks for newly added values.
[v2: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-51345997]
[v3: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-53902833]
[v4: addressed the case of upgrading from older daemon, in this case
container.HostConfig.IpcMode is unset and this is valid]
[v5: document old and new IpcMode values in api/swagger.yaml]
[v6: add the 'none' mode, changelog entry to docs/api/version-history.md]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There is no case which would resolve in this error. The root user always exists, and if the id maps are empty, the default value of 0 is correct.
Signed-off-by: Daniel Nephin <dnephin@docker.com>
- Moved DefaultInitBinary from daemon/daemon.go to
daemon/config/config.go since it's a daemon config and is referred in
config package files.
- Added condition in GetInitPath to check for any explicitly configured
DefaultInitBinary. If not, the default value of DefaultInitBinary is
returned.
- Changed all references of DefaultInitBinary to refer to the variable
from new location.
- Added TestCommonUnixGetInitPath to test for the various values of
GetInitPath.
Fixes#32314
Signed-off-by: Sunny Gogoi <indiasuny000@gmail.com>
This introduce a new `--device-cgroup-rule` flag that allow a user to
add one or more entry to the container cgroup device `devices.allow`
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
In certain cases (unattended upgrades), system services can disable
loaded AppArmor profiles. However, since /etc being read-only is a
supported setup we cannot just write a copy of the profile to
/etc/apparmor.d.
Instead, dynamically load the docker-default AppArmor profile if a
container is started with that profile set. This code will short-cut if
the profile is already loaded.
Fixes: 2f7596aaef ("apparmor: do not save profile to /etc/apparmor.d")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
- use /secrets for swarm secret create route
- do not specify omitempty for secret and secret reference
- simplify lookup for secret ids
- do not use pointer for secret grpc conversion
Signed-off-by: Evan Hazlett <ejhazlett@gmail.com>
- fix lint issues
- use errors pkg for wrapping errors
- cleanup on error when setting up secrets mount
- fix erroneous import
- remove unneeded switch for secret reference mode
- return single mount for secrets instead of slice
Signed-off-by: Evan Hazlett <ejhazlett@gmail.com>
containers may specify these cgroup values at runtime. This will allow
processes to change their priority to real-time within the container
when CONFIG_RT_GROUP_SCHED is enabled in the kernel. See #22380.
Also added sanity checks for the new --cpu-rt-runtime and --cpu-rt-period
flags to ensure that that the kernel supports these features and that
runtime is not greater than period.
Daemon will support a --cpu-rt-runtime flag to initialize the parent
cgroup on startup, this prevents the administrator from alotting runtime
to docker after each restart.
There are additional checks that could be added but maybe too far? Check
parent cgroups to ensure values are <= parent, inspecting rtprio ulimit
and issuing a warning.
Signed-off-by: Erik St. Martin <alakriti@gmail.com>
This adds a small C binary for fighting zombies. It is mounted under
`/dev/init` and is prepended to the args specified by the user. You
enable it via a daemon flag, `dockerd --init`, as it is disable by
default for backwards compat.
You can also override the daemon option or specify this on a per
container basis with `docker run --init=true|false`.
You can test this by running a process like this as the pid 1 in a
container and see the extra zombie that appears in the container as it
is running.
```c
int main(int argc, char ** argv) {
pid_t pid = fork();
if (pid == 0) {
pid = fork();
if (pid == 0) {
exit(0);
}
sleep(3);
exit(0);
}
printf("got pid %d and exited\n", pid);
sleep(20);
}
```
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This fix tries to fix 26326 where `docker inspect` will not show
ulimit even when daemon default ulimit has been set.
This fix merge the HostConfig's ulimit with daemon default in
`docker inspect`, so that when daemon is started with `default-ulimit`
and HostConfig's ulimit is not set, `docker inspect` will output
the daemon default.
An integration test has been added to cover the changes.
This fix fixes 26326.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This fix tries to address the issue raised in #22420. When
`--tmpfs` is specified with `/tmp`, the default value is
`rw,nosuid,nodev,noexec,relatime,size=65536k`. When `--tmpfs`
is specified with `/tmp:rw`, then the value changed to
`rw,nosuid,nodev,noexec,relatime`.
The reason for such an inconsistency is because docker tries
to add `size=65536k` option only when user provides no option.
This fix tries to address this issue by always pre-progating
`size=65536k` along with `rw,nosuid,nodev,noexec,relatime`.
If user provides a different value (e.g., `size=8192k`), it
will override the `size=65536k` anyway since the combined
options will be parsed and merged to remove any duplicates.
Additional test cases have been added to cover the changes
in this fix.
This fix fixes#22420.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This patch will allow users to specify namespace specific "kernel parameters"
for running inside of a container.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
This vendors in new spec/runc that supports
setting readonly and masked paths in the
configuration. Using this allows us to make an
exception for `—-privileged`.
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
runc expects a systemd cgroupsPath to be in slice:scopePrefix:containerName
format and the "--systemd-cgroup" option to be set. Update docker accordingly.
Fixes 21475
Signed-off-by: Anusha Ragunathan <anusha@docker.com>
Now that the namespace sharing code via runc is vendored with the
containerd changes, we can disable the restrictions on container to
container net and IPC namespace sharing when the daemon has user
namespaces enabled.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)