Merge pull request #25573 from thaJeztah/improve-runmetrics-layout

docs: improve layout for runmetrics
(cherry picked from commit a19029a719)
Signed-off-by: Sven Dowideit <SvenDowideit@home.org.au>
This commit is contained in:
Sebastiaan van Stijn 2016-08-11 10:43:15 +02:00 committed by Sven Dowideit
parent 6811254691
commit 83e47c15ee

View file

@ -21,11 +21,13 @@ and network IO metrics.
The following is a sample output from the `docker stats` command
$ docker stats redis1 redis2
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
redis1 0.07% 796 KB / 64 MB 1.21% 788 B / 648 B 3.568 MB / 512 KB
redis2 0.07% 2.746 MB / 64 MB 4.29% 1.266 KB / 648 B 12.4 MB / 0 B
```bash
$ docker stats redis1 redis2
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
redis1 0.07% 796 KB / 64 MB 1.21% 788 B / 648 B 3.568 MB / 512 KB
redis2 0.07% 2.746 MB / 64 MB 4.29% 1.266 KB / 648 B 12.4 MB / 0 B
```
The [docker stats](../reference/commandline/stats.md) reference page has
more details about the `docker stats` command.
@ -52,7 +54,9 @@ corresponding to existing containers.
To figure out where your control groups are mounted, you can run:
$ grep cgroup /proc/mounts
```bash
$ grep cgroup /proc/mounts
```
## Enumerating cgroups
@ -138,86 +142,19 @@ they represent occurrences of a specific event (e.g., pgfault, which
indicates the number of page faults which happened since the creation of
the cgroup; this number can never decrease).
<style>table tr > td:first-child { white-space: nowrap;}</style>
- **cache:**
the amount of memory used by the processes of this control group
that can be associated precisely with a block on a block device.
When you read from and write to files on disk, this amount will
increase. This will be the case if you use "conventional" I/O
(`open`, `read`,
`write` syscalls) as well as mapped files (with
`mmap`). It also accounts for the memory used by
`tmpfs` mounts, though the reasons are unclear.
- **rss:**
the amount of memory that *doesn't* correspond to anything on disk:
stacks, heaps, and anonymous memory maps.
- **mapped_file:**
indicates the amount of memory mapped by the processes in the
control group. It doesn't give you information about *how much*
memory is used; it rather tells you *how* it is used.
- **pgfault and pgmajfault:**
indicate the number of times that a process of the cgroup triggered
a "page fault" and a "major fault", respectively. A page fault
happens when a process accesses a part of its virtual memory space
which is nonexistent or protected. The former can happen if the
process is buggy and tries to access an invalid address (it will
then be sent a `SIGSEGV` signal, typically
killing it with the famous `Segmentation fault`
message). The latter can happen when the process reads from a memory
zone which has been swapped out, or which corresponds to a mapped
file: in that case, the kernel will load the page from disk, and let
the CPU complete the memory access. It can also happen when the
process writes to a copy-on-write memory zone: likewise, the kernel
will preempt the process, duplicate the memory page, and resume the
write operation on the process` own copy of the page. "Major" faults
happen when the kernel actually has to read the data from disk. When
it just has to duplicate an existing page, or allocate an empty
page, it's a regular (or "minor") fault.
- **swap:**
the amount of swap currently used by the processes in this cgroup.
- **active_anon and inactive_anon:**
the amount of *anonymous* memory that has been identified has
respectively *active* and *inactive* by the kernel. "Anonymous"
memory is the memory that is *not* linked to disk pages. In other
words, that's the equivalent of the rss counter described above. In
fact, the very definition of the rss counter is **active_anon** +
**inactive_anon** - **tmpfs** (where tmpfs is the amount of memory
used up by `tmpfs` filesystems mounted by this
control group). Now, what's the difference between "active" and
"inactive"? Pages are initially "active"; and at regular intervals,
the kernel sweeps over the memory, and tags some pages as
"inactive". Whenever they are accessed again, they are immediately
retagged "active". When the kernel is almost out of memory, and time
comes to swap out to disk, the kernel will swap "inactive" pages.
- **active_file and inactive_file:**
cache memory, with *active* and *inactive* similar to the *anon*
memory above. The exact formula is cache = **active_file** +
**inactive_file** + **tmpfs**. The exact rules used by the kernel
to move memory pages between active and inactive sets are different
from the ones used for anonymous memory, but the general principle
is the same. Note that when the kernel needs to reclaim memory, it
is cheaper to reclaim a clean (=non modified) page from this pool,
since it can be reclaimed immediately (while anonymous pages and
dirty/modified pages have to be written to disk first).
- **unevictable:**
the amount of memory that cannot be reclaimed; generally, it will
account for memory that has been "locked" with `mlock`.
It is often used by crypto frameworks to make sure that
secret keys and other sensitive material never gets swapped out to
disk.
- **memory and memsw limits:**
These are not really metrics, but a reminder of the limits applied
to this cgroup. The first one indicates the maximum amount of
physical memory that can be used by the processes of this control
group; the second one indicates the maximum amount of RAM+swap.
Metric | Description
--------------------------------------|-----------------------------------------------------------
**cache** | The amount of memory used by the processes of this control group that can be associated precisely with a block on a block device. When you read from and write to files on disk, this amount will increase. This will be the case if you use "conventional" I/O (`open`, `read`, `write` syscalls) as well as mapped files (with `mmap`). It also accounts for the memory used by `tmpfs` mounts, though the reasons are unclear.
**rss** | The amount of memory that *doesn't* correspond to anything on disk: stacks, heaps, and anonymous memory maps.
**mapped_file** | Indicates the amount of memory mapped by the processes in the control group. It doesn't give you information about *how much* memory is used; it rather tells you *how* it is used.
**pgfault**, **pgmajfault** | Indicate the number of times that a process of the cgroup triggered a "page fault" and a "major fault", respectively. A page fault happens when a process accesses a part of its virtual memory space which is nonexistent or protected. The former can happen if the process is buggy and tries to access an invalid address (it will then be sent a `SIGSEGV` signal, typically killing it with the famous `Segmentation fault` message). The latter can happen when the process reads from a memory zone which has been swapped out, or which corresponds to a mapped file: in that case, the kernel will load the page from disk, and let the CPU complete the memory access. It can also happen when the process writes to a copy-on-write memory zone: likewise, the kernel will preempt the process, duplicate the memory page, and resume the write operation on the process` own copy of the page. "Major" faults happen when the kernel actually has to read the data from disk. When it just has to duplicate an existing page, or allocate an empty page, it's a regular (or "minor") fault.
**swap** | The amount of swap currently used by the processes in this cgroup.
**active_anon**, **inactive_anon** | The amount of *anonymous* memory that has been identified has respectively *active* and *inactive* by the kernel. "Anonymous" memory is the memory that is *not* linked to disk pages. In other words, that's the equivalent of the rss counter described above. In fact, the very definition of the rss counter is **active_anon** + **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory used up by `tmpfs` filesystems mounted by this control group). Now, what's the difference between "active" and "inactive"? Pages are initially "active"; and at regular intervals, the kernel sweeps over the memory, and tags some pages as "inactive". Whenever they are accessed again, they are immediately retagged "active". When the kernel is almost out of memory, and time comes to swap out to disk, the kernel will swap "inactive" pages.
**active_file**, **inactive_file** | Cache memory, with *active* and *inactive* similar to the *anon* memory above. The exact formula is **cache** = **active_file** + **inactive_file** + **tmpfs**. The exact rules used by the kernel to move memory pages between active and inactive sets are different from the ones used for anonymous memory, but the general principle is the same. Note that when the kernel needs to reclaim memory, it is cheaper to reclaim a clean (=non modified) page from this pool, since it can be reclaimed immediately (while anonymous pages and dirty/modified pages have to be written to disk first).
**unevictable** | The amount of memory that cannot be reclaimed; generally, it will account for memory that has been "locked" with `mlock`. It is often used by crypto frameworks to make sure that secret keys and other sensitive material never gets swapped out to disk.
**memory_limit**, **memsw_limit** | These are not really metrics, but a reminder of the limits applied to this cgroup. The first one indicates the maximum amount of physical memory that can be used by the processes of this control group; the second one indicates the maximum amount of RAM+swap.
Accounting for memory in the page cache is very complex. If two
processes in different control groups both read the same file
@ -261,32 +198,12 @@ file in the kernel documentation, here is a short list of the most
relevant ones:
- **blkio.sectors:**
contain the number of 512-bytes sectors read and written by the
processes member of the cgroup, device by device. Reads and writes
are merged in a single counter.
- **blkio.io_service_bytes:**
indicates the number of bytes read and written by the cgroup. It has
4 counters per device, because for each device, it differentiates
between synchronous vs. asynchronous I/O, and reads vs. writes.
- **blkio.io_serviced:**
the number of I/O operations performed, regardless of their size. It
also has 4 counters per device.
- **blkio.io_queued:**
indicates the number of I/O operations currently queued for this
cgroup. In other words, if the cgroup isn't doing any I/O, this will
be zero. Note that the opposite is not true. In other words, if
there is no I/O queued, it does not mean that the cgroup is idle
(I/O-wise). It could be doing purely synchronous reads on an
otherwise quiescent device, which is therefore able to handle them
immediately, without queuing. Also, while it is helpful to figure
out which cgroup is putting stress on the I/O subsystem, keep in
mind that it is a relative quantity. Even if a process group does
not perform more I/O, its queue size can increase just because the
device load increases because of other devices.
Metric | Description
----------------------------|-----------------------------------------------------------
**blkio.sectors** | contains the number of 512-bytes sectors read and written by the processes member of the cgroup, device by device. Reads and writes are merged in a single counter.
**blkio.io_service_bytes** | indicates the number of bytes read and written by the cgroup. It has 4 counters per device, because for each device, it differentiates between synchronous vs. asynchronous I/O, and reads vs. writes.
**blkio.io_serviced** | the number of I/O operations performed, regardless of their size. It also has 4 counters per device.
**blkio.io_queued** | indicates the number of I/O operations currently queued for this cgroup. In other words, if the cgroup isn't doing any I/O, this will be zero. Note that the opposite is not true. In other words, if there is no I/O queued, it does not mean that the cgroup is idle (I/O-wise). It could be doing purely synchronous reads on an otherwise quiescent device, which is therefore able to handle them immediately, without queuing. Also, while it is helpful to figure out which cgroup is putting stress on the I/O subsystem, keep in mind that it is a relative quantity. Even if a process group does not perform more I/O, its queue size can increase just because the device load increases because of other devices.
## Network metrics
@ -313,7 +230,9 @@ an interface) can do some serious accounting.
For instance, you can setup a rule to account for the outbound HTTP
traffic on a web server:
$ iptables -I OUTPUT -p tcp --sport 80
```bash
$ iptables -I OUTPUT -p tcp --sport 80
```
There is no `-j` or `-g` flag,
so the rule will just count matched packets and go to the following
@ -321,7 +240,9 @@ rule.
Later, you can check the values of the counters, with:
$ iptables -nxvL OUTPUT
```bash
$ iptables -nxvL OUTPUT
```
Technically, `-n` is not required, but it will
prevent iptables from doing DNS reverse lookups, which are probably
@ -363,11 +284,15 @@ though.
The exact format of the command is:
$ ip netns exec <nsname> <command...>
```bash
$ ip netns exec <nsname> <command...>
```
For example:
$ ip netns exec mycontainer netstat -i
```bash
$ ip netns exec mycontainer netstat -i
```
`ip netns` finds the "mycontainer" container by
using namespaces pseudo-files. Each process belongs to one network
@ -388,7 +313,7 @@ container, we need to:
- Create a symlink from `/var/run/netns/<somename>` to `/proc/<thepid>/ns/net`
- Execute `ip netns exec <somename> ....`
Please review [*Enumerating Cgroups*](#enumerating-cgroups) to learn how to find
Please review [Enumerating Cgroups](#enumerating-cgroups) to learn how to find
the cgroup of a process running in the container of which you want to
measure network usage. From there, you can examine the pseudo-file named
`tasks`, which contains the PIDs that are in the
@ -397,11 +322,13 @@ control group (i.e., in the container). Pick any one of them.
Putting everything together, if the "short ID" of a container is held in
the environment variable `$CID`, then you can do this:
$ TASKS=/sys/fs/cgroup/devices/docker/$CID*/tasks
$ PID=$(head -n 1 $TASKS)
$ mkdir -p /var/run/netns
$ ln -sf /proc/$PID/ns/net /var/run/netns/$CID
$ ip netns exec $CID netstat -i
```bash
$ TASKS=/sys/fs/cgroup/devices/docker/$CID*/tasks
$ PID=$(head -n 1 $TASKS)
$ mkdir -p /var/run/netns
$ ln -sf /proc/$PID/ns/net /var/run/netns/$CID
$ ip netns exec $CID netstat -i
```
## Tips for high-performance metric collection