runmetrics.rst 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463
  1. :title: Runtime Metrics
  2. :description: Measure the behavior of running containers
  3. :keywords: docker, metrics, CPU, memory, disk, IO, run, runtime
  4. .. _run_metrics:
  5. Runtime Metrics
  6. ===============
  7. Linux Containers rely on `control groups
  8. <https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt>`_ which
  9. not only track groups of processes, but also expose metrics about CPU,
  10. memory, and block I/O usage. You can access those metrics and obtain
  11. network usage metrics as well. This is relevant for "pure" LXC
  12. containers, as well as for Docker containers.
  13. Control Groups
  14. --------------
  15. Control groups are exposed through a pseudo-filesystem. In recent
  16. distros, you should find this filesystem under
  17. ``/sys/fs/cgroup``. Under that directory, you will see multiple
  18. sub-directories, called devices, freezer, blkio, etc.; each
  19. sub-directory actually corresponds to a different cgroup hierarchy.
  20. On older systems, the control groups might be mounted on ``/cgroup``,
  21. without distinct hierarchies. In that case, instead of seeing the
  22. sub-directories, you will see a bunch of files in that directory, and
  23. possibly some directories corresponding to existing containers.
  24. To figure out where your control groups are mounted, you can run:
  25. ::
  26. grep cgroup /proc/mounts
  27. .. _run_findpid:
  28. Enumerating Cgroups
  29. -------------------
  30. You can look into ``/proc/cgroups`` to see the different control group
  31. subsystems known to the system, the hierarchy they belong to, and how
  32. many groups they contain.
  33. You can also look at ``/proc/<pid>/cgroup`` to see which control
  34. groups a process belongs to. The control group will be shown as a path
  35. relative to the root of the hierarchy mountpoint; e.g. ``/`` means
  36. “this process has not been assigned into a particular group”, while
  37. ``/lxc/pumpkin`` means that the process is likely to be a member of a
  38. container named ``pumpkin``.
  39. Finding the Cgroup for a Given Container
  40. ----------------------------------------
  41. For each container, one cgroup will be created in each hierarchy. On
  42. older systems with older versions of the LXC userland tools, the name
  43. of the cgroup will be the name of the container. With more recent
  44. versions of the LXC tools, the cgroup will be ``lxc/<container_name>.``
  45. For Docker containers using cgroups, the container name will be the
  46. full ID or long ID of the container. If a container shows up as
  47. ae836c95b4c3 in ``docker ps``, its long ID might be something like
  48. ``ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79``. You
  49. can look it up with ``docker inspect`` or ``docker ps --no-trunc``.
  50. Putting everything together to look at the memory metrics for a Docker
  51. container, take a look at ``/sys/fs/cgroup/memory/lxc/<longid>/``.
  52. Metrics from Cgroups: Memory, CPU, Block IO
  53. -------------------------------------------
  54. For each subsystem (memory, CPU, and block I/O), you will find one or
  55. more pseudo-files containing statistics.
  56. Memory Metrics: ``memory.stat``
  57. ...............................
  58. Memory metrics are found in the "memory" cgroup. Note that the memory
  59. control group adds a little overhead, because it does very
  60. fine-grained accounting of the memory usage on your host. Therefore,
  61. many distros chose to not enable it by default. Generally, to enable
  62. it, all you have to do is to add some kernel command-line parameters:
  63. ``cgroup_enable=memory swapaccount=1``.
  64. The metrics are in the pseudo-file ``memory.stat``. Here is what it
  65. will look like:
  66. ::
  67. cache 11492564992
  68. rss 1930993664
  69. mapped_file 306728960
  70. pgpgin 406632648
  71. pgpgout 403355412
  72. swap 0
  73. pgfault 728281223
  74. pgmajfault 1724
  75. inactive_anon 46608384
  76. active_anon 1884520448
  77. inactive_file 7003344896
  78. active_file 4489052160
  79. unevictable 32768
  80. hierarchical_memory_limit 9223372036854775807
  81. hierarchical_memsw_limit 9223372036854775807
  82. total_cache 11492564992
  83. total_rss 1930993664
  84. total_mapped_file 306728960
  85. total_pgpgin 406632648
  86. total_pgpgout 403355412
  87. total_swap 0
  88. total_pgfault 728281223
  89. total_pgmajfault 1724
  90. total_inactive_anon 46608384
  91. total_active_anon 1884520448
  92. total_inactive_file 7003344896
  93. total_active_file 4489052160
  94. total_unevictable 32768
  95. The first half (without the ``total_`` prefix) contains statistics
  96. relevant to the processes within the cgroup, excluding
  97. sub-cgroups. The second half (with the ``total_`` prefix) includes
  98. sub-cgroups as well.
  99. Some metrics are "gauges", i.e. values that can increase or decrease
  100. (e.g. swap, the amount of swap space used by the members of the
  101. cgroup). Some others are "counters", i.e. values that can only go up,
  102. because they represent occurrences of a specific event (e.g. pgfault,
  103. which indicates the number of page faults which happened since the
  104. creation of the cgroup; this number can never decrease).
  105. cache
  106. the amount of memory used by the processes of this control group
  107. that can be associated precisely with a block on a block
  108. device. When you read from and write to files on disk, this amount
  109. will increase. This will be the case if you use "conventional" I/O
  110. (``open``, ``read``, ``write`` syscalls) as well as mapped files
  111. (with ``mmap``). It also accounts for the memory used by ``tmpfs``
  112. mounts, though the reasons are unclear.
  113. rss
  114. the amount of memory that *doesn't* correspond to anything on
  115. disk: stacks, heaps, and anonymous memory maps.
  116. mapped_file
  117. indicates the amount of memory mapped by the processes in the
  118. control group. It doesn't give you information about *how much*
  119. memory is used; it rather tells you *how* it is used.
  120. pgfault and pgmajfault
  121. indicate the number of times that a process of the cgroup triggered
  122. a "page fault" and a "major fault", respectively. A page fault
  123. happens when a process accesses a part of its virtual memory space
  124. which is nonexistent or protected. The former can happen if the
  125. process is buggy and tries to access an invalid address (it will
  126. then be sent a ``SIGSEGV`` signal, typically killing it with the
  127. famous ``Segmentation fault`` message). The latter can happen when
  128. the process reads from a memory zone which has been swapped out, or
  129. which corresponds to a mapped file: in that case, the kernel will
  130. load the page from disk, and let the CPU complete the memory
  131. access. It can also happen when the process writes to a
  132. copy-on-write memory zone: likewise, the kernel will preempt the
  133. process, duplicate the memory page, and resume the write operation
  134. on the process' own copy of the page. "Major" faults happen when the
  135. kernel actually has to read the data from disk. When it just has to
  136. duplicate an existing page, or allocate an empty page, it's a
  137. regular (or "minor") fault.
  138. swap
  139. the amount of swap currently used by the processes in this cgroup.
  140. active_anon and inactive_anon
  141. the amount of *anonymous* memory that has been identified has
  142. respectively *active* and *inactive* by the kernel. "Anonymous"
  143. memory is the memory that is *not* linked to disk pages. In other
  144. words, that's the equivalent of the rss counter described above. In
  145. fact, the very definition of the rss counter is **active_anon** +
  146. **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory
  147. used up by ``tmpfs`` filesystems mounted by this control
  148. group). Now, what's the difference between "active" and "inactive"?
  149. Pages are initially "active"; and at regular intervals, the kernel
  150. sweeps over the memory, and tags some pages as "inactive". Whenever
  151. they are accessed again, they are immediately retagged
  152. "active". When the kernel is almost out of memory, and time comes to
  153. swap out to disk, the kernel will swap "inactive" pages.
  154. active_file and inactive_file
  155. cache memory, with *active* and *inactive* similar to the *anon*
  156. memory above. The exact formula is cache = **active_file** +
  157. **inactive_file** + **tmpfs**. The exact rules used by the kernel to
  158. move memory pages between active and inactive sets are different
  159. from the ones used for anonymous memory, but the general principle
  160. is the same. Note that when the kernel needs to reclaim memory, it
  161. is cheaper to reclaim a clean (=non modified) page from this pool,
  162. since it can be reclaimed immediately (while anonymous pages and
  163. dirty/modified pages have to be written to disk first).
  164. unevictable
  165. the amount of memory that cannot be reclaimed; generally, it will
  166. account for memory that has been "locked" with ``mlock``. It is
  167. often used by crypto frameworks to make sure that secret keys and
  168. other sensitive material never gets swapped out to disk.
  169. memory and memsw limits
  170. These are not really metrics, but a reminder of the limits applied
  171. to this cgroup. The first one indicates the maximum amount of
  172. physical memory that can be used by the processes of this control
  173. group; the second one indicates the maximum amount of RAM+swap.
  174. Accounting for memory in the page cache is very complex. If two
  175. processes in different control groups both read the same file
  176. (ultimately relying on the same blocks on disk), the corresponding
  177. memory charge will be split between the control groups. It's nice, but
  178. it also means that when a cgroup is terminated, it could increase the
  179. memory usage of another cgroup, because they are not splitting the
  180. cost anymore for those memory pages.
  181. CPU metrics: ``cpuacct.stat``
  182. .............................
  183. Now that we've covered memory metrics, everything else will look very
  184. simple in comparison. CPU metrics will be found in the ``cpuacct``
  185. controller.
  186. For each container, you will find a pseudo-file ``cpuacct.stat``,
  187. containing the CPU usage accumulated by the processes of the
  188. container, broken down between ``user`` and ``system`` time. If you're
  189. not familiar with the distinction, ``user`` is the time during which
  190. the processes were in direct control of the CPU (i.e. executing
  191. process code), and ``system`` is the time during which the CPU was
  192. executing system calls on behalf of those processes.
  193. Those times are expressed in ticks of 1/100th of a second. Actually,
  194. they are expressed in "user jiffies". There are ``USER_HZ``
  195. *"jiffies"* per second, and on x86 systems, ``USER_HZ`` is 100. This
  196. used to map exactly to the number of scheduler "ticks" per second; but
  197. with the advent of higher frequency scheduling, as well as `tickless
  198. kernels <http://lwn.net/Articles/549580/>`_, the number of kernel
  199. ticks wasn't relevant anymore. It stuck around anyway, mainly for
  200. legacy and compatibility reasons.
  201. Block I/O metrics
  202. .................
  203. Block I/O is accounted in the ``blkio`` controller. Different metrics
  204. are scattered across different files. While you can find in-depth
  205. details in the `blkio-controller
  206. <https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt>`_
  207. file in the kernel documentation, here is a short list of the most
  208. relevant ones:
  209. blkio.sectors
  210. contain the number of 512-bytes sectors read and written by the
  211. processes member of the cgroup, device by device. Reads and writes
  212. are merged in a single counter.
  213. blkio.io_service_bytes
  214. indicates the number of bytes read and written by the cgroup. It has
  215. 4 counters per device, because for each device, it differentiates
  216. between synchronous vs. asynchronous I/O, and reads vs. writes.
  217. blkio.io_serviced
  218. the number of I/O operations performed, regardless of their size. It
  219. also has 4 counters per device.
  220. blkio.io_queued
  221. indicates the number of I/O operations currently queued for this
  222. cgroup. In other words, if the cgroup isn't doing any I/O, this will
  223. be zero. Note that the opposite is not true. In other words, if
  224. there is no I/O queued, it does not mean that the cgroup is idle
  225. (I/O-wise). It could be doing purely synchronous reads on an
  226. otherwise quiescent device, which is therefore able to handle them
  227. immediately, without queuing. Also, while it is helpful to figure
  228. out which cgroup is putting stress on the I/O subsystem, keep in
  229. mind that is is a relative quantity. Even if a process group does
  230. not perform more I/O, its queue size can increase just because the
  231. device load increases because of other devices.
  232. Network Metrics
  233. ---------------
  234. Network metrics are not exposed directly by control groups. There is a
  235. good explanation for that: network interfaces exist within the context
  236. of *network namespaces*. The kernel could probably accumulate metrics
  237. about packets and bytes sent and received by a group of processes, but
  238. those metrics wouldn't be very useful. You want per-interface metrics
  239. (because traffic happening on the local ``lo`` interface doesn't
  240. really count). But since processes in a single cgroup can belong to
  241. multiple network namespaces, those metrics would be harder to
  242. interpret: multiple network namespaces means multiple ``lo``
  243. interfaces, potentially multiple ``eth0`` interfaces, etc.; so this is
  244. why there is no easy way to gather network metrics with control
  245. groups.
  246. Instead we can gather network metrics from other sources:
  247. IPtables
  248. ........
  249. IPtables (or rather, the netfilter framework for which iptables is
  250. just an interface) can do some serious accounting.
  251. For instance, you can setup a rule to account for the outbound HTTP
  252. traffic on a web server:
  253. ::
  254. iptables -I OUTPUT -p tcp --sport 80
  255. There is no ``-j`` or ``-g`` flag, so the rule will just count matched
  256. packets and go to the following rule.
  257. Later, you can check the values of the counters, with:
  258. ::
  259. iptables -nxvL OUTPUT
  260. Technically, ``-n`` is not required, but it will prevent iptables from
  261. doing DNS reverse lookups, which are probably useless in this
  262. scenario.
  263. Counters include packets and bytes. If you want to setup metrics for
  264. container traffic like this, you could execute a ``for`` loop to add
  265. two ``iptables`` rules per container IP address (one in each
  266. direction), in the ``FORWARD`` chain. This will only meter traffic
  267. going through the NAT layer; you will also have to add traffic going
  268. through the userland proxy.
  269. Then, you will need to check those counters on a regular basis. If you
  270. happen to use ``collectd``, there is a nice plugin to automate
  271. iptables counters collection.
  272. Interface-level counters
  273. ........................
  274. Since each container has a virtual Ethernet interface, you might want
  275. to check directly the TX and RX counters of this interface. You will
  276. notice that each container is associated to a virtual Ethernet
  277. interface in your host, with a name like ``vethKk8Zqi``. Figuring out
  278. which interface corresponds to which container is, unfortunately,
  279. difficult.
  280. But for now, the best way is to check the metrics *from within the
  281. containers*. To accomplish this, you can run an executable from the
  282. host environment within the network namespace of a container using
  283. **ip-netns magic**.
  284. The ``ip-netns exec`` command will let you execute any program
  285. (present in the host system) within any network namespace visible to
  286. the current process. This means that your host will be able to enter
  287. the network namespace of your containers, but your containers won't be
  288. able to access the host, nor their sibling containers. Containers will
  289. be able to “see” and affect their sub-containers, though.
  290. The exact format of the command is::
  291. ip netns exec <nsname> <command...>
  292. For example::
  293. ip netns exec mycontainer netstat -i
  294. ``ip netns`` finds the "mycontainer" container by using namespaces
  295. pseudo-files. Each process belongs to one network namespace, one PID
  296. namespace, one ``mnt`` namespace, etc., and those namespaces are
  297. materialized under ``/proc/<pid>/ns/``. For example, the network
  298. namespace of PID 42 is materialized by the pseudo-file
  299. ``/proc/42/ns/net``.
  300. When you run ``ip netns exec mycontainer ...``, it expects
  301. ``/var/run/netns/mycontainer`` to be one of those
  302. pseudo-files. (Symlinks are accepted.)
  303. In other words, to execute a command within the network namespace of a
  304. container, we need to:
  305. * Find out the PID of any process within the container that we want to
  306. investigate;
  307. * Create a symlink from ``/var/run/netns/<somename>`` to
  308. ``/proc/<thepid>/ns/net``
  309. * Execute ``ip netns exec <somename> ....``
  310. Please review :ref:`run_findpid` to learn how to find the cgroup of a
  311. pprocess running in the container of which you want to measure network
  312. usage. From there, you can examine the pseudo-file named ``tasks``,
  313. which containes the PIDs that are in the control group (i.e. in the
  314. container). Pick any one of them.
  315. Putting everything together, if the "short ID" of a container is held
  316. in the environment variable ``$CID``, then you can do this::
  317. TASKS=/sys/fs/cgroup/devices/$CID*/tasks
  318. PID=$(head -n 1 $TASKS)
  319. mkdir -p /var/run/netns
  320. ln -sf /proc/$PID/ns/net /var/run/netns/$CID
  321. ip netns exec $CID netstat -i
  322. Tips for high-performance metric collection
  323. -------------------------------------------
  324. Note that running a new process each time you want to update metrics
  325. is (relatively) expensive. If you want to collect metrics at high
  326. resolutions, and/or over a large number of containers (think 1000
  327. containers on a single host), you do not want to fork a new process
  328. each time.
  329. Here is how to collect metrics from a single process. You will have to
  330. write your metric collector in C (or any language that lets you do
  331. low-level system calls). You need to use a special system call,
  332. ``setns()``, which lets the current process enter any arbitrary
  333. namespace. It requires, however, an open file descriptor to the
  334. namespace pseudo-file (remember: that’s the pseudo-file in
  335. ``/proc/<pid>/ns/net``).
  336. However, there is a catch: you must not keep this file descriptor
  337. open. If you do, when the last process of the control group exits, the
  338. namespace will not be destroyed, and its network resources (like the
  339. virtual interface of the container) will stay around for ever (or
  340. until you close that file descriptor).
  341. The right approach would be to keep track of the first PID of each
  342. container, and re-open the namespace pseudo-file each time.
  343. Collecting metrics when a container exits
  344. -----------------------------------------
  345. Sometimes, you do not care about real time metric collection, but when
  346. a container exits, you want to know how much CPU, memory, etc. it has
  347. used.
  348. Docker makes this difficult because it relies on ``lxc-start``, which
  349. carefully cleans up after itself, but it is still possible. It is
  350. usually easier to collect metrics at regular intervals (e.g. every
  351. minute, with the collectd LXC plugin) and rely on that instead.
  352. But, if you'd still like to gather the stats when a container stops,
  353. here is how:
  354. For each container, start a collection process, and move it to the
  355. control groups that you want to monitor by writing its PID to the
  356. tasks file of the cgroup. The collection process should periodically
  357. re-read the tasks file to check if it's the last process of the
  358. control group. (If you also want to collect network statistics as
  359. explained in the previous section, you should also move the process to
  360. the appropriate network namespace.)
  361. When the container exits, ``lxc-start`` will try to delete the control
  362. groups. It will fail, since the control group is still in use; but
  363. that’s fine. You process should now detect that it is the only one
  364. remaining in the group. Now is the right time to collect all the
  365. metrics you need!
  366. Finally, your process should move itself back to the root control
  367. group, and remove the container control group. To remove a control
  368. group, just ``rmdir`` its directory. It's counter-intuitive to
  369. ``rmdir`` a directory as it still contains files; but remember that
  370. this is a pseudo-filesystem, so usual rules don't apply. After the
  371. cleanup is done, the collection process can exit safely.