瀏覽代碼

daemon/stats: more resilient cpu sampling

To avoid noise in sampling CPU usage metrics, we now sample the system
usage closer to the actual response from the underlying runtime. Because
the response from the runtime may be delayed, this makes the sampling
more resilient in loaded conditions. In addition to this, we also
replace the tick with a sleep to avoid situations where ticks can backup
under loaded conditions.

The trade off here is slightly more load reading the system CPU usage
for each container. There may be an optimization required for large
amounts of containers but the cost is on the order of 15 ms per 1000
containers. If this becomes a problem, we can time slot the sampling,
but the complexity may not be worth it unless we can test further.

Unfortunately, there aren't really any good tests for this condition.
Triggering this behavior is highly system dependent. As a matter of
course, we should qualify the fix with the users that are affected.

Signed-off-by: Stephen J Day <stephen.day@docker.com>
Stephen J Day 7 年之前
父節點
當前提交
fd0e24b718
共有 1 個文件被更改,包括 11 次插入7 次删除
  1. 11 7
      daemon/stats/collector.go

+ 11 - 7
daemon/stats/collector.go

@@ -57,7 +57,7 @@ func (s *Collector) Run() {
 	// it will grow enough in first iteration
 	// it will grow enough in first iteration
 	var pairs []publishersPair
 	var pairs []publishersPair
 
 
-	for range time.Tick(s.interval) {
+	for {
 		// it does not make sense in the first iteration,
 		// it does not make sense in the first iteration,
 		// but saves allocations in further iterations
 		// but saves allocations in further iterations
 		pairs = pairs[:0]
 		pairs = pairs[:0]
@@ -72,12 +72,6 @@ func (s *Collector) Run() {
 			continue
 			continue
 		}
 		}
 
 
-		systemUsage, err := s.getSystemCPUUsage()
-		if err != nil {
-			logrus.Errorf("collecting system cpu usage: %v", err)
-			continue
-		}
-
 		onlineCPUs, err := s.getNumberOnlineCPUs()
 		onlineCPUs, err := s.getNumberOnlineCPUs()
 		if err != nil {
 		if err != nil {
 			logrus.Errorf("collecting system online cpu count: %v", err)
 			logrus.Errorf("collecting system online cpu count: %v", err)
@@ -89,6 +83,14 @@ func (s *Collector) Run() {
 
 
 			switch err.(type) {
 			switch err.(type) {
 			case nil:
 			case nil:
+				// Sample system CPU usage close to container usage to avoid
+				// noise in metric calculations.
+				systemUsage, err := s.getSystemCPUUsage()
+				if err != nil {
+					logrus.WithError(err).WithField("container_id", pair.container.ID).Errorf("collecting system cpu usage")
+					continue
+				}
+
 				// FIXME: move to containerd on Linux (not Windows)
 				// FIXME: move to containerd on Linux (not Windows)
 				stats.CPUStats.SystemUsage = systemUsage
 				stats.CPUStats.SystemUsage = systemUsage
 				stats.CPUStats.OnlineCPUs = onlineCPUs
 				stats.CPUStats.OnlineCPUs = onlineCPUs
@@ -106,6 +108,8 @@ func (s *Collector) Run() {
 				logrus.Errorf("collecting stats for %s: %v", pair.container.ID, err)
 				logrus.Errorf("collecting stats for %s: %v", pair.container.ID, err)
 			}
 			}
 		}
 		}
+
+		time.Sleep(s.interval)
 	}
 	}
 }
 }