Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure containers are stopped on daemon startup #35805

Merged

Conversation

cpuguy83
Copy link
Member

When the containerd 1.0 runtime changes were made, we inadvertently
removed the functionality where any running containers are killed on
startup when not using live-restore.
This change restores that behavior.

ping @fcrisciani

@fcrisciani
Copy link
Contributor

LGTM

@andrewhsu
Copy link
Member

andrewhsu commented Dec 14, 2017

Seeing failures:

FAIL: docker_cli_nat_test.go:62: DockerSuite.TestNetworkLocalhostTCPNat
FAIL: docker_cli_nat_test.go:46: DockerSuite.TestNetworkNat
22:55:14 ----------------------------------------------------------------------
22:55:14 FAIL: docker_cli_nat_test.go:62: DockerSuite.TestNetworkLocalhostTCPNat
22:55:14 
22:55:14 docker_cli_nat_test.go:76:
22:55:14     c.Assert(final, checker.Equals, msg)
22:55:14 ... obtained string = ""
22:55:14 ... expected string = "hi yall"
22:55:14 
22:55:15 ----------------------------------------------------------------------
22:55:15 FAIL: docker_cli_nat_test.go:46: DockerSuite.TestNetworkNat
22:55:15 
22:55:15 docker_cli_nat_test.go:52:
22:55:15     c.Assert(err, check.IsNil)
22:55:15 ... value *net.OpError = &net.OpError{Op:"dial", Net:"tcp", Source:net.Addr(nil), Addr:(*net.TCPAddr)(0xc4203ecb70), Err:(*os.SyscallError)(0xc4204b3900)} ("dial tcp 172.17.0.3:8080: getsockopt: connection refused")
22:55:15 
22:55:16 

@thaJeztah thaJeztah added the status/failing-ci Indicates that the PR in its current state fails the test suite label Dec 15, 2017
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐯

@cpuguy83 cpuguy83 force-pushed the fix_kill_containers_on_restart branch from da15ad0 to 5d7041c Compare December 15, 2017 15:46
@anusha-ragunathan
Copy link
Contributor

Tested this and it doesnt work. After daemon restart, docker ps shows no live containers, but the process is still running. Probably needs a SIGTERM

@anusha-ragunathan
Copy link
Contributor

anusha-ragunathan commented Dec 15, 2017

Adding a SignalProcess call terminates the process cleanly. Needs addition of wait and check logic for process to be dead and SIGKILL after wait.

index 96b39e8..c18379b 100644
--- a/daemon/daemon.go
+++ b/daemon/daemon.go
@@ -44,6 +44,7 @@ import (
 	"github.com/docker/docker/pkg/containerfs"
 	"github.com/docker/docker/pkg/idtools"
 	"github.com/docker/docker/pkg/plugingetter"
+	"github.com/docker/docker/pkg/signal"
 	"github.com/docker/docker/pkg/sysinfo"
 	"github.com/docker/docker/pkg/system"
 	"github.com/docker/docker/pkg/truncindex"
@@ -241,12 +242,19 @@ func (daemon *Daemon) restore() error {
 				logrus.Errorf("Failed to restore container %s with containerd: %s", c.ID, err)
 				return
 			}
-			if !alive {
+
+			if !daemon.configStore.LiveRestoreEnabled {
+				err = daemon.containerd.SignalProcess(context.Background(), c.ID, libcontainerd.InitProcessName, int(signal.SignalMap["TERM"]))
+				logrus.WithError(err).Errorf("Failed to kill container %v due to %v", c.ID, err)
+			}
+
+			if !alive || !daemon.configStore.LiveRestoreEnabled {
 				ec, exitedAt, err = daemon.containerd.DeleteTask(context.Background(), c.ID)
 				if err != nil && !errdefs.IsNotFound(err) {
 					logrus.WithError(err).Errorf("Failed to delete container %s from containerd", c.ID)
 					return
 				}
+				alive = false
 			}
 
 			if c.IsRunning() || c.IsPaused() {

@cpuguy83 cpuguy83 force-pushed the fix_kill_containers_on_restart branch 2 times, most recently from ebd6a08 to c2f3b3b Compare December 15, 2017 21:31
daemon/daemon.go Outdated
@@ -247,6 +247,11 @@ func (daemon *Daemon) restore() error {
logrus.WithError(err).Errorf("Failed to delete container %s from containerd", c.ID)
return
}
} else if daemon.configStore.LiveRestoreEnabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if !daemon.configStore.LiveRestoreEnabled, am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct...

@andrewhsu
Copy link
Member

Just saw failure on experimental build job:

FAIL: docker_cli_network_unix_test.go:1135: DockerNetworkSuite.TestDockerNetworkHostModeUngracefulDaemonRestart

[d271672e27723] waiting for daemon to start
[d271672e27723] daemon started

Daemon d271672e27723 is not started
[d271672e27723] waiting for daemon to start
[d271672e27723] daemon started

docker_cli_network_unix_test.go:1157:
    c.Assert(err, checker.IsNil)
... value *errors.fundamental = condition ""false" == "true"" not true in time (10s) ("condition \"\"false\" == \"true\"\" not true in time (10s)")

@thaJeztah
Copy link
Member

https://jenkins.dockerproject.org/job/Docker-PRs-experimental/38431/console

23:51:15 ----------------------------------------------------------------------
23:51:15 FAIL: docker_cli_network_unix_test.go:1135: DockerNetworkSuite.TestDockerNetworkHostModeUngracefulDaemonRestart
23:51:15 
23:51:15 [d271672e27723] waiting for daemon to start
23:51:15 [d271672e27723] daemon started
23:51:15 
23:51:15 Daemon d271672e27723 is not started
23:51:15 [d271672e27723] waiting for daemon to start
23:51:15 [d271672e27723] daemon started
23:51:15 
23:51:15 docker_cli_network_unix_test.go:1157:
23:51:15     c.Assert(err, checker.IsNil)
23:51:15 ... value *errors.fundamental = condition ""false" == "true"" not true in time (10s) ("condition \"\"false\" == \"true\"\" not true in time (10s)")
23:51:15 
23:51:15 [d271672e27723] exiting daemon
23:51:30 

@cpuguy83 cpuguy83 force-pushed the fix_kill_containers_on_restart branch 5 times, most recently from ea89a1e to e0cdd4f Compare December 18, 2017 14:09
@cpuguy83
Copy link
Member Author

experimental CI failure is unrelated and is tracked in #35825
All green otherwise.

@cpuguy83 cpuguy83 added rebuild/experimental and removed status/failing-ci Indicates that the PR in its current state fails the test suite labels Dec 18, 2017
t.Parallel()

d := daemon.New(t, "", "dockerd", daemon.Config{})
defer d.Stop(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The daemon is not started yet. So remove defer d.Stop

When the containerd 1.0 runtime changes were made, we inadvertantly
removed the functionality where any running containers are killed on
startup when not using live-restore.
This change restores that behavior.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
@cpuguy83 cpuguy83 force-pushed the fix_kill_containers_on_restart branch from e0cdd4f to e69127b Compare December 18, 2017 19:34
@anusha-ragunathan
Copy link
Contributor

LGTM (Once CI is green)

Copy link
Contributor

@seemethere seemethere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on green as well

@anusha-ragunathan anusha-ragunathan merged commit c3a9f5d into moby:master Dec 19, 2017
@cpuguy83 cpuguy83 deleted the fix_kill_containers_on_restart branch December 19, 2017 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants