Ensure task reaper stopChan is closed no more than once #2491

nishanttotla · 2018-01-26T22:42:44Z

Signed-off-by: Nishant Totla nishanttotla@gmail.com

stevvooe · 2018-01-26T22:46:39Z

LGTM but add some assurance that run cannot be called twice.

anshulpundir · 2018-01-26T23:10:57Z

manager/orchestrator/taskreaper/task_reaper.go

-	// TODO(dperny) calling stop on the task reaper twice will cause a panic
-	// because we try to close a channel that will already have been closed.
-	close(tr.stopChan)
+	tr.closeOnce.Do(func() {


What is the reason to have multiple invocation for this function ?

Kicked off a rebuild. Lets close this out soon ? @nishanttotla

codecov · 2018-01-30T22:24:58Z

Codecov Report

Merging #2491 into master will decrease coverage by 0.55%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2491      +/-   ##
==========================================
- Coverage   62.09%   61.53%   -0.56%     
==========================================
  Files          49      129      +80     
  Lines        6898    21315   +14417     
==========================================
+ Hits         4283    13117    +8834     
- Misses       2174     6796    +4622     
- Partials      441     1402     +961

nishanttotla · 2018-01-31T20:45:48Z

manager/orchestrator/taskreaper/task_reaper.go

@@ -282,8 +286,8 @@ func (tr *TaskReaper) tick() {

 // Stop stops the TaskReaper and waits for the main loop to exit.
 func (tr *TaskReaper) Stop() {
-	// TODO(dperny) calling stop on the task reaper twice will cause a panic


@dperny do you remember the reason for this? I'm looking into it too.

It's never happened, IIRC, it's just possible for it to happen. Other components have Stop channels that are guarded against double close.

nishanttotla · 2018-01-31T22:13:26Z

I noticed that taskreaper.Stop() is called in both func (m *Manager) Stop(ctx context.Context, clearData bool) and func (m *Manager) becomeFollower() in manager/manager.go. So a scenario where an existing leader was shutting down, and there was a leader election before it could finish a shutdown to become a follower, taskreaper.Stop() could be called twice.

This is a very corner case scenario, but it seems realistic because both Stop and becomeFollower run in separate goroutines. We should look out for this for all components shut down by becomeFollower:

// becomeFollower shuts down the subsystems that are only run by the leader.
func (m *Manager) becomeFollower() {
	m.dispatcher.Stop()
	m.logbroker.Stop()
	m.caserver.Stop()

	if m.allocator != nil {
		m.allocator.Stop()
		m.allocator = nil
	}

	m.constraintEnforcer.Stop()
	m.constraintEnforcer = nil

	m.replicatedOrchestrator.Stop()
	m.replicatedOrchestrator = nil

	m.globalOrchestrator.Stop()
	m.globalOrchestrator = nil

	m.taskReaper.Stop()
	m.taskReaper = nil

	m.scheduler.Stop()
	m.scheduler = nil

	m.roleManager.Stop()
	m.roleManager = nil

	if m.keyManager != nil {
		m.keyManager.Stop()
		m.keyManager = nil
	}
}

WDYT @anshulpundir @dperny @stevvooe?

anshulpundir · 2018-02-01T00:57:22Z

Makes sense. Lets add that comment to TaskReaper.Stop() and land this change @nishanttotla

dperny · 2018-02-01T17:54:50Z

Actually, I do have a change request, can we log an error or warning message if we try to stop the taskreaper twice? It's a corner case, and it's better to log an error than to panic the program, but it's an undesirable occurrence. Something like

log.L.Warn("Stop called twice on taskreaper")

Signed-off-by: Nishant Totla <nishanttotla@gmail.com>

nishanttotla · 2018-02-01T18:15:30Z

manager/orchestrator/taskreaper/task_reaper.go

+// shutting down, and the other when the manager (the leader) is
+// becoming a follower. Since these two instances could race with
+// each other, we use closeOnce here to ensure that TaskReaper.Stop()
+// is called only once to avoid a panic.


@anshulpundir added a comment here.

nishanttotla · 2018-02-01T18:25:08Z

I think the test flakiness is related, I'm investigating.

nishanttotla · 2018-02-01T19:54:24Z

@dperny @anshulpundir have we seen this kind of flakiness in other PRs?

time="2018-02-01T18:21:57Z" level=warning msg="no certificate expiration specified, using default" cluster.id=tqntuxu76xeetoxmyjsl43zuq method="(*Server).UpdateRootCA" module=ca testHasExternalCA=false testname=TestLoadSecurityConfigLoadFromDisk 
time="2018-02-01T18:21:57Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:12: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:12 <nil>}" module=grpc 
time="2018-02-01T18:21:57Z" level=error msg="failed to retrieve remote root CA certificate" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" 
time="2018-02-01T18:21:57Z" level=info msg="Failed to dial 127.0.0.1:12: context canceled; please retry." module=grpc 
time="2018-02-01T18:21:59Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:12: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:12 <nil>}" module=grpc 
time="2018-02-01T18:21:59Z" level=error msg="failed to retrieve remote root CA certificate" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" 
time="2018-02-01T18:21:59Z" level=info msg="Failed to dial 127.0.0.1:12: context canceled; please retry." module=grpc 
time="2018-02-01T18:22:01Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:12: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:12 <nil>}" module=grpc 
time="2018-02-01T18:22:01Z" level=error msg="failed to retrieve remote root CA certificate" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" 
time="2018-02-01T18:22:01Z" level=info msg="Failed to dial 127.0.0.1:12: context canceled; please retry." module=grpc 
time="2018-02-01T18:22:03Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:12: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:12 <nil>}" module=grpc 
time="2018-02-01T18:22:03Z" level=error msg="failed to retrieve remote root CA certificate" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" 
time="2018-02-01T18:22:03Z" level=info msg="Failed to dial 127.0.0.1:12: context canceled; please retry." module=grpc 
time="2018-02-01T18:22:05Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:12: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:12 <nil>}" module=grpc 
time="2018-02-01T18:22:05Z" level=info msg="Failed to dial 127.0.0.1:12: context canceled; please retry." module=grpc 
time="2018-02-01T18:22:05Z" level=error msg="failed to retrieve remote root CA certificate" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" 
time="2018-02-01T18:22:07Z" level=warning msg="no certificate expiration specified, using default" cluster.id=wsgzortbrionw33zytsshw288 method="(*Server).UpdateRootCA" module=ca testHasExternalCA=false testname=TestLoadSecurityConfigDownloadAllCerts 
time="2018-02-01T18:22:08Z" level=info msg="Listening for local connections" addr="/tmpfs/manager-root-ca-update106457598/control.sock" module=node node.id=ka7utsr50amzqxcmy1ent66ju proto=unix 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c became follower at term 0" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="newRaft 4e1464b1d868fe2c [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c became follower at term 1" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c is starting a new election at term 1" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c became candidate at term 2" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c received MsgVoteResp from 4e1464b1d868fe2c at term 2" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c became leader at term 2" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="raft.node: 4e1464b1d868fe2c elected leader 4e1464b1d868fe2c at term 2" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="Creating default ingress network" module=node node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="dispatcher starting" module=dispatcher node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="manager selected by agent for new session: { }" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="waiting 0s before registering session" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="worker ka7utsr50amzqxcmy1ent66ju was successfully registered" method="(*Dispatcher).register" 
time="2018-02-01T18:22:08Z" level=info msg="Stopping manager" 
time="2018-02-01T18:22:08Z" level=info msg="dispatcher stopping" method="(*Dispatcher).Stop" module=dispatcher node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="dispatcher session dropped, marking node ka7utsr50amzqxcmy1ent66ju down" method="(*Dispatcher).Session" node.id=ka7utsr50amzqxcmy1ent66ju node.session=rb7x451ox253wcqcfu7mdabt8 
time="2018-02-01T18:22:08Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=ka7utsr50amzqxcmy1ent66ju node.session=rb7x451ox253wcqcfu7mdabt8 
time="2018-02-01T18:22:08Z" level=info msg="transport: http2Server.HandleStreams failed to read frame: read unix /tmpfs/manager-root-ca-update106457598/control.sock->@: use of closed network connection" module=grpc 
time="2018-02-01T18:22:08Z" level=info msg="transport: http2Client.notifyError got notified that the client transport was broken EOF." module=grpc 
time="2018-02-01T18:22:08Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial unix /tmpfs/manager-root-ca-update106457598/control.sock: connect: no such file or directory\"; Reconnecting to {/tmpfs/manager-root-ca-update106457598/control.sock <nil>}" module=grpc 
time="2018-02-01T18:22:08Z" level=error msg="heartbeat to manager { } failed" error="rpc error: code = Internal desc = transport is closing" method="(*session).heartbeat" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju session.id=rb7x451ox253wcqcfu7mdabt8 sessionID=rb7x451ox253wcqcfu7mdabt8 
time="2018-02-01T18:22:08Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = Internal desc = transport is closing" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="manager selected by agent for new session: { }" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="waiting 288.042µs before registering session" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=error msg="agent: session failed" backoff=300ms error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="manager selected by agent for new session: { }" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="waiting 66.168097ms before registering session" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="Manager shut down" 
time="2018-02-01T18:22:08Z" level=info msg="Stopping manager" module=node node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="Failed to dial /tmpfs/manager-root-ca-update106457598/control.sock: context canceled; please retry." module=grpc 
time="2018-02-01T18:22:08Z" level=error msg="agent: session failed" backoff=700ms error="rpc error: code = FailedPrecondition desc = grpc: the client connection is closing" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0xe1b128]

goroutine 296 [running]:
github.com/docker/swarmkit/connectionbroker.(*Conn).Peer(...)
	/home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/connectionbroker/broker.go:104
github.com/docker/swarmkit/agent.newSession(0x2b967133a250, 0xc4209a6270, 0xc4209a8000, 0x184395bd, 0xc420370820, 0x19, 0xc4209a62a0, 0xc42021e400)
	/home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/agent/session.go:68 +0x588
github.com/docker/swarmkit/agent.(*Agent).run(0xc4209a8000, 0x2b967133a250, 0xc4209a6270)
	/home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/agent/agent.go:365 +0x13f4
created by github.com/docker/swarmkit/agent.(*Agent).Start.func1
	/home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/agent/agent.go:83 +0xb2
FAIL	github.com/docker/swarmkit/node	10.945s
make: *** [coverage] Error 1

nishanttotla · 2018-02-01T21:06:28Z

Okay tests seem to pass now, the flakiness is strange but I can't say that it's related to this PR anymore.

As far as @dperny's comment about logging is concerned, it's not possible to do it with a sync.Once without using locking, which would be overkill.

Ping @dperny @anshulpundir

anshulpundir · 2018-02-01T21:14:20Z

The flakiness is fixed in 7d507f6

anshulpundir · 2018-02-01T21:16:41Z

I merged it, not sure if you were planning to address #2491 (comment)

If so, I can add it to my next PR @nishanttotla

nishanttotla · 2018-02-01T21:18:03Z

@anshulpundir I responded to that comment briefly in my last one. It's not possible to add just a simple log message with sync.Once, because it only lets you ensure that the callback provided to it is called once. It doesn't exactly let you find out if a second call was made.

nishanttotla added the area/orchestration label Jan 26, 2018

anshulpundir approved these changes Jan 26, 2018

View reviewed changes

anshulpundir reviewed Jan 26, 2018

View reviewed changes

nishanttotla commented Jan 31, 2018

View reviewed changes

nishanttotla force-pushed the stop-task-reaper-once branch 2 times, most recently from 6751f66 to 4657b27 Compare February 1, 2018 18:07

Make sure that task reaper stopChan is closed no more than one time

57139d0

Signed-off-by: Nishant Totla <nishanttotla@gmail.com>

nishanttotla force-pushed the stop-task-reaper-once branch from 4657b27 to 57139d0 Compare February 1, 2018 18:12

nishanttotla commented Feb 1, 2018

View reviewed changes

anshulpundir merged commit 4641f9e into moby:master Feb 1, 2018

nishanttotla deleted the stop-task-reaper-once branch February 1, 2018 21:15

thaJeztah mentioned this pull request Feb 10, 2018

Bump SwarmKit to f74983e7c015a38a81c8642803a78b8322cf7eac moby/moby#36274

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure task reaper stopChan is closed no more than once #2491

Ensure task reaper stopChan is closed no more than once #2491

nishanttotla commented Jan 26, 2018

stevvooe commented Jan 26, 2018

anshulpundir Jan 26, 2018

anshulpundir Jan 30, 2018

codecov bot commented Jan 30, 2018 •

edited

Loading

nishanttotla Jan 31, 2018

dperny Jan 31, 2018

nishanttotla commented Jan 31, 2018

anshulpundir commented Feb 1, 2018

dperny commented Feb 1, 2018 •

edited

Loading

nishanttotla Feb 1, 2018

nishanttotla commented Feb 1, 2018

nishanttotla commented Feb 1, 2018

nishanttotla commented Feb 1, 2018 •

edited

Loading

anshulpundir commented Feb 1, 2018

anshulpundir commented Feb 1, 2018

nishanttotla commented Feb 1, 2018 •

edited

Loading

Ensure task reaper stopChan is closed no more than once #2491

Ensure task reaper stopChan is closed no more than once #2491

Conversation

nishanttotla commented Jan 26, 2018

stevvooe commented Jan 26, 2018

anshulpundir Jan 26, 2018

Choose a reason for hiding this comment

anshulpundir Jan 30, 2018

Choose a reason for hiding this comment

codecov bot commented Jan 30, 2018 • edited Loading

Codecov Report

nishanttotla Jan 31, 2018

Choose a reason for hiding this comment

dperny Jan 31, 2018

Choose a reason for hiding this comment

nishanttotla commented Jan 31, 2018

anshulpundir commented Feb 1, 2018

dperny commented Feb 1, 2018 • edited Loading

nishanttotla Feb 1, 2018

Choose a reason for hiding this comment

nishanttotla commented Feb 1, 2018

nishanttotla commented Feb 1, 2018

nishanttotla commented Feb 1, 2018 • edited Loading

anshulpundir commented Feb 1, 2018

anshulpundir commented Feb 1, 2018

nishanttotla commented Feb 1, 2018 • edited Loading

codecov bot commented Jan 30, 2018 •

edited

Loading

dperny commented Feb 1, 2018 •

edited

Loading

nishanttotla commented Feb 1, 2018 •

edited

Loading

nishanttotla commented Feb 1, 2018 •

edited

Loading