Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run Testground pex-convergence with 48 nodes #2

Open
aratz-lasa opened this issue Nov 5, 2021 · 2 comments
Open

Unable to run Testground pex-convergence with 48 nodes #2

aratz-lasa opened this issue Nov 5, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@aratz-lasa
Copy link
Contributor

aratz-lasa commented Nov 5, 2021

Description

When running convergence test case in Testground with more than 48 nodes, the execution fails. There are times during the execution that nodes are not able to dial another peer.

Testground command: testground run single --plan=casm --testcase="pex-convergence" --runner=local:docker --builder=docker:go --instances=48

Error output is:

failed to dial QmbwgXDBnCvDC1XYP7SzfckdScYWFpQNhPUJQmEPgcQwsK:
  * [/ip4/192.18.0.41/tcp/46245] dial tcp4 192.18.0.41:46245: connect: network is unreachable
  * [/ip4/16.0.0.41/tcp/46245] dial tcp4 0.0.0.0:34309->16.0.0.41:46245: i/o timeout
  * [/ip4/127.0.0.1/tcp/46245] dial tcp4 0.0.0.0:34309->127.0.0.1:46245: i/o timeout

Ideas

The main idea is that the network and containers get overloaded. Convergence test case makes use of redis barriers in every iteration, in order to syncrhonize nodes. However, redis barriers are known to have a big overhead.

@aratz-lasa aratz-lasa added the bug Something isn't working label Nov 5, 2021
@aratz-lasa
Copy link
Contributor Author

In order to test whether barriers present too much overhead, a new branch with no barriers was created. The branch can be found at no-barriers branch.

However, it also raises errors, but this time due to name resolution:

failed while signalling entry to state finished: dial tcp: lookup testground-redis: Temporary failure in name resolutiongoroutine 1 [running]:
runtime/debug.Stack(0xecfdd5, 0x2, 0xc0002479c0)
	/usr/local/go/src/runtime/debug/stack.go:24 +0x9f
github.com/testground/sdk-go/runtime.(*RunEnv).RecordCrash(0xc00021a000, 0xdb9420, 0xc0005045c0)
	/go/pkg/mod/github.com/testground/[email protected]/runtime/runenv_events.go:245 +0xa5
github.com/testground/sdk-go/run.invoke.func3(0xc00021a000)
	/go/pkg/mod/github.com/testground/[email protected]/run/invoker.go:135 +0x65
panic(0xdb9420, 0xc0005045c0)
	/usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/testground/sdk-go/run.invoke(0xc00021a000, 0xd92620, 0x102e0b0)
	/go/pkg/mod/github.com/testground/[email protected]/run/invoker.go:173 +0x505
github.com/testground/sdk-go/run.InvokeMap(0xc0000f9e38)
	/go/pkg/mod/github.com/testground/[email protected]/run/invoker.go:65 +0xa9
main.main()
	/plan/main.go:14 +0x145

@aratz-lasa
Copy link
Contributor Author

aratz-lasa commented Nov 5, 2021

Another test case was created in order to check whether name resolution problem arise because of overloading the system. The test simply runs N nodes, and all except one node, call a Redis barrier. Preliminary tests show that if 200-500 nodes are used, Testground raises name resolution errors.

Error output:
{"ts":1636108935146165512,"msg":"failed while getting barriers; iteration skipped","group_id":"single","run_id":"c62ggeeqmkrnfmls1p40","process":"barriers","error":"dial tcp: lookup testground-redis: Temporary failure in name resolution"}

and also

failed to send batch to InfluxDB; attempt 1; err: Post "http://testground-influxdb:8086/write?consistency=&db=testground&precision=ns&rp=": dial tcp: lookup testground-influxdb: Temporary failure in name resolution

Command to run new test case is: testground run single --plan=casm --testcase="dns-test" --runner=local:docker --builder=docker:go --instances=500 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant