Skip to content
This repository was archived by the owner on Dec 13, 2021. It is now read-only.

Commit 938c58c

Browse files
committed
Merge pull request #129 from osrg/suda/wip
Major improvements (doc, CLI)
2 parents a7defa0 + cc050d4 commit 938c58c

File tree

15 files changed

+389
-197
lines changed

15 files changed

+389
-197
lines changed

README.md

Lines changed: 117 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,13 @@
55
[![GoDoc](https://godoc.org/github.com/osrg/earthquake/earthquake?status.svg)](https://godoc.org/github.com/osrg/earthquake/earthquake)
66
[![Build Status](https://travis-ci.org/osrg/earthquake.svg?branch=master)](https://travis-ci.org/osrg/earthquake)
77
[![Coverage Status](https://coveralls.io/repos/github/osrg/earthquake/badge.svg?branch=master)](https://coveralls.io/github/osrg/earthquake?branch=master)
8+
[![Go Report Card](https://goreportcard.com/badge/github.com/osrg/earthquake)](https://goreportcard.com/report/github.com/osrg/earthquake)
89

910
Earthquake is a programmable fuzzy scheduler for testing real implementations of distributed system (such as ZooKeeper).
1011

1112
Blog: [http://osrg.github.io/earthquake/](http://osrg.github.io/earthquake/)
1213

13-
Earthquakes permutes C/Java function calls, Ethernet packets, Filesystem events, and injected faults in various orders so as to find implementation-level bugs of the distributed system.
14+
Earthquakes permutes Java function calls, Ethernet packets, Filesystem events, and injected faults in various orders so as to find implementation-level bugs of the distributed system.
1415
Earthquake can also control non-determinism of the thread interleaving (by calling `sched_setattr(2)` with randomized parameters).
1516
So Earthquake can be also used for testing standalone multi-threaded software.
1617

@@ -27,13 +28,13 @@ Basically, Earthquake permutes events in a random order, but you can write your
2728
* Found [YARN-4301](https://issues.apache.org/jira/browse/YARN-4301) (fault tolerance): ([repro code](example/yarn/4301-reproduce))
2829
* Reproduced flaky tests YARN-{[1978](https://issues.apache.org/jira/browse/YARN-1978), [4168](https://issues.apache.org/jira/browse/YARN-4168), [4543](https://issues.apache.org/jira/browse/YARN-4543), [4548](https://issues.apache.org/jira/browse/YARN-4548), [4556](https://issues.apache.org/jira/browse/YARN-4556)} ([repro instruction](http://www.slideshare.net/AkihiroSuda/tackling-nondeterminism-in-hadoop-testing-and-debugging-distributed-systems-with-earthquake-57866497/42))
2930

30-
## Quick Start
31-
The following instruction shows how you can start *Earthquake Container*, the simplified CLI for Earthquake.
31+
## Quick Start (Container mode)
32+
The following instruction shows how you can start *Earthquake Container*, the simplified, Docker-like CLI for Earthquake.
3233

3334

3435
$ sudo apt-get install libzmq3-dev libnetfilter-queue-dev
3536
$ go get github.com/osrg/earthquake/earthquake-container
36-
$ sudo earthquake-container run -it --rm ubuntu bash
37+
$ sudo earthquake-container run -it --rm -v /foo:/foo ubuntu bash
3738

3839

3940
In *Earthquake Container*, you can run arbitrary command that might be *flaky*.
@@ -59,6 +60,11 @@ explorePolicy = "random"
5960
# Default: 0 and 0
6061
minInterval = "80ms"
6162
maxInterval = "3000ms"
63+
64+
# for Ethernet/Filesystem inspectors, you can specify fault-injection probability (0.0-1.0).
65+
# Default: 0.0
66+
faultActionProbability = 0.0
67+
6268
# for Process inspector, you can specify how to schedule processes
6369
# "mild": execute processes with randomly prioritized SCHED_NORMAL/SCHED_BATCH scheduler.
6470
# "extreme": pick up some processes and execute them with SCHED_RR scheduler. others are executed with SCHED_BATCH scheduler.
@@ -76,31 +82,125 @@ explorePolicy = "random"
7682
```
7783
For other parameters, please refer to [`config.go`](earthquake/util/config/config.go) and [`randompolicy.go`](earthquake/explorepolicy/random/randompolicy.go).
7884

79-
If you don't want to use containers, you can also use Earthquake (process inspector) with an arbitrary process tree.
8085

86+
## Quick Start (Non-container mode)
87+
If you don't want to use containers, please use the `earthquake` command directly.
88+
89+
$ sudo apt-get install libzmq3-dev libnetfilter-queue-dev
8190
$ go get github.com/osrg/earthquake/earthquake
82-
$ sudo earthquake inspectors proc -root-pid $TARGET_PID -watch-interval 1s -autopilot config.toml
8391

84-
For Ethernet inspector,
92+
### Process inspector
8593

86-
$ iptables -A OUTPUT -p tcp -m owner --uid-owner $(id -u johndoe) -j NFQUEUE --queue-num 42
87-
$ sudo earthquake inspectors ethernet -nfq-number 42 -autopilot config.toml
88-
$ sudo -u johndoe $TARGET_PROGRAM
89-
$ iptables -D OUTPUT -p tcp -m owner --uid-owner $(id -u johndoe) -j NFQUEUE --queue-num 42
94+
$ sudo earthquake inspectors proc -root-pid $TARGET_PID -watch-interval 1s
95+
96+
By default, all the processes and the threads under `$TARGET_PID` are randomly scheduled.
97+
98+
You can also specify a config file by running with `-autopilot config.toml`.
99+
100+
You can also set `-orchestrator-url` and `-entity-id` for distributed execution.
101+
102+
Note that the process inspector may be not effective for reproducing short-running flaky tests, but it's still effective for long-running tests: [issue #125](https://github.com/osrg/earthquake/issues/125).
103+
104+
105+
The guide for reproducing flaky Hadoop tests (please use `earthquake` instead of `microearthquake`): [FOSDEM slide 42](http://www.slideshare.net/AkihiroSuda/tackling-nondeterminism-in-hadoop-testing-and-debugging-distributed-systems-with-earthquake-57866497/42).
90106

91-
For Filesystem inspector,
107+
108+
### Filesystem inspector (FUSE)
92109

93110
$ mkdir /tmp/{eqfs-orig,eqfs}
94-
$ sudo earthquake inspectors fs -original-dir /tmp/eqfs-orig -mount-point /tmp/eqfs -autopilot config.toml
111+
$ sudo earthquake inspectors fs -original-dir /tmp/eqfs-orig -mount-point /tmp/eqfs
95112
$ $TARGET_PROGRAM_WHICH_ACCESSES_TMP_EQFS
96113
$ sudo fusermount -u /tmp/eqfs
97114

98-
For full-stack (fully-distributed) Earthquake environment, please refer to [doc/how-to-setup-env-full.md](doc/how-to-setup-env-full.md).
115+
By default, all the `read`, `mkdir`, and `rmdir` accesses to the files under `/tmp/eqfs` are randomly scheduled.
116+
`/tmp/eqfs-orig` is just used as the backing storage.
117+
118+
You can also inject faullts (currently just injects `-EIO`) by setting `explorePolicyParam.faultActionProbability` in the config file.
119+
120+
### Ethernet inspector (Linux netfilter_queue)
121+
122+
$ iptables -A OUTPUT -p tcp -m owner --uid-owner $(id -u johndoe) -j NFQUEUE --queue-num 42
123+
$ sudo earthquake inspectors ethernet -nfq-number 42
124+
$ sudo -u johndoe $TARGET_PROGRAM
125+
$ iptables -D OUTPUT -p tcp -m owner --uid-owner $(id -u johndoe) -j NFQUEUE --queue-num 42
126+
127+
By default, all the packets for `johndoe` are randomly scheduled (with some optimization for TCP retransmission).
128+
129+
You can also inject faults (currently just drop packets) by setting `explorePolicyParam.faultActionProbability` in the config file.
130+
131+
### Ethernet inspector (Openflow 1.3)
132+
133+
You have to install [ryu](https://github.com/osrg/ryu) and [hookswitch](https://github.com/osrg/hookswitch) for this feature.
134+
135+
$ sudo pip install ryu hookswitch
136+
$ sudo hookswitch-of13 ipc:///tmp/hookswitch-socket --tcp-ports=4242,4243,4244
137+
$ sudo earthquake inspectors ethernet -hookswitch ipc:///tmp/hookswitch-socket
99138

100-
[The slides for the presentation at FOSDEM](http://www.slideshare.net/AkihiroSuda/tackling-nondeterminism-in-hadoop-testing-and-debugging-distributed-systems-with-earthquake-57866497/42) might be also helpful.
139+
Please also refer to [doc/how-to-setup-env-full.md](doc/how-to-setup-env-full.md) for this feature.
140+
141+
### Java inspector (AspectJ, byteman)
142+
143+
To be documented
144+
145+
### Distributed execution
146+
147+
Basically please follow these examples: [example/zk-found-2212.ryu](example/zk-found-2212.ryu), [example/zk-found-2212.nfqhook](example/zk-found-2212.nfqhook)
148+
149+
#### Step 1
150+
Prepare `config.toml` for distributed execution.
151+
Example:
152+
```toml
153+
# executed in `earthquake init`
154+
init = "init.sh"
155+
156+
# executed in `earthquake run`
157+
run = "run.sh"
158+
159+
# executed in `earthquake run` as the test oracle
160+
validate = "validate.sh"
161+
162+
# executed in `earthquake run` as the clean-up script
163+
clean = "clean.sh"
164+
165+
# REST port for the communication.
166+
# You can also set pbPort for ProtocolBuffers (Java inspector)
167+
restPort = 10080
168+
169+
# of course you can also set explorePolicy here as well
170+
```
171+
172+
#### Step 2
173+
Create `materials` directory, and put `*.sh` into it.
174+
175+
#### Step 3
176+
Run `earthquake init --force config.toml materials /tmp/x`.
177+
178+
This command executes `init.sh` for initializing the workspace `/tmp/x`.
179+
`init.sh` can access the `materials` directory as `${EQ_MATERIALS_DIR}`.
180+
181+
#### Step 4
182+
Run `for f in $(seq 1 100);do earthquake run /tmp/x; done`.
183+
184+
This command starts the orchestrator, and executes `run.sh`, `validate.sh`, and `clean.sh` for testing the system (100 times).
185+
186+
`run.sh` should invoke multiple Earthquake inspectors: `earthquake inspectors <proc|fs|ethernet> -entity-id _some_unique_string -orchestrator-url http://127.0.0.1:10080`
187+
188+
`*.sh` can access the `/tmp/x/{00000000, 00000001, 00000002, ..., 00000063}` directory as `${EQ_WORKING_DIR}`, which is intended for putting test results and some relevant information. (Note: 0x63==99)
189+
190+
`validate.sh` should exit with zero for successful executions, and with non-zero status for failed executions.
191+
192+
`clean.sh` is an optional clean-up script for each of the execution.
193+
194+
#### Step 5
195+
Run `earthquake summary /tmp/x` for summarizing the result.
196+
197+
If you have [JaCoCo](http://eclemma.org/jacoco/) coverage data, you can run `java -jar bin/earthquake-analyzer.jar --classes-path /somewhere/classes /tmp/x` for counting execution patterns as in [FOSDEM slide 18](http://www.slideshare.net/AkihiroSuda/tackling-nondeterminism-in-hadoop-testing-and-debugging-distributed-systems-with-earthquake-57866497/18).
198+
199+
![doc/img/exec-pattern.png](doc/img/exec-pattern.png)
101200

102201
## Talks
103202

203+
* [CoreOS Fest](http://sched.co/6Szb) (May 9-10, 2016, Berlin)
104204
* [ApacheCon Core North America](http://events.linuxfoundation.org/events/apachecon-north-america/program/schedule) (May 11-13, 2016, Vancouver)
105205
* [FOSDEM](https://fosdem.org/2016/schedule/event/nondeterminism_in_hadoop/) (January 30-31, 2016, Brussels)
106206
* The poster session of [ACM Symposium on Cloud Computing (SoCC)](http://acmsocc.github.io/2015/) (August 27-29, 2015, Hawaii)
@@ -116,7 +216,8 @@ Released under [Apache License 2.0](LICENSE).
116216

117217
---------------------------------------
118218

119-
## API Overview
219+
## API for your own exploration policy
220+
120221
```go
121222
// implements earthquake/explorepolicy/ExplorePolicy interface
122223
type MyPolicy struct {

doc/img/exec-pattern.png

42.3 KB
Loading

earthquake-container/cli/run/run.go

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,34 @@ func prepare(args []string) (dockerOpt *docker.CreateContainerOptions, removeOnE
5454
return
5555
}
5656

57+
func help() string {
58+
// FIXME: why not use the strings in runflag.go?
59+
s := `Usage: earthquake-container run [OPTIONS] IMAGE COMMAND
60+
61+
Run a command in a new Earthquake Container
62+
63+
Docker-compatible options:
64+
-d, --detach [NOT SUPPORTED] Run container in background and print container ID
65+
-i, --interactive Keep STDIN open even if not attached
66+
--name Assign a name to the container
67+
--rm Automatically remove the container when it exits
68+
-t, --tty Allocate a pseudo-TTY
69+
-v, --volume=[] Bind mount a volume
70+
71+
Earthquake-specific options:
72+
-eq-config Earthquake configuration file
73+
74+
NOTE: Unlike docker, COMMAND is mandatory at the moment.
75+
`
76+
return s
77+
}
78+
5779
func Run(args []string) int {
5880
dockerOpt, removeOnExit, eqCfg, err := prepare(args)
5981
if err != nil {
6082
// do not panic here
6183
fmt.Fprintf(os.Stderr, "%s\n", err)
84+
fmt.Fprintf(os.Stderr, "\n%s\n", help())
6285
return 1
6386
}
6487

earthquake/cli/init.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -244,15 +244,15 @@ type initCmd struct {
244244
}
245245

246246
func (cmd initCmd) Help() string {
247-
return "init help (todo)"
247+
return "Please run `earthquake --help run` instead"
248248
}
249249

250250
func (cmd initCmd) Run(args []string) int {
251251
return _init(args)
252252
}
253253

254254
func (cmd initCmd) Synopsis() string {
255-
return "Initialize storage directory"
255+
return "Initialize the workspace for \"run\" command"
256256
}
257257

258258
func initCommandFactory() (cli.Command, error) {

earthquake/cli/inspectors.go

Lines changed: 55 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -27,16 +27,57 @@ type inspectorsCmd struct {
2727
}
2828

2929
func (cmd inspectorsCmd) Help() string {
30+
// FIXME: much more helpful help string
3031
return `
31-
Earthquake Inspectors
32-
- proc: Process inspector
33-
- fs: Filesystem inspector
34-
- ethernet: Ethernet inspector
35-
36-
NOTE: this binary does NOT include following inspectors:
37-
- Java Inspector: (included in earthquake/inspector/java)
38-
- C Inspector: (included in earthquake/inspector/c)
39-
`
32+
The inspectors command starts an Earthquake inspector.
33+
34+
If -orchestrator-url is set, the inspector connects the external orchestrator.
35+
For how to start the external orchestrator, please refer to the help of the run command.
36+
(earthquake --help run)
37+
38+
Note that you have to set -entity-id to an unique value if you connect multiple inspectors to the external orchestrator.
39+
40+
If -orchestrator-url is not set, the inspector connects the embedded orchestrator.
41+
You can specify the configuration file for the embedded orchestrator by setting -autopilot <config.toml>.
42+
43+
44+
Process inspector (proc)
45+
Inspects running Linux process information, and set scheduling attributes.
46+
47+
Typical usage: earthquake inspectors proc -root-pid 42 -watch-interval 1s
48+
49+
Event signals: ProcSetEvent
50+
Action signals: ProcSetSchedAction
51+
52+
53+
Filesystem inspector (fs)
54+
Inspects file access information, and inject delays and faults.
55+
Implemented in FUSE.
56+
57+
Typical usage: earthquake inspectors fs -original-dir /tmp/eqfs-orig -mount-point /tmp/eqfs
58+
59+
Event signals: FilesystemEvent
60+
Action signals: EventAcceptanceAction, FilesystemFaultAction
61+
62+
63+
Ethernet inspector (ethernet)
64+
Inspects Ethernet packet information, and inject delays and faults.
65+
Implemented in Linux netfilter / Openflow.
66+
For Openflow implementation, you have to install hookswitch: https://github.com/osrg/hookswitch
67+
68+
Typical usage: earthquake inspectors ethernet -nfq-number 42
69+
70+
Event signals: PacketEvent
71+
Action signals: EventAcceptanceAction, PacketFaultAction
72+
73+
74+
NOTE: this binary does NOT include the following inspectors:
75+
Java Inspector: (included in misc/inspector/java)
76+
C Inspector: (included in misc/inspector/c, NOT MAINTAINED)
77+
78+
NOTE: Python implementation for Ethernet inspector is also available in misc/pyearthquake.
79+
You can also implement your own inspector in an arbitrary language.
80+
`
4081
}
4182

4283
func (cmd inspectorsCmd) Run(args []string) int {
@@ -47,6 +88,11 @@ func (cmd inspectorsCmd) Run(args []string) int {
4788
"fs": inspectors.FsCommandFactory,
4889
"ethernet": inspectors.EtherCommandFactory,
4990
}
91+
c.HelpFunc = func(commands map[string]mcli.CommandFactory) string {
92+
s := (mcli.BasicHelpFunc("earthquake inspectors"))(commands)
93+
s += cmd.Help()
94+
return s
95+
}
5096

5197
exitStatus, err := c.Run()
5298
if err != nil {

0 commit comments

Comments
 (0)