-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1492] Introduce Celeborn Chaos Testing Framework #3091
base: main
Are you sure you want to change the base?
Conversation
93b5c30
to
e8739c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we need to change celeborn-class
and make-distribution
files for Compiling and packaging.
And for using this framework we need to add some shell files in sbin.
I added a few changes to your local branch, plz help review~
pr
verifier/src/main/scala/org/apache/celeborn/verifier/conf/VerifierConf.scala
Outdated
Show resolved
Hide resolved
verifier/src/main/scala/org/apache/celeborn/verifier/conf/VerifierConf.scala
Outdated
Show resolved
Hide resolved
verifier/src/main/scala/org/apache/celeborn/verifier/conf/VerifierConf.scala
Outdated
Show resolved
Hide resolved
e8739c5
to
6c55e85
Compare
Ping @zaynt4606. |
look at here |
@zaynt4606, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There is a nit that when cli thread finished the log is always "failed to launch..." in execute_command()
of celeborn-daemon.sh
.
The execute_command()
can be
execute_command() {
if [ -z ${CELEBORN_NO_DAEMONIZE+set} ]; then
exec nohup -- "$@" >> "$log" 2>&1 < /dev/null &
newpid="$!"
echo "$newpid" > "$pid"
# Poll for up to 5 seconds for the java process to start
for i in {1..10}; do
if [[ $(ps -p "$newpid" -o comm=) =~ "java" ]] || [[ $(ps -p "$newpid" -o comm=) =~ "jboot" ]]; then
break
fi
sleep 0.5
done
sleep 2
# Check if the process has died; in that case we'll tail the log so the user can see
if ! ps -p "$newpid" > /dev/null; then
# Process is no longer running, check its exit status
wait "$newpid"
exit_code=$?
if [[ $exit_code -ne 0 ]]; then
echo "failed to launch: $@"
tail -10 "$log" | sed 's/^/ /'
echo "full log in $log"
else
echo "process completed successfully with exit code 0"
fi
else
# Process is still running, assume it started successfully
if [[ ! $(ps -p "$newpid" -o comm=) =~ "java" ]] && [[ ! $(ps -p "$newpid" -o comm=) =~ "jboot" ]]; then
echo "failed to launch: $@"
tail -10 "$log" | sed 's/^/ /'
echo "full log in $log"
fi
fi
else
exec "$@"
fi
}
What changes were proposed in this pull request?
Introduce Celeborn Chaos Testing Framework.
Why are the changes needed?
A chaos testing framework is designed to simulate unpredictable and adverse conditions in distributed systems to validate their robustness and resilience. CIP-10 Introduce Celeborn Chaos Testing Framework aims to simulate various anomalies and test the stability of Celeborn in distributed environments via chaos testing.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
GA.