Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check pid exists when waiting for child #388

Merged
merged 3 commits into from
Jan 8, 2024

Conversation

TTianshun
Copy link
Contributor

@TTianshun TTianshun commented Dec 25, 2023

Added child check when waiting for children to finish, in case that child failed to delete viztmp file. The current pid_exists has a issue that if a process return 259 (STILL_ACTIVE) as exit code (which is a really rare case that is nearly impossible I guess).

@codecov-commenter
Copy link

codecov-commenter commented Dec 25, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (68b7784) 100.00% compared to head (072b4ee) 100.00%.
Report is 1 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff            @@
##            master      #388   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           21        21           
  Lines         2249      2276   +27     
=========================================
+ Hits          2249      2276   +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Owner

@gaogaotiantian gaogaotiantian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good, a couple of comments.

@@ -590,7 +590,16 @@ def wait_children_finish(self) -> None:
try:
if any((f.endswith(".viztmp") for f in os.listdir(self.multiprocess_output_dir))):
same_line_print("Wait for child processes to finish, Ctrl+C to skip")
while any((f.endswith(".viztmp") for f in os.listdir(self.multiprocess_output_dir))):
while True:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic here can be simplified. You don't need to check if remain_viztmp is empty if you need to iterate through it. Do the for loop directly and your remaining logic will take care of it.

int(f[7:-12]) is a horrible magic number. Yit would be better to use regular expression here because it's self-explainable. Also, if there happen to be a file that does not match the pattern, this implementation will raise an index error or a integer conversion error, which would be a huge confusion to users. For regular expression, if it does not match, you can issue a warning (might be something we change in other places but forgot to match here).

You probably do not need a remain_children variable, you can iterate through remain_viztmp directly.

elif err.errno == errno.EPERM:
# EPERM clearly means there's a process to deny access to
# UNIX
if sys.platform != "win32":
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the logic "positive" here. Do if sys.platform == "win32" then windows logic, and Unix in else with a comment stating this is Unix.

tests/test_regression.py Show resolved Hide resolved

class TestWaitForChild(CmdlineTmpl):
def test_child_process_exits_normally(self):
self.template(["viztracer", "-o", "result.json", "--dump_raw", "cmdline_test.py"],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not think too much but is --dump_raw necessary here?

if pid_exists(pid):
break
else: # pragma: no cover
color_print("WARNING", f"Can't parse {viztmp_file}, skip")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning messge is a bit confusing. Let's do something like "Unknown viztmp file {viztmp_file}"

# Usually it's impossible to run here in viztracer.
return True
return False # pragma: no cover
# UNIX
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment should be inside else case, not above it.

p.start()
# The main process will join the child in multiprocessing.process._children.
# This is a hack to make sure the main process won't join the child process,
# so we can test the wait_children_finish function
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both comments, let's mention the class (I think it's VizUI.wait_children_finish?). Because we are discussing in the context of multiprocessing here and it's easy for people to mistaken this function as a multiprocessing function.

@TTianshun
Copy link
Contributor Author

Fixed it!

@gaogaotiantian gaogaotiantian merged commit 3dd4f3f into gaogaotiantian:master Jan 8, 2024
43 checks passed
@gaogaotiantian
Copy link
Owner

I believe this PR is making the nightly validation unstable - https://github.com/gaogaotiantian/viztracer/actions/runs/7539946132/job/20523513962 (not sure if you have access to it).

The issue is the child process only lives for 1 second, and the parent process might need more to reach the trace collection part. github actions is normally much slower than our PC. There are multiple tests that use different timeout on github actions (search for GITHUB_ACTIONS) and we should probably do the same here. (To not increase the local full-test time and stablize the remote one).

@TTianshun
Copy link
Contributor Author

I believe this PR is making the nightly validation unstable - https://github.com/gaogaotiantian/viztracer/actions/runs/7539946132/job/20523513962 (not sure if you have access to it).

The issue is the child process only lives for 1 second, and the parent process might need more to reach the trace collection part. github actions is normally much slower than our PC. There are multiple tests that use different timeout on github actions (search for GITHUB_ACTIONS) and we should probably do the same here. (To not increase the local full-test time and stablize the remote one).

I agree with that if using a much slower PC the test will fail. It takes about 0.4 seconds on my computer between _winapi.CreateProcess and the child starting. The time may be longer on github actions. So when the parent collects the data, the child might haven't created viztmp file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants