Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip invalid json files in ReportBuilder #384

Merged
merged 13 commits into from
Dec 20, 2023

Conversation

TTianshun
Copy link
Contributor

  1. Now the ReporterBuilder will ignore the corrupted json and only combine those valid json. It will warn user if there's corrupted json file.
    image

  2. When waiting for child, the main process will find all subprocess recursively and check if the subproceess has terminated.

@codecov-commenter
Copy link

codecov-commenter commented Dec 14, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (56faa75) 100.00% compared to head (41c31b0) 100.00%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff            @@
##            master      #384   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           21        21           
  Lines         2233      2249   +16     
=========================================
+ Hits          2233      2249   +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TTianshun
Copy link
Contributor Author

Seemd that there's not an easy way to get children on mac with pure python. I skipped find all subprocess on mac.

@gaogaotiantian
Copy link
Owner

Okay. I think the code to check existing pids on Windows is a bit too invasive. Bringing up another subprocess to run a cmd based command probably makes this more complicated than it is supposed to be, and the problem it solves is not as important.

We can deal with SIGTERM on Linux, so Linux actually has a much smaller issue compared to Windows. The Linux code is cleaner, but less useful.

The benefit to introduce such a large piece of fragile code is not significant enough - we still can't recover the data, we only know the process actually finished. I'm not saying it's not useful, but the code might be hard to maintain in a long term.

The skipping of corrupted json report on the other hand, is much useful - it keeps the data from finished processes. There are some stuff that I'd like to change on the UI side, but overall I think that makes sense.

So for now, could you remove the pid checking part, just leave the report piece? I think Ctrl+C to cancel is enough.

However, there might be a cleaner way to do this, without relying on the cmdlets and Linux path for processes so it's cross-platform. We can check the size of the json report. If it has not changed for a while, then the process must be dead (at least we can warn the users). Or, we can at least list out the processes that we are waiting for (requires some techniques to refresh the line), and the user might realize it's stuck.

For this PR, let's do one thing at a time. Do the json report skipping first, then let's do the feature to detect dead processes.

@TTianshun
Copy link
Contributor Author

TTianshun commented Dec 16, 2023

Okay. I think the code to check existing pids on Windows is a bit too invasive. Bringing up another subprocess to run a cmd based command probably makes this more complicated than it is supposed to be, and the problem it solves is not as important.

We can deal with SIGTERM on Linux, so Linux actually has a much smaller issue compared to Windows. The Linux code is cleaner, but less useful.

The benefit to introduce such a large piece of fragile code is not significant enough - we still can't recover the data, we only know the process actually finished. I'm not saying it's not useful, but the code might be hard to maintain in a long term.

The skipping of corrupted json report on the other hand, is much useful - it keeps the data from finished processes. There are some stuff that I'd like to change on the UI side, but overall I think that makes sense.

So for now, could you remove the pid checking part, just leave the report piece? I think Ctrl+C to cancel is enough.

However, there might be a cleaner way to do this, without relying on the cmdlets and Linux path for processes so it's cross-platform. We can check the size of the json report. If it has not changed for a while, then the process must be dead (at least we can warn the users). Or, we can at least list out the processes that we are waiting for (requires some techniques to refresh the line), and the user might realize it's stuck.

For this PR, let's do one thing at a time. Do the json report skipping first, then let's do the feature to detect dead processes.

Yes the current code to check existing pids is not suitable. For the children of multiprocessing. Maybe we can use multiprocessing.active_children() to get it's process info. For subprocess, there's not a similar method, so I use this to get all the children. Maybe we can add a similar method on subprocess and patch it.

If we use file size to check if the process is running,I'm worried that there's no json file if the child has not yet run to it's finalizer.

@gaogaotiantian
Copy link
Owner

Seems like you can use the API interface for WMI https://stackoverflow.com/a/568589. That's a much cleaner way to check the existing PIDs. Combining with the current pid_exists for POSIX, I think that's good enough.

So - get the pids to wait for from the viztmp file -> that's deterministic and we can control that part, less dependency. For each pid, check whether that pid exists by one function.

(The windows api is not the fastest, you may either need some cache mechanism, or have a new function pids_exist to batch process them)

However, do this in a separate PR. We want to keep each PR as single purpose as possible.

Copy link
Owner

@gaogaotiantian gaogaotiantian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I mentioned, in combine_json(), if there's no valid json, but there are inputs (all of them invalid), we should generate a better message to the users.

@@ -67,10 +67,16 @@ def load_jsons(self) -> None:
self.jsons = [get_json(self.data)]
elif isinstance(self.data, (list, tuple)):
self.jsons = []
corrupted_jsons = []
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it self.invalid_json_paths as this does not store json data and it might not be "corrupted" - could be not json at all.

We need this in the future because if there's no valid json files, we will raise an exception and the final message won't be printed.

valid_json_file = shutil.copy(valid_json_path, os.path.join(tmpdir, "valid.json"))
rb = ReportBuilder([corrupted_json_file, valid_json_file], verbose=1)
with io.StringIO() as s:
rb.save(s)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check for some printed messages here. This test only assures no exceptions raised.

@TTianshun
Copy link
Contributor Author

Like I mentioned, in combine_json(), if there's no valid json, but there are inputs (all of them invalid), we should generate a better message to the users.

Now it will still raise an exception if all of the json files are invalid. I've changed the error message and make it print warning before raise an exception. Just like this picture:
image

Comment on lines 87 to 92
if not self.jsons and not self.invalid_json_paths:
raise ValueError("Can't get report of nothing")
elif not self.jsons:
# print corrupted json files
self.print_messages()
raise ValueError("All json files are corrupted")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not self.jsons and not self.invalid_json_paths:
raise ValueError("Can't get report of nothing")
elif not self.jsons:
# print corrupted json files
self.print_messages()
raise ValueError("All json files are corrupted")
if not self.jsons:
if not self.invalid_json_paths:
raise ValueError("Can't get report of nothing")
else:
raise ValueError("No valid json files found")

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the exception message should be enough - this is a rather rare case, just separate that from "no files".

except json.JSONDecodeError:
self.invalid_json_paths.append(j)
if len(self.invalid_json_paths) > 0:
self.final_messages.append(("corrupted_json", {"paths": self.invalid_json_paths}))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do a full search on this PR and change all "corrupted" to "invalid".

raise ValueError("Can't get report of nothing")
elif not self.jsons:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you hate nested if so much :) . You are evaluating self.jsons twice (performance wise it's not a big deal, but logic wise it's not clean).

Just check self.jsons first, and raise different errors based on self.invalid_json_paths. Nested if works in many cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you hate nested if so much :) . You are evaluating self.jsons twice (performance wise it's not a big deal, but logic wise it's not clean).

Just check self.jsons first, and raise different errors based on self.invalid_json_paths. Nested if works in many cases.

Sorry, I didn't notice that. And thank you for correcting my bad habit.

@gaogaotiantian gaogaotiantian changed the title Optimize ReporterBuilder and multi-process based on issue 364 Skip invalid json files in ReportBuilder Dec 20, 2023
@gaogaotiantian gaogaotiantian merged commit 09b2489 into gaogaotiantian:master Dec 20, 2023
43 checks passed
@gaogaotiantian
Copy link
Owner

Good job on this, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants