-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #344: add multi-thread support to CheckstyleReportsParser #514
Conversation
@rnveach , please write a summary of benefit to have MT mode. Is it only for extremely large reports ? Or it is benefitial on real life examples of use that we have currently in our PRs as regression report. |
The benefit of MT is that the process will now split any work that was solely done in a single thread to be done simultaneously into separate threads, allowing the execution to be split among as many available threads open to the system for however big the reports and differences get. Parsing of the 2 XML files, base and patch as represented by O(B + P) time (basically O(2n) time), is now split into 2 different threads allowing parsing time decreased to the smallest of either O(B) and O(P), which basically amounts to now O(n) time. Finding differences on the violations in the XML files is where the most benefit is seen. Difference finding, which is comparing 2 lists TWICE and amounts to O(2 * B * P) time (basically an O(n^2) complexity), is now split into as small as 100 record chunks to each thread that is available on the system. This haves a dramatic effect on time proportional to how many threads are on the system, theoretically amounting to O(2 * B * P / T) time, where T is number of threads given to the process. The more threads available and the more records to process, the more time is saved with multi-threading. I think with all the savings, this may reduce the complexity to basically O(n log n) time. ( https://stackoverflow.com/questions/1592649/examples-of-algorithms-which-has-o1-on-log-n-and-olog-n-complexities )
No. Anything with over 100 violations in any XML file with more than 1 thread available will see improvements. The bigger the violations and the threads, the more improvements seen.
Mine was a real life example as presented in issue. The more violations a produced in the final XML, the more savings will be generated. Pitest regression ( https://github.com/checkstyle/contribution/tree/master/checkstyle-tester#checkstyle-pitest-regression ) multiplies the number of violations seen by 1 check and makes the number of violations bigger than normal. Indentation also produces possibly many violations just per file and will the be the single check to show the most improvement. Remember improvements are seen based on the more violations in the XML file. This search routine is O(n^2) complexity. |
I did not mean general ideas of why MT mode is better. I meant, what is reduction of execution time for CI, for example for checkstyle/checkstyle#8913 . |
That is explained as I go into O(n) time complexity. I explain how the routines changed will now behave. The O notation is used to describe how many comparisons a routine does. The less the time complexity, the less comparisons done, the faster the routine generally is. O(1) is less time than O(n) https://stackoverflow.com/questions/1592649/examples-of-algorithms-which-has-o1-on-log-n-and-olog-n-complexities which I linked to shows examples of code for each complexity. Main issue provides some numbers an actual run but I don't have the file to reproduce and provide more details from it, I can only explain the time reduction in terms of comparisons as there are many variables.
Can you post to exact comment because I am not seeing anything. |
I understand notation. I curious how much quicker generation of report be after merge if this update. So I provided link to PR with configs. |
patch-diff-report-tool/src/main/java/com/github/checkstyle/data/DiffReport.java
Show resolved
Hide resolved
final ExecutorService executor = | ||
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()); | ||
final List<Future<List<CheckstyleRecord>>> futures = new ArrayList<>(); | ||
final int size = list1.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks as hardcoded
private static List<CheckstyleRecord> produceDiff(
List<CheckstyleRecord> list1, List<CheckstyleRecord> list2) {
return Stream.concat(
StreamSupport.stream(list1.spliterator(), SPLIT_SIZE <= list1.size())
.filter(rec -> !isInList(list2, rec)),
StreamSupport.stream(list2.spliterator(), SPLIT_SIZE <= list2.size())
.filter(rec -> !isInList(list1, rec))
)
.collect(Collectors.toList());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this support multi-threading? The whole issue is we need this multi-threaded to reduce processing time looping through the 2 lists. If everything is processed in 1 thread then there is no time saving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StreamSupport.stream(list1.spliterator(), SPLIT_SIZE <= list1.size())
is basically list1.stream()
if the size is less then SPLIT_SIZE
. Otherwise it is list1.parallelStream()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familar with streams but I assume parallel is multi-threaded. I will compare and see how they stack against each other.
@rnveach what about such optimization: List<CheckstyleRecord> shortest;
List<CheckstyleRecord> longest;
if (list1.size() < list2.size()) {
shortest = list1;
longest = list2;
}
else {
shortest = list2;
longest = list1;
}
// O(log(N))
Set<CheckstyleRecord> recordSet =
StreamSupport.stream(shortest.spliterator(), SPLIT_SIZE <= shortest.size()) // Do we need threading here?
.collect(Collectors.toCollection(ConcurrentSkipListSet::new));
// O(M * log(N)), M > N
List<CheckstyleRecord> diff =
StreamSupport.stream(longest.spliterator(), SPLIT_SIZE <= longest.size())
.filter(rec -> !recordSet.remove(rec)) // Remove from both collections if matches
.collect(Collectors.toCollection(ArrayList::new));
diff.addAll(recordSet);
return diff; Could you run your benchmark on this code? |
I can add it to the list to benchmark but I don't understand the code to know what it is doing. Is this a replacement for
Imagine both lists are of infinite size, it doesn't matter who is smaller. This is how I read |
In this case, may be we should keep these lists sorted? It is easy to compare two sorted lists. Or even store them in a sorted collection, like private Map<String, ConcurrentSkipListSet<CheckstyleRecord>> records ? |
It looks like we sorted the results after the diff. contribution/patch-diff-report-tool/src/main/java/com/github/checkstyle/data/DiffReport.java Line 113 in 44f0290
If we are going to sort beforehand, wouldn't it be better to sort the records as they are added to the list? |
@rnveach , please share comparison of executions, you might use Indentation Check to make sure that reports are huge in diff. |
Please take a look to my PR #521 |
Issue #344
I feel I have been sitting on this long enough. I have not run into any issues in all the times I have used it.