Issue #344: add multi-thread support to CheckstyleReportsParser #514

rnveach · 2020-10-23T01:21:35Z

Issue #344

I feel I have been sitting on this long enough. I have not run into any issues in all the times I have used it.

romani · 2020-10-24T05:52:40Z

@rnveach , please write a summary of benefit to have MT mode. Is it only for extremely large reports ? Or it is benefitial on real life examples of use that we have currently in our PRs as regression report.

rnveach · 2020-10-24T06:48:13Z

The benefit of MT is that the process will now split any work that was solely done in a single thread to be done simultaneously into separate threads, allowing the execution to be split among as many available threads open to the system for however big the reports and differences get.

Parsing of the 2 XML files, base and patch as represented by O(B + P) time (basically O(2n) time), is now split into 2 different threads allowing parsing time decreased to the smallest of either O(B) and O(P), which basically amounts to now O(n) time.

Finding differences on the violations in the XML files is where the most benefit is seen. Difference finding, which is comparing 2 lists TWICE and amounts to O(2 * B * P) time (basically an O(n^2) complexity), is now split into as small as 100 record chunks to each thread that is available on the system. This haves a dramatic effect on time proportional to how many threads are on the system, theoretically amounting to O(2 * B * P / T) time, where T is number of threads given to the process. The more threads available and the more records to process, the more time is saved with multi-threading. I think with all the savings, this may reduce the complexity to basically O(n log n) time. ( https://stackoverflow.com/questions/1592649/examples-of-algorithms-which-has-o1-on-log-n-and-olog-n-complexities )

Is it only for extremely large reports ?

No. Anything with over 100 violations in any XML file with more than 1 thread available will see improvements. The bigger the violations and the threads, the more improvements seen.

is benefitial on real life examples of use that we have currently in our PRs as regression report.

Mine was a real life example as presented in issue. The more violations a produced in the final XML, the more savings will be generated. Pitest regression ( https://github.com/checkstyle/contribution/tree/master/checkstyle-tester#checkstyle-pitest-regression ) multiplies the number of violations seen by 1 check and makes the number of violations bigger than normal. Indentation also produces possibly many violations just per file and will the be the single check to show the most improvement.

Remember improvements are seen based on the more violations in the XML file. This search routine is O(n^2) complexity.

romani · 2020-10-24T12:46:07Z

I did not mean general ideas of why MT mode is better.

I meant, what is reduction of execution time for CI, for example for checkstyle/checkstyle#8913 .

rnveach · 2020-10-24T20:50:01Z

I meant, what is reduction of execution time for CI

That is explained as I go into O(n) time complexity. I explain how the routines changed will now behave. The O notation is used to describe how many comparisons a routine does. The less the time complexity, the less comparisons done, the faster the routine generally is.

O(1) is less time than O(n)
O(n) is less time than O(A * n) where A is any numeric value
O(n) is less time than O(n log n)
O(n log n) is less time than O(n^2)
etc...

https://stackoverflow.com/questions/1592649/examples-of-algorithms-which-has-o1-on-log-n-and-olog-n-complexities which I linked to shows examples of code for each complexity.

Main issue provides some numbers an actual run but I don't have the file to reproduce and provide more details from it, I can only explain the time reduction in terms of comparisons as there are many variables.

for example for checkstyle/checkstyle#8913 .

Can you post to exact comment because I am not seeing anything.

romani · 2020-10-24T23:54:03Z

I understand notation.
But if optimize process that takes 10 seconds to be 5 seconds, in Os it awesome but in real life it not that much , especially if whole other process takes 40 minutes.

I curious how much quicker generation of report be after merge if this update. So I provided link to PR with configs.
I need to maintainability price of code in comparison with benefits of time reduction to get report ready.

patch-diff-report-tool/src/main/java/com/github/checkstyle/data/DiffReport.java

pbludov · 2020-10-29T06:38:08Z

patch-diff-report-tool/src/main/java/com/github/checkstyle/data/DiffReport.java

+            final ExecutorService executor =
+                Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
+            final List<Future<List<CheckstyleRecord>>> futures = new ArrayList<>();
+            final int size = list1.size();


this looks as hardcoded

private static List<CheckstyleRecord> produceDiff( List<CheckstyleRecord> list1, List<CheckstyleRecord> list2) { return Stream.concat( StreamSupport.stream(list1.spliterator(), SPLIT_SIZE <= list1.size()) .filter(rec -> !isInList(list2, rec)), StreamSupport.stream(list2.spliterator(), SPLIT_SIZE <= list2.size()) .filter(rec -> !isInList(list1, rec)) ) .collect(Collectors.toList()); }

Does this support multi-threading? The whole issue is we need this multi-threaded to reduce processing time looping through the 2 lists. If everything is processed in 1 thread then there is no time saving.

StreamSupport.stream(list1.spliterator(), SPLIT_SIZE <= list1.size()) is basically list1.stream() if the size is less then SPLIT_SIZE. Otherwise it is list1.parallelStream()

I am not familar with streams but I assume parallel is multi-threaded. I will compare and see how they stack against each other.

pbludov · 2020-10-30T06:56:59Z

@rnveach what about such optimization:

        List<CheckstyleRecord> shortest;
        List<CheckstyleRecord> longest;
        if (list1.size() < list2.size()) {
            shortest = list1;
            longest = list2;
        }
        else {
            shortest = list2;
            longest = list1;
        }

        // O(log(N))
        Set<CheckstyleRecord> recordSet =
                StreamSupport.stream(shortest.spliterator(), SPLIT_SIZE <= shortest.size()) // Do we need threading here?
            .collect(Collectors.toCollection(ConcurrentSkipListSet::new));

        // O(M * log(N)), M > N
        List<CheckstyleRecord> diff =
                StreamSupport.stream(longest.spliterator(), SPLIT_SIZE <= longest.size())
            .filter(rec -> !recordSet.remove(rec)) // Remove from both collections if matches
            .collect(Collectors.toCollection(ArrayList::new));

        diff.addAll(recordSet);
        return diff;

Could you run your benchmark on this code?

rnveach · 2020-10-30T08:17:49Z

what about such optimization:

I can add it to the list to benchmark but I don't understand the code to know what it is doing. Is this a replacement for produceDiffEx or produceDiff?

// Do we need threading here?

Imagine both lists are of infinite size, it doesn't matter who is smaller. This is how I read N in this complexities since it isn't a direct number. It only matters what the final complexity is and if it can be reduced below O(N^2) or O(N log N).

pbludov · 2020-10-30T10:39:06Z

Imagine both lists are of infinite size, it doesn't matter who is smaller. This is how I read N in this complexities since it isn't a direct number. It only matters what the final complexity is and if it can be reduced below O(N^2) or O(N log N).

In this case, may be we should keep these lists sorted? It is easy to compare two sorted lists. Or even store them in a sorted collection, like

    private Map<String, ConcurrentSkipListSet<CheckstyleRecord>> records

?

rnveach · 2020-10-30T11:58:07Z

In this case, may be we should keep these lists sorted?

It looks like we sorted the results after the diff.

contribution/patch-diff-report-tool/src/main/java/com/github/checkstyle/data/DiffReport.java

Line 113 in 44f0290

Collections.sort(diff, new PositionOrderComparator());

If we are going to sort beforehand, wouldn't it be better to sort the records as they are added to the list?

romani · 2020-10-31T15:17:31Z

@rnveach , please share comparison of executions, you might use Indentation Check to make sure that reports are huge in diff.

pbludov · 2020-10-31T17:28:05Z

Please take a look to my PR #521

Issue #344: add multi-thread support to CheckstyleReportsParser

44f0290

rnveach requested a review from romani October 23, 2020 01:21

pbludov reviewed Oct 29, 2020

View reviewed changes

rnveach closed this Nov 6, 2020

rnveach deleted the issue_344 branch October 27, 2024 14:59

Issue #344: add multi-thread support to CheckstyleReportsParser #514

Issue #344: add multi-thread support to CheckstyleReportsParser #514

Uh oh!

Conversation

rnveach commented Oct 23, 2020

Uh oh!

romani commented Oct 24, 2020

Uh oh!

rnveach commented Oct 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romani commented Oct 24, 2020

Uh oh!

rnveach commented Oct 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romani commented Oct 24, 2020

Uh oh!

Uh oh!

pbludov Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

rnveach Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbludov Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

rnveach Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

pbludov commented Oct 30, 2020

Uh oh!

rnveach commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pbludov commented Oct 30, 2020

Uh oh!

rnveach commented Oct 30, 2020

Uh oh!

romani commented Oct 31, 2020

Uh oh!

pbludov commented Oct 31, 2020

Uh oh!

Uh oh!

rnveach commented Oct 24, 2020 •

edited

Loading

rnveach commented Oct 24, 2020 •

edited

Loading

rnveach Oct 30, 2020 •

edited

Loading

rnveach commented Oct 30, 2020 •

edited

Loading