-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
changed processTransposeBatch to prune duplicate arcs
- Loading branch information
Showing
1 changed file
with
46 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
10c1152
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But pruning duplicate arcs cannot be done without a LabelMergeStrategy. How are you going to decide which label to keep otherwise? Also, why are you getting duplicates?
10c1152
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem of duplicate arcs arose when using a very small batch size, causing the same arc to be in two separate batches, in that case
processTransposeBatch
cannot remove the duplicates because it works on one batch at a time.If I understood correctly inside
BatchGraph
this problem is solved when computing thesuccessorArray()
, in the case ofArcLabelledBatchGraph
, which inherits fromArcLabelledImmutableSequentialGraph
and therefore must implementsuccessor()
, to avoid code duplication the removal of duplicates was moved to thenextInt()
function, which also computes the right value for the outdegree.I was thinking on how to solve the problem of duplicate arcs with different labels, I forgot about the
LabelMergeStrategy
, I'm thinking of passing it to theScatteredLabelledArcsASCIIGraph
constructor and using it both when processing the batch and when computing the successors. Are we not interested in the equal arcs with different labels though?Edit: The labels are structured as key-value associations, which means we don't need multiple arcs with different labels. We simply merge the two labels and the process is invertible. This answers my question.
10c1152
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, my question was: why is the ASCII class adding twice the same arc. That shouldn't happen (in theory).
OK, what happened here is the following (I think). Sorting successors and deduplicating them is expensive. When I implemented parallel compression, I realized that since parallel compression works by scanning once the batch graph, building a number of iterators, and then compressing (which implies a second pass) sorting and deduplicating in
NodeIterator.nextInt()
was foolish because in the first pass there is no need for that. So I moved deduplication tosuccessorArray()
. That change was never ported to the labeled version because up to a few weeks ago we did not have parallel compression of labeled graphs.In the labeled version there is no deduplication because there is no
mapOffline()
method, so a labeled batch cannot contain duplicates.Having parallel arcs (e.g., the same successors many times) is a possible extension for the future but a lot of code depends on the successor lists be made of distinct elements. In that case we suggest to use a label class that contains a list of values.
So what I suggest is that, first of all, you move the sorting code out as in the non-labelled case. Then, we add a constructor to the batch graph with a LabelMergeStrategy. My idea is that if the code finds to parallel arcs and there is no LabelMergeStrategy, an exception is thrown. Otherwise, it uses the strategy to generate a single label. For existing code (transposition) throwing an exception is fine. For the new class, the user can pass a strategy.
10c1152
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly it makes more sense to dedup only when computing the successor array. But this causes an issue with the array of labels, which is created on the first call to
nextInt
and sized according to the outdegree, but if it's impossible to accurately compute the outdegree without sorting first, the solution is simple, we can override thelabelArray
method and do the deduplication (keeping in mind not to do it again when callingsuccessorArray()
or viceversa).10c1152
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something, can you please elaborate on exactly why there is no need for sorting and deduplication in the first pass? Because as far as I understand the method
NodeIterator.nextInt()
reads from multiple batches, each batch is already sorted and deduplicated, but were is the guarantee that combining their contents will yield arcs without duplication?