Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication iterator #351

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Deduplication iterator #351

wants to merge 4 commits into from

Conversation

dspoja
Copy link
Collaborator

@dspoja dspoja commented Apr 24, 2019

Here is the row deduplication iterator. We should plan on testing it with real data too.

@sfeng88
Copy link
Contributor

sfeng88 commented May 2, 2019

@dspoja There are test failures, can you take a look at this PR?

@dspoja
Copy link
Collaborator Author

dspoja commented May 8, 2019

The test failures are occurring because the Travis job times out while waiting for the unit tests to complete. Before these changes, accumulo test were taking about 16 minutes to complete, and the entire test suite taking about 40 minutes. Now, with the addition of the FlushEvent and that fact that we call flush a lot in the unit tests and the flush events invoke the new iterator which attempts to deduplicate rows on each flush, the accumulo tests take about 1 hour to run. In my local environment, all tests complete successfully, but they take over 1 hour to run. Commenting out queueEvent() call on line 1479 in AccumuloGraph.java shortens the test time back to 16 minutes, but then the new iterator is not run. Extending the Travis timeout would allow the build to succeed. This PR needs to be tested with real data before it is merged.

@sfeng88
Copy link
Contributor

sfeng88 commented May 9, 2019

@dspoja Is it possible to split the test that is taking a really long time to run into it's own profile? That way we can get travis to run it in parallel as well

@dspoja
Copy link
Collaborator Author

dspoja commented May 10, 2019

@sfeng88 I don't think that is possible, since it is basically every test in GraphBaseTest that has a flush event, so each test takes a little longer to complete, and they all together add up to significant delay. If there were only a couple of tests that were taking too long, we could definitely separate them.

@sfeng88
Copy link
Contributor

sfeng88 commented May 20, 2019

@mwizeman @joeferner Thoughts about this PR?

@@ -27,7 +27,8 @@
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version should be coming from the root directory pom


@Override
public SortedKeyValueIterator<Key, Value> deepCopy(IteratorEnvironment env) {
RowDeduplicationIterator newInstance;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should just new RowDeduplicationIterator() here. Using reflection here probably not needed.

return newInstance;
}

List<KeyValue> sortedRow = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move variables to the top. Mark them as private.

import java.io.IOException;
import java.util.*;

public class RowDeduplicationIteratorTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test interleaved properties.

In time order...

  • t1 - prop1: John
  • t2 - prop2: Doe
  • t3 - prop1: John
  • t4 - prop2: Smith

The two John values should be collapsed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants