-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication iterator #351
base: master
Are you sure you want to change the base?
Conversation
@dspoja There are test failures, can you take a look at this PR? |
The test failures are occurring because the Travis job times out while waiting for the unit tests to complete. Before these changes, accumulo test were taking about 16 minutes to complete, and the entire test suite taking about 40 minutes. Now, with the addition of the FlushEvent and that fact that we call flush a lot in the unit tests and the flush events invoke the new iterator which attempts to deduplicate rows on each flush, the accumulo tests take about 1 hour to run. In my local environment, all tests complete successfully, but they take over 1 hour to run. Commenting out |
@dspoja Is it possible to split the test that is taking a really long time to run into it's own profile? That way we can get travis to run it in parallel as well |
@sfeng88 I don't think that is possible, since it is basically every test in GraphBaseTest that has a flush event, so each test takes a little longer to complete, and they all together add up to significant delay. If there were only a couple of tests that were taking too long, we could definitely separate them. |
@mwizeman @joeferner Thoughts about this PR? |
@@ -27,7 +27,8 @@ | |||
<dependency> | |||
<groupId>junit</groupId> | |||
<artifactId>junit</artifactId> | |||
<version>4.12</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version should be coming from the root directory pom
|
||
@Override | ||
public SortedKeyValueIterator<Key, Value> deepCopy(IteratorEnvironment env) { | ||
RowDeduplicationIterator newInstance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably should just new RowDeduplicationIterator()
here. Using reflection here probably not needed.
return newInstance; | ||
} | ||
|
||
List<KeyValue> sortedRow = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move variables to the top. Mark them as private.
import java.io.IOException; | ||
import java.util.*; | ||
|
||
public class RowDeduplicationIteratorTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test interleaved properties.
In time order...
- t1 - prop1: John
- t2 - prop2: Doe
- t3 - prop1: John
- t4 - prop2: Smith
The two John
values should be collapsed.
Here is the row deduplication iterator. We should plan on testing it with real data too.