-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-2167] Allow filtering of Hive datasets by underlying HDFS folder location #4069
[GOBBLIN-2167] Allow filtering of Hive datasets by underlying HDFS folder location #4069
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4069 +/- ##
============================================
+ Coverage 45.12% 45.36% +0.24%
+ Complexity 3199 3181 -18
============================================
Files 705 695 -10
Lines 26949 26587 -362
Branches 2680 2655 -25
============================================
- Hits 12160 12061 -99
+ Misses 13781 13523 -258
+ Partials 1008 1003 -5 ☔ View full report in Codecov by Sentry. |
Assert.assertEquals(datasets.size(), 1); | ||
|
||
properties.put(HiveDatasetFinder.HIVE_DATASET_PREFIX + "." + WhitelistBlacklist.WHITELIST, ""); | ||
// The table located at /tmp/test should be filtered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be helpful to call out that the dataset table is created at path /tmp/test
on line 221 since there is another assertion above using a different regex which doesn't filter the table
finder = new TestHiveDatasetFinder(FileSystem.getLocal(new Configuration()), properties, pool); | ||
datasets = Lists.newArrayList(finder.getDatasetsIterator()); | ||
|
||
Assert.assertEquals(datasets.size(), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also add a test for the case where the regex is empty or null
@@ -215,6 +215,32 @@ public void testDatasetConfig() throws Exception { | |||
|
|||
} | |||
|
|||
@Test | |||
public void testHiveTableFolderFilter() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be renamed to testHiveTableFolderAllowlistFilter
if (!regex.isPresent()) { | ||
return true; | ||
} | ||
return Pattern.compile(regex.get()).matcher(table.getSd().getLocation()).matches(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pattern.compile(regex.get()) is called every time the method shouldAllowTableLocation is executed when the regex is present. Compiling a regex every time can be inefficient, we can compile the pattern once and store it, eg.:
private static Pattern compiledAllowlistPattern = regex.map(Pattern::compile).orElse(null);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compiled it as part of the class field instead, good callout
LGTM! |
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Hive tables can be located in different folders in HDFS even if they belong to the same database.
This becomes tricky to manage within a single Gobblin job especially when there are different permissions and handling based on underlying files for viewFS.
This PR adds a configuration to have a regex to filter tables based on their table location:
where tables with paths matching this filter will be selected, otherwise ignore
Tests
Unit tests
Commits