Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28409: Column lineage when creating view is missing if atlas HiveHook is set #5370

Merged
merged 8 commits into from
Aug 26, 2024

Conversation

kasakrisz
Copy link
Contributor

@kasakrisz kasakrisz commented Jul 26, 2024

What changes were proposed in this pull request?

  • Remove hardcoded Atlas HiveHook related filters from lineage information Generator and introduce an independent setting to filter based on statement/operation type.
  • Add isView property to QueryProperties to indicate that this statement is a logical view creation.

Why are the changes needed?

Lineage information was not generated in case of create view operation. See jira for details: https://issues.apache.org/jira/browse/HIVE-28409

Does this PR introduce any user-facing change?

No.

Is the change a dependency upgrade?

No.

How was this patch tested?

mvn test -Dtest.output.overwrite -DskipSparkTests -Dtest=TestMiniLlapLocalCliDriver -Dqfile=lineage6.q -pl itests/qtest -Pitests
mvn test -Dtest=TestGenerator.java -pl ql

@@ -3865,6 +3865,8 @@ public static enum ConfVars {
"get old behavior, if desired. See, test-case in patch for HIVE-6689."),
HIVE_LINEAGE_INFO("hive.lineage.hook.info.enabled", false,
"Whether Hive provides lineage information to hooks."),
HIVE_LINEAGE_STATEMENT_FILTER("hive.lineage.statement.filter", "All",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we have a rule about it or not. But can the default value be lowercase?
I would love to avoid future confusion about if is that All, all or ALL.

Also, could you please put some description and and example values there that can help us to understand what that config value exactly does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison is not case sensitive to make this more felxible

if (ALL.equalsIgnoreCase(valueText)) {
if (NONE.equalsIgnoreCase(valueText)) {
HiveOperation enumValue = EnumUtils.getEnumIgnoreCase(HiveOperation.class, valueText);

I added an upper case default value because other values coming from HiveOperation are in upper case. There are other configs with upper case values so I think it is not an exception.
I also added the missing description.

@@ -6,7 +6,7 @@
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
*f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accident typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -79,7 +79,7 @@ public void initialize(HiveConf hiveConf) {
|| postExecHooks.contains("org.apache.hadoop.hive.ql.hooks.PostExecutePrinter")
|| postExecHooks.contains("org.apache.hadoop.hive.ql.hooks.LineageLogger")
|| postExecHooks.contains("org.apache.atlas.hive.hook.HiveHook")) {
transformations.add(new Generator(postExecHooks));
transformations.add(Generator.fromConf(hiveConf));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that before that change, we read the transformations from the postExecHooks that is as far I remember, calculated once. And from now, we read those again at every single query execution.
Am I correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postExecHooks also calculated at every query execution:

Set<String> postExecHooks = Sets.newHashSet(
Splitter.on(",").trimResults().omitEmptyStrings().split(
Strings.nullToEmpty(HiveConf.getVar(hiveConf, HiveConf.ConfVars.POST_EXEC_HOOKS))));

I think this should be addressed in a follow-up only if this is bottleneck.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx


HiveOperation enumValue = EnumUtils.getEnumIgnoreCase(HiveOperation.class, valueText);
if (enumValue == null) {
throw new EnumConstantNotPresentException(HiveOperation.class, valueText);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with that part: what will be the user experience if they add this value to the conf: "Hello" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception is thrown at compile time.
Example:

java.lang.EnumConstantNotPresentException: org.apache.hadoop.hive.ql.plan.HiveOperation.Hello
	at org.apache.hadoop.hive.ql.optimizer.lineage.Generator.createFilterPredicateFromConf(Generator.java:106)
	at org.apache.hadoop.hive.ql.optimizer.lineage.Generator.fromConf(Generator.java:72)
	at org.apache.hadoop.hive.ql.optimizer.Optimizer.initialize(Optimizer.java:82)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:13274)
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

operations.stream().anyMatch(hiveOperation -> filterMap.get(hiveOperation).apply(parseContext));
}

private final Predicate<ParseContext> statementFilter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move that declaration and the constructor up so that the class can keep that pattern: fields -> constructors -> methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything before statementFilter is static. Where should those static fields and methods go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really not add method declarations before field ones.
What about static fields, instance fields, static initialization block, constructor, public methods, private methods?

Copy link
Contributor Author

@kasakrisz kasakrisz Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually move static members to the beginning of the file but I don't have strong opinion about this.
I changed the order as you suggested.

LOG.debug("Not evaluating lineage");
return pctx;
}
if (!statementFilter.test(pctx)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a fear that this change will break existing configs: Previously, we said that if the Atlas HiveHook is set, we will run that transformation. Now we are saying that if hive.lineage.statement.filter is set, we do the transformation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it breaks existing configs but previously Atlas HiveHook-related filters were hardcoded and impossible to test. Now hive.lineage.statement.filter is independent of hooks and hence applies for any hook making it possible to test using LineageLogger.

"Whether Hive provides lineage information to hooks."),
"Whether Hive provides lineage information to hooks." +
"Deprecated: use hive.lineage.statement.filter instead."),
HIVE_LINEAGE_STATEMENT_FILTER("hive.lineage.statement.filter", "ALL",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using NONE as default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NONE means that this functionality is never triggered. IMHO all functionality should be tested otherwise follow-up changes in the project may make it impossible to turn it on in the future. We have experienced this in the past a few times.

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

LOAD(parseContext -> !(parseContext.getLoadTableWork() == null || parseContext.getLoadTableWork().isEmpty())),
QUERY(parseContext -> parseContext.getQueryProperties().isQuery()),
ALL(parseContext -> true),
NONE(parseContext -> false);
Copy link
Member

@deniskuzZ deniskuzZ Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still think that NONE is redundant, but I am fine to keep it

Copy link

sonarcloud bot commented Aug 23, 2024

@kasakrisz kasakrisz merged commit d3bbb03 into apache:master Aug 26, 2024
6 checks passed
@kasakrisz kasakrisz deleted the HIVE-28409-master-atlas-lineage branch August 26, 2024 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants