Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Add dev config for nested column support #466

Merged
merged 4 commits into from
Jun 23, 2021

Conversation

sezruby
Copy link
Collaborator

@sezruby sezruby commented Jun 18, 2021

What is the context for this pull request?

What changes were proposed in this pull request?

Disable index creation with nested columns until it's fully supported.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

@sezruby sezruby requested review from imback82 and clee704 June 18, 2021 20:21
@sezruby
Copy link
Collaborator Author

sezruby commented Jun 18, 2021

@andrei-ionescu I created a temp config for disabling nested column support (for v0.5 release).
We could remove the config when it's fully supported. Could you have a look at this?

@sezruby sezruby self-assigned this Jun 18, 2021
@sezruby sezruby added this to the v0.5.0 milestone Jun 18, 2021
@andrei-ionescu
Copy link
Contributor

@sezruby Please provide a clear explanation of the reason to add this flag. Is there a technical reason to add it and have this "nested column based index creation feature" disabled by default?

In the project are other features that are partially supported and I don't understand why this is different and needs to have this new flag.

By the way, adding this flag cannot stop one to set it and have the nested column based index creation enabled.

One more thing, if such a nested column based index is created, when they'll try to use it, it will not be applied. The results will be correct and impact is only that performance wise Hyperspace does not provide enough performance in this case. The user may just open some tickets in regards to it.

On the other hand, I offered my time to work on adding the full nested field support into Hyperspace for v0.5 but due to different priorities and lack of time from your team, the nested field support had to be postponed and I agreed with it, but with this PR I'm under the impression that this feature is in your way and needs to be removed (more work from my side to put it back). Maybe I'm getting it wrong and please correct me if so.

@rapoth,

  1. Do you have some more info on this decision?
  2. When should we resume the nested field support in Hyperspace? Do you have a time frame?

@sezruby
Copy link
Collaborator Author

sezruby commented Jun 18, 2021

@andrei-ionescu Sorry for the lack of explanation. We need to release v0.5 this month to deliver some bug fixes.
Without the config, customers may create indexes with nested columns and ask why it's not applied.
To avoid unnecessary time consuming analysis, (it's just for quick failure), I introduced this config to deliver 0.5 without those kind of issues.

It's ok to deliver incomplete PRs without the temp config while developing, but for release, we need to disable (or exclude) incomplete features to avoid unnecessary issue handling.

It's a temporal / hidden config, so I don't expect anyone will use the config.

@andrei-ionescu
Copy link
Contributor

andrei-ionescu commented Jun 19, 2021

@sezruby The feature of creating an index over a nested field is also hidden, the users may not use it at all. And this is the way Hyperspace did work before I added it.

Before adding this functionality I remember that there hasn't been any error specific to the fact that nested fields are not supported - it just did not work.

From my perspective with or without it Hyperspace works the same as before.

So, let me see if I got right what you're saying:

  1. There is no technical issue forcing you do this.
  2. You're afraid that on your products the customers will possibly use this feature and wonder why does not work when querying.

My questions are:

  1. How was it in version 0.4 and why keeping the feature as it is in 0.5 is worse?
  2. Why do you want the customers to be locked in the fact that Hyperspace does not support indexes on nested fields? Is this nested field support totally removed from Hyperspace roadmap?
  3. Can "nested field support" get into v0.5.0?
  4. If "not" is the answer to question 3, then is it v0.6.0 and what's the timeframe for it?

@rapoth, @imback82: I may need you input here are there are some product related questions that I would like to get some answers. Thank you.

@sezruby
Copy link
Collaborator Author

sezruby commented Jun 19, 2021

1=>
In version 0.4

scala> hs.createIndex(dff, IndexConfig("nestedIndex", Seq("nestedC"), Seq("hsh")))
// succeeded

scala> hs.createIndex(dff, IndexConfig("nestedIndex", Seq("nestedC.id"), Seq("hsh")))
com.microsoft.hyperspace.HyperspaceException: Index config is not applicable to dataframe schema.
  at com.microsoft.hyperspace.actions.CreateAction.validate(CreateAction.scala:53)
  at com.microsoft.hyperspace.actions.Action$class.run(Action.scala:88)
  at com.microsoft.hyperspace.actions.CreateAction.run(CreateAction.scala:30)
  at com.microsoft.hyperspace.index.IndexCollectionManager.create(IndexCollectionManager.scala:43)
  at com.microsoft.hyperspace.index.CachingIndexCollectionManager.create(CachingIndexCollectionManager.scala:77)
  at com.microsoft.hyperspace.Hyperspace.createIndex(Hyperspace.scala:42)
  ... 54 elided

// the index with "nestedC" can be applied
scala> val filter = dff.filter($"nestedC".isNotNull).select("hsh", "nestedC")
filter: org.apache.spark.sql.DataFrame = [hsh: string, nestedC: struct<id: string, leaf: struct<id: string, cnt: int>>]

scala> filter.queryExecution.optimizedPlan
res4: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [hsh#13, nestedC#17]
+- Filter isnotnull(nestedC#17)
   +- Relation[hsh#13,nestedC#17] Hyperspace(Type: CI, Name: nestedIndex, LogVersion: 1)

in current master:

scala> hs.createIndex(dff, IndexConfig("nestedIndex2", Seq("nestedC"), Seq("hsh")))
scala.MatchError: List() (of class scala.collection.immutable.Nil$)
  at com.microsoft.hyperspace.util.ResolverUtils$.com$microsoft$hyperspace$util$ResolverUtils$$getColumnNameFromSchema(ResolverUtils.scala:215)
  at com.microsoft.hyperspace.util.ResolverUtils$.com$microsoft$hyperspace$util$ResolverUtils$$getColumnNameFromSchema(ResolverUtils.scala:220)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1$$anonfun$apply$2.apply(ResolverUtils.scala:174)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1$$anonfun$apply$2.apply(ResolverUtils.scala:171)
  at scala.Option.map(Option.scala:146)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1.apply(ResolverUtils.scala:171)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1.apply(ResolverUtils.scala:168)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:296)
  at com.microsoft.hyperspace.util.ResolverUtils$.resolve(ResolverUtils.scala:168)
  at com.microsoft.hyperspace.actions.CreateAction.isValidIndexSchema(CreateAction.scala:77)
  at com.microsoft.hyperspace.actions.CreateAction.validate(CreateAction.scala:60)
  at com.microsoft.hyperspace.actions.Action$class.run(Action.scala:89)
  at com.microsoft.hyperspace.actions.CreateAction.run(CreateAction.scala:29)
  at com.microsoft.hyperspace.index.IndexCollectionManager.create(IndexCollectionManager.scala:47)
  at com.microsoft.hyperspace.index.CachingIndexCollectionManager.create(CachingIndexCollectionManager.scala:78)
  at com.microsoft.hyperspace.Hyperspace$$anonfun$createIndex$1.apply$mcV$sp(Hyperspace.scala:45)
  at com.microsoft.hyperspace.Hyperspace.withHyperspaceRuleDisabled(Hyperspace.scala:188)
  at com.microsoft.hyperspace.Hyperspace.createIndex(Hyperspace.scala:44)
  ... 54 elided

scala> hs.createIndex(dff, IndexConfig("nestedIndex2", Seq("nestedC.id"), Seq("hsh")))
=> succeeded

with this fix:

cala> hs.createIndex(dff, IndexConfig("nested3", Seq("nestedC"), Seq("hsh")))
scala.MatchError: List() (of class scala.collection.immutable.Nil$)
  at com.microsoft.hyperspace.util.ResolverUtils$.com$microsoft$hyperspace$util$ResolverUtils$$getColumnNameFromSchema(ResolverUtils.scala:215)
  at com.microsoft.hyperspace.util.ResolverUtils$.com$microsoft$hyperspace$util$ResolverUtils$$getColumnNameFromSchema(ResolverUtils.scala:220)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1$$anonfun$apply$2.apply(ResolverUtils.scala:174)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1$$anonfun$apply$2.apply(ResolverUtils.scala:171)
  at scala.Option.map(Option.scala:146)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1.apply(ResolverUtils.scala:171)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1.apply(ResolverUtils.scala:168)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:296)
  at com.microsoft.hyperspace.util.ResolverUtils$.resolve(ResolverUtils.scala:168)
  at com.microsoft.hyperspace.actions.CreateAction.validate(CreateAction.scala:63)
  at com.microsoft.hyperspace.actions.Action$class.run(Action.scala:89)
  at com.microsoft.hyperspace.actions.CreateAction.run(CreateAction.scala:29)
  at com.microsoft.hyperspace.index.IndexCollectionManager.create(IndexCollectionManager.scala:47)
  at com.microsoft.hyperspace.index.CachingIndexCollectionManager.create(CachingIndexCollectionManager.scala:81)
  at com.microsoft.hyperspace.Hyperspace$$anonfun$createIndex$1.apply$mcV$sp(Hyperspace.scala:45)
  at com.microsoft.hyperspace.Hyperspace.withHyperspaceRuleDisabled(Hyperspace.scala:177)
  at com.microsoft.hyperspace.Hyperspace.createIndex(Hyperspace.scala:44)
  ... 55 elided

scala> hs.createIndex(dff, IndexConfig("nested3", Seq("nestedC.id"), Seq("hsh")))
com.microsoft.hyperspace.HyperspaceException: Hyperspace does not support nested columns.
  at com.microsoft.hyperspace.actions.CreateAction.validate(CreateAction.scala:71)
  at com.microsoft.hyperspace.actions.Action$class.run(Action.scala:89)
  at com.microsoft.hyperspace.actions.CreateAction.run(CreateAction.scala:29)
  at com.microsoft.hyperspace.index.IndexCollectionManager.create(IndexCollectionManager.scala:47)
  at com.microsoft.hyperspace.index.CachingIndexCollectionManager.create(CachingIndexCollectionManager.scala:81)
  at com.microsoft.hyperspace.Hyperspace$$anonfun$createIndex$1.apply$mcV$sp(Hyperspace.scala:45)
  at com.microsoft.hyperspace.Hyperspace.withHyperspaceRuleDisabled(Hyperspace.scala:177)
  at com.microsoft.hyperspace.Hyperspace.createIndex(Hyperspace.scala:44)
  ... 54 elided

From this result, I'll

  • create a fix for the ResolverUtil
  • change the msg to "Hyperspace does not support nested columns yet"

IMO quick failure and a temp config for incomplete / unstable features are a kind of best practice for development.
There's no reason we should keep the bad practice.

2=>
I would have liked to put this PR, because

  1. successfully created (but useless) indexes might take much amount of data in the storage.
  2. successfully created indexes may give the impression that Hyperspace can support it.

This PR doesn't mean that Hyperspace won't support nested field at all in the future, and there's no such decision between our team. This is a quick fix for release, just because I strongly thought that we should disable incomplete features before release to avoid any confusion.

It was my mistake that I didn't inform you first before this PR and give enough information about it.
This is the first time I work with outside of company and also open source community, there's still a bit of trial and error.

3=>
Though our refactoring is almost done, still there're some remaining works. (@clee704 please correct me if I'm wrong)
As we should release the version by this month, I'm not sure we could make it. v0.6 would be safe.

4=>
There's no specific time frame but Data Skipping index + another type of index will be the included.
It would be the end of July or August, from my expectation.

@rapoth and @imback82 are OOO until next week / next month, so there will be some delay on response.

@clee704
Copy link

clee704 commented Jun 21, 2021

Having a feature toggle for a non-trivial under-tested new feature is a good thing. Maybe I should make one for DataSkippingIndex, too 😄

The PR message can be reworded like "Add a feature toggle for nested columns". I think we can drop "temp" in the config key, and make the feature fully pluggable at will. Or maybe "dev" instead of "temp".

@sezruby sezruby changed the title Disable index creation with nested columns Add dev config for nested column support Jun 21, 2021
clee704
clee704 previously approved these changes Jun 22, 2021
Copy link

@clee704 clee704 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# Conflicts:
#	src/test/scala/com/microsoft/hyperspace/index/CreateIndexNestedTest.scala
Copy link
Contributor

@andrei-ionescu andrei-ionescu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to add a flag to disable this functionality. Just check if there is any nested field in the config and throw the needed exception.

Tests can be disabled with ignore for this release.

I can develop later by removing the exists(_.isNested) condition and re-enabling the tests.

@sezruby, @clee704: Thanks for clarifying things.

As @clee704 noted, the same approach should be to any big feature regardless if is developed by your internal team or by someone from the open source community.

throw HyperspaceException("Index config is not applicable to dataframe schema.")
}

// valid state check
// TODO: Temporarily block creating indexes using nested columns until it's fully supported.
if (!(HyperspaceConf.nestedColumnEnabled(spark) || resolvedColumns.get.forall(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding a flag? Just check if there is any nested column and throw. And disable the tests.

Suggested change
if (!(HyperspaceConf.nestedColumnEnabled(spark) || resolvedColumns.get.forall(
if (resolvedColumns.get.exists(_.isNested)) {

@andrei-ionescu
Copy link
Contributor

@sezruby I need a clear time frame when to get back working on the nested field feature. I need to plan my time as I also have other things on my plate and I would like as much as possible to stick to that time frame.

@clee704
Copy link

clee704 commented Jun 22, 2021

As @clee704 noted, the same approach should be to any big feature regardless if is developed by your internal team or by someone from the open source community.

I'm not sure what do you mean by this. There is no half-implemented feature on the master branch except this. I'm going to add a feature toggle for DataSkippingIndex which is not merged yet, but you are against using a feature toggle. Can you explain why?

@andrei-ionescu
Copy link
Contributor

andrei-ionescu commented Jun 22, 2021

@clee704

There is no half-implemented feature on the master branch except this

You guys have been merging different PRs with refactor in master that address a part of a future feature. For me that is a partial implemented feature. I don't want to go into this discussion.

I'm not sure what do you mean by this.

What I wanted to say is that I'm glad that you come with this idea of having a unified approach when developing big features.

but you are against using a feature toggle. Can you explain why?

It is not that I'm against a flag. It is just the fact that there is a simpler approach to disable it.

For the "nested feature" the simplest thing to do is to directly throw if a nested field is detected and disable the tests. When I'll resume developing the feature I'll remove the throw lines and enable the tests. Doing it with a flag all require do some more extra cleanup.

@clee704
Copy link

clee704 commented Jun 22, 2021

You guys have been merging different PRs with refactor in master that address a part of a future feature. For me that is a partial implemented feature. I don't want to go into this discussion.

Refactoring is not a feature. A feature provides new functionality to users. Refactoring does not.

@clee704
Copy link

clee704 commented Jun 22, 2021

Currently we're not doing it but disabling the tests can drop the code coverage and it would be an issue if we are going to add coverage check to CI. Having a non-tested, unused code sitting around in the codebase doesn't feel right.

But if you insist then I'm fine and I'll leave it up to @sezruby.

@andrei-ionescu
Copy link
Contributor

@clee704

I already said that I prefer not to get into this discussion but now that you mentioned it I'll give some feedback.

You know the phrase: "don't fix it if it works!"

In the majority of the cases refactoring happens due to new features that needs to be added. So, with this being said, any refactor is a partial feature, at least from my perspective.

@sezruby
Copy link
Collaborator Author

sezruby commented Jun 22, 2021

I'd like to keep the dev config approach as I don't want to break the tests without knowing it.
@clee704 Could you approve the change again? it's dismissed due to the merge conflict.

@andrei-ionescu please note that @clee704 and I are NOT the kind of person who doesn't fix the code just because it's working. @imback82 has even higher standards.

BTW WHY ARE YOU ARGUING WITH US? I don't understand what you're doing here.
We made enough justification, but you're still complaining about it and blaming us.
If you think this's the first time using disabled config in this project, I insist that I've done the rule refactoring with a temporal "disabled" package & tests for it, and enabled after it's fully delivered.
Anyway this is not a helpful discussion at all.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling temporarily via config seems reasonable to me. Thanks @sezruby!

@sezruby sezruby merged commit 1fe104f into microsoft:master Jun 23, 2021
@clee704
Copy link

clee704 commented Jun 23, 2021

I forgot to add: the nested column support is only relevant to CoveringIndex. Other index types might support it without any change to the framework. For example, DataSkippingIndex can support nested columns out of the box because it doesn't directly store the column values in the index. Therefore, a more correct fix would be blocking nested columns in CoveringIndex. But this is a minor issue as currently we only have CoveringIndex now. This is just something to remember when we continue to work on the nested column support later.

@andrei-ionescu
Copy link
Contributor

@sezruby I'm not arguing with you. I just propose different solutions than what you already did choose to do and I try to compare the differences between your proposal and mine and build a case for my proposal. The fact that it is in contradiction with what you are planning is in fact part of the development itself.

@andrei-ionescu please note that @clee704 and I are NOT the kind of person who doesn't fix the code just because it's working. @imback82 has even higher standards.

I never did think of you guys having low standards. On the contrary, my impression is that you have pretty high standards. I'm used to work in all kinds of projects (start-up, research, PoCs, production level, etc) and I'm comfortable in any of them. From my experience there should be a balance in regards the standards too, as these may slow the development in the early stages.

There are things that I see that you can improve in regards to doing open development. The biggest thing that in my opinion is hindering the open development is that there is a "double standard development" -- what's on your direct interests is treated differently than other things. This is understandable up to a point but after that it drives developers off the project. You need to find the balance.

There are other things as well in regards to open development but I'll stop here. If you want to discuss about these open dev things we can have a separate context.

@andrei-ionescu
Copy link
Contributor

@sezruby I need a clear time frame when to get back working on the nested field feature. I need to plan my time as I also have other things on my plate and I would like as much as possible to stick to that time frame.

I'm getting back to my ask which didn't get an answer yet. Can you provide a time frame for me so that I can plan my time accordingly?

@sezruby
Copy link
Collaborator Author

sezruby commented Jun 24, 2021

I'm getting back to my ask which didn't get an answer yet. Can you provide a time frame for me so that I can plan my time accordingly?

I'm not sure but you can resume after merging #461 - early or mid July.
As it also has some refactoring change and another refactoring change might be required for different index type support.

@sezruby sezruby deleted the nestedconfig branch June 30, 2021 23:09
clee704 pushed a commit that referenced this pull request Jan 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants