Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Use indexes subdirectory for custom index system path #369

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

sezruby
Copy link
Collaborator

@sezruby sezruby commented Feb 26, 2021

What is the context for this pull request?

  • Tracking Issue: n/a
  • Parent Issue: n/a
  • Dependencies: n/a

What changes were proposed in this pull request?

Use subdirectory for custom index system path for consistency & maintenance.
e.g. with indexDirName = "indexes"

current:
- /custom/system/path/index1
- /custom/system/path/index2

with this change:
- /custom/system/path/indexes/index1
- /custom/system/path/indexes/index2

Added new config for the subdir name

// Config used for subdirectory name under the system path.
  val INDEX_DIR_NAME = "spark.hyperspace.system.indexDirName"
  val INDEX_DIR_NAME_DEFAULT = "indexes"

Does this PR introduce any user-facing change?

Yes, this PR changes the system directory path.
This is a breaking change.

How was this patch tested?

Need to fix tests

@sezruby
Copy link
Collaborator Author

sezruby commented Feb 26, 2021

@imback82 @rapoth @apoorvedave1 @thugsatbay

Though this change will be a breaking change, I think we need this sub root dir to avoid any kind of interference.
We shouldn't rely on user's input. WDYT?

@thugsatbay
Copy link
Contributor

  • One advantage I see is that user index are more organized.
  • Will we be keeping some files in the indexes folder that denote some global state of all the index created so far ?
  • What kind of interference are we expecting ?
  • Will user ever come to the hyperspace index dir and play or move files ?

@sezruby
Copy link
Collaborator Author

sezruby commented Feb 27, 2021

  • we might have some metadata later
  • if there's other directories in there, index manager tries to find or build IndexLogEntry on it and causes some exception, or directory size estimation or something like that
  • rarely but possible

The default behavior is creating indexes under the spark warehouse dir.
But if the system path config is set, we don't create indexes under it.

@imback82 could you add some comment?

@imback82
Copy link
Contributor

I guess the only concern I have is the breaking change as you pointed out. @rapoth?

@andrei-ionescu
Copy link
Contributor

I'll give the example of using custom indexing path with Iceberg.

Iceberg has its own metadata folder where it's keeping its own information. That is the default value but can be changed if needed through some table or custom define Spark parameters. This is named metadata_location and is a relative path starting from the dataset path. In the case of Iceberg it is set in stone that the main use case is keeping the metadata next to the data.

In the case of Hyperspace, the current possibility of setting the path to whatever value it is needed is the best option although the user needs to know not to use a folder used for other purposes (ie: metadata in Iceberg case).

In my opinion, adding the hardcoded indexes value inside the path adds some limitations to the current freedom.

What should some one do if they want to keep the indexes next to their data using custom indexes path but the indexes folder inside the dataset is already used by other indexing mechanism?

@sezruby
Copy link
Collaborator Author

sezruby commented Mar 4, 2021

@andrei-ionescu That's a fair point but currently Hyperspace doesn't have "metadata" dir and rely on the directory structure under the system path; so I suggest this sub-root dir. We don't support different index data locations yet (#243), but I think Hyperspace also needs "metadata" dir to maintain indexes located in different paths.

@andrei-ionescu
Copy link
Contributor

andrei-ionescu commented Mar 4, 2021

@sezruby I'm not against having a default folder name for the custom index system path, but I'm against of having it hardcoded (burned in code) as indexes.

I'm for making the sub-folder a setting at all levels: either if the user chooses the Spark warehouse (default) or dataset path (custom path) or other path if a new suffix is added to the path, the suffix should be a setting.

I see that in HyperspaceConf are these two constants: INDEX_SYSTEM_PATH and INDEXES_DIR. I see that INDEXES_DIR is not used in code and I would suggest to make use of it too, or rename it for a better understanding.

Code wise, this what I suggest:

def systemPath: Path = {
  val indexesSuffix = conf.getConfString(IndexConstants.INDEXES_DIR_SUFFIX, IndexConstants.INDEXES_DIR)
  val defaultIndexesPath = conf.getConfString("spark.sql.warehouse.dir")
  new Path(conf.getConfString(IndexConstants.INDEX_SYSTEM_PATH, defaultIndexesPath), indexesSuffix)
}

And add the new INDEXES_DIR_SUFFIX into IndexConstants

val INDEXES_DIR_SUFFIX = "spark.hyperspace.indexes.dir.suffix"

Just to conclude, I'm not against having this subdirectory but 1) let's make it settable 2) let's make it consistent over all possible pathing options in Hyperspace.

@sezruby
Copy link
Collaborator Author

sezruby commented Mar 5, 2021

@andrei-ionescu Yep sounds good to me. Thanks for the good suggestion :)

@rapoth any opinion on this PR?

@sezruby sezruby changed the title [WIP] Use indexes subdirectory for custom index system path Use indexes subdirectory for custom index system path Mar 5, 2021
@sezruby sezruby self-assigned this Mar 7, 2021
andrei-ionescu
andrei-ionescu previously approved these changes Mar 10, 2021
Copy link
Contributor

@andrei-ionescu andrei-ionescu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sezruby
Copy link
Collaborator Author

sezruby commented Mar 11, 2021

@imback82 Now at least we have a workaround ( indexDirName = "" ) for the breaking change. WDYT?

// Config used for setting the system path, which is considered as a "root" path for Hyperspace;
// e.g, indexes are created under the system path.
val INDEX_SYSTEM_PATH = "spark.hyperspace.system.path"

// Config used for subdirectory name under the system path.
val INDEX_DIR_NAME = "spark.hyperspace.system.indexDirName"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe simply DIR or LOCATION instead of DIR_NAME?

@@ -63,9 +63,16 @@ private[hyperspace] class PathResolver(conf: SQLConf, hadoopConf: Configuration)
* @return Hyperspace index system path.
*/
def systemPath: Path = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit confused by now about what a system path is in Hyperspace, because the value of INDEX_SYSTEM_PATH and the return value of this function can be different, yet both are being called "system paths". Which one is the "system path"? Shouldn't we name either one differently? Maybe we can 1) improve the docstring, 2) change the config name from INDEX_SYSTEM_PATH to something like INDEX_SYSTEM_ROOT, and/or 3) use a more self-explanatory name than system path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think it's better to change the function name rather than the config string. So my suggestion is 1) keep "system path" as it is (w/o subdir) 2) give another name to "system path"/subdir path.

How about indexLocationPath?

cc @imback82

@clee704
Copy link

clee704 commented Apr 9, 2021

I think this PR is good in general, but we need some clarification and write something about how and where data is stored in Hyperspace. Maybe there is one already?

@sezruby
Copy link
Collaborator Author

sezruby commented Apr 9, 2021

Maybe there is one already?

Maybe this one? https://microsoft.github.io/hyperspace/docs/toh-indexes-on-the-lake/
I think it's good to add this link to quick-start guide for reference.

// Config used for setting the system path, which is considered as a "root" path for Hyperspace;
// e.g, indexes are created under the system path.
val INDEX_SYSTEM_PATH = "spark.hyperspace.system.path"

// Config used for subdirectory name under the system path.
val INDEX_DIR_NAME = "spark.hyperspace.system.indexDirName"
val INDEX_DIR_NAME_DEFAULT = "indexes"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: will change the default dir name as "hyperspace".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants