Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for OpenSearch alias type #3246

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

qianheng-aws
Copy link
Contributor

@qianheng-aws qianheng-aws commented Jan 15, 2025

Description

Add Support for OpenSearch alias type, so PPL can query fields of alias type correctly.

Implementation:

  • Add a new ExprType named OpenSearchAliasType
  • Add special parsing logic for alias type in OpenSearchDataType::parseMapping
  • Use original path and type when constructing ReferenceExpression. For now, only alias type will have different original path and type.
  • When constructing _source in OpenSearchRequestBuilder::pushDownProjects, use path instead of attr since they are different for alias type. The former one is more accurate since OpenSearch don't accept fields of alias type in _source.

e.g. After this PR:

# Create Index with alias type
PUT /try
{
    "mappings":{
        "properties": {
            "time1": {
                "type": "date"
            },
            "time2": {
                "type": "alias",
                "path": "time1"
            }
        }
    }
}

# Add doc
POST try/_doc
{
  "time1": 1736911491131
}
# Query with field of alias type
POST /_plugins/_ppl/
{
   "query": "source=try |where `time2` > '2025-01-01' |  fields `time2`"
}

# Get result
{
  "schema": [
    {
      "name": "time2",
      "type": "timestamp"
    }
  ],
  "datarows": [
    [
      "2025-01-15 03:24:51.131"
    ]
  ],
  "total": 1,
  "size": 1
}

Related Issues

Resolves #3069

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Heng Qian <[email protected]>
(cherry picked from commit 2f475af)
Signed-off-by: Heng Qian <[email protected]>
Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice - thanks !

* The type of alias. See <a
* href="https://opensearch.org/docs/latest/opensearch/supported-field-types/alias/">doc</a>
*/
public class OpenSearchAliasType extends OpenSearchDataType {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should an alias be treated as a new data type, or should it remain as just an alias for the original field? Additionally, what are the expectations for the output of the DESCRIBE TABLE command? for example, describe try?

Copy link
Contributor Author

@qianheng-aws qianheng-aws Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I've thought of resolving alias type to NamedExpression which is a resolved expression for Alias. In that proposal, we still need to maintain this OpenSearchAliasType in TypeEnvironment when analyzing, so as to identify ident of alias type, and resolve them to NamedExpression(i.e. Alias) with its delegated to be a ReferenceExpression to the original field.

  2. With the current code, describe try returns:

{
  "schema": [{"name": "TABLE_CAT","type": "string"},{"name": "TABLE_SCHEM","type": "string"},{"name": "TABLE_NAME","type": "string"},{"name": "COLUMN_NAME","type": "string"},{"name": "DATA_TYPE","type": "string"},{"name": "TYPE_NAME","type": "string"},{"name": "COLUMN_SIZE","type": "string"},{"name": "BUFFER_LENGTH","type": "string"},{"name": "DECIMAL_DIGITS","type": "string"},{"name": "NUM_PREC_RADIX","type": "string"},{"name": "NULLABLE","type": "string"},{"name": "REMARKS","type": "string"},{"name": "COLUMN_DEF","type": "string"},{"name": "SQL_DATA_TYPE","type": "string"},{"name": "SQL_DATETIME_SUB","type": "string"},{"name": "CHAR_OCTET_LENGTH","type": "string"},{"name": "ORDINAL_POSITION","type": "string"},{"name": "IS_NULLABLE","type": "string"},{"name": "SCOPE_CATALOG","type": "string"},{"name": "SCOPE_SCHEMA","type": "string"},{"name": "SCOPE_TABLE","type": "string"},{"name": "SOURCE_DATA_TYPE","type": "string"},{"name": "IS_AUTOINCREMENT","type": "string"},{"name": "IS_GENERATEDCOLUMN","type": "string"}
  ],
"datarows": [
	[ "opensearch", null, "try", "time1", null, "timestamp", null, null, null, 10, 2, null, null, null, null, null, 0, "", null, null, null, null, "NO", ""],
	[ "opensearch", null, "try", "time2", null, "timestamp", null, null, null, 10, 2, null, null, null, null, null, 1, "", null, null, null, null, "NO", ""]
  ],
  "total": 2,
  "size": 2
}

Copy link
Collaborator

@penghuo penghuo Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought of resolving alias type to NamedExpression which is a resolved expression for Alias

+1, agree
instead create new data type, could we add alias field when parseMapping? for instance,

    Set<Alias> aliases = new HashSet<>()
    indexMapping.forEach(
        (k, v) -> {

          var type = ((String) innerMap.getOrDefault("type", "object")).replace("_", "");
          if (type == "alias") {
             aliases.add(new Alias(aliasField, rawField)
           }
          // current code
         result
        });
   for(alias: aliases) {
     result.put(alias.aliasField, result.get(alias.rawField)
   }
   return result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding alias field when parse mapping is more simply. But what if the original field definition changed? In DSL, the alias and field definitions are "static". But in PPL, the field definition can be changed by eval <original_field> = xxx. Should the alias change its returns correspondingly.

Copy link
Contributor Author

@qianheng-aws qianheng-aws Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@penghuo it seems not work to parse alias field in the way like that, as the information of path cannot be recorded anywhere. For the example in the description, in analyzing process, we only know the datatype of time2 but we cannot resolve its value when executing, because time2 isn't an actual exist value in the source of OpenSearch index.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in analyzing process, we only know the datatype of time2 but we cannot resolve its value when executing

agree.
my concern is alias is not a data type even when move to Calcite, right? is it possible to resolve alias field as NamedExpression?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking alias field will be "replaced" everywhere in the query plan after analyzing, similar as table alias?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid we need to add alias type as well when moving to Calcite. But it's a data type only exist in analyzing process and will be resolved before execution.

is it possible to resolve alias field as NamedExpression?

Yes if using the way mentioned in the beginning of this thread. Taking the above example, we can change resolving time2 from

ReferenceExpression(attr = `time2`, path =`time1`, type = `date`)

to

NamedExpression(name = `time2`, delegated = ReferenceExpression(attr = `time1`, path =`time1`, type = `date`))

They two are nearly semantic equal I think.

@penghuo penghuo added backport 2.x enhancement New feature or request labels Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] PPL does not support querying Alias fields
5 participants