Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing DefaultDimensionSpec in Druid query when querying with multiple Lookup dimensions. #716

Open
upendrareddy opened this issue Sep 21, 2020 · 10 comments

Comments

@upendrareddy
Copy link

When we query Maha with multiple lookup dimensions in the select fields we are seeing some of the fields returned as nulls in the results. The underlying druid query issued did not have these lookups listed in the default dimension specs.

To elaborate further
The issued maha query is of the following format.
Maha Query

{
  "cube": "test_cube",
  "rowsPerPage": 1000,
  "selectFields": [
    {
      "field": "Adunit ID"
    },
   {
      "field": "Adunit Name"  //Lookup based on Adunit ID
    },
    {
      "field": "Adgroup ID"
    },
   {
      "field": "Adgroup Name" //Lookup based on Adgroup ID
    },
    {
      "field": "Adserver Requests"
    }
  ],
  "filterExpressions": [
    {
      "field": "Publisher ID",
      "operator": "=",
      "value": "xxxxxxxxxAAAAAABBBBBBBBBB"
    },
    {
      "field": "Day",
      "operator": "Between",
      "from": "2020-09-17",
      "to": "2020-09-17"
    }
  ]
}

Output

{
  "header": {
    "cube": "test_cube",
    "fields": [
      {
        "fieldName": "Adunit ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adunit Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adserver Requests",
        "fieldType": "FACT"
      }
    ],
    "maxRows": 1000,
    "debug": {}
  },
  "rows": [
    [
      "dddddddddddAdunitI1dddddddddddddd",
      "dddddddddddAdunitI1 Name dddddddddddddd",
      null,   // _Missing adgroup_id_
      "dddddddddddAdgroup1 Name dddddddddddddd",
      0
    ],
    [
      "dddddddddddAdunitId2ddddddddddddd",
      "dddddddddddAdunitId2 Name ddddddddddddd",
      null,   // _Missing adgroup_id_
      "dddddddddddAdgroup2 Name dddddddddddddd",
      1
    ]
  ],
  "curators": {}
}

Druid Query Created by Maha

{
  "queryType": "groupBy",
  "dataSource": {
    "type": "table",
    "name": "test_cube"
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "2020-09-17T00:00:00.000Z/2020-09-18T00:00:00.000Z"
    ]
  },
  "virtualColumns": [],
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "or",
        "fields": [
          {
            "type": "selector",
            "dimension": "__time",
            "value": "2020-09-17",
            "extractionFn": {
              "type": "timeFormat",
              "format": "YYYY-MM-dd",
              "timeZone": "UTC",
              "granularity": {
                "type": "none"
              },
              "asMillis": false
            }
          }
        ]
      },
      {
        "type": "selector",
        "dimension": "pubId",
        "value": "xxxxxxxxxAAAAAABBBBBBBBBB"
      }
    ]
  },
  "granularity": {
    "type": "all"
  },
  "dimensions": [     // No adgroup_id added here. 
    {
      "type": "default",
      "dimension": "adunitId",
      "outputName": "Adunit ID",
      "outputType": "STRING"
    },
    {
      "type": "extraction",
      "dimension": "adgroupId",
      "outputName": "Adgroup Name",
      "outputType": "STRING",
      "extractionFn": {
        "type": "registeredLookup",
        "lookup": "adgroup_names",
        "retainMissingValue": false,
        "replaceMissingValueWith": "UNKNOWN"
      }
    },
    {
      "type": "extraction",
      "dimension": "adunitId",
      "outputName": "Adunit Name",
      "outputType": "STRING",
      "extractionFn": {
        "type": "registeredLookup",
        "lookup": "adunit_names",
        "retainMissingValue": false,
        "replaceMissingValueWith": "UNKNOWN"
      }
    }
  ],
  "aggregations": [
    {
      "type": "longSum",
      "name": "Adserver Requests",
      "fieldName": "adserverRequests"
    }
  ],
  "postAggregations": [],
  "limitSpec": {
    "type": "default",
    "columns": [],
    "limit": 10000000
  },
  "context": {
    "groupByStrategy": "v2",
    "applyLimitPushDown": "false",
    "implyUser": "internal_user",
    "priority": 10,
    "userId": "internal_user",
    "uncoveredIntervalsLimit": 1,
    "groupByIsSingleThreaded": true,
    "timeout": 900000,
    "queryId": "9292389f-2a7f-4e12-a39a-6f727097ab92"
  },
  "descending": false
}

Upon debugging further I stumbled upon the variable factRequestCols(Set of Strings) at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L353 which is being passed on to method at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L381 where druid queries dimension specs are being created in getDimensions method based on factRequestCols passed.

I am not quite sure I understand the logic of factRequestCols set creation here but adgroup_id dimension is not getting included in the resulting set because of which it is not getting added to Druid query dimension spec as well.

I locally overrode the code by passing queryContext.factBestCandidate.requestCols to getDimensions method at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L381 and it fixed the issue and started populating the dimension spec in druid query as well as had adgroup_id values in the resulting out.

Output after the change

{
  "header": {
    "cube": "platform_performance_cube",
    "fields": [
      {
        "fieldName": "Adunit ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adunit Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adserver Requests",
        "fieldType": "FACT"
      }
    ],
    "maxRows": 1000,
    "debug": {}
  },
  "rows": [
    [
      "dddddddddddAdunitI1dddddddddddddd",
      "dddddddddddAdunitI1 Name dddddddddddddd",
      "dddddddddddAdgroup1dddddddddddddd",
      "dddddddddddAdgroup1 Name dddddddddddddd",
      0
    ],
    [
      "dddddddddddAdunitId2ddddddddddddd",
      "dddddddddddAdunitId2 Name ddddddddddddd",
      "dddddddddddAdgroup2dddddddddddddd",
      "dddddddddddAdgroup2 Name dddddddddddddd",
      1
    ]
  ],
  "curators": {}
}

Could you please help me with the context of factRequestCols set creation and also let me know if the logic of the factRequestCols set creation or anything else needs to be changed to include the missing dimensions.

@patelh
Copy link
Collaborator

patelh commented Sep 21, 2020

Does the request have forceDimensionDriven set to true? Try to get debug info to see if isDimDriven or forceDimensionDriven are true on request model.

@patelh
Copy link
Collaborator

patelh commented Sep 21, 2020

Also try running the request with forceDimensionDriven: "false" to see what the query looks like.

@upendrareddy
Copy link
Author

Following are different boolean RequestModel variables set in our case.

true,  //hasFactFilters
false, //hasMetricFilters
true,  //hasNonFKFactFilters
false, //hasDimFilters
false, //hasNonFKDimFilters
false, //hasFactSortBy
false, //hasDimSortBy
false, //isFactDriven
false, //forceDimDriven
false, //forceFactDriven
false, //hasNonDrivingDimSortOrFilter
false, //hasDrivingDimNonFKNonPKSortBy
false, //hasNonDrivingDimNonFKNonPKFilter
false, //anyDimHasNonFKNonForceFilter

forceDimensionDriven, isDimDriven set to false in our case.

@upendrareddy
Copy link
Author

Looks like isDimDriven is the opposite of isFactDriven, So it is being set to true in our case.

@patelh
Copy link
Collaborator

patelh commented Sep 21, 2020

You may have identified a corner case which may not have coverage. Try setting forceFactDriven to true.

@upendrareddy
Copy link
Author

Yes, setting the forceFactDriven variable to true helped with generating the desired Druid query.

I see the variable can be set in Maha request JSON payload. Should the variable be defaulted to true since most of the queries are fact-driven. Let me know if you have any recommendations.

@patelh
Copy link
Collaborator

patelh commented Sep 22, 2020

Use that as a work around until we figure out what should be the right behavior in the dim driven case. Thanks for reporting it!

@upendrareddy
Copy link
Author

Thanks for the workaround. Are there any repercussions by using the forceFactDriven on all the queries?

@patelh
Copy link
Collaborator

patelh commented Sep 22, 2020

@upendrareddy Unless you have a dim driven use case (e.g. entity management in the UI, e.g. Campaign Management view), they should all be fact driven so I don't see any issues with doing it for all queries.

@upendrareddy
Copy link
Author

Got it. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants