Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

constraints of input datasets for each data-mining algorithm #13

Open
HimmelStein opened this issue Sep 5, 2017 · 9 comments
Open

constraints of input datasets for each data-mining algorithm #13

HimmelStein opened this issue Sep 5, 2017 · 9 comments
Assignees

Comments

@HimmelStein
Copy link
Contributor

when a user select a dataset, and moves on to the data-mining service. Indigo shall only display data-mining algorithms which can be applied for the selected dataset.

so, please describe constraints of input datasets for your developed data-mining algorithm (send me through email before this Thursday).

@wk0206
Copy link

wk0206 commented Sep 20, 2017

@larjohn as described in mail, I will add the constraints of each data-mining algorithm in dam.json.
So that when you visit dam in the first time, you can get the list.
Now I am focus on Timeseries and using it as sample:

Facts: at least one time dimension with years as values, and three or more years available
Aggregates: at least one time dimension as drilldown

I changed the dam.json to following format,

  • add "conditions" as the same level as "aggregate"

  • add "dimension","numberRestriction","formatRestriction","description" as value

  • all value use String format

  • "three or more years available" becomes "3+" in String.

> "time_series": {
>     "configurations": {
>       "aggregate": {
>         "inputs": {
>         },
>         "outputs": {
>         },
>         "prompt": XX,
>         "method": XX,
>         "endpoint": XX,
>         "name": "aggregate",
>         "title": "Timeseries of aggregated fiscal data"
>       },
>       "conditions": {
>         "Facts": {
>           "dimension": "year",
>           "numberRestriction": "3+",
>           "formatRestriction": "",
>           "description":"at least one time dimension with years as values, and three or more years available"
>         },
>         "Aggregates": {
>           "dimension": "time",
>           "numberRestriction": "",
>           "formatRestriction": "drilldown",
>           "description":"at least one time dimension as drilldown"
>         }
>       }
>     },
>     "name": "time_series",
>     "title": "Time Series",
>     "description": XX
>   }

The difficult part is how to understand the "Semantics meaning" of " time dimension with years as values", should us only check the "dimension title" to make sure it has "year" inside, or we have to focus on the value, guarantee the regex filter like "^(19|20)\d{2}$".

Actually , how to write/read this condition, we have to listen more to your opinion.
If you prefer some other format, such as put the condition as the same level as "input" or "endpoint", it is OK for me too.

@larjohn
Copy link

larjohn commented Sep 27, 2017

@wk0206 time dimensions should be found in the package query, with datetime dimension type:


{
  "model": {
    "dimensions": {
      "global__functionalClassification__78f10": {
        "dimensionType": "classification"
      },
      "global__economicClassification__569a2": {
        "dimensionType": "classification"
      },
      "global__budgetPhase__afd93": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__854d0": {
        "dimensionType": "classification"
      },
      "global__classification__9ddd4": {
        "dimensionType": "classification"
      },
      "global__currency__1a842": {
        "dimensionType": "classification"
      },
      "global__fiscalPeriod__28951": {
        "dimensionType": "datetime"
      },
      "global__operationCharacter__0c040": {
        "dimensionType": "classification"
      },
      "global__organization__0eba1": {
        "dimensionType": "location"
      },
      "global__administrativeClassification__70a05": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__f9d35": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__38bee": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__13968": {
        "dimensionType": "classification"
      },
      "global__date__99de8": {
        "dimensionType": "classification"
      }
    },
    "measures": {
      "global__amount__0397f": {
        "currency": "EUR",
        "title": "Global amount"
      }
    }
  },
  "countryCode": null,
  "cityCode": null,
  "name": "global",
  "title": "Global dataset"
}

Regarding the constraints, please put them inside the configuration, as each configuration (usually facts vs aggregates) could have different starting points, requiring different thins, so they can't have the same constraints every time.

Also note, that most of the constraints might be better to be applied at DAM level, so that instead of requiring the algorithms with a generic request, indigo should request per dataset and only get the datasets that apply. A good strategy would be to cache the constraints analysis to avoid overhead.

@HimmelStein
Copy link
Contributor Author

@larjohn why there is a random tail at each global key? e.g. what is the function of "__0eba1" for "global__organization__0eba1"?

@HimmelStein
Copy link
Contributor Author

@larjohn let us take the data-mining function 'time series' as the example. The applicable datasets must have a dimension 'fiscalPeriod' and there shall be 3 different values in the dimension 'fiscalPeriod'.

"time_series": {
"configurations": {
"aggregate": {
"inputs": {
},
"outputs": {
},
"endpoint": <..>,
"name": "aggregate",
"title": "Timeseries of aggregated fiscal data"
},
"conditions": {
"Facts": {
"dimension": "datetime",
"numberRestriction": "3+",
"formatRestriction": "",
"description":"at least one dimension of type "datetime" with three or more different values available"
},

"Aggregates": {
"dimension": "datetime",
"numberRestriction": "",
"formatRestriction": "drilldown",
"description":"at least one time dimension as drilldown"
}
}
},
"name": "time_series",
"title": "Time Series"
}

@larjohn
Copy link

larjohn commented Oct 12, 2017

@HimmelStein sorry for the delay - I have been sick since last week...

The 'random' tail ensures that datasets from the same region that have similar last URI parts get different name. I can't recall exactly what led me to this, but here is an example:

http://datasets.obeu.com/athens/2016/expenditure
http://datasets.obeu.com/thessaloniki/2013/expenditure

In order to select a simple name for those two (not containing dashes etc.) one would use the last part, but it is the same here. So creating a hash of the URI and taking a part of it minimizes name clashes.

The restriction seems good, give me some time to implement it in Indigo.

@larjohn
Copy link

larjohn commented Oct 12, 2017

@HimmelStein I can't find the updated dam.json. Can you check so that I can update the running instance on the Fraunhofer server?

@HimmelStein
Copy link
Contributor Author

@larjohn we have not checked in. As we are waiting for your feedback to the format (conditions used for time series), see my last comment above (the json structue)

@wk0206
Copy link

wk0206 commented Oct 17, 2017

@larjohn I update the dam.json , please check.

@larjohn
Copy link

larjohn commented Oct 18, 2017

 "conditions": {
          "Facts": {
            "dimension": "datetime",
            "numberRestriction": "3+",
            "formatRestriction": "",
            "description": "at least one dimension of type \"datetime\" with three or more different values available"
          },
          "Aggregates": {
            "dimension": "datetime",
            "numberRestriction": "",
            "formatRestriction": "drilldown",
            "description":"at least one time dimension as drilldown"
          }
        }

So I revisited the constraints, here are my comments:

  1. The constraints should be embedded into each configuration they are applicable for, not the whole algorithm, as different configurations may require different things

  2. The filtering is more obvious to be done at the DAM level but...

  3. ...The first constraint is evaluated before building the algorithm input, while the second is during building the algorithm input. The former should be expected at DAM level (do not show datasets that cannot produce any correct input for this algorithm configuration). The latter should be expected to be evaluated at the front-end (indigo) side. So let's define a way to separate them (a custom attribute?)

  4. dimension should be dimension_type, numberRestriction should be cardinalNumberRestriction, formatRestriction should be roleRestriction (and have these possible values: {measure, field, aggregate, drilldown, sort, cut}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants