constraints of input datasets for each data-mining algorithm #13

HimmelStein · 2017-09-05T13:29:23Z

when a user select a dataset, and moves on to the data-mining service. Indigo shall only display data-mining algorithms which can be applied for the selected dataset.

so, please describe constraints of input datasets for your developed data-mining algorithm (send me through email before this Thursday).

wk0206 · 2017-09-20T11:01:18Z

@larjohn as described in mail, I will add the constraints of each data-mining algorithm in dam.json.
So that when you visit dam in the first time, you can get the list.
Now I am focus on Timeseries and using it as sample:

Facts: at least one time dimension with years as values, and three or more years available
Aggregates: at least one time dimension as drilldown

I changed the dam.json to following format,

add "conditions" as the same level as "aggregate"
add "dimension","numberRestriction","formatRestriction","description" as value
all value use String format
"three or more years available" becomes "3+" in String.

> "time_series": {
>     "configurations": {
>       "aggregate": {
>         "inputs": {
>         },
>         "outputs": {
>         },
>         "prompt": XX,
>         "method": XX,
>         "endpoint": XX,
>         "name": "aggregate",
>         "title": "Timeseries of aggregated fiscal data"
>       },
>       "conditions": {
>         "Facts": {
>           "dimension": "year",
>           "numberRestriction": "3+",
>           "formatRestriction": "",
>           "description":"at least one time dimension with years as values, and three or more years available"
>         },
>         "Aggregates": {
>           "dimension": "time",
>           "numberRestriction": "",
>           "formatRestriction": "drilldown",
>           "description":"at least one time dimension as drilldown"
>         }
>       }
>     },
>     "name": "time_series",
>     "title": "Time Series",
>     "description": XX
>   }

The difficult part is how to understand the "Semantics meaning" of " time dimension with years as values", should us only check the "dimension title" to make sure it has "year" inside, or we have to focus on the value, guarantee the regex filter like "^(19|20)\d{2}$".

Actually , how to write/read this condition, we have to listen more to your opinion.
If you prefer some other format, such as put the condition as the same level as "input" or "endpoint", it is OK for me too.

larjohn · 2017-09-27T17:25:13Z

@wk0206 time dimensions should be found in the package query, with datetime dimension type:


{
  "model": {
    "dimensions": {
      "global__functionalClassification__78f10": {
        "dimensionType": "classification"
      },
      "global__economicClassification__569a2": {
        "dimensionType": "classification"
      },
      "global__budgetPhase__afd93": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__854d0": {
        "dimensionType": "classification"
      },
      "global__classification__9ddd4": {
        "dimensionType": "classification"
      },
      "global__currency__1a842": {
        "dimensionType": "classification"
      },
      "global__fiscalPeriod__28951": {
        "dimensionType": "datetime"
      },
      "global__operationCharacter__0c040": {
        "dimensionType": "classification"
      },
      "global__organization__0eba1": {
        "dimensionType": "location"
      },
      "global__administrativeClassification__70a05": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__f9d35": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__38bee": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__13968": {
        "dimensionType": "classification"
      },
      "global__date__99de8": {
        "dimensionType": "classification"
      }
    },
    "measures": {
      "global__amount__0397f": {
        "currency": "EUR",
        "title": "Global amount"
      }
    }
  },
  "countryCode": null,
  "cityCode": null,
  "name": "global",
  "title": "Global dataset"
}

Regarding the constraints, please put them inside the configuration, as each configuration (usually facts vs aggregates) could have different starting points, requiring different thins, so they can't have the same constraints every time.

Also note, that most of the constraints might be better to be applied at DAM level, so that instead of requiring the algorithms with a generic request, indigo should request per dataset and only get the datasets that apply. A good strategy would be to cache the constraints analysis to avoid overhead.

HimmelStein · 2017-10-05T12:19:56Z

@larjohn why there is a random tail at each global key? e.g. what is the function of "__0eba1" for "global__organization__0eba1"?

HimmelStein · 2017-10-05T12:29:35Z

@larjohn let us take the data-mining function 'time series' as the example. The applicable datasets must have a dimension 'fiscalPeriod' and there shall be 3 different values in the dimension 'fiscalPeriod'.

"time_series": {
"configurations": {
"aggregate": {
"inputs": {
},
"outputs": {
},
"endpoint": <..>,
"name": "aggregate",
"title": "Timeseries of aggregated fiscal data"
},
"conditions": {
"Facts": {
"dimension": "datetime",
"numberRestriction": "3+",
"formatRestriction": "",
"description":"at least one dimension of type "datetime" with three or more different values available"
},
"Aggregates": {
"dimension": "datetime",
"numberRestriction": "",
"formatRestriction": "drilldown",
"description":"at least one time dimension as drilldown"
}
}
},
"name": "time_series",
"title": "Time Series"
}

larjohn · 2017-10-12T04:32:21Z

@HimmelStein sorry for the delay - I have been sick since last week...

The 'random' tail ensures that datasets from the same region that have similar last URI parts get different name. I can't recall exactly what led me to this, but here is an example:

http://datasets.obeu.com/athens/2016/expenditure
http://datasets.obeu.com/thessaloniki/2013/expenditure

In order to select a simple name for those two (not containing dashes etc.) one would use the last part, but it is the same here. So creating a hash of the URI and taking a part of it minimizes name clashes.

The restriction seems good, give me some time to implement it in Indigo.

larjohn · 2017-10-12T11:25:45Z

@HimmelStein I can't find the updated dam.json. Can you check so that I can update the running instance on the Fraunhofer server?

HimmelStein · 2017-10-12T11:30:46Z

@larjohn we have not checked in. As we are waiting for your feedback to the format (conditions used for time series), see my last comment above (the json structue)

wk0206 · 2017-10-17T09:27:32Z

@larjohn I update the dam.json , please check.

larjohn · 2017-10-18T21:37:47Z

 "conditions": {
          "Facts": {
            "dimension": "datetime",
            "numberRestriction": "3+",
            "formatRestriction": "",
            "description": "at least one dimension of type \"datetime\" with three or more different values available"
          },
          "Aggregates": {
            "dimension": "datetime",
            "numberRestriction": "",
            "formatRestriction": "drilldown",
            "description":"at least one time dimension as drilldown"
          }
        }

So I revisited the constraints, here are my comments:

The constraints should be embedded into each configuration they are applicable for, not the whole algorithm, as different configurations may require different things
The filtering is more obvious to be done at the DAM level but...
...The first constraint is evaluated before building the algorithm input, while the second is during building the algorithm input. The former should be expected at DAM level (do not show datasets that cannot produce any correct input for this algorithm configuration). The latter should be expected to be evaluated at the front-end (indigo) side. So let's define a way to separate them (a custom attribute?)
dimension should be dimension_type, numberRestriction should be cardinalNumberRestriction, formatRestriction should be roleRestriction (and have these possible values: {measure, field, aggregate, drilldown, sort, cut}

HimmelStein assigned skarampatakis, vojir and HimmelStein Sep 5, 2017

skarampatakis assigned larjohn, vojir and HimmelStein and unassigned skarampatakis, vojir and HimmelStein Sep 5, 2017

larjohn removed their assignment Sep 17, 2017

larjohn mentioned this issue Sep 17, 2017

When user selects an insutiable data-mining service for a selected data, ... openbudgets/platform#67

Closed

larjohn mentioned this issue Nov 1, 2017

Error in JSON encoding openbudgets/integration#36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

constraints of input datasets for each data-mining algorithm #13

constraints of input datasets for each data-mining algorithm #13

HimmelStein commented Sep 5, 2017

wk0206 commented Sep 20, 2017

larjohn commented Sep 27, 2017

HimmelStein commented Oct 5, 2017

HimmelStein commented Oct 5, 2017

larjohn commented Oct 12, 2017 •

edited

Loading

larjohn commented Oct 12, 2017

HimmelStein commented Oct 12, 2017

wk0206 commented Oct 17, 2017

larjohn commented Oct 18, 2017

constraints of input datasets for each data-mining algorithm #13

constraints of input datasets for each data-mining algorithm #13

Comments

HimmelStein commented Sep 5, 2017

wk0206 commented Sep 20, 2017

larjohn commented Sep 27, 2017

HimmelStein commented Oct 5, 2017

HimmelStein commented Oct 5, 2017

larjohn commented Oct 12, 2017 • edited Loading

larjohn commented Oct 12, 2017

HimmelStein commented Oct 12, 2017

wk0206 commented Oct 17, 2017

larjohn commented Oct 18, 2017

larjohn commented Oct 12, 2017 •

edited

Loading