Skip to content
This repository has been archived by the owner on Aug 13, 2021. It is now read-only.

[267] Tidy & slim schema transformations #281

Merged
merged 22 commits into from
Jun 9, 2020
Merged

Conversation

jaklinger
Copy link
Contributor

@jaklinger jaklinger commented Jun 3, 2020

Related to #267, which isn't closed until #280 is addressed (which is the base of this branch).

  • Prune deprecated schema transformations
  • Only one ontology per dataset (no project specific ontologies)
  • Tidy directory naming, and refactor references accordingly
  • Add entity_type to new dataset ontologies
  • Remove unused ontologies from tier_1.json (renamed ontology.json)
  • Add tests to ensure ES mappings are both self-consistent and consistent with the slimmed schema transformations
  • Rollback any inadvertently changed json files, unrelated to this PR
  • Confirm all new mappings and old mappings are equivalent, then prune old mappings
  • Spawn PR [267b] Iterate on development pipelines with new ES config setup #282 to iterate on dev pipeline validation

@jaklinger jaklinger changed the title pruned deprecated schema transformations [267] Tidy a slim schema transformations Jun 3, 2020
@jaklinger jaklinger self-assigned this Jun 3, 2020
@jaklinger jaklinger changed the title [267] Tidy a slim schema transformations [267] Tidy & slim schema transformations Jun 5, 2020
@jaklinger
Copy link
Contributor Author

@mindrones
For the review, could you please consider any file under nesta/core/schemas/tier_1/. All other changes are in the python code to reflect corresponding simplifications in the directory and config structure.

@jaklinger jaklinger requested a review from mindrones June 5, 2020 14:50
@jaklinger jaklinger marked this pull request as ready for review June 5, 2020 14:50
Copy link
Contributor

@mindrones mindrones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noted down a couple of changes I'd like to see, but not really sure what implications these would have.

nesta/core/schemas/tier_1/datasets/arxiv.json Show resolved Hide resolved
nesta/core/schemas/tier_1/datasets/arxiv.json Show resolved Hide resolved
@@ -1,24 +1,24 @@
{
"mappings":{
"_doc":{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just a comment, in Svizzle I'm actually going back to tabs as it's editor configurable so it makes everyone happy :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, actually I have a new unit test that this would have failed (all json in the repo must be clean to pass) - so this will fail anyway :)

Copy link
Contributor

@mindrones mindrones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge for my part 👍

* changed schema_transformor to use new simpler mapping

* removed to/from keys

* new null syntax mapping implemented

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* testing es7 on cordis only

* testing es7 on cordis only

* testing es7 on cordis only

* changes to make cordis es7 run

* eurito-dev iteration

* compatibility issues between arxlive and eurito arxiv

* sorted json

* pycountry change no longer assumes not null country

* needed to split pathstub args

* removed redundant es mappings

* old new index paradigm fix

Co-authored-by: Joel Klinger <[email protected]>
@jaklinger jaklinger merged commit 844b2a2 into 267_es_mappings Jun 9, 2020
@jaklinger jaklinger deleted the 267a_schematrans branch June 9, 2020 11:34
jaklinger added a commit that referenced this pull request Jun 9, 2020
* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* using soft alias until a future PR to minimise changes

* cleaned and sorted json

* [267] Tidy & slim schema transformations (#281)

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* harmonised name fieldsofstudy across arxiv

* added novelty back in

* sorted json

* sorted json

* sorted json

Co-authored-by: Joel Klinger <[email protected]>

Co-authored-by: Joel Klinger <[email protected]>
jaklinger added a commit that referenced this pull request Jun 9, 2020
* make sure conf dir is empty

* simplified es config

* added orm es config reader

* modified setup_es to pick up new es config

* swapped es_mode for boolean

* aliases now consistent with config

* aliases now automatically located

* added endpoint field to estasks

* added endpoint field to sql2estasks

* [267] Pool ES mappings across datasets (#280)

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* using soft alias until a future PR to minimise changes

* cleaned and sorted json

* [267] Tidy & slim schema transformations (#281)

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* harmonised name fieldsofstudy across arxiv

* added novelty back in

* sorted json

* sorted json

* sorted json

Co-authored-by: Joel Klinger <[email protected]>

Co-authored-by: Joel Klinger <[email protected]>

* patched out es config setup from tests

* removed redundant tests

* fixed json formatting

* none included for testing

* picked up bug in test

Co-authored-by: Joel Klinger <[email protected]>
jaklinger added a commit that referenced this pull request Jun 26, 2020
* make sure conf dir is empty

* simplified es config

* added orm es config reader

* modified setup_es to pick up new es config

* swapped es_mode for boolean

* aliases now consistent with config

* aliases now automatically located

* added endpoint field to estasks

* added endpoint field to sql2estasks

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* harmonised name fieldsofstudy across arxiv

* using soft alias until a future PR to minimise changes

* added novelty back in

* sorted json

* sorted json

* sorted json

* changed schema_transformor to use new simpler mapping

* removed to/from keys

* new null syntax mapping implemented

* cleaned and sorted json

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* testing es7 on cordis only

* testing es7 on cordis only

* testing es7 on cordis only

* changes to make cordis es7 run

* eurito-dev iteration

* compatibility issues between arxlive and eurito arxiv

* sorted json

* pycountry change no longer assumes not null country

* needed to split pathstub args

* removed redundant es mappings

* empty gtr transformation

* [267] Pool ES mappings across datasets (#280)

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* using soft alias until a future PR to minimise changes

* cleaned and sorted json

* [267] Tidy & slim schema transformations (#281)

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* harmonised name fieldsofstudy across arxiv

* added novelty back in

* sorted json

* sorted json

* sorted json

Co-authored-by: Joel Klinger <[email protected]>

Co-authored-by: Joel Klinger <[email protected]>

* patched out es config setup from tests

* removed redundant tests

* fixed json formatting

* fixed bad table name (NB table was empty anyway)

* fixed bad table name (NB table was empty anyway)

* gtr ontology

* none included for testing

* added schema transformation

* picked up bug in test

* gtr ontology is self consistent

* added gtr mapping

* added gtr to config

* fixed merge conflicts

* fixed merge conflicts

* changed json field names

* instiutes are now analyzed and text

* sorted and cleaned json

* added geopoint

* fixed bad json

* fixed bad json

Co-authored-by: Joel Klinger <[email protected]>
jaklinger added a commit that referenced this pull request Jul 13, 2020
* make sure conf dir is empty

* simplified es config

* added orm es config reader

* modified setup_es to pick up new es config

* swapped es_mode for boolean

* aliases now consistent with config

* aliases now automatically located

* added endpoint field to estasks

* added endpoint field to sql2estasks

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* harmonised name fieldsofstudy across arxiv

* using soft alias until a future PR to minimise changes

* added novelty back in

* sorted json

* sorted json

* sorted json

* changed schema_transformor to use new simpler mapping

* removed to/from keys

* new null syntax mapping implemented

* cleaned and sorted json

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* testing es7 on cordis only

* testing es7 on cordis only

* testing es7 on cordis only

* changes to make cordis es7 run

* eurito-dev iteration

* compatibility issues between arxlive and eurito arxiv

* sorted json

* pycountry change no longer assumes not null country

* needed to split pathstub args

* removed redundant es mappings

* empty gtr transformation

* [267] Pool ES mappings across datasets (#280)

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* using soft alias until a future PR to minimise changes

* cleaned and sorted json

* [267] Tidy & slim schema transformations (#281)

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* harmonised name fieldsofstudy across arxiv

* added novelty back in

* sorted json

* sorted json

* sorted json

Co-authored-by: Joel Klinger <[email protected]>

Co-authored-by: Joel Klinger <[email protected]>

* patched out es config setup from tests

* removed redundant tests

* fixed json formatting

* fixed bad table name (NB table was empty anyway)

* fixed bad table name (NB table was empty anyway)

* gtr ontology

* none included for testing

* added schema transformation

* picked up bug in test

* gtr ontology is self consistent

* added gtr mapping

* added gtr to config

* fixed merge conflicts

* fixed merge conflicts

* changed json field names

* instiutes are now analyzed and text

* sorted and cleaned json

* added gtr batchable

* empty test commit

* couple of tests

* tidied json

* added schema module to reqs, finished tests

* set up root task

* moved to es7 image

* removed standard token filter, as it is deprecated in es6.5 then removed in es7

* removed start/end dates since theyre empty

* misalignment between batchable keys and field names

* fixed mapping and removed outcomes due to mapping explosion

* removed seconds from fund date fields

* tidied json

* added none value edgecase to str truncation

* Update elasticsearchplus.py

Co-authored-by: Joel Klinger <[email protected]>
jaklinger added a commit that referenced this pull request Sep 28, 2020
* make sure conf dir is empty

* simplified es config

* added orm es config reader

* modified setup_es to pick up new es config

* swapped es_mode for boolean

* aliases now consistent with config

* aliases now automatically located

* added endpoint field to estasks

* added endpoint field to sql2estasks

* [267] Pool ES mappings across datasets (#280)

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* using soft alias until a future PR to minimise changes

* cleaned and sorted json

* [267] Tidy & slim schema transformations (#281)

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* harmonised name fieldsofstudy across arxiv

* added novelty back in

* sorted json

* sorted json

* sorted json

Co-authored-by: Joel Klinger <[email protected]>

Co-authored-by: Joel Klinger <[email protected]>

* patched out es config setup from tests

* removed redundant tests

* fixed json formatting

* none included for testing

* picked up bug in test

Co-authored-by: Joel Klinger <[email protected]>
jaklinger added a commit that referenced this pull request Sep 28, 2020
* make sure conf dir is empty

* simplified es config

* added orm es config reader

* modified setup_es to pick up new es config

* swapped es_mode for boolean

* aliases now consistent with config

* aliases now automatically located

* added endpoint field to estasks

* added endpoint field to sql2estasks

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* harmonised name fieldsofstudy across arxiv

* using soft alias until a future PR to minimise changes

* added novelty back in

* sorted json

* sorted json

* sorted json

* changed schema_transformor to use new simpler mapping

* removed to/from keys

* new null syntax mapping implemented

* cleaned and sorted json

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* testing es7 on cordis only

* testing es7 on cordis only

* testing es7 on cordis only

* changes to make cordis es7 run

* eurito-dev iteration

* compatibility issues between arxlive and eurito arxiv

* sorted json

* pycountry change no longer assumes not null country

* needed to split pathstub args

* removed redundant es mappings

* empty gtr transformation

* [267] Pool ES mappings across datasets (#280)

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* using soft alias until a future PR to minimise changes

* cleaned and sorted json

* [267] Tidy & slim schema transformations (#281)

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* harmonised name fieldsofstudy across arxiv

* added novelty back in

* sorted json

* sorted json

* sorted json

Co-authored-by: Joel Klinger <[email protected]>

Co-authored-by: Joel Klinger <[email protected]>

* patched out es config setup from tests

* removed redundant tests

* fixed json formatting

* fixed bad table name (NB table was empty anyway)

* fixed bad table name (NB table was empty anyway)

* gtr ontology

* none included for testing

* added schema transformation

* picked up bug in test

* gtr ontology is self consistent

* added gtr mapping

* added gtr to config

* fixed merge conflicts

* fixed merge conflicts

* changed json field names

* instiutes are now analyzed and text

* sorted and cleaned json

* added geopoint

* fixed bad json

* fixed bad json

Co-authored-by: Joel Klinger <[email protected]>
jaklinger added a commit that referenced this pull request Sep 28, 2020
* make sure conf dir is empty

* simplified es config

* added orm es config reader

* modified setup_es to pick up new es config

* swapped es_mode for boolean

* aliases now consistent with config

* aliases now automatically located

* added endpoint field to estasks

* added endpoint field to sql2estasks

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* harmonised name fieldsofstudy across arxiv

* using soft alias until a future PR to minimise changes

* added novelty back in

* sorted json

* sorted json

* sorted json

* changed schema_transformor to use new simpler mapping

* removed to/from keys

* new null syntax mapping implemented

* cleaned and sorted json

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* adding temporary eurito-dev index to avoid conflating es7 compatibility issues

* testing es7 on cordis only

* testing es7 on cordis only

* testing es7 on cordis only

* changes to make cordis es7 run

* eurito-dev iteration

* compatibility issues between arxlive and eurito arxiv

* sorted json

* pycountry change no longer assumes not null country

* needed to split pathstub args

* removed redundant es mappings

* empty gtr transformation

* [267] Pool ES mappings across datasets (#280)

* changed branch name

* mappings build

* updated docs

* updated docs

* updated docs

* added docstrings

* added dynamic strict to settings

* removed index.json in favour of a single defaults file

* using soft alias until a future PR to minimise changes

* cleaned and sorted json

* [267] Tidy & slim schema transformations (#281)

* pruned deprecated schema transformations

* updated fos fieldname on arxlive

* unified data set schema transformations

* restructured directory

* refactored references to schema_transformation

* refactored references to schema_transformation

* slimmed down transformations, and included entity_type

* pruned ontology

* tidied schemas

* consistency tests

* reverted unrelated json file

* harmonised name fieldsofstudy across arxiv

* added novelty back in

* sorted json

* sorted json

* sorted json

Co-authored-by: Joel Klinger <[email protected]>

Co-authored-by: Joel Klinger <[email protected]>

* patched out es config setup from tests

* removed redundant tests

* fixed json formatting

* fixed bad table name (NB table was empty anyway)

* fixed bad table name (NB table was empty anyway)

* gtr ontology

* none included for testing

* added schema transformation

* picked up bug in test

* gtr ontology is self consistent

* added gtr mapping

* added gtr to config

* fixed merge conflicts

* fixed merge conflicts

* changed json field names

* instiutes are now analyzed and text

* sorted and cleaned json

* added gtr batchable

* empty test commit

* couple of tests

* tidied json

* added schema module to reqs, finished tests

* set up root task

* moved to es7 image

* removed standard token filter, as it is deprecated in es6.5 then removed in es7

* removed start/end dates since theyre empty

* misalignment between batchable keys and field names

* fixed mapping and removed outcomes due to mapping explosion

* removed seconds from fund date fields

* tidied json

* added none value edgecase to str truncation

* Update elasticsearchplus.py

Co-authored-by: Joel Klinger <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants