Create unified ES end-point for health scanner #215

bishax · 2020-01-22T14:17:34Z

This would unify the three datasets under a common schema, alleviating the need for GraphQL whose cold lambda starts are currently causing poor performance.
This will be implemented as a Luigi task that will take data from the latest ES indices for each dataset, apply a remapping and insert them under a unified index (adding a flag for the dataset type).

mindrones · 2020-01-22T15:14:00Z

adding a flag for the dataset type

there should be type_of_entity

Write schema remappings

I'll do this now for the discussion

bishax · 2020-01-22T15:40:57Z

type_of_entity : meetup, company, project ?

mindrones · 2020-01-22T15:50:59Z

yes

mindrones · 2020-01-22T17:03:18Z

This is just comparing CB, MU, NIH for Mosaic (not EURITO).

(EDIT: make sure to scroll the yaml portions, for some reason they have a max height)

identification

id:
  mu:
    id_of_group: string
  nih:
    id_of_project: string
name:
  cb:
    name_of_organisation: string
  mu:
    name_of_group: string
  nih:
    title_of_organisation: string

content

description:
  cb:
    textBody_descriptive_organisation: string
  mu:
    textBody_descriptive_group: string
  nih:
    textBody_descriptive_project: string
brief:
  cb:
    textBody_summary_organisation: string
  nih:
    textBody_abstract_project: string
title:
  nih:
    title_of_project: string

time

start_date:
  cb:
    date_birth_organisation: date # yyyy-MM-dd
  mu:
    date_start_group: date # yyyy-MM-dd
  nih:
    date_start_project: date # yyyy-MM-dd
end_date:
  cb:
    date_death_organisation: date # yyyy-MM-dd
  nih:
    date_end_project: date # yyyy-MM-dd
update_date:
  cb:
    datetime_updated_organisation: date # yyyy-MM-dd HH:mm:ss

geo

continentName:
  cb, nih:
    placeName_continent_organisation: string
  mu:
    placeName_continent_group: string
continentId:
  cb, nih:
    id_of_continent: string
    id_continent_organisation: string
  mu:
    id_continent_group: string
countryId:
  cb:
    id_iso2_country: string
    id_iso3_country: string
    id_isoNumeric_country: integer
  mu:
    id_country_group: string
    id_iso2_country: string
    id_iso3_country: string
    id_isoNumeric_country: integer
  nih:
    id_iso2_country: string
    id_iso3_country: string
    id_isoNumeric_country: integer
countryName:
  cb, nih:
    placeName_country_organisation: string
  mu:
    placeName_country_group: string
stateId:
  cb, mu, nih:
    id_state_organisation: string
stateName:
  cb, nih:
    placeName_state_organisation: string
regionName:
  cb:
    placeName_region_organisation: string
city:
  cb, nih:
    placeName_city_organisation: string
  mu:
    placeName_city_group: string
zipcode:
  nih:
    placeName_zipcode_organisation: string
address:
  cb:
    address_of_organisation: string
location:
  cb:
    coordinate_of_city:
      lat: float
      lon: float
  mu:
    coordinate_of_group:
      lat: float
      lon: float
  nih:
    coordinate_of_organisation:
      lat: float
      lon: float

metrics

novelty:
  cb:
    rank_rhodonite_organisation: float
  mu:
    rank_rhodonite_group: float
  nih:
    rank_rhodonite_abstract: float
size:
  cb:
    count_employee_organisation: string # 1-100
  mu:
    count_member_group: integer

classification

type_of_entity:
  cb, mu, nih:
    type_of_entity: string
is_duplicate:
  nih:
    booleanFlag_duplicate_abstract: boolean
is_autotranslated:
  mu:
    booleanFlag_autotranslated_entity: boolean
is_health:
  cb:
    booleanFlag_health_organisation: boolean
terms_mesh:
  cb:
    terms_mesh_description: string[]
  mu:
    terms_mesh_group: string[]
  nih:
    terms_mesh_abstract: string[]
terms_sdg:
  nih:
    terms_sdg_abstract: string[]
terms_place:
  cb, mu, nih:
    terms_of_countryTags: string[]
terms_topics:
  mu:
    terms_topics_group: string[]
  nih:
    terms_descriptive_project: string[]
terms_funders:
  cb, nih:
    terms_of_funders: string[]
terms_language:
  mu:
    terms_iso2lang_entity: string[]

web

url_cb:
  cb:
    url_crunchBase_organisation: string
url_fb:
  cb:
    url_facebook_organisation: string
url_li:
  cb:
    url_linkedIn_organisation: string
url_site:
  cb:
    url_of_organisation: string
  mu:
    url_of_group: string
url_tw:
  cb:
    url_twitter_organisation: string

funding

funding_cost:
  cb:
    cost_of_funding: float
  nih:
    cost_total_project: float
funding_rounds:
  cb:
    count_rounds_funding: integer
  nih:
    json_funding_project:
      []:
        cost_ref: long
        end_date: date
        start_date: date
        year: integer
funding_currency:
  cb:
    currency_of_funding: string
  nih:
    currency_total_cost: string
funding_last_date: # gah..
  cb:
    date_last_funding: date # yyyy-MM-dd
funding_year:
  nih:
    year_fiscal_funding: integer
funding_entity:
  nih:
    title_of_funder: string

# this could become an object (see also `json_funding_project` above)
#funding:
#  cost: float
#  rounds: integer
#  currency: string
#  date_last_funding?: date # yyyy-MM-dd

custom

owner:
  cb:
    id_parent_organisation: string
status:
  cb:
    status_of_organisation: string
alias:
  cb:
    terms_alias_organisation: string[]
terms_category:
  cb:
    terms_category_organisation: string[] # multiple, of a group of known categories
  mu:
    name_of_category: string # single, of a group of known categories
terms_subcategory:
  cb:
    terms_subcategory_organisation: string[]
roles:
  cb:
    terms_roles_organisation: string[]
type:
  cb:
    type_of_organisation: string

unused

cb:
  _cost_usd2018_organisation: float
  _terms_sdg_summary: string[]

mu:
  _id_state_group: string
  _placeName_state_group: string
  _terms_memberOrigin_group: string[]
  _terms_sdg_description: string[]

mindrones · 2020-01-22T17:14:44Z

Back then, even using aliases the response still contained items with the original, non-aliased, schema (which basically defeats the purpose of aliasing, although helping when composing the query).

As an alternative to this re-mapping, we could investigate if newer versions of ElasticSearch can return items with the aliased schema.

bishax · 2020-01-23T08:27:06Z

Even if that option was now available I think migrating to a new version of ES would be a larger effort, particularly if there's been any breaking changes.
Furthermore, this way reduces the number of queries needing to be made?

bishax · 2020-01-23T14:08:46Z

id:
  mu:
    id_of_group: string
  nih:
    id_of_project: string

Why not id_parent_organisation for cb?

mindrones · 2020-01-23T14:12:59Z

That would be the id of the main entity, id_parent_organisation identifies another company in Crunchbase I think.

Btw, if discussing via snippets sounds difficult we can start a branch and review mappings via PR comments?

mindrones · 2020-01-23T14:14:22Z

Not sure why there is no id_of_organisation for Crunchbase entities.

bishax · 2020-01-23T14:14:55Z

Is there any documentation for RWJF outside of nestauk/nesta?

bishax · 2020-01-23T14:16:47Z

I have a branch.
I'll push and open a PR when I have a first pass

mindrones · 2020-01-23T14:30:17Z

I'll push and open a PR when I have a first pass

OK.

Furthermore, this way reduces the number of queries needing to be made?

I don't think so, as by using the alias health_scanner we can query all the endpoints aliased by that alias at the same time (the problem being as we discussed that you get an array of items with the schema from the originating index).

mindrones · 2020-05-29T14:29:40Z

@jaklinger here's the definitive mapping in CSV format:

new_name,CB,MU,NIH
address,address_of_organisation,,
brief,textBody_summary_organisation,,textBody_abstract_project
continent_id,id_of_continent,id_continent_group,id_of_continent
continent,placeName_continent_organisation,placeName_continent_group,placeName_continent_organisation
country_id,id_iso2_country,id_iso2_country,id_iso2_country
country,placeName_country_organisation,placeName_country_group,placeName_country_organisation
city,placeName_city_organisation,placeName_city_group,placeName_city_organisation
date_end,date_death_organisation,date_end_project,
date_start,date_birth_organisation,date_start_group,date_start_project
date_update,datetime_updated_organisation,,
description,textBody_descriptive_organisation,textBody_descriptive_group,textBody_descriptive_project
funding_cost,cost_of_funding,,cost_total_project
funding_currency,currency_of_funding,,currency_total_cost
funding_rounds,count_rounds_funding,,json_funding_project
funding_year,,,year_fiscal_funding
funder,,,title_of_funder
id,,id_of_group,id_of_project
is_autotranslated,,booleanFlag_autotranslated_entity,
is_duplicate,,,booleanFlag_duplicate_abstract
is_health,booleanFlag_health_organisation,,
location,coordinate_of_city,coordinate_of_group,coordinate_of_organisation
name,name_of_organisation,name_of_group,title_of_organisation
novelty,rank_rhodonite_organisation,rank_rhodonite_group,rank_rhodonite_abstract
parent_id,id_parent_organisation,,
region_name,placeName_region_organisation,,
source,type_of_entity,type_of_entity,type_of_entity
state_id,id_state_organisation,id_state_organisation,id_state_organisation
state,placeName_state_organisation,,placeName_state_organisation
status,status_of_organisation,,
size,count_employee_organisation,count_member_group,
terms_alias,terms_alias_organisation,,
terms_category,terms_category_organisation,,name_of_category
terms_funder,terms_of_funders,,terms_of_funders
terms_lang,,terms_iso2lang_entity,
terms_mesh,terms_mesh_description,terms_mesh_group,terms_mesh_abstract
terms_place,terms_of_countryTags,terms_of_countryTags,terms_of_countryTags
terms_role,terms_roles_organisation,,
terms_sdg,,,terms_sdg_abstract
terms_subcategory,terms_subcategory_organisation,,
terms_topics,,terms_topics_group,terms_descriptive_project
title,,,title_of_project
type,,,type_of_organisation
url_source,url_crunchBase_organisation,url_of_group,
url_fb,url_facebook_organisation,,
url_li,url_linkedIn_organisation,,
url_site,url_of_organisation,,
url_tw,url_twitter_organisation,,
zipcode,,,placeName_zipcode_organisation
<remove>,id_iso3_country,id_iso3_country,id_iso3_country
<remove>,id_isoNumeric_country,id_isoNumeric_country,id_isoNumeric_country
<remove>,id_continent_organisation (dupe),_id_state_group,id_continent_organisation (dupe)
<remove>,date_last_funding,_placeName_state_group,
<remove>,_cost_usd2018_organisation,_terms_memberOrigin_group,
<remove>,_terms_sdg_summary,_terms_sdg_description,

mindrones · 2020-05-29T14:31:05Z

I've marked some fields for removal as they're duplicate or redundant, temporary or unused, see <remove>

mindrones · 2020-05-29T16:15:38Z

In the above csv, I've changed dataset into source, with the request to change the current value of type_of_entity into crunchbase, meetup and NIH, so that in arxlive this could be source = arxiv | biorxiv | medrxiv (instead of article_source, for uniformity)

mindrones · 2020-05-30T13:32:49Z

In the above csv, I've changed alias into terms_alias (I didn't realise it is an array).

bishax self-assigned this Jan 22, 2020

mindrones added this to the Mosaic 0.3.x milestone Jan 22, 2020

mindrones self-assigned this Jan 22, 2020

bishax mentioned this issue Jan 23, 2020

[215] health scanner unify ES #216

Draft

mindrones added proj: HealthMosaic tech: ES labels May 29, 2020

mindrones mentioned this issue Jun 5, 2020

[267] Tidy & slim schema transformations #281

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create unified ES end-point for health scanner #215

Create unified ES end-point for health scanner #215

bishax commented Jan 22, 2020

mindrones commented Jan 22, 2020

bishax commented Jan 22, 2020

mindrones commented Jan 22, 2020

mindrones commented Jan 22, 2020 •

edited

Loading

mindrones commented Jan 22, 2020

bishax commented Jan 23, 2020

bishax commented Jan 23, 2020

mindrones commented Jan 23, 2020 •

edited

Loading

mindrones commented Jan 23, 2020

bishax commented Jan 23, 2020

bishax commented Jan 23, 2020

mindrones commented Jan 23, 2020 •

edited

Loading

mindrones commented May 29, 2020 •

edited

Loading

mindrones commented May 29, 2020 •

edited

Loading

mindrones commented May 29, 2020 •

edited

Loading

mindrones commented May 30, 2020 •

edited

Loading

Create unified ES end-point for health scanner #215

Create unified ES end-point for health scanner #215

Comments

bishax commented Jan 22, 2020

mindrones commented Jan 22, 2020

bishax commented Jan 22, 2020

mindrones commented Jan 22, 2020

mindrones commented Jan 22, 2020 • edited Loading

identification

content

time

geo

metrics

classification

web

funding

custom

unused

mindrones commented Jan 22, 2020

bishax commented Jan 23, 2020

bishax commented Jan 23, 2020

mindrones commented Jan 23, 2020 • edited Loading

mindrones commented Jan 23, 2020

bishax commented Jan 23, 2020

bishax commented Jan 23, 2020

mindrones commented Jan 23, 2020 • edited Loading

mindrones commented May 29, 2020 • edited Loading

mindrones commented May 29, 2020 • edited Loading

mindrones commented May 29, 2020 • edited Loading

mindrones commented May 30, 2020 • edited Loading

mindrones commented Jan 22, 2020 •

edited

Loading

mindrones commented Jan 23, 2020 •

edited

Loading

mindrones commented Jan 23, 2020 •

edited

Loading

mindrones commented May 29, 2020 •

edited

Loading

mindrones commented May 29, 2020 •

edited

Loading

mindrones commented May 29, 2020 •

edited

Loading

mindrones commented May 30, 2020 •

edited

Loading