-
Notifications
You must be signed in to change notification settings - Fork 5
Create unified ES end-point for health scanner #215
Comments
there should be
I'll do this now for the discussion |
|
yes |
This is just comparing CB, MU, NIH for Mosaic (not EURITO). (EDIT: make sure to scroll the yaml portions, for some reason they have a max height) identificationid:
mu:
id_of_group: string
nih:
id_of_project: string
name:
cb:
name_of_organisation: string
mu:
name_of_group: string
nih:
title_of_organisation: string contentdescription:
cb:
textBody_descriptive_organisation: string
mu:
textBody_descriptive_group: string
nih:
textBody_descriptive_project: string
brief:
cb:
textBody_summary_organisation: string
nih:
textBody_abstract_project: string
title:
nih:
title_of_project: string timestart_date:
cb:
date_birth_organisation: date # yyyy-MM-dd
mu:
date_start_group: date # yyyy-MM-dd
nih:
date_start_project: date # yyyy-MM-dd
end_date:
cb:
date_death_organisation: date # yyyy-MM-dd
nih:
date_end_project: date # yyyy-MM-dd
update_date:
cb:
datetime_updated_organisation: date # yyyy-MM-dd HH:mm:ss geocontinentName:
cb, nih:
placeName_continent_organisation: string
mu:
placeName_continent_group: string
continentId:
cb, nih:
id_of_continent: string
id_continent_organisation: string
mu:
id_continent_group: string
countryId:
cb:
id_iso2_country: string
id_iso3_country: string
id_isoNumeric_country: integer
mu:
id_country_group: string
id_iso2_country: string
id_iso3_country: string
id_isoNumeric_country: integer
nih:
id_iso2_country: string
id_iso3_country: string
id_isoNumeric_country: integer
countryName:
cb, nih:
placeName_country_organisation: string
mu:
placeName_country_group: string
stateId:
cb, mu, nih:
id_state_organisation: string
stateName:
cb, nih:
placeName_state_organisation: string
regionName:
cb:
placeName_region_organisation: string
city:
cb, nih:
placeName_city_organisation: string
mu:
placeName_city_group: string
zipcode:
nih:
placeName_zipcode_organisation: string
address:
cb:
address_of_organisation: string
location:
cb:
coordinate_of_city:
lat: float
lon: float
mu:
coordinate_of_group:
lat: float
lon: float
nih:
coordinate_of_organisation:
lat: float
lon: float metricsnovelty:
cb:
rank_rhodonite_organisation: float
mu:
rank_rhodonite_group: float
nih:
rank_rhodonite_abstract: float
size:
cb:
count_employee_organisation: string # 1-100
mu:
count_member_group: integer classificationtype_of_entity:
cb, mu, nih:
type_of_entity: string
is_duplicate:
nih:
booleanFlag_duplicate_abstract: boolean
is_autotranslated:
mu:
booleanFlag_autotranslated_entity: boolean
is_health:
cb:
booleanFlag_health_organisation: boolean
terms_mesh:
cb:
terms_mesh_description: string[]
mu:
terms_mesh_group: string[]
nih:
terms_mesh_abstract: string[]
terms_sdg:
nih:
terms_sdg_abstract: string[]
terms_place:
cb, mu, nih:
terms_of_countryTags: string[]
terms_topics:
mu:
terms_topics_group: string[]
nih:
terms_descriptive_project: string[]
terms_funders:
cb, nih:
terms_of_funders: string[]
terms_language:
mu:
terms_iso2lang_entity: string[] weburl_cb:
cb:
url_crunchBase_organisation: string
url_fb:
cb:
url_facebook_organisation: string
url_li:
cb:
url_linkedIn_organisation: string
url_site:
cb:
url_of_organisation: string
mu:
url_of_group: string
url_tw:
cb:
url_twitter_organisation: string fundingfunding_cost:
cb:
cost_of_funding: float
nih:
cost_total_project: float
funding_rounds:
cb:
count_rounds_funding: integer
nih:
json_funding_project:
[]:
cost_ref: long
end_date: date
start_date: date
year: integer
funding_currency:
cb:
currency_of_funding: string
nih:
currency_total_cost: string
funding_last_date: # gah..
cb:
date_last_funding: date # yyyy-MM-dd
funding_year:
nih:
year_fiscal_funding: integer
funding_entity:
nih:
title_of_funder: string
# this could become an object (see also `json_funding_project` above)
#funding:
# cost: float
# rounds: integer
# currency: string
# date_last_funding?: date # yyyy-MM-dd customowner:
cb:
id_parent_organisation: string
status:
cb:
status_of_organisation: string
alias:
cb:
terms_alias_organisation: string[]
terms_category:
cb:
terms_category_organisation: string[] # multiple, of a group of known categories
mu:
name_of_category: string # single, of a group of known categories
terms_subcategory:
cb:
terms_subcategory_organisation: string[]
roles:
cb:
terms_roles_organisation: string[]
type:
cb:
type_of_organisation: string unusedcb:
_cost_usd2018_organisation: float
_terms_sdg_summary: string[]
mu:
_id_state_group: string
_placeName_state_group: string
_terms_memberOrigin_group: string[]
_terms_sdg_description: string[] |
Back then, even using aliases the response still contained items with the original, non-aliased, schema (which basically defeats the purpose of aliasing, although helping when composing the query). As an alternative to this re-mapping, we could investigate if newer versions of ElasticSearch can return items with the aliased schema. |
Even if that option was now available I think migrating to a new version of ES would be a larger effort, particularly if there's been any breaking changes. |
Why not |
That would be the Btw, if discussing via snippets sounds difficult we can start a branch and review mappings via PR comments? |
Not sure why there is no |
Is there any documentation for RWJF outside of |
I have a branch. |
OK.
I don't think so, as by using the alias |
@jaklinger here's the definitive mapping in CSV format:
|
I've marked some fields for removal as they're duplicate or redundant, temporary or unused, see |
In the above |
In the above |
This would unify the three datasets under a common schema, alleviating the need for GraphQL whose cold lambda starts are currently causing poor performance.
This will be implemented as a Luigi task that will take data from the latest ES indices for each dataset, apply a remapping and insert them under a unified index (adding a flag for the dataset type).
The text was updated successfully, but these errors were encountered: