Skip to content

naustica/openalex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Workflow for Processing and Loading OpenAlex data into Google BigQuery

This repository contains instructions on how to extract and transform OpenAlex data for data analysis with Google BigQuery.

Requirements

The following packages are required for this workflow.

Download Snapshot

OpenAlex snapshots are available through AWS. Instructions for downloading can be found here: https://docs.openalex.org/download-all-data/download-to-your-machine.

$ aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request

Data transformation

To reduce the size of the data stored in BigQuery, some data transformation is applied to the works entity. Data transformation is carried out on the High Performance Cluster of the GWDG Göttingen. However, you can also use the script on other servers with only minor adjustments. Entities like authors, publishers, institutions, funders and sources are not affected by the data transformation step.

$ sbatch openalex_works_hpc.sh

Uploading Files to Google Bucket

Files can be uploaded to a Google Bucket using gsutil. Note that only data in the works entity has been transformed. All other data can be found in openalex-snapshot/data.

$ gsutil -m cp -r /scratch/users/haupka/works gs://bigschol

Creating a BigQuery Table

Use bq load to create a table in BigQuery with data stored in a Google Bucket. Schemas for the tables can be found here.

$ bq load --ignore_unknown_values --source_format=NEWLINE_DELIMITED_JSON subugoe-collaborative:openalex.works gs://bigschol/works/*.gz schema_openalex_work.json

Notes

  • Following fields are not included in the works schema: mesh, related_works, concepts.
  • An additional field has_abstract is added during the data transformation step that replaces the field abstract_inverted_index.

About

Repository containing scripts for importing OpenAlex snapshots into BigQuery

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published