Workflow for Processing and Loading OpenAlex data into Google BigQuery

This repository contains instructions on how to extract and transform OpenAlex data for data analysis with Google BigQuery.

Requirements

The following packages are required for this workflow.

AWS
Python3
- gsutil

Download Snapshot

OpenAlex snapshots are available through AWS. Instructions for downloading can be found here: https://docs.openalex.org/download-all-data/download-to-your-machine.

$ aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request

Data transformation

To reduce the size of the data stored in BigQuery, some data transformation is applied to the works entity. Data transformation is carried out on the High Performance Cluster of the GWDG Göttingen. However, you can also use the script on other servers with only minor adjustments. Entities like authors, publishers, institutions, funders and sources are not affected by the data transformation step.

$ sbatch openalex_works_hpc.sh

Uploading Files to Google Bucket

Files can be uploaded to a Google Bucket using gsutil. Note that only data in the works entity has been transformed. All other data can be found in openalex-snapshot/data.

$ gsutil -m cp -r /scratch/users/haupka/works gs://bigschol

Creating a BigQuery Table

Use bq load to create a table in BigQuery with data stored in a Google Bucket. Schemas for the tables can be found here.

$ bq load --ignore_unknown_values --source_format=NEWLINE_DELIMITED_JSON subugoe-collaborative:openalex.works gs://bigschol/works/*.gz schema_openalex_work.json

Notes

Following fields are not included in the works schema: mesh, related_works, concepts.
An additional field has_abstract is added during the data transformation step that replaces the field abstract_inverted_index.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
schemas		schemas
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
openalex_works_hpc.sh		openalex_works_hpc.sh
works.py		works.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Workflow for Processing and Loading OpenAlex data into Google BigQuery

Requirements

Download Snapshot

Data transformation

Uploading Files to Google Bucket

Creating a BigQuery Table

Notes

About

Uh oh!

Releases

Packages

Languages

License

naustica/openalex

Folders and files

Latest commit

History

Repository files navigation

Workflow for Processing and Loading OpenAlex data into Google BigQuery

Requirements

Download Snapshot

Data transformation

Uploading Files to Google Bucket

Creating a BigQuery Table

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages