This repository contains instructions on how to extract and transform OpenAlex data for data analysis with Google BigQuery.
The following packages are required for this workflow.
OpenAlex snapshots are available through AWS. Instructions for downloading can be found here: https://docs.openalex.org/download-all-data/download-to-your-machine.
$ aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request
To reduce the size of the data stored in BigQuery, some data transformation
is applied to the works
entity. Data transformation is
carried out on the High Performance Cluster of the
GWDG Göttingen. However, you can also
use the script on other servers with only minor adjustments. Entities
like authors
, publishers
, institutions
, funders
and sources
are not affected by the data transformation step.
$ sbatch openalex_works_hpc.sh
Files can be uploaded to a Google Bucket using gsutil
. Note that only
data in the works
entity has been transformed. All other data can be found
in openalex-snapshot/data
.
$ gsutil -m cp -r /scratch/users/haupka/works gs://bigschol
Use bq load
to create a table in BigQuery with data stored in a
Google Bucket. Schemas for the tables can be found here.
$ bq load --ignore_unknown_values --source_format=NEWLINE_DELIMITED_JSON subugoe-collaborative:openalex.works gs://bigschol/works/*.gz schema_openalex_work.json
- Following fields are not included in the
works
schema:mesh
,related_works
,concepts
. - An additional field
has_abstract
is added during the data transformation step that replaces the fieldabstract_inverted_index
.