This repository contains various scripts and configuration for converting MARC bibliographic records into RDF, for use at the National Library of Finland.
The main component is a conversion pipeline driven by a Makefile that defines rules for realizing the conversion steps using command line tools.
The steps of the conversion are:
- Start with a file of MARC records in Aleph sequential format
- Split the file into smaller batches
- Preprocess using unix tools such as grep and sed, to remove some local peculiarities
- Convert to MARCXML and enrich the MARC records, using Catmandu
- Run the Library of Congress marc2bibframe2 XSLT conversion from MARC to BIBFRAME RDF
- Convert the BIBFRAME RDF/XML data into N-Triples format and fix up some bad URIs
- Calculate work keys (e.g. author+title combination) used later for merging data about the same creative work
- Convert the BIBFRAME data into Schema.org RDF in N-Triples format
- Reconcile entities in the Schema.org data against external sources (e.g. YSA/YSO, Corporate names authority, RDA vocabularies)
- Merge the Schema.org data about the same works
- Calculate agent keys used for merging data about the same agent (person or organization)
- Merge the agents based on agent keys
- Convert the raw Schema.org data to HDT format so the full data set can be queried with SPARQL from the command line
- Consolidate the data by e.g. rewriting URIs and moving subjects into the original work
- Convert the consolidated data to HDT
- ??? (TBD)
- Profit!
Command line tools are assumed to be available in $PATH
, but the paths can be overridden on the make command line, e.g. make CATMANDU=/opt/catmandu
- Apache Jena command line utilities
sparql
andrsparql
- Catmandu utility
catmandu
uconv
utility from Ubuntu packageicu-devtools
xsltproc
utility from Ubuntu packagexsltproc
- hdt-cpp command line utilities
rdf2hdt
andhdtSearch
- hdt-java command line utility
hdtsparql.sh
In addition to above:
- bats in $PATH
xmllint
utility from Ubuntu packagelibxml2-utils
in $PATH