Skip to content

SwissFederalArchives/chgov-brprotokolle-server

Repository files navigation

chgov-brprotokolle-server

Context

The chgov-brprotokolle project is settled around managing, retrieving and displaying historic minutes of the Federal Council, based on the IIIF standard and the live project set-up can be experienced over at Federal Archives's site. The project is separated into 4 dedicated repositories while this current repository chgov-brprotokolle-server is the backend for the ingestion of minutes and the interface for SOLR search requests. It was developed using TypeScript and is based on the archival-iiif-server. The other projects include the publicly accessible frontend (chgov-brprotokolle-frontend), a frontend utility to properly enable OCR display in Mirador (chgov-brprotokolle-mirador-ocr-helper) and documentation chgov-brprotokolle-markdown. The frontend is written in React and the frontend utility in plain JavaScript.

Architecture and components

The backend server has two major tasks: ingestion and search routing. The latter is more or less directly passed to the corresponding SOLR instance and it's objective is to provide an interface for queries.. The former is outlined below with its objective to store data in the SOLR instance and create IIIF representations of the minutes.

Pipeline Ingestion

Pipeline

The ingestion pipline handles either handwritten minutes (e.g. with provided OCR from the Transkribus project) or machine written minutes (e.g. as PDF files, no OCR provided), enhances the minutes with provided metadata and ultimately stores relevant information in a SOLR instance. In order to start the ingestion, files in the appropriate format have to be added to the HOTFOLDER, which the dirWatcher catches. Then, depending on the type of minutes the collectionBuilder handles handwritten minutes for further processing. Machine written minutes are ingested as single PDFs, thus before further processing, the images have to be extracted (imgExtractor) and subsequently, OCR is extracted based on the images (ocrExtractor). At this point, the images, ocr data and metadata are provided and there is no distinction between machine written and handwritten anymore. The ocr data is compiled into a single text file, the ocr plaintext and together with the images, and, ocr data, it's stored under the DATAFOLDER directory. The metadata and known locations of the images, ocr data, and, ocr plaintext are used to generate the IIIF manifests (manifestCreate). These manifests are delivered by an external webserver and are not further part of the backend project. The pipeline is built in such a way that the solrAdd step finalises the ingestion and adds relevant information to the SOLR instance.

First steps

Preparations

To prepare the backend server's setup, it is mandatory to have a running SOLR instance, prepared with the appropriate schema and plugin.

Install

Installation of the development enviornment is done by calling npm install, as this is a node project.

Customization

General

Custom elements for the pipeine can be added as described in the archival-iiif-server documentation.

Run tests

There aren't any automated tests available. End to end runs have to be manually checked.

Authors

License

GNU Affero General Public License (AGPLv3), see LICENSE

Contribute

This repository is a copy which is updated regularly - therefore contributions via pull requests are not possible. However, independent copies (forks) are possible under consideration of the The MIT license.

Contact

  • For general questions (and technical support), please contact the Swiss Federal Archives by e-mail at [email protected].
  • Technical questions or problems concerning the source code can be posted here on GitHub via the "Issues" interface.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published