Skip to content

Reusable NLP pipelines: identify language, assess OCR quality, model topics, and extract news‑agency entities from any text.

Notifications You must be signed in to change notification settings

impresso/impresso-pipelines

Repository files navigation

Python Package: impresso-pipelines

PyPI Python versions Weekly Downloads Contributors QA Workflow

Overview

This repository contains a Python package designed for modular and efficient text processing workflows. Currently, it includes the following subpackages:

  • Language Identification Pipeline: Identifies the language of input text and returns a probability score.
  • OCR QA Pipeline: Assesses the quality of OCR text by estimating the proportion of recognized words (0–1), using efficient language-specific Bloom filters.
  • LDA Topic Modeling Pipeline: Soft clustering of input texts using LDA-based topic modeling.
  • News Agencies Pipeline: Extracts and ranks news agency entities from text, providing relevance scores and optional links to Wikidata.
  • Lucene/Solr ormalization Pipeline: Replicates Solr’s language-specific text normalization to clarify how input text is tokenized and indexed in impresso.

Installation

To install the full package with all submodules:

pip install "impresso-pipelines[all]"

The [all] extra installs all dependencies required for each component.

To install individual modules without unnecessary dependencies, use:

pip install "impresso-pipelines[langident]"         # Language Identification
pip install "impresso-pipelines[ocrqa]"             # OCR QA
pip install "impresso-pipelines[ldatopics]"         # LDA Topics
pip install "impresso-pipelines[newsagencies]"      # News Agencies
pip install "impresso-pipelines[solrnormalization]" # Solr text normalization

Usage

Each pipeline is instantiated from a corresponding class.

from impresso_pipelines.langident import LangIdentPipeline
from impresso_pipelines.ocrqa import OCRQAPipeline
from impresso_pipelines.ldatopics import LDATopicsPipeline
from impresso_pipelines.newsagencies import NewsAgenciesPipeline
from impresso_pipelines.solrnormalization import SolrNormalizationPipeline

Pipeline Examples

For usage examples, refer to the individual README files:

See also the interactive notebooks for further examples:

Future Plans

Additional functionality will be added to extend use cases and support further processing tasks.

About Impresso

Impresso project

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.

Copyright

Copyright (C) 2025 The Impresso team.

License

This program is provided as open source under the GNU Affero General Public License v3 or later.


Impresso Project Logo

About

Reusable NLP pipelines: identify language, assess OCR quality, model topics, and extract news‑agency entities from any text.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages