Welcome to the Impresso Make-Based Offline (NLP) Processing Cookbook! This repository provides a comprehensive guide and set of tools for processing newspaper content. The build system leverages Makefiles to orchestrate complex workflows, ensuring efficient and scalable data processing. By utilizing S3 for data storage and local stamp files for tracking progress, this system supports distributed processing across multiple machines without conflicts.
- Build System Structure
- Uploading to impresso S3 bucket
- Processing Workflow Overview
- Setup Guide
- Makefile Targets
- Usage Examples
- Contributing
- License
A miminal package with the minimal Python code that is common to most functionality shared by processing pipelines in the cookbook can be installed with:
# install via pip
python3 -m pip install git+https://github.com/impresso/impresso-make-cookbook.git@main#subdirectory=lib
# or add the following to your Pipfile
impresso-cookbook = {git = "https://github.com/impresso/impresso-make-cookbook.git", ref = "main", subdirectory = "lib"}
The build system is organized into several make include files:
config.local.mk
: Local configuration overrides (not in the repository)config.mk
: Main configuration file with default settingscookbook/make_settings.mk
: Core make settings and shell configurationcookbook/log.mk
: Logging utilities with configurable log levelscookbook/setup.mk
: General setup targets and directory managementcookbook/sync.mk
: Data synchronization between S3 and local storagecookbook/clean.mk
: Cleanup targets for build artifactscookbook/processing.mk
: Processing configuration and behavior settingscookbook/main_targets.mk
: Core processing targets and parallelizationcookbook/newspaper_list.mk
: Newspaper list management and S3 discoverycookbook/local_to_s3.mk
: Path conversion utilities between local and S3cookbook/aws.mk
: AWS CLI configuration and testing
-
cookbook/paths_*.mk
: Path definitions for different processing stagespaths_canonical.mk
: Canonical newspaper content pathspaths_rebuilt.mk
: Rebuilt newspaper content pathspaths_lingproc.mk
: Linguistic processing pathspaths_ocrqa.mk
: OCR quality assessment pathspaths_langident.mk
: Language identification pathspaths_topics.mk
: Topic modeling pathspaths_bboxqa.mk
: Bounding box quality assessment paths
-
cookbook/processing_*.mk
: Processing targets for different NLP tasksprocessing_lingproc.mk
: Linguistic processing (POS tagging, NER)processing_ocrqa.mk
: OCR quality assessmentprocessing_langident.mk
: Language identificationprocessing_topics.mk
: Topic modeling with Malletprocessing_bboxqa.mk
: Bounding box quality assessment
-
cookbook/sync_*.mk
: Data synchronization for different processing stagessync_canonical.mk
: Canonical content synchronizationsync_rebuilt.mk
: Rebuilt content synchronizationsync_lingproc.mk
: Linguistic processing data syncsync_ocrqa.mk
: OCR QA data synchronizationsync_langident.mk
: Language identification data syncsync_topics.mk
: Topic modeling data synchronizationsync_bboxqa.mk
: Bounding box QA data synchronization
-
cookbook/setup_*.mk
: Setup targets for different processing environmentssetup_python.mk
: Python environment setupsetup_lingproc.mk
: Linguistic processing environmentsetup_ocrqa.mk
: OCR quality assessment setupsetup_topics.mk
: Topic modeling environment setupsetup_aws.mk
: AWS CLI setup and configuration
-
cookbook/aggregators_*.mk
: Data aggregation targetsaggregators_langident.mk
: Language identification statisticsaggregators_bboxqa.mk
: Bounding box QA statistics
Ensure that the environment variables SE_ACCESS_KEY
and SE_SECRET_KEY
for access to the S3 impresso infrastructure are set, e.g., by setting them in a local .env
file.
The build process uploads the processed data to the impresso S3 bucket.
This overview explains the impresso linguistic preprocessing pipeline, focusing on efficient data processing, distributed scalability, and minimizing interference between machines.
All input and output data reside on S3, allowing multiple machines to access shared data without conflicts. Processing directly from S3 reduces the need for local storage.
Local stamp files mirror S3 metadata, enabling machines to independently track and manage processing tasks without downloading full datasets. This prevents interference between machines, as builds are verified against S3 before processing starts, ensuring no overwrites or duplicate results.
The Makefile orchestrates the pipeline by defining independent targets and dependencies based on stamp files. Each machine maintains its local state, ensuring stateless and conflict-free builds.
Processing scripts operate independently, handling data in a randomized order. Inputs are read from S3, outputs are uploaded back to S3, and no synchronization is required between machines. Additional machines can join or leave without disrupting ongoing tasks.
Processed files are validated locally and uploaded to S3 with integrity checks (e.g., JSON schema validation and md5sum). Results are never overwritten, ensuring consistency even with concurrent processing.
By leveraging S3 and stamp files, machines with limited storage (e.g., 100GB) can process large datasets efficiently without downloading entire files.
- Local Parallelization: Each machine uses Make's parallel build feature to maximize CPU utilization.
- Distributed Parallelization: Machines process separate subsets of data independently (e.g., by newspaper or date range) and write results to S3 without coordination.
- Stateless Processing: Scripts rely only on S3 and local configurations, avoiding shared state.
- Custom Configurations: Each machine uses local configuration files or environment variables to tailor processing behavior.
- Python 3.11
- AWS CLI
- Git
- Make,
remake
- Additional tools:
git-lfs
,coreutils
,parallel
Case | Recipe? | Our Comment Terminology | GNU Make Terminology |
---|---|---|---|
User-configurable variable (?= ) |
❌ | USER-VARIABLE | "Recursive Variable (User-Overridable)" |
Internal computed variable (:= ) |
❌ | VARIABLE | "Simply Expanded Variable" |
Transformation function (define … endef ) |
❌ | FUNCTION | "Multiline Variable (Make Function)" |
Target without a recipe (.PHONY ) |
❌ | TARGET | "Phony Target (Dependency-Only Target)" |
Target with a recipe that creates a file | âś… | FILE-RULE | "File Target (Explicit Rule)" |
Target that creates a timestamp file | âś… | STAMPED-FILE-RULE | "File Target (Explicit Rule with Timestamp Purpose)" |
Double-colon target with no recipe (:: ) |
❌ | DOUBLE-COLON-TARGET | "Double-Colon Target (Dependency-Only Target)" |
Double-colon target with a recipe (:: ) |
âś… | DOUBLE-COLON-TARGET-RULE | "Double-Colon Target (Explicit Rule)" |
- Recursive Variable (User-Overridable) → Defined using
?=
, allowing users to override it. - Simply Expanded Variable → Defined using
:=
, evaluated only once. - Multiline Variable (Make Function) → A
define … endef
construct that acts as a function or script snippet. - Phony Target (Dependency-Only Target) → A
.PHONY
target that does not create an actual file. - File Target (Explicit Rule) → A normal rule that produces a file.
- File Target (Explicit Rule with Timestamp Purpose) → A special case of an explicit rule where the file primarily serves as a timestamp.
- Double-Colon Target (Dependency-Only Target) → A dependency-only target using
::
, allowing multiple independent rules. - Double-Colon Target (Explicit Rule) → A
::
target that executes independently from others of the same name.
-
Clone the repository:
git clone https://github.com/impresso/impresso-make-cookbook.git cd impresso-make-cookbook
-
Set up environment variables: Create a
.env
file in the project root:SE_ACCESS_KEY=your_access_key SE_SECRET_KEY=your_secret_key SE_HOST_URL=https://os.zhdk.cloud.switch.ch/
-
Install system dependencies:
# On Ubuntu/Debian sudo apt-get install -y make git-lfs parallel coreutils openjdk-17-jre-headless # On macOS brew install make git-lfs parallel coreutils openjdk@17
-
Set up Python environment:
make setup-python-env # This installs Python 3.11, pip, and pipenv
-
Install Python dependencies:
pipenv install # or python3 -m pip install -r requirements.txt
-
Configure AWS CLI:
make create-aws-config make test-aws
-
Run initial setup:
make setup
The cookbook provides several categories of makefile targets:
make help
: Display all available targets with descriptionsmake setup
: Initialize environment and create necessary directoriesmake newspaper
: Process a single newspaper (uses NEWSPAPER variable)make collection
: Process multiple newspapers in parallelmake all
: Complete processing pipeline with fresh data sync
The build system automatically detects CPU cores and configures parallel processing:
NPROC
: Automatically detected number of CPU coresPARALLEL_JOBS
: Maximum parallel jobs (defaults to NPROC)COLLECTION_JOBS
: Number of parallel newspaper collections (defaults to NPROC/2)NEWSPAPER_JOBS
: Jobs per newspaper (defaults to PARALLEL_JOBS/COLLECTION_JOBS)MAX_LOAD
: Maximum system load average for job scheduling
make langident-target
: Run language identification pipelinemake impresso-lid-stage1a-target
: Initial language classificationmake impresso-lid-stage1b-target
: Collect language statisticsmake impresso-lid-stage2-target
: Final language decisions with ensemble
make lingproc-target
: Run linguistic processing (POS tagging, NER)make check-spacy-pipelines
: Validate spaCy model installations
make ocrqa-target
: Run OCR quality assessmentmake check-python-installation-hf
: Test HuggingFace Hub setup
make topics-target
: Run topic modeling with Malletmake check-python-installation
: Test Java/JPype setup for Mallet
make bboxqa-target
: Run bounding box quality assessment
make sync
: Synchronize both input and output data with S3make sync-input
: Download input data from S3make sync-output
: Upload output data to S3make resync
: Force complete resynchronizationmake resync-input
: Force input data resynchronizationmake resync-output
: Force output data resynchronization
make clean-build
: Remove entire build directorymake clean-sync-input
: Remove synchronized input datamake clean-sync-output
: Remove synchronized output datamake clean-sync
: Remove all synchronized data
make setup-python-env
: Install Python, pip, and pipenvmake create-aws-config
: Generate AWS configuration from .envmake test-aws
: Test AWS S3 connectivitymake newspaper-list-target
: Generate list of newspapers to processmake update-pip-requirements-file
: Update requirements.txt from Pipfile
make aggregate
: Generate aggregated statisticsmake aggregate-pagestats
: Aggregate page-level statisticsmake aggregate-iiif-errors
: Aggregate IIIF error statistics
make test-LocalToS3
: Test path conversion utilitiesmake check-parallel
: Verify GNU parallel installationmake test_debug_level
: Test logging configuration at different levels
# Process a single newspaper
make newspaper NEWSPAPER=gazette-de-lausanne
# Process with custom parallel settings
make newspaper NEWSPAPER=journal-de-geneve PARALLEL_JOBS=4
# Process a specific processing stage
make lingproc-target NEWSPAPER=actionfem
# Process multiple newspapers using collection target
make collection
# Process with custom job limits
make collection COLLECTION_JOBS=4 MAX_LOAD=8
# Process with specific newspaper sorting
make collection NEWSPAPER_YEAR_SORTING=cat # chronological order
make collection NEWSPAPER_YEAR_SORTING=shuf # random order
# Process using GNU parallel with custom settings
make collection COLLECTION_JOBS=6 NEWSPAPER_JOBS=2
# Sync specific dataset types
make sync-input-rebuilt NEWSPAPER=gazette-de-lausanne
make sync-output-lingproc NEWSPAPER=actionfem
# Force resync with fresh data
make resync NEWSPAPER=journal-de-geneve
# Clean up specific processing outputs
make clean-sync-lingproc
make clean-sync-output
# Set up complete environment
make setup-python-env
make create-aws-config
make setup
# Test environment components
make test-aws
make check-spacy-pipelines
make check-python-installation
# Configure custom paths
make newspaper S3_BUCKET_CANONICAL=12-canonical-test BUILD_DIR=test.d
# Language identification with custom models
make langident-target \
LANGIDENT_IMPPRESSO_FASTTEXT_MODEL_OPTION=models/custom-lid.bin \
LANGIDENT_STAGE1A_MINIMAL_TEXT_LENGTH_OPTION=150
# OCR quality assessment with specific languages
make ocrqa-target \
OCRQA_LANGUAGES_OPTION="de fr en" \
OCRQA_MIN_SUBTOKENS_OPTION="--min-subtokens 5"
# Topic modeling with custom Mallet seed
make topics-target \
MALLET_RANDOM_SEED=123 \
MODEL_VERSION_TOPICS=v3.0.0
# Linguistic processing with validation
make lingproc-target \
LINGPROC_VALIDATE_OPTION=--validate \
LOGGING_LEVEL=DEBUG
# Enable debug logging
make newspaper LOGGING_LEVEL=DEBUG
# Process with dry-run mode (no S3 uploads)
make lingproc-target PROCESSING_S3_OUTPUT_DRY_RUN=--s3-output-dry-run
# Monitor processing status
make status # if implemented
make logs TARGET=lingproc-target # if implemented
# Test specific components
make test-LocalToS3
make test_debug_level
# Full production run with optimal settings
make all \
COLLECTION_JOBS=8 \
MAX_LOAD=12 \
NEWSPAPER_YEAR_SORTING=shuf \
LOGGING_LEVEL=INFO
# Process specific newspaper subset
echo "gazette-de-lausanne journal-de-geneve" > newspapers.txt
make collection NEWSPAPERS_TO_PROCESS_FILE=newspapers.txt
The cookbook uses several environment variables for configuration:
SE_ACCESS_KEY
: S3 access key for authenticationSE_SECRET_KEY
: S3 secret key for authenticationSE_HOST_URL
: S3 endpoint URL (defaults tohttps://os.zhdk.cloud.switch.ch/
)
The cookbook includes a sophisticated logging system with multiple levels:
LOGGING_LEVEL
: Set toDEBUG
,INFO
,WARNING
, orERROR
- Debug logging provides detailed information about variable values and processing steps
- All makefiles use consistent logging functions:
log.debug
,log.info
,log.warning
,log.error
# Enable debug logging for detailed output
make newspaper LOGGING_LEVEL=DEBUG
# Set to WARNING to reduce output
make collection LOGGING_LEVEL=WARNING
Key user-configurable variables (can be overridden):
PARALLEL_JOBS
: Maximum parallel jobs (auto-detected from CPU cores)COLLECTION_JOBS
: Number of parallel newspaper collectionsNEWSPAPER_JOBS
: Jobs per newspaper processingMAX_LOAD
: Maximum system load average for job scheduling
PROCESSING_S3_OUTPUT_DRY_RUN
: Set to--s3-output-dry-run
to prevent S3 uploadsPROCESSING_KEEP_TIMESTAMP_ONLY_OPTION
: Keep only timestamp files after S3 uploadPROCESSING_QUIT_IF_S3_OUTPUT_EXISTS_OPTION
: Skip processing if output exists on S3
NEWSPAPER
: Target newspaper to processNEWSPAPER_YEAR_SORTING
: Sort order (shuf
for random,cat
for chronological)BUILD_DIR
: Local build directory (defaults tobuild.d
)
LANGIDENT_LID_SYSTEMS_OPTION
: LID systems to use (e.g.,langid impresso_ft wp_ft
)LANGIDENT_STAGE1A_MINIMAL_TEXT_LENGTH_OPTION
: Minimum text length for stage 1aLANGIDENT_BOOST_FACTOR_OPTION
: Boost factor for language scoring
OCRQA_LANGUAGES_OPTION
: Languages for OCR QA (e.g.,de fr
)OCRQA_BLOOMFILTERS_OPTION
: Bloom filter files for OCR assessmentOCRQA_MIN_SUBTOKENS_OPTION
: Minimum subtokens for processing
MALLET_RANDOM_SEED
: Random seed for Mallet topic modelingMODEL_VERSION_TOPICS
: Version identifier for topic modelsLANG_TOPICS
: Language specification for topic models
The cookbook uses a sophisticated path management system:
- Input paths:
paths_canonical.mk
,paths_rebuilt.mk
- Output paths:
paths_lingproc.mk
,paths_ocrqa.mk
,paths_topics.mk
, etc. - Automatic conversion between local and S3 paths via
LocalToS3
function
Different processing stages use different S3 buckets:
S3_BUCKET_CANONICAL
: Canonical newspaper content (e.g.,12-canonical-final
)S3_BUCKET_REBUILT
: Rebuilt newspaper data (e.g.,22-rebuilt-final
)S3_BUCKET_LINGPROC
: Linguistic processing outputs (e.g.,40-processed-data-sandbox
)S3_BUCKET_TOPICS
: Topic modeling results (e.g.,41-processed-data-staging
)
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.
Copyright (C) 2024 The Impresso team.
This program is provided as open source under the GNU Affero General Public License v3 or later.