Find, download, and classify company Sustainability & ESG reports - automatically.
⭐️ If this project helps you, please give it a star! Stars help others discover it.
Report Scout is a two-stage Python pipeline that finds, downloads & classifies any company's sustainability and ESG related reports (PDFs).
It uses ScaleSERP for discovery and OpenAI for text classification (relevance, report type, year), saving clean artifacts under outputs/.
TL;DR: Provide a CSV of companies → run one command → get a folder of vetted sustainability reports per company + CSV summaries.
- 🌐 Multilingual search with per-country Google domains (auto native language ↔ English fallback).
- 🔎 Stage 1: Discover & score
pdf_linkN/page_linkNcandidates per company. - 📥 Stage 2: Download PDFs, extract text, classify via LLM, and save only acceptable reports.
- 🧠 Heuristic scoring to prefer official company sites & recent years.
- 🧹 Clean CSV artifacts for links, results, failed downloads, and type mismatches.
- 🗂️ Structured outputs (
outputs/pdfs,outputs/html, temp dirs) with collision-safe names.
report_fetcher/
__init__.py
config.py # global settings (paths, API keys, report types, limits)
search.py # Stage 1: ScaleSERP discovery + multilingual scoring
pipeline_stage1.py # Stage 1 orchestration and link table
pipeline_stage2.py # Stage 2 orchestration, fallbacks, result tables
fetch.py # Robust PDF download + HTML fallback parsing
pdf_utils.py # PDF text extraction + save/classify gate
classify.py # OpenAI classification helpers
utils.py # Link scoring & selection utilities
run_pipeline.py # CLI entry point
-
Python 3.10+ recommended.
-
Install dependencies (example with
uvorpip):
# using uv
uv pip install -r requirements.txt
# or pip
pip install -r requirements.txtIf you don't keep a
requirements.txtyet, you'll likely need:
pandas tqdm requests beautifulsoup4 PyMuPDF python-dotenv openai.
- Set environment variables (or a
.envfile next torun_pipeline.py):
OPENAI_API_KEY=sk-...
SCALESERP_API_KEY=...The CLI also lets you pass
--openai-api-keyand--scaleserp-api-keydirectly.
Prepare a CSV with two columns: Company and Country (column names are configurable). Example:
Company,Country
ABB,Switzerland
Telefónica,Spain
Run the pipeline:
python run_pipeline.py --input-csv companies.csv -vv-v→ INFO logs-vv→ DEBUG logs
Outputs land in outputs/ (configurable), including:
outputs/stage1_links.csvoutputs/stage2_results.csvoutputs/stage2_failed_downloads.csvoutputs/stage2_type_mismatch.csv- downloaded PDFs under
outputs/pdfs/
- For each company, runs two ScaleSERP queries:
- PDF results (
filetype:pdf) - Non‑PDF page results (
-filetype:pdf)
- PDF results (
- Ranks results with a heuristic that favors:
- company domain matches, recent years in snippets/URLs, accepted report‑type keywords/synonyms, clean URLs, and known good path segments.
- Produces a table with
pdf_link1..Nandpage_link1..Nper company.
- Visits PDF links first; if none are saved, can process page links:
- Filters for likely official pages (domain match), scrapes for deeper PDF links, caps per domain / global totals.
- Downloads PDFs (with realistic headers and a fallback HTML parse if a viewer page is returned).
- Extracts text (first ~10–15 pages) and asks an LLM for:
Relevance(Yes/No),CorrectCompany(Yes/No),ReportType(raw label),Year(latest 4‑digit year).
- Keeps the file only if:
Relevance=YesandCorrectCompany=Yes,- the
Yearis within the configured minimum (e.g., ≥ 2023), - the
ReportTypemaps to an allowed canonical type (no cross‑upgrades; strict contains‑match).
- Files are saved as:
Company_CanonicalType_Year[_#].pdf(collision‑safe).
All knobs live in report_fetcher/config.py. Common options:
- Paths
BASE_DIR(defaultoutputs/),PDF_DIR,HTML_DIR,TEMP_PDF_DIR,CONFIRMED_PDF_DIR
- API & Model
OPENAI_API_KEY,SCALESERP_API_KEY,OPENAI_MODEL(defaultgpt-5-mini)
- Report Types
ACCEPTABLE_REPORT_TYPES(default:["Annual","Integrated","CSR","ESG","Impact","CDP"])
- Search (Stage 1)
MAX_PDF_RESULTS,MAX_PAGE_RESULTSSEARCH_LANGUAGE_PRIORITY = "native" | "english"STAGE1_SCORING_ENABLED = TrueCOMPANY_NAME_COLUMN,COUNTRY_COLUMN
- Year Gate
MIN_ACCEPTABLE_REPORT_YEAR(e.g.,2023)
- Stage 2 behavior
STAGE2_PAGE_PROCESSING_CONFIG = "process_if_no_pdfs" | "process_all" | "skip"- per‑domain caps and overall caps for pages and PDFs
You can also pass API keys via CLI. See CLI below.
python run_pipeline.py --input-csv companies.csv --openai-api-key $OPENAI_API_KEY --scaleserp-api-key $SCALESERP_API_KEY -vvIf --input-csv is omitted, a tiny built‑in sample is used. The script writes cleaned CSVs (single \n line endings, trimmed strings, no blank “every second row” artifacts on Windows).
One row per company, plus columns:
pdf_link1..MAX_PDF_RESULTSpage_link1..MAX_PAGE_RESULTS
One row per company with:
Company,CountryPDF files— comma‑separated list of saved filenames
Rows for any download that returned HTML, failed, or was unreachable.
Company,Country,Failed_Link,Reason
Reports judged relevant but with a non‑accepted report type.
Company,Country,PDF_Link,AI_Report_Type,AI_Year_Raw,Final_Year_Used,Reason
Downloaded PDFs live in outputs/pdfs/.
- Logging: use
-v/-vvto increase verbosity. - .env support:
.envis loaded automatically ifpython-dotenvis installed. - HTTP strategy: realistic headers, byte‑range requests, viewer‑page parsing, and simple origin warm‑ups.
- LLM prompts: strict one‑line output for easy parsing; only the first pages are read for speed.
- Safety rails: file‑type checks, year gates, strict report‑type canonicalization, domain filtering.
- “OpenAI configuration error” → provide a valid key via env or
--openai-api-key. - No PDFs found → try
STAGE2_PAGE_PROCESSING_CONFIG="process_all"or switchSEARCH_LANGUAGE_PRIORITY. - Type mismatch (relevant but not in accepted list) → expand
ACCEPTABLE_REPORT_TYPESif intended. - Corrupt/HTML PDFs → downloader will attempt to parse viewers; some sites may still require manual handling.
Issues and PRs welcome! Please include reproducible examples (company + country), console logs (-vv), and your config.py deltas.
Start with our good first issues or help wanted.
MIT License — see the LICENSE file for details.
Copyright (c) 2025 Arboretica B.V.
- Developed by Markas Nausėda, 2025
- Copyright ownership: Arboretica
- ScaleSERP for search results.
- OpenAI for text classification.
- PyMuPDF for fast PDF text extraction.

