Skip to content

A data engineering pipeline that ingests public cyber threat indicators from ThreatFox, AbuseIPDB, and PhishTank. Enriches IOCs via IPinfo and VirusTotal, transforms with dbt, and simulates real-world lifecycle management for security intel.

License

Notifications You must be signed in to change notification settings

pduebel/threat-intel-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›‘οΈ Threat Intelligence Pipeline (Work In Progress)

An evolving Kafka-based pipeline for ingesting and enriching cyber threat indicators, orchestrated using Airflow and containerized infrastructure. This project is being developed to explore scalable threat intel workflows and showcase hands-on data engineering capabilities.


⚠️ Work In Progress

This repository is currently under active development. Some components are outlined but not yet functional, and full setup instructions will be added as implementation proceeds. The repo serves as both a technical playground and a conceptual showcase of modern pipeline architecture.


🧱 Planned Architecture Overview

This project is structured around a modular flow designed to simulate real-world threat ingestion and analysis.

1. Ingestion

  • βœ… ThreatFox: API-based feed of fresh IPs, domains, hashes linked to active malware.
  • πŸ”œ AbuseIPDB: Recently reported malicious IPs.
  • πŸ”œ PhishTank (via OTX Pulse): Confirmed phishing URLs across industry targets.
  • ⏱️ Scheduled Python scripts and Airflow DAGs (every 10–30 mins) to simulate a near-real-time feed.

2. Enrichment

  • πŸ”œ IPinfo.io: Geo-tagging and ASN data for IP indicators.
  • ⏳ (Optional) VirusTotal / URLScan.io: Enrichment metadata and detection scores (within API limits).

3. Transformation (dbt)

  • πŸ”œ Normalize into structured models:
    • stg_raw_iocs β€” base extraction
    • dim_ips, dim_domains, dim_hashes β€” dimension tables
    • fct_threat_events β€” enriched and timestamped threat data
  • πŸ“Š Build insights such as:
    • IOC distribution by country, type, source
    • Recurring IPs and campaign freshness timelines

4. Output & Analysis (Optional)

  • πŸ”œ Jupyter / Streamlit App:
    • Visual threat timelines
    • IP heatmaps
    • Top malicious infrastructures by ASN or registrar

About

A data engineering pipeline that ingests public cyber threat indicators from ThreatFox, AbuseIPDB, and PhishTank. Enriches IOCs via IPinfo and VirusTotal, transforms with dbt, and simulates real-world lifecycle management for security intel.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published