An evolving Kafka-based pipeline for ingesting and enriching cyber threat indicators, orchestrated using Airflow and containerized infrastructure. This project is being developed to explore scalable threat intel workflows and showcase hands-on data engineering capabilities.
This repository is currently under active development. Some components are outlined but not yet functional, and full setup instructions will be added as implementation proceeds. The repo serves as both a technical playground and a conceptual showcase of modern pipeline architecture.
This project is structured around a modular flow designed to simulate real-world threat ingestion and analysis.
- β ThreatFox: API-based feed of fresh IPs, domains, hashes linked to active malware.
- π AbuseIPDB: Recently reported malicious IPs.
- π PhishTank (via OTX Pulse): Confirmed phishing URLs across industry targets.
- β±οΈ Scheduled Python scripts and Airflow DAGs (every 10β30 mins) to simulate a near-real-time feed.
- π IPinfo.io: Geo-tagging and ASN data for IP indicators.
- β³ (Optional) VirusTotal / URLScan.io: Enrichment metadata and detection scores (within API limits).
- π Normalize into structured models:
stg_raw_iocs
β base extractiondim_ips
,dim_domains
,dim_hashes
β dimension tablesfct_threat_events
β enriched and timestamped threat data
- π Build insights such as:
- IOC distribution by country, type, source
- Recurring IPs and campaign freshness timelines
- π Jupyter / Streamlit App:
- Visual threat timelines
- IP heatmaps
- Top malicious infrastructures by ASN or registrar