A data pipeline that collects YouTube trending videos data, processes it through Kafka and Spark, stores it in Azure Synapse, and enables analysis through Looker Studio.
The pipeline follows these steps:
- Fetches trending videos data from YouTube API
- Streams data through Kafka for real-time processing
- Processes and transforms data using Spark Streaming
- Stores results in Azure Synapse Analytics
- Prepares formatted data for Looker Studio visualization
- Data Collection: YouTube Data API v3
- Streaming: Apache Kafka
- Processing: Apache Spark
- Storage: Azure Synapse Analytics
- Containerization: Docker, Docker Compose
- Visualization: Looker Studio
- Language: Python 3.9+
- Go to Google Cloud Console
- Create a new project or select existing one
- Enable YouTube Data API v3:
- Go to APIs & Services > Library
- Search for "YouTube Data API v3"
- Click Enable
- Create credentials:
- Go to APIs & Services > Credentials
- Click Create Credentials > API Key
- Copy the API key
- Create an Azure account if you don't have one
- Create a Synapse workspace:
- Go to Azure Portal
- Search for "Synapse Analytics"
- Create a new workspace
- Get storage account details:
- In your workspace, go to Storage Settings
- Copy the storage account URL
- Create a container or use the default one
- Note down the container name
- Clone the repository and install dependencies:
pip install -r requirements.txt
- Create
.env
file with required credentials:
YOUTUBE_API_KEY=your_youtube_api_key
KAFKA_BOOTSTRAP_SERVERS=localhost:29092
KAFKA_TOPIC=youtube_trending
SYNAPSE_STORAGE_ACCOUNT_URL=your_synapse_url
SYNAPSE_CONTAINER_NAME=your_container_name
- Start Kafka:
docker-compose up -d
- Start the Kafka producer to fetch YouTube data:
python src/ingestion/kafka_producer.py
This will fetch trending videos every 15 minutes and send them to Kafka.
- Start the Spark streaming processor:
python src/processing/spark_streaming.py
This processes the data and prepares it for storage.
- Check data in Synapse:
python src/monitoring/synapse_data_checker.py
Verifies data quality and schema compliance in Azure Synapse.
- Prepare data for Looker:
python src/visualization/looker_data.py
Formats and prepares data for Looker Studio visualization.
The pipeline processes and stores two main data types:
Data about each trending video:
- video_id: Unique video identifier
- title: Video title
- channel_title: Channel name
- publish_time: When video was published
- fetch_time: When data was collected
- processing_time: When data was processed
- view_count: Total views
- like_count: Total likes
- comment_count: Total comments
- duration_seconds: Video duration
- category_id: YouTube category ID
Tags associated with each video:
- video_id: Video identifier
- tag: Individual tag text
After preparing the data using looker_data.py
, you'll find two CSV files in the data/looker
directory:
video_metrics.csv
: Contains all video metricsvideo_tags.csv
: Contains video tags data
To create visualizations:
-
Go to Looker Studio
-
Click "Create" and select "New Report"
-
Click "Create new data source" and select "File Upload"
-
First upload
video_metrics.csv
:- Set date fields: publish_time, fetch_time, processing_time
- Ensure numeric fields (view_count, like_count, etc.) are set as numbers
- Click "Connect"
-
Add
video_tags.csv
as an additional data source:- Click "Add data" in the Resources menu
- Upload
video_tags.csv
- Ensure both video_id and tag are set as text fields
- Click "Connect"
-
Create a blend:
- Click "Add a Blend"
- Select both data sources
- Join using "video_id" as the join key
- Click "Save"
-
You can now create various visualizations:
- Trending videos by view count
- Tag clouds for popular topics
- Channel performance metrics
- Time-based trends
- Engagement rate analysis
- Category distribution
It will look something like this:
It can be viewed here: https://lookerstudio.google.com/reporting/1a5ffb1c-70f4-4123-8ecd-079740a55d3b/page/YEiME/edit
- Python 3.9+
- Docker and Docker Compose
- YouTube Data API key
- Azure account with Synapse Analytics access