A comprehensive AWS S3 Tables/Iceberg data loader for processing and analyzing web event data from virtual tours and website interactions, designed to support clickstream analysis and user journey tracking.
This system implements a modern Lambda Architecture pattern with real-time and batch processing capabilities using AWS S3 Tables with Apache Iceberg:
- Ingestion Layer: AWS Kinesis streams for real-time events + S3 for historical data
- Processing Layer: AWS Glue ETL jobs + Lambda functions for data transformation
- Storage Layer: AWS S3 Tables with Iceberg format for ACID transactions and time travel
- Analytics Layer: Amazon Athena with Iceberg support for high-performance queries
- Orchestration: Apache Airflow for workflow management and table maintenance
- Transformation: dbt for data modeling and analytics with Iceberg table support
βββ cdk/ # AWS CDK infrastructure code
β βββ lib/
β β βββ web-events-data-lake-stack.ts
β βββ app.ts
β βββ package.json
βββ glue/ # Glue ETL job scripts
β βββ kinesis_processor.py
β βββ s3_processor.py
βββ lambda/ # Lambda function code
β βββ kinesis-processor/
βββ airflow/ # Airflow DAGs
β βββ dags/
β βββ web_events_comprehensive_pipeline.py
βββ dbt/ # dbt models for analytics
β βββ models/
β β βββ staging/
β β βββ intermediate/
β β βββ marts/
β βββ dbt_project.yml
βββ sql/ # SQL schemas and queries
βββ monitoring/ # Monitoring configurations
βββ docs/ # Documentation
- AWS CLI configured with appropriate permissions
- Node.js 18+ (for CDK)
- Python 3.9+ (for Glue/Lambda)
- dbt CLI (for analytics models)
cd cdk
npm install
npm run build
cdk bootstrap
cdk deploy WebEventsDataLakeStack
# Set required Airflow variables
airflow variables set aws_account_id YOUR_AWS_ACCOUNT_ID
airflow variables set source_s3_bucket YOUR_SOURCE_BUCKET_NAME
# Upload DAGs
cp airflow/dags/* $AIRFLOW_HOME/dags/
cd dbt
dbt deps
dbt seed
dbt run --target dev
dbt test
# Create S3 Batch Operations job for Glacier restore
aws s3control create-job \
--account-id YOUR_ACCOUNT_ID \
--operation '{"S3RestoreObject":{"ExpirationInDays":7,"GlacierJobTier":"Standard"}}' \
--manifest-location s3://peek-inventory-bucket/prod-backup-web-event/archived-web-events/LATEST/manifest.json \
--role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/S3BatchOperationsRole \
--priority 100
# Monitor restore job status
aws s3control describe-job --account-id YOUR_ACCOUNT_ID --job-id JOB_ID
# Wait for Glacier restore completion (3-5 days), then trigger processing
airflow dags trigger web_events_historical_processing \
--conf '{"source_bucket": "prod-backup-web-events"}'
- Events β Kinesis Stream - Web events sent to Kinesis Data Streams
- Lambda Processing - Base64 decoding, enrichment, and S3 writing
- Glue Streaming - Continuous processing and data quality checks
- S3 Raw Zone - Partitioned storage by date/hour
- Athena/Redshift - Real-time querying capabilities
- S3 Inventory Analysis - Daily inventory manifest identifies Glacier-stored objects
- S3 Batch Operations Restore - Automated Glacier restore using inventory manifest
- Restore Monitoring - 3-5 day restore process with job status tracking
- Glue ETL Jobs - Data parsing, quality scoring, and Iceberg format conversion
- S3 Tables/Iceberg - Final storage in queryable Iceberg format
- Historical S3 Data - 800GB+ of existing web events (mixed storage classes)
- Glue ETL Jobs - Data parsing, quality scoring, and enrichment
- S3 Curated Zone - Clean, analysis-ready data
- dbt Transformations - Business logic and analytics models
- S3 Analytics Zone - Aggregated metrics and insights
- Automated Quality Scoring - Events scored 0-1 based on completeness
- Bot Detection - User agent analysis and behavioral patterns
- Schema Validation - Ensures data consistency across pipeline
- Data Lineage - Full traceability from source to analytics
- Space Interaction Tracking - Detailed engagement with virtual spaces
- User Journey Mapping - Complete clickstream flow analysis
- Conversion Funnel Analysis - Multi-step user progression tracking
- Device & Geographic Analysis - Cross-platform user behavior
- Partitioned Storage - Date-based partitioning for query performance
- Columnar Format - Parquet with Snappy compression
- Query Optimization - Predicate pushdown and partition pruning
- Auto-scaling - Dynamic resource allocation based on load
-- Example: Virtual tour engagement funnel
SELECT
journey_type,
engagement_tier,
COUNT(*) as sessions,
AVG(session_duration_minutes) as avg_duration,
conversion_rate
FROM analytics.user_journey_analysis
WHERE analysis_date >= CURRENT_DATE - 7
GROUP BY journey_type, engagement_tier
ORDER BY sessions DESC;
-- Example: Most engaging virtual spaces
SELECT
space_name,
space_type,
unique_sessions,
avg_time_on_space,
views_per_session
FROM analytics.daily_space_engagement
WHERE analysis_date = CURRENT_DATE - 1
ORDER BY avg_time_on_space DESC
LIMIT 10;
-- Example: Weekly user retention
SELECT
cohort_week,
weeks_since_first_visit,
retention_rate
FROM analytics.daily_cohort_analysis
WHERE analysis_date = CURRENT_DATE - 1
ORDER BY cohort_week, weeks_since_first_visit;
- Real-time Metrics - Kinesis ingestion, Lambda performance
- Data Quality Metrics - Processing success rates, error counts
- Business Metrics - User engagement, conversion rates
- Cost Monitoring - Resource utilization and spend tracking
- Data processing delays > 15 minutes
- Error rates > 1% over 1 hour
- Data quality score drops below 0.8
- Storage costs exceed budget thresholds
# Test dbt models locally
dbt run --target dev --models staging
dbt test --target dev
# Validate Glue jobs
python glue/s3_processor.py --local-mode
# Infrastructure changes
cdk diff
cdk deploy
# Data model updates
dbt run --target prod
dbt docs generate
dbt docs serve
# Run comprehensive data quality checks
airflow dags trigger web_events_data_quality_monitoring
# Manual quality validation
python scripts/validate_data_quality.py --date 2024-01-15
- Monthly AWS Costs: ~$6,500
- S3 Storage: $1,500
- Redshift: $3,000
- Glue: $800
- Kinesis: $500
- Other: $700
- S3 Intelligent Tiering - Automatic cost optimization
- Spot Instances - 60-80% savings on EMR workloads
- Reserved Capacity - Redshift reserved instances
- Lifecycle Policies - Automated data archival
- At Rest: S3 KMS encryption, Redshift cluster encryption
- In Transit: TLS 1.2 for all data transfers
- Key Management: Customer-managed KMS keys
- IAM Policies - Least privilege access
- VPC Endpoints - Private AWS service connectivity
- Data Masking - PII protection in non-production
- Audit Logging - CloudTrail for all API calls
1. Kinesis Processing Delays
# Check Lambda concurrency limits
aws lambda get-function-concurrency --function-name KinesisProcessor
# Monitor Kinesis metrics
aws cloudwatch get-metric-statistics --namespace AWS/Kinesis \
--metric-data MetricName=IncomingRecords,StreamName=peek-web-events-stream
2. Glue Job Failures
# Check job logs
aws logs describe-log-groups --log-group-name-prefix /aws-glue/jobs
# Review job bookmarks
aws glue get-job-bookmark --job-name peek-web-events-s3-processor
3. Data Quality Issues
# Run data quality validation
dbt test --models staging
dbt run-operation check_data_freshness
4. S3 Batch Operations Glacier Restore Issues
# Check batch operations job status
aws s3control list-jobs --account-id YOUR_ACCOUNT_ID --job-statuses Active,Complete,Failed
aws s3control describe-job --account-id YOUR_ACCOUNT_ID --job-id JOB_ID
# Verify restore status of individual objects
aws s3 head-object --bucket prod-backup-web-events --key path/to/file.json
# Check inventory manifest validity
aws s3 ls s3://peek-inventory-bucket/prod-backup-web-event/archived-web-events/ --recursive
- System Architecture: Comprehensive architecture overview with data flow diagrams
- Deployment Guide: Step-by-step deployment instructions for all environments
- Operational Runbook: Daily operations, incident response, and maintenance procedures
- Glacier Restore Guide: S3 Batch Operations guide for restoring archived data from Glacier
- Data Dictionary: Complete field definitions, business logic, and transformations
- Test Documentation: Testing strategy, test cases, and coverage reports
- dbt Docs: Available at
http://localhost:8080
after runningdbt docs serve
- API Reference: Generated from code documentation
- Schema Definitions: See
sql/iceberg_schemas.sql
for complete table schemas
- S3 Tables/Iceberg Integration: ACID transactions, time travel queries, schema evolution
- Spatial Analytics: 3D virtual tour navigation analysis with room-level insights
- Data Quality Framework: Automated scoring, bot detection, and validation
- Real-time Processing: Kinesis stream processing with sub-second latency
- Advanced Analytics: User journey analysis, conversion funnels, cohort retention
- Follow the established project structure
- Add comprehensive tests for new features
- Update documentation for any changes
- Follow data quality standards and validation rules
- Test changes in dev environment before production deployment
Project Team: Data Engineering Team
Last Updated: 2024-07-28
Version: 1.0.0