The repository contains my results of Large Scale Data Processing course at Wroclaw Univ. of Science and Technology. Main goal of laboratories was implementation of distributed reddit analysis tool with given architecture:
- Linux - bash, ssh, scp, tmux, htop, kill, killall, pipe operator, ls, sed, vim, cat
- Docker - Dockerfile, docker-compose, containers in general
- Python - pip, virtualenv, requirements, tox
- Parallelize computation in Python
- Celery
- Task queue (RabbitMQ)
- System monitoring (Prometheus, InfluxDB)
- Reddit API usage
- Text embedding (magnitude library)
- Data persistency (MongoDB)
- Data analysis (Redash)
- pySpark
- Linear regression
- Binary classification
- Multi-class classification
- Kubernetes
- K3s
- Helm
- Docker
- Application deployment (AWS EC2)
- Serving
- API (Flask)
- SPA (Streamlit)