Perfect, Moksh 👏 — here’s your final, polished README.md — ready to use for your GitHub project.
This version includes: ✅ Dataset link at the top ✅ Complete setup instructions (clone → create venv → install deps → run replay) ✅ Everything in clean Markdown with code blocks and emojis for clarity
# 🏢 Digital Twin for Smart Building (Historical Replay System)
This project builds a **Digital Twin** of a smart building by replaying historical IoT sensor data in real time.
It reconstructs the building’s evolving state using **machine learning**, allowing analysis of anomalies, sensor health, and environmental behavior patterns.
---
## 📦 Dataset
The dataset used in this project comes from the **SMART Infrastructure Facility, University of Wollongong (Australia)**.
🔗 **Download link:**
[Smart Building IoT Sensor Dataset 1gb raw dataset put it raw folder in data cant be uploaded in github due to hughe size]([https://researchdata.edu.au/smart-building-iot-sensor-data/557052](https://drive.google.com/file/d/1HvaTByQp1sqvPsJDSD9nyn5EmsUTwmW6/view?usp=share_link))
After downloading, store the cleaned version in: data/processed/building_replay
as a **Parquet (.parquet)** file.
---
## 🚀 How to Run the Project
### 1️⃣ Clone the Repository
```bash
git clone https://github.com/moksh2212/digital-twin-project.git
cd digital-twin-project
For macOS / Linux:
python3 -m venv venv
source venv/bin/activateFor Windows (PowerShell):
python -m venv venv
venv\Scripts\activatepip install -r requirements.txtpython -m src.replay.replay_engineOr specify a time window:
from src.replay.replay_engine import ReplayEngine
replayer = ReplayEngine("../data/processed/building_replay", speed=1000)
final_state = replayer.run("2019-02-09", "2019-02-10", log_every=1000)A Digital Twin is a virtual replica of a physical system — in this case, a smart building equipped with temperature, humidity, CO₂, light, and motion sensors. This project reads historical sensor data, replays it in chronological order, and continuously updates an internal representation model of the building.
The twin uses several ML models to:
- Detect abnormal sensor behavior
- Assess sensor health
- Cluster rooms based on environmental patterns
File: src/representation/building_model.py
The Building Representation Model maintains the current state of each room in the twin. Every incoming sensor event updates this state, storing:
- Timestamp
- Latest sensor readings (
temp,humidity,co2,light,movement, etc.) - ML model results (
anomaly_flag,health_status,cluster_label)
This model acts as the core memory of the digital twin.
File: src/replay/replay_engine.py
The Replay Engine replays historical data from a processed parquet dataset and feeds each event sequentially to the twin.
Main steps:
-
Load and sort dataset by timestamp.
-
Iterate through events between a start and end date.
-
For each event:
- Update the Building model.
- Run all ML models on live features.
- Update the room’s digital twin state.
Example:
replayer = ReplayEngine("../data/processed/building_replay", speed=1000)
final_state = replayer.run("2019-02-09", "2019-02-10", log_every=1000)speed— controls replay rate (real-time or fast-forward).log_every— prints periodic logs.final_state— final building snapshot after replay.
flowchart TD
A[Historical Sensor Data (Parquet)] --> B[Replay Engine]
B --> C[Preprocess & Feature Extraction]
C --> D[Anomaly Detector]
C --> E[Sensor Health Model]
C --> F[Room Clustering Model]
D --> G[Building Representation Model]
E --> G
F --> G
G --> H[Digital Twin State Updated]
H --> I[Visualization / Analysis]
Explanation:
- Data flows chronologically through the Replay Engine.
- Each ML model produces independent predictions.
- The combined results continuously update the building’s digital state.
The system integrates three lightweight ML components, each focusing on a specific analytical dimension.
File: src/ml/anomaly.py
Goal: Detect abnormal or unexpected sensor behavior in real time.
Description:
-
Learns normal operating patterns of sensor data (
temp,humidity,co2,light,movement). -
Flags anomalies that deviate significantly from the learned distribution.
-
Output:
1→ Normal behavior-1→ Anomaly detected
Tech Stack:
- Algorithm: Isolation Forest (scikit-learn)
- Features:
[temp, humidity, co2, light, movement]
Purpose: Helps identify faulty sensors or unusual environmental conditions (e.g., abnormal CO₂ spikes or temperature drops).
File: src/ml/sensor_health.py
Goal: Predict whether a sensor is operating healthily or showing signs of degradation.
Description:
-
Evaluates communication and power-related metrics (
voltage,rssi,snr). -
Automatically labels training data based on defined thresholds:
- Low voltage (< 3.5 V) or weak signal → unhealthy.
-
Output:
1→ Healthy sensor0→ Unhealthy sensor
Tech Stack:
- Algorithm: Random Forest Classifier
- Preprocessing: StandardScaler for normalization
- Features:
[voltage, rssi, snr]
Purpose: Monitors sensor hardware reliability — detects sensors with weak signal strength or low battery.
File: src/ml/room_clustering.py
Goal: Group rooms or sensor readings based on similar environmental patterns.
Description:
-
Uses unsupervised learning to cluster readings or rooms into environmental behavior groups.
-
Output:
cluster_label∈ {0, 1, 2, 3}
-
Two operating modes:
- Static: Clusters based on average per-room conditions.
- Dynamic: Clusters raw live readings directly (used in replay).
Tech Stack:
- Algorithm: K-Means Clustering
- Preprocessing: StandardScaler
- Features:
[temp, humidity, co2, light, movement]
Purpose: Identifies room behavior types such as “occupied and bright” vs. “idle and dark,” enabling contextual analysis of building usage.
| Step | Model | Input | Output | Purpose |
|---|---|---|---|---|
| 1 | AnomalyDetector | Sensor values | anomaly_flag |
Detect abnormal readings |
| 2 | SensorHealthModel | voltage, rssi, snr | health_status |
Assess sensor health |
| 3 | RoomClusteringModel | temp, humidity, co2, light, movement | cluster_label |
Group similar room behaviors |
digital-twin-project/
│
├── data/
│ ├── raw/ # Original sensor datasets
│ ├── processed/
│ │ └── building_replay # Cleaned parquet file for replay
│
├── src/
│ ├── ml/
│ │ ├── anomaly.py
│ │ ├── sensor_health.py
│ │ └── room_clustering.py
│ │
│ ├── replay/
│ │ └── replay_engine.py
│ │
│ └── representation/
│ └── building_model.py
│
├── notebooks/ # Jupyter notebooks for testing or analysis
├── README.md # Documentation (this file)
└── requirements.txt # Dependencies
✅ All ML models trained.
Processed 5000 events | Latest room 6.315:
{
'date_time': '2019-02-09 17:56:31',
'temp': 27.23,
'humidity': 44.0,
'co2': 55.0,
'light': 1.0,
'movement': False,
'voltage': 4.24,
'rssi': -99.0,
'snr': 11.2,
'anomaly_flag': 1,
'health_status': 1,
'cluster_label': 2
}
🎬 Replay complete.
The dataset originates from LoRaWAN building sensors deployed at the SMART Infrastructure Facility, University of Wollongong.
Typical columns:
| Feature | Description |
|---|---|
date_time |
Timestamp of reading |
room_id |
Room identifier |
temp |
Temperature (°C) |
humidity |
Relative humidity (%) |
co2 |
CO₂ level (ppm) |
light |
Light intensity |
movement |
Occupancy/motion detected |
voltage, rssi, snr |
Sensor network health metrics |
- Language: Python 3.11+
- Data Processing: pandas, numpy
- Machine Learning: scikit-learn
- Storage: Apache Parquet
- Visualization (optional): matplotlib, seaborn
- Integrate energy-aware analytics for sustainability tracking.
- Add real-time dashboard for digital twin visualization.
- Explore deep learning for adaptive anomaly detection.
- Extend to multi-building or campus-scale twins.
✅ Next Step: To publish:
git add README.md
git commit -m "Add complete README with dataset link and setup instructions"
git push origin mainWould you like me to also generate a lightweight requirements.txt file (only the key dependencies used — pandas, scikit-learn, numpy, etc.) to include with this README?