GitHub - moksh2212/digital-twin-project: Digital-Twin

Perfect, Moksh 👏 — here’s your final, polished README.md — ready to use for your GitHub project.

This version includes: ✅ Dataset link at the top ✅ Complete setup instructions (clone → create venv → install deps → run replay) ✅ Everything in clean Markdown with code blocks and emojis for clarity

# 🏢 Digital Twin for Smart Building (Historical Replay System)

This project builds a **Digital Twin** of a smart building by replaying historical IoT sensor data in real time.  
It reconstructs the building’s evolving state using **machine learning**, allowing analysis of anomalies, sensor health, and environmental behavior patterns.

---

## 📦 Dataset

The dataset used in this project comes from the **SMART Infrastructure Facility, University of Wollongong (Australia)**.

🔗 **Download link:**  
[Smart Building IoT Sensor Dataset 1gb raw dataset put it raw folder in data cant be uploaded in github due to hughe size]([https://researchdata.edu.au/smart-building-iot-sensor-data/557052](https://drive.google.com/file/d/1HvaTByQp1sqvPsJDSD9nyn5EmsUTwmW6/view?usp=share_link))  

After downloading, store the cleaned version in:

data/processed/building_replay

as a **Parquet (.parquet)** file.

---

## 🚀 How to Run the Project

### 1️⃣ Clone the Repository
```bash
git clone https://github.com/moksh2212/digital-twin-project.git
cd digital-twin-project

2️⃣ Create and Activate a Virtual Environment

For macOS / Linux:

python3 -m venv venv
source venv/bin/activate

For Windows (PowerShell):

python -m venv venv
venv\Scripts\activate

3️⃣ Install Dependencies

pip install -r requirements.txt

4️⃣ Run the Replay Engine

python -m src.replay.replay_engine

Or specify a time window:

from src.replay.replay_engine import ReplayEngine

replayer = ReplayEngine("../data/processed/building_replay", speed=1000)
final_state = replayer.run("2019-02-09", "2019-02-10", log_every=1000)

🌍 Overview

A Digital Twin is a virtual replica of a physical system — in this case, a smart building equipped with temperature, humidity, CO₂, light, and motion sensors. This project reads historical sensor data, replays it in chronological order, and continuously updates an internal representation model of the building.

The twin uses several ML models to:

Detect abnormal sensor behavior
Assess sensor health
Cluster rooms based on environmental patterns

🧩 Architecture

1️⃣ Representation Model

File: src/representation/building_model.py

The Building Representation Model maintains the current state of each room in the twin. Every incoming sensor event updates this state, storing:

Timestamp
Latest sensor readings (temp, humidity, co2, light, movement, etc.)
ML model results (anomaly_flag, health_status, cluster_label)

This model acts as the core memory of the digital twin.

2️⃣ Historical Replay Model

File: src/replay/replay_engine.py

The Replay Engine replays historical data from a processed parquet dataset and feeds each event sequentially to the twin.

Main steps:

Load and sort dataset by timestamp.
Iterate through events between a start and end date.
For each event:
- Update the Building model.
- Run all ML models on live features.
- Update the room’s digital twin state.

Example:

replayer = ReplayEngine("../data/processed/building_replay", speed=1000)
final_state = replayer.run("2019-02-09", "2019-02-10", log_every=1000)

speed — controls replay rate (real-time or fast-forward).
log_every — prints periodic logs.
final_state — final building snapshot after replay.

⚙️ Algorithmic Flow

flowchart TD
    A[Historical Sensor Data (Parquet)] --> B[Replay Engine]
    B --> C[Preprocess & Feature Extraction]
    C --> D[Anomaly Detector]
    C --> E[Sensor Health Model]
    C --> F[Room Clustering Model]
    D --> G[Building Representation Model]
    E --> G
    F --> G
    G --> H[Digital Twin State Updated]
    H --> I[Visualization / Analysis]

Explanation:

Data flows chronologically through the Replay Engine.
Each ML model produces independent predictions.
The combined results continuously update the building’s digital state.

🧠 Machine Learning Models

The system integrates three lightweight ML components, each focusing on a specific analytical dimension.

🔍 1. Anomaly Detection Model

File: src/ml/anomaly.py Goal: Detect abnormal or unexpected sensor behavior in real time.

Description:

Learns normal operating patterns of sensor data (temp, humidity, co2, light, movement).
Flags anomalies that deviate significantly from the learned distribution.
Output:
- 1 → Normal behavior
- -1 → Anomaly detected

Tech Stack:

Algorithm: Isolation Forest (scikit-learn)
Features: [temp, humidity, co2, light, movement]

Purpose: Helps identify faulty sensors or unusual environmental conditions (e.g., abnormal CO₂ spikes or temperature drops).

⚙️ 2. Sensor Health Model

File: src/ml/sensor_health.py Goal: Predict whether a sensor is operating healthily or showing signs of degradation.

Description:

Evaluates communication and power-related metrics (voltage, rssi, snr).
Automatically labels training data based on defined thresholds:
- Low voltage (< 3.5 V) or weak signal → unhealthy.
Output:
- 1 → Healthy sensor
- 0 → Unhealthy sensor

Tech Stack:

Algorithm: Random Forest Classifier
Preprocessing: StandardScaler for normalization
Features: [voltage, rssi, snr]

Purpose: Monitors sensor hardware reliability — detects sensors with weak signal strength or low battery.

🏠 3. Room Clustering Model

File: src/ml/room_clustering.py Goal: Group rooms or sensor readings based on similar environmental patterns.

Description:

Uses unsupervised learning to cluster readings or rooms into environmental behavior groups.
Output:
- cluster_label ∈ {0, 1, 2, 3}
Two operating modes:
- Static: Clusters based on average per-room conditions.
- Dynamic: Clusters raw live readings directly (used in replay).

Tech Stack:

Algorithm: K-Means Clustering
Preprocessing: StandardScaler
Features: [temp, humidity, co2, light, movement]

Purpose: Identifies room behavior types such as “occupied and bright” vs. “idle and dark,” enabling contextual analysis of building usage.

🧾 ML Integration Summary

Step	Model	Input	Output	Purpose
1	AnomalyDetector	Sensor values	`anomaly_flag`	Detect abnormal readings
2	SensorHealthModel	voltage, rssi, snr	`health_status`	Assess sensor health
3	RoomClusteringModel	temp, humidity, co2, light, movement	`cluster_label`	Group similar room behaviors

🗂️ Project Folder Structure

digital-twin-project/
│
├── data/
│   ├── raw/                  # Original sensor datasets
│   ├── processed/
│   │   └── building_replay   # Cleaned parquet file for replay
│
├── src/
│   ├── ml/
│   │   ├── anomaly.py
│   │   ├── sensor_health.py
│   │   └── room_clustering.py
│   │
│   ├── replay/
│   │   └── replay_engine.py
│   │
│   └── representation/
│       └── building_model.py
│
├── notebooks/                # Jupyter notebooks for testing or analysis
├── README.md                 # Documentation (this file)
└── requirements.txt          # Dependencies

📊 Example Replay Output

✅ All ML models trained.

Processed 5000 events | Latest room 6.315:
{
  'date_time': '2019-02-09 17:56:31',
  'temp': 27.23,
  'humidity': 44.0,
  'co2': 55.0,
  'light': 1.0,
  'movement': False,
  'voltage': 4.24,
  'rssi': -99.0,
  'snr': 11.2,
  'anomaly_flag': 1,
  'health_status': 1,
  'cluster_label': 2
}
🎬 Replay complete.

💾 Data Used

The dataset originates from LoRaWAN building sensors deployed at the SMART Infrastructure Facility, University of Wollongong.

Typical columns:

Feature	Description
`date_time`	Timestamp of reading
`room_id`	Room identifier
`temp`	Temperature (°C)
`humidity`	Relative humidity (%)
`co2`	CO₂ level (ppm)
`light`	Light intensity
`movement`	Occupancy/motion detected
`voltage`, `rssi`, `snr`	Sensor network health metrics

🧮 Technologies & Libraries

Language: Python 3.11+
Data Processing: pandas, numpy
Machine Learning: scikit-learn
Storage: Apache Parquet
Visualization (optional): matplotlib, seaborn

📈 Future Improvements

Integrate energy-aware analytics for sustainability tracking.
Add real-time dashboard for digital twin visualization.
Explore deep learning for adaptive anomaly detection.
Extend to multi-building or campus-scale twins.

✅ Next Step: To publish:

git add README.md
git commit -m "Add complete README with dataset link and setup instructions"
git push origin main

Would you like me to also generate a lightweight requirements.txt file (only the key dependencies used — pandas, scikit-learn, numpy, etc.) to include with this README?

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
config		config
data		data
notebooks		notebooks
reports		reports
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
DatsetAnalyzer.ipynb		DatsetAnalyzer.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

2️⃣ Create and Activate a Virtual Environment

3️⃣ Install Dependencies

4️⃣ Run the Replay Engine

🌍 Overview

🧩 Architecture

1️⃣ Representation Model

2️⃣ Historical Replay Model

⚙️ Algorithmic Flow

🧠 Machine Learning Models

🔍 1. Anomaly Detection Model

⚙️ 2. Sensor Health Model

🏠 3. Room Clustering Model

🧾 ML Integration Summary

🗂️ Project Folder Structure

📊 Example Replay Output

💾 Data Used

🧮 Technologies & Libraries

📈 Future Improvements

About

Uh oh!

Releases

Packages

Languages

moksh2212/digital-twin-project

Folders and files

Latest commit

History

Repository files navigation

2️⃣ Create and Activate a Virtual Environment

3️⃣ Install Dependencies

4️⃣ Run the Replay Engine

🌍 Overview

🧩 Architecture

1️⃣ Representation Model

2️⃣ Historical Replay Model

⚙️ Algorithmic Flow

🧠 Machine Learning Models

🔍 1. Anomaly Detection Model

⚙️ 2. Sensor Health Model

🏠 3. Room Clustering Model

🧾 ML Integration Summary

🗂️ Project Folder Structure

📊 Example Replay Output

💾 Data Used

🧮 Technologies & Libraries

📈 Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages