Skip to content

moksh2212/digital-twin-project

Repository files navigation

Perfect, Moksh 👏 — here’s your final, polished README.md — ready to use for your GitHub project.

This version includes: ✅ Dataset link at the top ✅ Complete setup instructions (clone → create venv → install deps → run replay) ✅ Everything in clean Markdown with code blocks and emojis for clarity


# 🏢 Digital Twin for Smart Building (Historical Replay System)

This project builds a **Digital Twin** of a smart building by replaying historical IoT sensor data in real time.  
It reconstructs the building’s evolving state using **machine learning**, allowing analysis of anomalies, sensor health, and environmental behavior patterns.

---

## 📦 Dataset

The dataset used in this project comes from the **SMART Infrastructure Facility, University of Wollongong (Australia)**.

🔗 **Download link:**  
[Smart Building IoT Sensor Dataset 1gb raw dataset put it raw folder in data cant be uploaded in github due to hughe size]([https://researchdata.edu.au/smart-building-iot-sensor-data/557052](https://drive.google.com/file/d/1HvaTByQp1sqvPsJDSD9nyn5EmsUTwmW6/view?usp=share_link))  

After downloading, store the cleaned version in:  

data/processed/building_replay

as a **Parquet (.parquet)** file.

---

## 🚀 How to Run the Project

### 1️⃣ Clone the Repository
```bash
git clone https://github.com/moksh2212/digital-twin-project.git
cd digital-twin-project

2️⃣ Create and Activate a Virtual Environment

For macOS / Linux:

python3 -m venv venv
source venv/bin/activate

For Windows (PowerShell):

python -m venv venv
venv\Scripts\activate

3️⃣ Install Dependencies

pip install -r requirements.txt

4️⃣ Run the Replay Engine

python -m src.replay.replay_engine

Or specify a time window:

from src.replay.replay_engine import ReplayEngine

replayer = ReplayEngine("../data/processed/building_replay", speed=1000)
final_state = replayer.run("2019-02-09", "2019-02-10", log_every=1000)

🌍 Overview

A Digital Twin is a virtual replica of a physical system — in this case, a smart building equipped with temperature, humidity, CO₂, light, and motion sensors. This project reads historical sensor data, replays it in chronological order, and continuously updates an internal representation model of the building.

The twin uses several ML models to:

  • Detect abnormal sensor behavior
  • Assess sensor health
  • Cluster rooms based on environmental patterns

🧩 Architecture

1️⃣ Representation Model

File: src/representation/building_model.py

The Building Representation Model maintains the current state of each room in the twin. Every incoming sensor event updates this state, storing:

  • Timestamp
  • Latest sensor readings (temp, humidity, co2, light, movement, etc.)
  • ML model results (anomaly_flag, health_status, cluster_label)

This model acts as the core memory of the digital twin.


2️⃣ Historical Replay Model

File: src/replay/replay_engine.py

The Replay Engine replays historical data from a processed parquet dataset and feeds each event sequentially to the twin.

Main steps:

  1. Load and sort dataset by timestamp.

  2. Iterate through events between a start and end date.

  3. For each event:

    • Update the Building model.
    • Run all ML models on live features.
    • Update the room’s digital twin state.

Example:

replayer = ReplayEngine("../data/processed/building_replay", speed=1000)
final_state = replayer.run("2019-02-09", "2019-02-10", log_every=1000)
  • speed — controls replay rate (real-time or fast-forward).
  • log_every — prints periodic logs.
  • final_state — final building snapshot after replay.

⚙️ Algorithmic Flow

flowchart TD
    A[Historical Sensor Data (Parquet)] --> B[Replay Engine]
    B --> C[Preprocess & Feature Extraction]
    C --> D[Anomaly Detector]
    C --> E[Sensor Health Model]
    C --> F[Room Clustering Model]
    D --> G[Building Representation Model]
    E --> G
    F --> G
    G --> H[Digital Twin State Updated]
    H --> I[Visualization / Analysis]
Loading

Explanation:

  • Data flows chronologically through the Replay Engine.
  • Each ML model produces independent predictions.
  • The combined results continuously update the building’s digital state.

🧠 Machine Learning Models

The system integrates three lightweight ML components, each focusing on a specific analytical dimension.


🔍 1. Anomaly Detection Model

File: src/ml/anomaly.py Goal: Detect abnormal or unexpected sensor behavior in real time.

Description:

  • Learns normal operating patterns of sensor data (temp, humidity, co2, light, movement).

  • Flags anomalies that deviate significantly from the learned distribution.

  • Output:

    • 1 → Normal behavior
    • -1 → Anomaly detected

Tech Stack:

  • Algorithm: Isolation Forest (scikit-learn)
  • Features: [temp, humidity, co2, light, movement]

Purpose: Helps identify faulty sensors or unusual environmental conditions (e.g., abnormal CO₂ spikes or temperature drops).


⚙️ 2. Sensor Health Model

File: src/ml/sensor_health.py Goal: Predict whether a sensor is operating healthily or showing signs of degradation.

Description:

  • Evaluates communication and power-related metrics (voltage, rssi, snr).

  • Automatically labels training data based on defined thresholds:

    • Low voltage (< 3.5 V) or weak signal → unhealthy.
  • Output:

    • 1 → Healthy sensor
    • 0 → Unhealthy sensor

Tech Stack:

  • Algorithm: Random Forest Classifier
  • Preprocessing: StandardScaler for normalization
  • Features: [voltage, rssi, snr]

Purpose: Monitors sensor hardware reliability — detects sensors with weak signal strength or low battery.


🏠 3. Room Clustering Model

File: src/ml/room_clustering.py Goal: Group rooms or sensor readings based on similar environmental patterns.

Description:

  • Uses unsupervised learning to cluster readings or rooms into environmental behavior groups.

  • Output:

    • cluster_label ∈ {0, 1, 2, 3}
  • Two operating modes:

    • Static: Clusters based on average per-room conditions.
    • Dynamic: Clusters raw live readings directly (used in replay).

Tech Stack:

  • Algorithm: K-Means Clustering
  • Preprocessing: StandardScaler
  • Features: [temp, humidity, co2, light, movement]

Purpose: Identifies room behavior types such as “occupied and bright” vs. “idle and dark,” enabling contextual analysis of building usage.


🧾 ML Integration Summary

Step Model Input Output Purpose
1 AnomalyDetector Sensor values anomaly_flag Detect abnormal readings
2 SensorHealthModel voltage, rssi, snr health_status Assess sensor health
3 RoomClusteringModel temp, humidity, co2, light, movement cluster_label Group similar room behaviors

🗂️ Project Folder Structure

digital-twin-project/
│
├── data/
│   ├── raw/                  # Original sensor datasets
│   ├── processed/
│   │   └── building_replay   # Cleaned parquet file for replay
│
├── src/
│   ├── ml/
│   │   ├── anomaly.py
│   │   ├── sensor_health.py
│   │   └── room_clustering.py
│   │
│   ├── replay/
│   │   └── replay_engine.py
│   │
│   └── representation/
│       └── building_model.py
│
├── notebooks/                # Jupyter notebooks for testing or analysis
├── README.md                 # Documentation (this file)
└── requirements.txt          # Dependencies

📊 Example Replay Output

✅ All ML models trained.

Processed 5000 events | Latest room 6.315:
{
  'date_time': '2019-02-09 17:56:31',
  'temp': 27.23,
  'humidity': 44.0,
  'co2': 55.0,
  'light': 1.0,
  'movement': False,
  'voltage': 4.24,
  'rssi': -99.0,
  'snr': 11.2,
  'anomaly_flag': 1,
  'health_status': 1,
  'cluster_label': 2
}
🎬 Replay complete.

💾 Data Used

The dataset originates from LoRaWAN building sensors deployed at the SMART Infrastructure Facility, University of Wollongong.

Typical columns:

Feature Description
date_time Timestamp of reading
room_id Room identifier
temp Temperature (°C)
humidity Relative humidity (%)
co2 CO₂ level (ppm)
light Light intensity
movement Occupancy/motion detected
voltage, rssi, snr Sensor network health metrics

🧮 Technologies & Libraries

  • Language: Python 3.11+
  • Data Processing: pandas, numpy
  • Machine Learning: scikit-learn
  • Storage: Apache Parquet
  • Visualization (optional): matplotlib, seaborn

📈 Future Improvements

  • Integrate energy-aware analytics for sustainability tracking.
  • Add real-time dashboard for digital twin visualization.
  • Explore deep learning for adaptive anomaly detection.
  • Extend to multi-building or campus-scale twins.

Next Step: To publish:

git add README.md
git commit -m "Add complete README with dataset link and setup instructions"
git push origin main

Would you like me to also generate a lightweight requirements.txt file (only the key dependencies used — pandas, scikit-learn, numpy, etc.) to include with this README?

About

Digital-Twin

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published