GPU Monitor

A service for monitoring GPU metrics and sending them to a central server.

Features

Monitor NVIDIA GPU metrics including:
- Memory usage (total, used, free)
- SM and memory utilization
- Temperature
- Fan speed
- Power usage
- Clock speeds (graphics and memory)
Aggregates metrics hourly
Sends aggregated data to a central server
Handles offline mode and retries
Maintains data retention policies:
- Raw data: 30 days
- Aggregated data: 1 year
Supports multiple GPUs
Provides detailed logging
Cross-platform support (Windows, Linux, macOS)
Python 3.6+ compatibility

Requirements

Python 3.6 or higher
NVIDIA GPU with NVIDIA drivers installed
NVIDIA Management Library (NVML)

Installation

From PyPI

pip install gpu-monitor

From Source

Clone the repository:

git clone https://github.com/targoman/gpu_monitor.git
cd gpu_monitor

Install dependencies:

pip install -e .

Platform-Specific Notes

Windows

Colorama is automatically installed for Windows color support
NVML is fully supported
Windows 10 and 11 are supported
Run as a Windows Service using NSSM (see Running as a Service)

Linux

NVML is fully supported
Most Linux distributions are supported
Run as a systemd service (see Running as a Service)

macOS

NVML has limited support on macOS
Some GPU metrics may not be available
Consider using alternative monitoring tools for macOS
Run as a launchd service (see Running as a Service)

Configuration

The tool can be configured using a JSON configuration file. By default, it looks for config.json in the current directory.

Example configuration:

{
    "server": {
        "url": "https://example.com/api/metrics",
        "contract_number": "12345"
    },
    "collection": {
        "interval_seconds": 60,
        "aggregation_interval_hours": 1
    },
    "database": {
        "file": "gpu_metrics.db"
    },
    "logging": {
        "file": "gpu_monitor.log",
        "level": "INFO"
    }
}

Usage

Basic Usage

gpu_monitor

Command Line Options

Option	Short	Description
`--help`	`-h`	Show help message and exit
`--config CONFIG`	`-c`	Path to configuration file
`--offline`	`-o`	Run in offline mode (don't send data to server)
`--verbose`	`-v`	Enable verbose logging
`--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}`	`-l`	Set logging level
`--list-sends`	`-ls`	List all send attempts
`--search-send SEARCH_SEND`	`-ss`	Search for a specific send attempt by time
`--show-collection SHOW_COLLECTION`	`-sc`	Show collection data for a specific time
`--output-format {json,csv}`	`-f`	Output format for collection data (default: json)

Usage Examples

Online Mode

Start monitoring with default settings:

gpu_monitor

View current GPU metrics in JSON format:

gpu_monitor --show-collection "" --output-format json

Example output:

[
  {
    "timestamp": "2024-03-20T14:30:00",
    "gpus": [
      {
        "uid": "GPU-2e87a766-ac74-ea75-1dbe-13eb97066bd5",
        "pci_bus_id": "00000000:25:00.0",
        "name": "NVIDIA GeForce RTX 4090",
        "temperature": 29,
        "memory_used": 361431040,
        "memory_total": 24146608128,
        "gpu_utilization": 0,
        "memory_utilization": 0,
        "power_usage": 12.985,
        "fan_speed": 30,
        "graphics_clock": 210,
        "memory_clock": 405
      },
      {
        "uid": "GPU-1a2b3c4d-5e6f-7g8h-9i0j-1k2l3m4n5o6p",
        "pci_bus_id": "00000000:26:00.0",
        "name": "NVIDIA H100",
        "temperature": 35,
        "memory_used": 4294967296,
        "memory_total": 85899345920,
        "gpu_utilization": 0,
        "memory_utilization": 0,
        "power_usage": 15.234,
        "fan_speed": 25,
        "graphics_clock": 180,
        "memory_clock": 350
      }
    ]
  }
]

View aggregated hourly data in CSV format:

gpu_monitor --show-collection "2024-03-20T14:00:00" --output-format csv

Example output:

timestamp,uid,pci_bus_id,name,temperature,memory_used,memory_total,gpu_utilization,memory_utilization,power_usage,fan_speed,graphics_clock,memory_clock
2024-03-20T14:00:00,GPU-2e87a766-ac74-ea75-1dbe-13eb97066bd5,00000000:25:00.0,NVIDIA GeForce RTX 4090,29,361431040,24146608128,0,0,12.985,30,210,405
2024-03-20T14:00:00,GPU-1a2b3c4d-5e6f-7g8h-9i0j-1k2l3m4n5o6p,00000000:26:00.0,NVIDIA H100,35,4294967296,85899345920,0,0,15.234,25,180,350

Offline Mode

Start monitoring in offline mode:

gpu_monitor --offline

View send attempts history:

gpu_monitor --list-sends

Example output:

Send Attempts:
Aggregation Time | Attempts | First Attempt | Last Attempt | Last Error | UID | Sent
2024-03-20T13:00:00 | 3 | 2024-03-20T14:00:05 | 2024-03-20T14:00:15 | Connection timeout | abc123 | 0
2024-03-20T12:00:00 | 1 | 2024-03-20T13:00:05 | 2024-03-20T13:00:05 | None | def456 | 1

Search for specific send attempt:

gpu_monitor --search-send "2024-03-20T13:00:00"

Example output:

Send Attempt Details:
Aggregation Time: 2024-03-20T13:00:00
Attempts: 1
First Attempt: 2024-03-20T13:00:05
Last Attempt: 2024-03-20T13:00:05
Last Error: None
UID: def456
Sent: 1

View raw collection data for a specific time:

gpu_monitor --show-collection "2024-03-20T13:30:00" --output-format json

Example output:

[
  {
    "timestamp": "2024-03-20T13:30:00",
    "gpus": [
      {
        "uid": "GPU-2e87a766-ac74-ea75-1dbe-13eb97066bd5",
        "pci_bus_id": "00000000:25:00.0",
        "name": "NVIDIA GeForce RTX 4090",
        "temperature": 29,
        "memory_used": 361431040,
        "memory_total": 24146608128,
        "gpu_utilization": 0,
        "memory_utilization": 0,
        "power_usage": 12.985,
        "fan_speed": 30,
        "graphics_clock": 210,
        "memory_clock": 405
      },
      {
        "uid": "GPU-1a2b3c4d-5e6f-7g8h-9i0j-1k2l3m4n5o6p",
        "pci_bus_id": "00000000:26:00.0",
        "name": "NVIDIA H100",
        "temperature": 35,
        "memory_used": 4294967296,
        "memory_total": 85899345920,
        "gpu_utilization": 0,
        "memory_utilization": 0,
        "power_usage": 15.234,
        "fan_speed": 25,
        "graphics_clock": 180,
        "memory_clock": 350
      }
    ]
  }
]

Running as a Service

Linux (systemd)

Install the systemd service:

sudo cp scripts/gpu-monitor.service /etc/systemd/system/
sudo systemctl daemon-reload

Start the service:

sudo systemctl start gpu-monitor

Enable auto-start:

sudo systemctl enable gpu-monitor

Windows (NSSM)

Download and install NSSM from nssm.cc
Install the service:

nssm install GPU-Monitor "C:\Path\To\Python\python.exe" "C:\Path\To\gpu_monitor"

Start the service:

nssm start GPU-Monitor

macOS (launchd)

Create the service file:

cp scripts/com.targoman.gpu-monitor.plist ~/Library/LaunchAgents/

Load the service:

launchctl load ~/Library/LaunchAgents/com.targoman.gpu-monitor.plist

Data Flow

Collection:
- Collects raw GPU metrics every minute
- Stores raw data in SQLite database
- Raw data is retained for 30 days
Aggregation:
- Aggregates raw data hourly
- Calculates averages for all metrics
- Stores aggregated data in SQLite database
- Aggregated data is retained for 1 year
Transmission:
- Sends aggregated data to central server
- Implements retry logic (max 10 attempts)
- Records all send attempts
- Verifies data integrity with server response
- Handles offline mode gracefully

Error Handling

Collection Errors:
- Logs errors but continues operation
- Retries on next collection cycle
Aggregation Errors:
- Logs errors but continues operation
- Retries on next aggregation cycle
Transmission Errors:
- Implements retry logic (max 10 attempts)
- Records all attempts in database
- Verifies data integrity
- Handles offline mode

Logging

Logs are written to both file and console
Supports different log levels:
- DEBUG: Detailed information for debugging
- INFO: General operational information
- WARNING: Warning messages for potential issues
- ERROR: Error messages for serious problems
- CRITICAL: Critical errors that may prevent operation
Includes timestamps and color-coded output
Control logging level via:
- Command line: --log-level LEVEL or -l LEVEL
- Config file: logging.level setting

Development

Setting Up Development Environment

Clone the repository:

git clone https://github.com/targoman/gpu_monitor.git
cd gpu_monitor

Install development dependencies:

pip install -e ".[dev]"

Running Tests

python -m unittest discover tests

Code Style

The project follows PEP 8 style guide. To check code style:

flake8 src tests

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

Acknowledgments

NVIDIA Management Library (NVML) for GPU monitoring capabilities
The open-source community for various tools and libraries used in this project

Contact

Email: [email protected]
GitHub: https://github.com/targoman/gpu_monitor
Documentation: https://github.com/targoman/gpu_monitor/wiki

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
src/gpu_monitor		src/gpu_monitor
tests		tests
.coverage		.coverage
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json.sample		config.json.sample
pyproject.toml		pyproject.toml
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.py		setup.py

License

Targoman/gpu_monitor

Folders and files

Latest commit

History

Repository files navigation

GPU Monitor

Table of Contents

Features

Requirements

Installation

From PyPI

From Source

Platform-Specific Notes

Windows

Linux

macOS

Configuration

Usage

Basic Usage

Command Line Options

Usage Examples

Online Mode

Offline Mode

Running as a Service

Linux (systemd)

Windows (NSSM)

macOS (launchd)

Data Flow

Error Handling

Logging

Development

Setting Up Development Environment

Running Tests

Code Style

License

Contributing

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages