Automatically archive time-series data from PostgreSQL to S3 as Parquet files. Perfect for IoT applications where you want to:
- Keep recent data in PostgreSQL for fast querying
- Archive older data to S3 in a cost-effective format
- Maintain data organized by time partitions
- Enable analytical queries on historical data
# Using Docker
docker run --rm \
-e PG_CONN_STRING="postgresql://user:pass@host:5432/dbname" \
-e S3_BUCKET="my-archive-bucket" \
-e TABLE_NAMES="iot_data,iot_metrics" \
-e AWS_ACCESS_KEY_ID=xxx \
-e AWS_SECRET_ACCESS_KEY=xxx \
ghcr.io/johnonline35/pg-archiver:latest
# Or download a binary from releases:
# https://github.com/johnonline35/pg-archiver/releases
- Connects to your PostgreSQL database
- Finds data older than retention period (default: 90 days) from specified tables
- Converts it to Parquet format (optimized for analytical queries)
- Uploads to S3 with time-based partitioning (year/month)
- Safely removes archived data from PostgreSQL
Environment variables:
PG_CONN_STRING PostgreSQL connection string (required)
S3_BUCKET S3 bucket name (required)
AWS_REGION AWS region (default: us-east-1)
TABLE_NAMES Comma-separated list of source table names (default: iot_data)
AWS_ENDPOINT_URL Custom S3 endpoint for testing
Expected table schema (for each table):
CREATE TABLE iot_data (
id BIGINT,
timestamp TIMESTAMP,
device_id TEXT,
value DOUBLE PRECISION
);
Test locally using LocalStack:
# Start LocalStack
docker run -d \
--name test-s3 \
-p 4566:4566 \
-e SERVICES=s3 \
localstack/localstack:latest
# Create test bucket
aws --endpoint-url=http://localhost:4566 s3 mb s3://test-bucket
# Run archiver with LocalStack
docker run --rm \
-e PG_CONN_STRING="postgresql://user:pass@host:5432/dbname" \
-e S3_BUCKET="test-bucket" \
-e TABLE_NAMES="iot_data,iot_metrics" \
-e AWS_ACCESS_KEY_ID=test \
-e AWS_SECRET_ACCESS_KEY=test \
-e AWS_ENDPOINT_URL="http://localhost:4566" \
ghcr.io/johnonline35/pg-archiver:latest
The archived Parquet files can be queried using tools like:
- DuckDB
- AWS Athena
- Apache Spark
- Any Parquet-compatible query engine
Example using DuckDB:
-- Query archived data across all tables
SELECT *
FROM parquet_scan('s3://my-bucket/year=2024/month=11/*.parquet')
WHERE table_name = 'iot_data'
AND device_id = 'sensor1'
AND timestamp >= '2024-01-01';
-- Or query specific tables
SELECT *
FROM parquet_scan('s3://my-bucket/year=2024/month=11/*.parquet')
WHERE table_name IN ('iot_data', 'iot_metrics')
AND timestamp >= '2024-01-01';
- Docker:
docker pull ghcr.io/johnonline35/pg-archiver:latest
-
Pre-built binaries:
- Linux (amd64):
pg-archiver-linux-amd64
- macOS (Intel):
pg-archiver-darwin-amd64
- macOS (Apple Silicon):
pg-archiver-darwin-arm64
- Linux (amd64):
-
Build from source:
go install github.com/johnonline35/pg-archiver@latest
Contributions welcome! Some areas we'd love help with:
- Real-time archival using NOTIFY/LISTEN
- Support for more column types
- Additional output formats
- Monitoring and metrics
- Custom table schemas support
MIT