Librarian

Librarian is a modern cloud-native kafka connect alternative. Librarian uses native data replication technologies (such as Postgres Replication and Mongo Changestreams) to efficiently archive data.

Think of Librarian as Kafka Connect CDC source for modern data world.

What makes librarian Modern?

Librarian is distributed as a single binary with no external dependencies.
Librarian includes data integrity checks to ensure correctness.
Librarian includes data-oriented observability out of the box, including latency and completeness.
Librarian is runnable as a daemon or as a batch process.

What Makes librarian cloud-native?

Librarian includes modern telemetry including metrics and tracing natively.
Librarian is performant and efficient, deployable on modest hardware instances.

Features and Roadmap.

Data Snapshots

Snapshots capture the state of data at a specific point in time. They are simple to create and maintain because each snapshot is an isolated copy. This reduces complexity compared to incremental or differential backups.

Snapshots are operationally forgiving. If a snapshot fails or becomes corrupted, previous snapshots remain intact. Each snapshot is independent, allowing easy recovery without affecting other snapshots. This independence simplifies debugging and rollback processes.

By using snapshots, you ensure data durability and maintain a reliable history of changes.

Librarian supports data snapshots for Postgres tables and archiving the snapshot using Parquet, either locally or remotely in s3.

The following image describes Librarian snapshots:

Librarian Issues a SELECT query to postgres.
Librarian "preserves" (serializes/encodes) data to parquet.
Librarian saves the snapshotted data to local disk or s3.

Quickstart

The easiest way to get started with librarian is to clone this repo.

Tutorial: Generate a Postgres Parquet Snapshot Locally

This tutorial will perform a local snapshot using a sample postgres dataset. The end result will be a valid local parquet file that contains a snapshot of the postgres table.

Start Postgres Locally

docker-compose -f dev/compose.yml up -d

Use librarian to snapshot postgres property sales test dataset and save it locally

time go run cmd/librarian/main.go archiver snapshot -c dev/examples/property-sales.snapshot.yml

Check librarian stdout for the location of the parquet snapshot file

Query the Parquet file using duckdb :)

Inspect the snapshot catalog

cat data/property_sales/7900e2f0-b75a-11ef-8e40-9e78fe1d02fa/catalog.json| jq .

{
  "id": "7900e2f0-b75a-11ef-8e40-9e78fe1d02fa",
  "start_time": "2024-12-11T00:54:27.725337Z",
  "end_time": "2024-12-11T00:54:32.492852Z",
  "source": "public.property_sales",
  "num_source_records": 1097629,
  "num_records_processed": 1097629,
  "success": true
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
cmd/librarian		cmd/librarian
dev		dev
internal		internal
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Librarian

What makes librarian Modern?

What Makes librarian cloud-native?

Features and Roadmap.

Data Snapshots

Quickstart

Tutorial: Generate a Postgres Parquet Snapshot Locally

About

Releases

Packages

Languages

License

turbolytics/librarian

Folders and files

Latest commit

History

Repository files navigation

Librarian

What makes librarian Modern?

What Makes librarian cloud-native?

Features and Roadmap.

Data Snapshots

Quickstart

Tutorial: Generate a Postgres Parquet Snapshot Locally

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages