Skip to content

ekrist1/tinyfinder

Repository files navigation

Simple Search Service

A lightweight, powerful full-text search service built with Rust, Tantivy, and SQLite. Think Elasticsearch/Solr but much simpler to deploy and operate.

Features

  • 🚀 Fast: Built on Tantivy, Rust's answer to Lucene
  • 💾 Simple Storage: Uses SQLite for metadata and Tantivy's built-in index storage
  • 🔌 RESTful API: Easy integration with any application
  • 🐳 Easy Deploy: Single binary or Docker container
  • 🔍 Full-Text Search: BM25 ranking, phrase queries, fuzzy matching
  • 🤖 Generative Answers: Mistral-powered, source-grounded responses (optional)
  • 🌍 Multi-language: Supports Norwegian, English, and more
  • 📊 Lightweight: Runs on 512MB RAM

Quick Start

Option 1: Docker (Recommended)

# Clone or extract the project
cd search-service

# Start with Docker Compose
docker-compose up -d

# Check health
curl http://localhost:3000/health

Docker Compose loads environment variables from .env (see env_file in docker-compose.yml).

Option 2: Build from Source

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build
cargo build --release

# Run
./target/release/simple-search-service

The service will start on http://localhost:3000

API Documentation

Health Check

GET /health

Response:

{
  "status": "healthy",
  "service": "simple-search-service",
  "version": "0.1.0"
}

Create Index

POST /indices
Content-Type: application/json

{
  "name": "products",
  "fields": [
    {
      "name": "title",
      "field_type": "text",
      "stored": true,
      "indexed": true
    },
    {
      "name": "description",
      "field_type": "text",
      "stored": true,
      "indexed": true
    },
    {
      "name": "price",
      "field_type": "f64",
      "stored": true,
      "indexed": true
    }
  ]
}

Field types: text, string, i64, f64, date

For sorting and aggregations, set "fast": true on the field (required for date sorting).

List Indices

GET /indices

Response:

{
  "success": true,
  "data": [
    {
      "name": "products",
      "document_count": 1250,
      "created_at": "2025-01-16T10:30:00Z"
    }
  ]
}

Add Documents

POST /indices/products/documents
Content-Type: application/json

{
  "documents": [
    {
      "id": "prod_001",
      "fields": {
        "title": "Smil Barnehage Bergen",
        "description": "Modern barnehage i Bergen sentrum med fokus på læring gjennom lek",
        "price": 15000.0
      }
    },
    {
      "id": "prod_002",
      "fields": {
        "title": "Lekeland Barnehage",
        "description": "Familievennlig barnehage med store uteområder",
        "price": 12500.0
      }
    }
  ]
}

Search

POST /indices/products/search
Content-Type: application/json

{
  "query": "barnehage bergen",
  "limit": 10,
  "fields": ["title", "description"],
  "boost": {
    "title": 2.0
  },
  "fuzzy": true,
  "sort": {
    "field": "starts_at",
    "order": "desc"
  }
}

Response:

{
  "success": true,
  "data": {
    "took_ms": 2.4,
    "total": 2,
    "hits": [
      {
        "id": "prod_001",
        "score": 8.42,
        "fields": {
          "id": "prod_001",
          "title": "Smil Barnehage Bergen",
          "description": "Modern barnehage i Bergen sentrum...",
          "price": 15000.0
        }
      }
    ]
  }
}

Partial and fuzzy matching

  • Append an asterisk to any term (for example, "query": "eventyr*") to perform a prefix search that matches tokens beginning with that fragment.
  • Set "fuzzy": true in the search payload to tolerate a single-character typo (insertions, deletions, substitutions, or transpositions), which helps catch misspellings like evntyr.

Sorting by date

To sort by a date field, define the field as "field_type": "date" and set "fast": true when creating the index. Then pass the sort object in the search request:

{
  "query": "barnehage",
  "limit": 10,
  "sort": {
    "field": "starts_at",
    "order": "asc"
  }
}

Supported sort field types: i64, f64, date (must be fast: true).

Generative Answers (Mistral)

This endpoint runs a search, then asks Mistral to summarize the top hits into a grounded answer. If stream is true (default), the response is an SSE stream.

POST /indices/products/answer
Content-Type: application/json

{
  "query": "hvor er familievennlig barnehage",
  "search_limit": 5,
  "fields": ["title", "description", "location"],
  "fuzzy": true,
  "stream": false,
  "temperature": 0.2
}

Response (non-streaming):

{
  "success": true,
  "data": {
    "answer": "...",
    "model": "mistral-large-latest",
    "search_took_ms": 3.1,
    "llm_took_ms": 412.7,
    "total_took_ms": 418.5,
    "sources": [
      {
        "id": "kg_001",
        "score": 8.42,
        "fields": {
          "title": "Lekeland Barnehage",
          "description": "Familievennlig barnehage ..."
        }
      }
    ]
  }
}

Streaming (SSE) example:

curl -N http://localhost:3000/indices/kindergartens/answer \
  -H "Content-Type: application/json" \
  -d '{"query":"hvor er familievennlig barnehage","stream":true}'

The stream emits:

  • event: meta with JSON containing model, search_took_ms, and sources
  • data: chunks with partial answer text
  • event: done when finished

Delete Document

DELETE /indices/products/documents/prod_001

Delete Index

DELETE /indices/products

Bulk Operations

POST /indices/products/bulk
Content-Type: application/json

{
  "operations": [
    {
      "operation": "index",
      "document": {
        "id": "prod_003",
        "fields": {
          "title": "New Product",
          "description": "Description here"
        }
      }
    },
    {
      "operation": "delete",
      "id": "prod_001"
    }
  ]
}

Integration Examples

Laravel/PHP

use Illuminate\Support\Facades\Http;

// Create index
$response = Http::post('http://localhost:3000/indices', [
    'name' => 'kindergartens',
    'fields' => [
        ['name' => 'title', 'field_type' => 'text', 'stored' => true, 'indexed' => true],
        ['name' => 'description', 'field_type' => 'text', 'stored' => true, 'indexed' => true],
    ]
]);

// Add documents
$response = Http::post('http://localhost:3000/indices/kindergartens/documents', [
    'documents' => [
        [
            'id' => 'kg_001',
            'fields' => [
                'title' => 'Smil Barnehage',
                'description' => 'En flott barnehage i Bergen',
            ]
        ]
    ]
]);

// Search
$response = Http::post('http://localhost:3000/indices/kindergartens/search', [
    'query' => 'barnehage bergen',
    'limit' => 10
]);

$results = $response->json()['data'];

Exact match filter

curl -X POST http://localhost:3000/indices/myindex/search
-H 'Content-Type: application/json'
-d '{ "query": "collection_handle:my-collection", "limit": 10 }' | jq '.'

Combine with search terms

curl -X POST http://localhost:3000/indices/myindex/search
-H 'Content-Type: application/json'
-d '{ "query": "tariff AND collection_handle:my-collection", "limit": 10, "fuzzy": true }' | jq '.'

Multiple collections (using OR)

curl -X POST http://localhost:3000/indices/myindex/search
-H 'Content-Type: application/json'
-d '{ "query": "tariff AND (collection_handle:collection-a OR collection_handle:collection-b)", "limit": 10 }' | jq '.'

Multiple collections (using IN syntax - more efficient)

curl -X POST http://localhost:3000/indices/myindex/search
-H 'Content-Type: application/json'
-d '{ "query": "tariff AND collection_handle:IN[collection-a,collection-b,collection-c]", "limit": 10 }' | jq '.'

JavaScript/Node.js

// Add documents
const response = await fetch('http://localhost:3000/indices/products/documents', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    documents: [
      {
        id: 'prod_001',
        fields: {
          title: 'Product Name',
          description: 'Product description'
        }
      }
    ]
  })
});

// Search
const searchResponse = await fetch('http://localhost:3000/indices/products/search', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    query: 'search term',
    limit: 10
  })
});

const results = await searchResponse.json();

Configuration

Environment variables:

  • DATA_DIR: Data directory path (default: ./data)
  • PORT: Server port (default: 3000)
  • RUST_LOG: Log level (default: info, options: trace, debug, info, warn, error)
  • MISTRAL_API_KEY: API key for Mistral (enables /indices/:name/answer)
  • MISTRAL_MODEL: Mistral model name (default: mistral-large-latest)
  • MISTRAL_BASE_URL: Base URL for Mistral-compatible API (default: https://api.mistral.ai/v1)

.env is loaded automatically at startup (if present in the project root).

Performance Tips

  1. Bulk Operations: Use bulk endpoints for adding multiple documents
  2. Field Selection: Only store fields you need to display in results
  3. Index Size: Expect index size to be 10-20% of original text
  4. Memory: Allocate ~50MB per active index + buffer

Production Deployment

Systemd Service

Create /etc/systemd/system/search-service.service:

[Unit]
Description=Simple Search Service
After=network.target

[Service]
Type=simple
User=search
WorkingDirectory=/opt/search-service
Environment="DATA_DIR=/var/lib/search-service"
Environment="PORT=3000"
ExecStart=/opt/search-service/simple-search-service
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable search-service
sudo systemctl start search-service

Nginx Reverse Proxy

server {
    listen 80;
    server_name search.yourdomain.com;

    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

Monitoring

The service exposes a /health endpoint for health checks:

# Docker health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

Backup

The data directory contains:

  • metadata.db: SQLite database with metadata
  • indices/: Directory with Tantivy index files

Simply backup the entire data directory:

# Backup
tar -czf search-backup-$(date +%Y%m%d).tar.gz data/

# Restore
tar -xzf search-backup-20250116.tar.gz

Use Cases

  • E-commerce: Product search with faceted filtering
  • Documentation: Technical documentation search
  • CRM: Customer and contact search
  • Content Management: Article and page search
  • Internal Tools: Log search, ticket search

Comparison with Elasticsearch

Feature Simple Search Service Elasticsearch
Memory ~512MB ~2GB minimum
Deployment Single binary JVM + cluster
Setup Time < 1 minute 15-30 minutes
Cluster No Yes
Scaling Vertical Horizontal
Best For Single server, <10M docs Distributed, >10M docs

License

MIT License - feel free to use in commercial projects

Support

For issues or questions, please open an issue on the GitHub repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors