Name	Name	Last commit message	Last commit date
parent directory ..
climate	climate
edgar	edgar
framework	framework
openalex	openalex
Cargo.lock	Cargo.lock
Cargo.toml	Cargo.toml
README.md	README.md

RuVector Dataset Discovery Framework

Find hidden patterns and connections in massive datasets that traditional tools miss.

RuVector turns your data—research papers, climate records, financial filings—into a connected graph, then uses cutting-edge algorithms to spot emerging trends, cross-domain relationships, and regime shifts before they become obvious.

Why RuVector?

Most data analysis tools excel at answering questions you already know to ask. RuVector is different: it helps you discover what you don't know you're looking for.

Real-world examples:

🔬 Research: Spot a new field forming 6-12 months before it gets a name, by detecting when papers start citing across traditional boundaries
🌍 Climate: Detect regime shifts in weather patterns that correlate with economic disruptions
💰 Finance: Find companies whose narratives are diverging from their peers—often an early warning signal

Features

Feature	What It Does	Why It Matters
Vector Memory	Stores data as 384-1536 dim embeddings	Similar concepts cluster together automatically
HNSW Index	O(log n) approximate nearest neighbor search	10-50x faster than brute force for large datasets
Graph Structure	Connects related items with weighted edges	Reveals hidden relationships in your data
Min-Cut Analysis	Measures how "connected" your network is	Detects regime changes and fragmentation
Cross-Domain Detection	Finds bridges between different fields	Discovers unexpected correlations (e.g., climate → finance)
ONNX Embeddings	Neural semantic embeddings (MiniLM, BGE, etc.)	Production-quality text understanding
Causality Testing	Checks if changes in X predict changes in Y	Moves beyond correlation to actionable insights
Statistical Rigor	Reports p-values and effect sizes	Know which findings are real vs. noise

What's New in v0.3.0

HNSW Integration: O(n log n) similarity search replaces O(n²) brute force
Similarity Cache: 2-3x speedup for repeated similarity queries
Batch ONNX Embeddings: Chunked processing with progress callbacks
Shared Utils Module: cosine_similarity, euclidean_distance, normalize_vector
Auto-connect by Embeddings: CoherenceEngine creates edges from vector similarity

Performance

⚡ 10-50x faster similarity search (HNSW vs brute force)
⚡ 8.8x faster batch vector insertion (parallel processing)
⚡ 2.9x faster similarity computation (SIMD acceleration)
⚡ 2-3x faster repeated queries (similarity cache)
📊 Works with millions of records on standard hardware

Quick Start

Prerequisites

# Ensure you're in the ruvector workspace
cd /workspaces/ruvector

Run Your First Example

# 1. Performance benchmark - see the speed improvements
cargo run --example optimized_benchmark -p ruvector-data-framework --features parallel --release

# 2. Discovery hunter - find patterns in sample data
cargo run --example discovery_hunter -p ruvector-data-framework --features parallel --release

# 3. Cross-domain analysis - detect bridges between fields
cargo run --example cross_domain_discovery -p ruvector-data-framework --release

Domain-Specific Examples

# Climate: Detect weather regime shifts
cargo run --example regime_detector -p ruvector-data-climate

# Finance: Monitor corporate filing coherence
cargo run --example coherence_watch -p ruvector-data-edgar

What You'll See

🔍 Discovery Results:
   Pattern: Climate ↔ Finance bridge detected
   Strength: 0.73 (strong connection)
   P-value: 0.031 (statistically significant)

   → Drought indices may predict utility sector
     performance with a 3-period lag

The Discovery Thesis

RuVector's unique combination of vector memory, graph structures, and dynamic minimum cut algorithms enables discoveries that most analysis tools miss:

Emerging patterns before they have names: Detect topic splits and merges as cut boundaries shift over time
Non-obvious cross-domain bridges: Find small "connector" subgraphs where disciplines quietly start citing each other
Causal leverage maps: Link funders, labs, venues, and downstream citations to spot high-impact intervention points
Regime shifts in time series: Use coherence breaks to flag fundamental changes in system behavior

Tutorial

1. Creating the Engine

use ruvector_data_framework::optimized::{
    OptimizedDiscoveryEngine, OptimizedConfig,
};
use ruvector_data_framework::ruvector_native::{
    Domain, SemanticVector,
};

let config = OptimizedConfig {
    similarity_threshold: 0.55,   // Minimum cosine similarity
    mincut_sensitivity: 0.10,     // Coherence change threshold
    cross_domain: true,           // Enable cross-domain discovery
    use_simd: true,               // SIMD acceleration
    significance_threshold: 0.05, // P-value threshold
    causality_lookback: 12,       // Temporal lookback periods
    ..Default::default()
};

let mut engine = OptimizedDiscoveryEngine::new(config);

2. Adding Data

use std::collections::HashMap;
use chrono::Utc;

// Single vector
let vector = SemanticVector {
    id: "climate_drought_2024".to_string(),
    embedding: generate_embedding(), // 128-dim vector
    domain: Domain::Climate,
    timestamp: Utc::now(),
    metadata: HashMap::from([
        ("region".to_string(), "sahel".to_string()),
        ("severity".to_string(), "extreme".to_string()),
    ]),
};
let node_id = engine.add_vector(vector);

// Batch insertion (8.8x faster)
#[cfg(feature = "parallel")]
{
    let vectors: Vec<SemanticVector> = load_vectors();
    let node_ids = engine.add_vectors_batch(vectors);
}

3. Computing Coherence

let snapshot = engine.compute_coherence();

println!("Min-cut value: {:.3}", snapshot.mincut_value);
println!("Partition sizes: {:?}", snapshot.partition_sizes);
println!("Boundary nodes: {:?}", snapshot.boundary_nodes);

Interpretation:

Min-cut Trend	Meaning
Rising	Network consolidating, stronger connections
Falling	Fragmentation, potential regime change
Stable	Steady state, consistent structure

4. Pattern Detection

let patterns = engine.detect_patterns_with_significance();

for pattern in patterns.iter().filter(|p| p.is_significant) {
    println!("{}", pattern.pattern.description);
    println!("  P-value: {:.4}", pattern.p_value);
    println!("  Effect size: {:.3}", pattern.effect_size);
}

Pattern Types:

Type	Description	Example
`CoherenceBreak`	Min-cut dropped significantly	Network fragmentation crisis
`Consolidation`	Min-cut increased	Market convergence
`BridgeFormation`	Cross-domain connections	Climate-finance link
`Cascade`	Temporal causality	Climate → Finance lag-3
`EmergingCluster`	New dense subgraph	Research topic emerging

5. Cross-Domain Analysis

// Check coupling strength
let stats = engine.stats();
let coupling = stats.cross_domain_edges as f64 / stats.total_edges as f64;
println!("Cross-domain coupling: {:.1}%", coupling * 100.0);

// Domain coherence scores
for domain in [Domain::Climate, Domain::Finance, Domain::Research] {
    if let Some(coh) = engine.domain_coherence(domain) {
        println!("{:?}: {:.3}", domain, coh);
    }
}

Performance Benchmarks

Operation	Baseline	Optimized	Speedup
Vector Insertion	133ms	15ms	8.84x
SIMD Cosine	432ms	148ms	2.91x
Pattern Detection	524ms	655ms	-

Datasets

1. OpenAlex (Research Intelligence)

Best for: Emerging field detection, cross-discipline bridges

250M+ works, 90M+ authors
Native graph structure
Bulk download + API access

use ruvector_data_openalex::{OpenAlexConfig, FrontierRadar};

let radar = FrontierRadar::new(OpenAlexConfig::default());
let frontiers = radar.detect_emerging_topics(papers);

2. NOAA + NASA (Climate Intelligence)

Best for: Regime shift detection, anomaly prediction

Weather observations, satellite imagery
Time series → graph transformation
Economic risk modeling

use ruvector_data_climate::{ClimateConfig, RegimeDetector};

let detector = RegimeDetector::new(config);
let shifts = detector.detect_shifts();

3. SEC EDGAR (Financial Intelligence)

Best for: Corporate risk signals, peer divergence

XBRL financial statements
10-K/10-Q filings
Narrative + fundamental analysis

use ruvector_data_edgar::{EdgarConfig, CoherenceMonitor};

let monitor = CoherenceMonitor::new(config);
let alerts = monitor.analyze_filing(filing);

Directory Structure

examples/data/
├── README.md                 # This file
├── Cargo.toml               # Workspace manifest
├── framework/               # Core discovery framework
│   ├── src/
│   │   ├── lib.rs              # Framework exports
│   │   ├── ruvector_native.rs  # Native engine with Stoer-Wagner
│   │   ├── optimized.rs        # SIMD + parallel optimizations
│   │   ├── coherence.rs        # Coherence signal computation
│   │   ├── discovery.rs        # Pattern detection
│   │   └── ingester.rs         # Data ingestion
│   └── examples/
│       ├── cross_domain_discovery.rs  # Cross-domain patterns
│       ├── optimized_benchmark.rs     # Performance comparison
│       └── discovery_hunter.rs        # Novel pattern search
├── openalex/               # OpenAlex integration
├── climate/                # NOAA/NASA integration
└── edgar/                  # SEC EDGAR integration

Configuration Reference

OptimizedConfig

Parameter	Default	Description
`similarity_threshold`	0.65	Minimum cosine similarity for edges
`mincut_sensitivity`	0.12	Sensitivity to coherence changes
`cross_domain`	true	Enable cross-domain discovery
`batch_size`	256	Parallel batch size
`use_simd`	true	Enable SIMD acceleration
`similarity_cache_size`	10000	Max cached similarity pairs
`significance_threshold`	0.05	P-value threshold
`causality_lookback`	10	Temporal lookback periods
`causality_min_correlation`	0.6	Minimum correlation for causality

CoherenceConfig (v0.3.0)

Parameter	Default	Description
`similarity_threshold`	0.5	Min similarity for auto-connecting embeddings
`use_embeddings`	true	Auto-create edges from embedding similarity
`hnsw_k_neighbors`	50	Neighbors to search per vector (HNSW)
`hnsw_min_records`	100	Min records to trigger HNSW (else brute force)
`min_edge_weight`	0.01	Minimum edge weight threshold
`approximate`	true	Use approximate min-cut for speed
`parallel`	true	Enable parallel computation

Discovery Examples

Climate-Finance Bridge

Detected: Climate ↔ Finance bridge
  Strength: 0.73
  Connections: 197

Hypothesis: Drought indices may predict
  utility sector performance with lag-2

Regime Shift Detection

Min-cut trajectory:
  t=0: 72.5 (baseline)
  t=1: 73.3 (+1.1%)
  t=2: 74.5 (+1.6%) ← Consolidation

Effect size: 2.99 (large)
P-value: 0.042 (significant)

Causality Pattern

Climate → Finance causality detected
  F-statistic: 4.23
  Optimal lag: 3 periods
  Correlation: 0.67
  P-value: 0.031

Algorithms

HNSW (Hierarchical Navigable Small World)

Approximate nearest neighbor search in high-dimensional spaces.

Complexity: O(log n) search, O(log n) insert
Use: Fast similarity search for edge creation
Parameters: m=16, ef_construction=200, ef_search=50

Stoer-Wagner Min-Cut

Computes minimum cut of weighted undirected graph.

Complexity: O(VE + V² log V)
Use: Network coherence measurement

SIMD Cosine Similarity

Processes 8 floats per iteration using AVX2.

Speedup: 2.9x vs scalar
Fallback: Chunked scalar (8 floats per iteration)

Granger Causality

Tests if past values of X predict Y.

Compute cross-correlation at lags 1..k
Find optimal lag with max |correlation|
Calculate F-statistic
Convert to p-value

Best Practices

Start with low thresholds - Use similarity_threshold: 0.45 for exploration
Use batch insertion - add_vectors_batch() is 8x faster
Monitor coherence trends - Min-cut trajectory predicts regime changes
Filter by significance - Focus on p_value < 0.05
Validate causality - Temporal patterns need domain expertise

Troubleshooting

Problem	Solution
No patterns detected	Lower `mincut_sensitivity` to 0.05
Too many edges	Raise `similarity_threshold` to 0.70
Slow performance	Use `--features parallel --release`
Memory issues	Reduce `batch_size`

References

License

MIT OR Apache-2.0

FilesExpand file tree

data

Directory actions

More options