Find hidden patterns and connections in massive datasets that traditional tools miss.
RuVector turns your data—research papers, climate records, financial filings—into a connected graph, then uses cutting-edge algorithms to spot emerging trends, cross-domain relationships, and regime shifts before they become obvious.
Most data analysis tools excel at answering questions you already know to ask. RuVector is different: it helps you discover what you don't know you're looking for.
Real-world examples:
- 🔬 Research: Spot a new field forming 6-12 months before it gets a name, by detecting when papers start citing across traditional boundaries
- 🌍 Climate: Detect regime shifts in weather patterns that correlate with economic disruptions
- 💰 Finance: Find companies whose narratives are diverging from their peers—often an early warning signal
| Feature | What It Does | Why It Matters |
|---|---|---|
| Vector Memory | Stores data as 384-1536 dim embeddings | Similar concepts cluster together automatically |
| HNSW Index | O(log n) approximate nearest neighbor search | 10-50x faster than brute force for large datasets |
| Graph Structure | Connects related items with weighted edges | Reveals hidden relationships in your data |
| Min-Cut Analysis | Measures how "connected" your network is | Detects regime changes and fragmentation |
| Cross-Domain Detection | Finds bridges between different fields | Discovers unexpected correlations (e.g., climate → finance) |
| ONNX Embeddings | Neural semantic embeddings (MiniLM, BGE, etc.) | Production-quality text understanding |
| Causality Testing | Checks if changes in X predict changes in Y | Moves beyond correlation to actionable insights |
| Statistical Rigor | Reports p-values and effect sizes | Know which findings are real vs. noise |
- HNSW Integration: O(n log n) similarity search replaces O(n²) brute force
- Similarity Cache: 2-3x speedup for repeated similarity queries
- Batch ONNX Embeddings: Chunked processing with progress callbacks
- Shared Utils Module:
cosine_similarity,euclidean_distance,normalize_vector - Auto-connect by Embeddings: CoherenceEngine creates edges from vector similarity
- ⚡ 10-50x faster similarity search (HNSW vs brute force)
- ⚡ 8.8x faster batch vector insertion (parallel processing)
- ⚡ 2.9x faster similarity computation (SIMD acceleration)
- ⚡ 2-3x faster repeated queries (similarity cache)
- 📊 Works with millions of records on standard hardware
# Ensure you're in the ruvector workspace
cd /workspaces/ruvector# 1. Performance benchmark - see the speed improvements
cargo run --example optimized_benchmark -p ruvector-data-framework --features parallel --release
# 2. Discovery hunter - find patterns in sample data
cargo run --example discovery_hunter -p ruvector-data-framework --features parallel --release
# 3. Cross-domain analysis - detect bridges between fields
cargo run --example cross_domain_discovery -p ruvector-data-framework --release# Climate: Detect weather regime shifts
cargo run --example regime_detector -p ruvector-data-climate
# Finance: Monitor corporate filing coherence
cargo run --example coherence_watch -p ruvector-data-edgar🔍 Discovery Results:
Pattern: Climate ↔ Finance bridge detected
Strength: 0.73 (strong connection)
P-value: 0.031 (statistically significant)
→ Drought indices may predict utility sector
performance with a 3-period lag
RuVector's unique combination of vector memory, graph structures, and dynamic minimum cut algorithms enables discoveries that most analysis tools miss:
- Emerging patterns before they have names: Detect topic splits and merges as cut boundaries shift over time
- Non-obvious cross-domain bridges: Find small "connector" subgraphs where disciplines quietly start citing each other
- Causal leverage maps: Link funders, labs, venues, and downstream citations to spot high-impact intervention points
- Regime shifts in time series: Use coherence breaks to flag fundamental changes in system behavior
use ruvector_data_framework::optimized::{
OptimizedDiscoveryEngine, OptimizedConfig,
};
use ruvector_data_framework::ruvector_native::{
Domain, SemanticVector,
};
let config = OptimizedConfig {
similarity_threshold: 0.55, // Minimum cosine similarity
mincut_sensitivity: 0.10, // Coherence change threshold
cross_domain: true, // Enable cross-domain discovery
use_simd: true, // SIMD acceleration
significance_threshold: 0.05, // P-value threshold
causality_lookback: 12, // Temporal lookback periods
..Default::default()
};
let mut engine = OptimizedDiscoveryEngine::new(config);use std::collections::HashMap;
use chrono::Utc;
// Single vector
let vector = SemanticVector {
id: "climate_drought_2024".to_string(),
embedding: generate_embedding(), // 128-dim vector
domain: Domain::Climate,
timestamp: Utc::now(),
metadata: HashMap::from([
("region".to_string(), "sahel".to_string()),
("severity".to_string(), "extreme".to_string()),
]),
};
let node_id = engine.add_vector(vector);
// Batch insertion (8.8x faster)
#[cfg(feature = "parallel")]
{
let vectors: Vec<SemanticVector> = load_vectors();
let node_ids = engine.add_vectors_batch(vectors);
}let snapshot = engine.compute_coherence();
println!("Min-cut value: {:.3}", snapshot.mincut_value);
println!("Partition sizes: {:?}", snapshot.partition_sizes);
println!("Boundary nodes: {:?}", snapshot.boundary_nodes);Interpretation:
| Min-cut Trend | Meaning |
|---|---|
| Rising | Network consolidating, stronger connections |
| Falling | Fragmentation, potential regime change |
| Stable | Steady state, consistent structure |
let patterns = engine.detect_patterns_with_significance();
for pattern in patterns.iter().filter(|p| p.is_significant) {
println!("{}", pattern.pattern.description);
println!(" P-value: {:.4}", pattern.p_value);
println!(" Effect size: {:.3}", pattern.effect_size);
}Pattern Types:
| Type | Description | Example |
|---|---|---|
CoherenceBreak |
Min-cut dropped significantly | Network fragmentation crisis |
Consolidation |
Min-cut increased | Market convergence |
BridgeFormation |
Cross-domain connections | Climate-finance link |
Cascade |
Temporal causality | Climate → Finance lag-3 |
EmergingCluster |
New dense subgraph | Research topic emerging |
// Check coupling strength
let stats = engine.stats();
let coupling = stats.cross_domain_edges as f64 / stats.total_edges as f64;
println!("Cross-domain coupling: {:.1}%", coupling * 100.0);
// Domain coherence scores
for domain in [Domain::Climate, Domain::Finance, Domain::Research] {
if let Some(coh) = engine.domain_coherence(domain) {
println!("{:?}: {:.3}", domain, coh);
}
}| Operation | Baseline | Optimized | Speedup |
|---|---|---|---|
| Vector Insertion | 133ms | 15ms | 8.84x |
| SIMD Cosine | 432ms | 148ms | 2.91x |
| Pattern Detection | 524ms | 655ms | - |
Best for: Emerging field detection, cross-discipline bridges
- 250M+ works, 90M+ authors
- Native graph structure
- Bulk download + API access
use ruvector_data_openalex::{OpenAlexConfig, FrontierRadar};
let radar = FrontierRadar::new(OpenAlexConfig::default());
let frontiers = radar.detect_emerging_topics(papers);Best for: Regime shift detection, anomaly prediction
- Weather observations, satellite imagery
- Time series → graph transformation
- Economic risk modeling
use ruvector_data_climate::{ClimateConfig, RegimeDetector};
let detector = RegimeDetector::new(config);
let shifts = detector.detect_shifts();Best for: Corporate risk signals, peer divergence
- XBRL financial statements
- 10-K/10-Q filings
- Narrative + fundamental analysis
use ruvector_data_edgar::{EdgarConfig, CoherenceMonitor};
let monitor = CoherenceMonitor::new(config);
let alerts = monitor.analyze_filing(filing);examples/data/
├── README.md # This file
├── Cargo.toml # Workspace manifest
├── framework/ # Core discovery framework
│ ├── src/
│ │ ├── lib.rs # Framework exports
│ │ ├── ruvector_native.rs # Native engine with Stoer-Wagner
│ │ ├── optimized.rs # SIMD + parallel optimizations
│ │ ├── coherence.rs # Coherence signal computation
│ │ ├── discovery.rs # Pattern detection
│ │ └── ingester.rs # Data ingestion
│ └── examples/
│ ├── cross_domain_discovery.rs # Cross-domain patterns
│ ├── optimized_benchmark.rs # Performance comparison
│ └── discovery_hunter.rs # Novel pattern search
├── openalex/ # OpenAlex integration
├── climate/ # NOAA/NASA integration
└── edgar/ # SEC EDGAR integration
| Parameter | Default | Description |
|---|---|---|
similarity_threshold |
0.65 | Minimum cosine similarity for edges |
mincut_sensitivity |
0.12 | Sensitivity to coherence changes |
cross_domain |
true | Enable cross-domain discovery |
batch_size |
256 | Parallel batch size |
use_simd |
true | Enable SIMD acceleration |
similarity_cache_size |
10000 | Max cached similarity pairs |
significance_threshold |
0.05 | P-value threshold |
causality_lookback |
10 | Temporal lookback periods |
causality_min_correlation |
0.6 | Minimum correlation for causality |
| Parameter | Default | Description |
|---|---|---|
similarity_threshold |
0.5 | Min similarity for auto-connecting embeddings |
use_embeddings |
true | Auto-create edges from embedding similarity |
hnsw_k_neighbors |
50 | Neighbors to search per vector (HNSW) |
hnsw_min_records |
100 | Min records to trigger HNSW (else brute force) |
min_edge_weight |
0.01 | Minimum edge weight threshold |
approximate |
true | Use approximate min-cut for speed |
parallel |
true | Enable parallel computation |
Detected: Climate ↔ Finance bridge
Strength: 0.73
Connections: 197
Hypothesis: Drought indices may predict
utility sector performance with lag-2
Min-cut trajectory:
t=0: 72.5 (baseline)
t=1: 73.3 (+1.1%)
t=2: 74.5 (+1.6%) ← Consolidation
Effect size: 2.99 (large)
P-value: 0.042 (significant)
Climate → Finance causality detected
F-statistic: 4.23
Optimal lag: 3 periods
Correlation: 0.67
P-value: 0.031
Approximate nearest neighbor search in high-dimensional spaces.
- Complexity: O(log n) search, O(log n) insert
- Use: Fast similarity search for edge creation
- Parameters:
m=16,ef_construction=200,ef_search=50
Computes minimum cut of weighted undirected graph.
- Complexity: O(VE + V² log V)
- Use: Network coherence measurement
Processes 8 floats per iteration using AVX2.
- Speedup: 2.9x vs scalar
- Fallback: Chunked scalar (8 floats per iteration)
Tests if past values of X predict Y.
- Compute cross-correlation at lags 1..k
- Find optimal lag with max |correlation|
- Calculate F-statistic
- Convert to p-value
- Start with low thresholds - Use
similarity_threshold: 0.45for exploration - Use batch insertion -
add_vectors_batch()is 8x faster - Monitor coherence trends - Min-cut trajectory predicts regime changes
- Filter by significance - Focus on
p_value < 0.05 - Validate causality - Temporal patterns need domain expertise
| Problem | Solution |
|---|---|
| No patterns detected | Lower mincut_sensitivity to 0.05 |
| Too many edges | Raise similarity_threshold to 0.70 |
| Slow performance | Use --features parallel --release |
| Memory issues | Reduce batch_size |
MIT OR Apache-2.0