Indexing bugs #1948
-
Logs attached Bug Report – Copilot Plus & Smart Connections (Obsidian plugins) 1. SummaryWhen indexing a large Obsidian vault with Copilot Plus (and, to a lesser extent, Smart Connections) the application frequently hangs or crashes. The problem appears tied to:
Smart Connections is comparatively stable, but both plugins suffer from index corruption after a freeze. 2. Environment
3. Steps to Reproduce
4. Expected Behaviour
5. Actual Behaviour
6. Workarounds Attempted (by the user)
These mitigations improve success probability but do not eliminate the underlying issue. 7. Logs & AttachmentsThe user has already sent the complete log files (see attached 8. Suggested Improvements for the Developers
9. ConclusionThe current implementation of Copilot Plus (and, indirectly, Smart Connections) is highly sensitive to vault size, file‑system layout, batch configuration, and interaction with the Obsidian developer console. The user’s extensive manual testing demonstrates reproducible failure modes that lead to UI freezes and index corruption. Implementing the suggested safeguards would greatly increase robustness for power users who need to index large knowledge bases. obsidian.md-1760869775478.log |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 1 reply
-
|
I think it's JSON's size limit issue. cc @logancyang |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
@logancyang At this moment, I have found two ways to index a large number of files using your system. The first variant. Instead of running global indexing for the entire vault, I go into each note separately and press the button on the top right that indexes only that one. In such a case, so far there were no problems — never once has anything frozen in this setup with 1500 notes indexed this way. The second variant I found is: if files are very small, it was possible to index 18 thousand files. But still, sometimes there are failures which appear to be initiated not just by a single file; rather, some part of the system — perhaps access control, logging, or buffer overflows — causes these failures and freezes. It's my assumption that some queues for user action control, resource allocation, etc., cause these errors and hangs, but your system can definitely index many files. I cannot say though whether this will continue for weeks or months; indexing one file at a time as they appear might lead to another failure tomorrow or tonight — and everything could collapse again. For now it works. If I were in your place, I would analyze what the difference is between indexing a large number of small files versus big ones. Also, what’s the difference when you index large files one by one vs. running batches of 50–100 at once? Perhaps it would be good to conduct a small experiment — run a version of the indexing system that emulates the loop which is triggered when each file gets indexed individually and simply feed files into it one by one. Possibly if we change the architecture of mass indexing, these errors will disappear. Naturally, in the settings we'd need to hardcode batch size to exactly one file per batch. And so-called, let’s name this mode “BigVault Index Mode” or “Safety Mode”. |
Beta Was this translation helpful? Give feedback.
-
|
Manual index 1700 files, i think this method save index to RAM? I dont see file modification. But index dont save after app crash. |
Beta Was this translation helpful? Give feedback.
-
|
No index doesn't save to RAM, it saves to disk by batch with checkpoints. What file are you looking at? Use "list all indexed files" command for more details. |
Beta Was this translation helpful? Give feedback.
-
|
We have tested with several other users with 100GB+ vaults. They are using a similar approach you described - index a subset and add more with the button one by one. It's not worth the engineering hours since this approach has clear limitations. We are moving to real vector dbs in self-host mode as a proper solution. |
Beta Was this translation helpful? Give feedback.
-
I research about better RAG system. Current my preference - https://platform.cognee.ai/auth Graph+Vector human-like memory, intellectual agentic graph generation inside this memory. Can you integrate this? Maybe in MCP mode? Or API call from you agent? Relational store — Tracks your documents, their chunks, and provenance (i.e. where each piece of data came from and how it’s linked to the source). |
Beta Was this translation helpful? Give feedback.
-
|
@logancyang Based on my analysis of the codebase, I need to provide you with important context: the Obsidian Copilot plugin's indexing pipeline has limited resource management capabilities and does not implement storage-type-aware I/O optimizations or inter-plugin coordination mechanisms. The persistent Resource Management ImplementationThe plugin implements basic resource control mechanisms: Rate Limiting: A simple time-based rate limiter controls embedding API requests 1 Batch Processing: Documents are processed in configurable batches, but this only affects embedding API calls, not I/O operations 2 Checkpointing: Database saves occur at intervals calculated as I/O Operations and Storage LimitationsCritical Issue with Partitioning: While the plugin detects The chunked storage system uses hash-based document distribution 5 , but partition 0 receives additional global metadata that includes the entire index structure 6 . This means:
No Storage-Type Awareness: The plugin uses Obsidian's abstract vault adapter without any special handling for SSD vs. HDD or I/O prioritization 8 Concurrency Control and Conflict PreventionInternal Concurrency: The plugin uses a mutex for atomic upsert operations 9 , but this only prevents internal race conditions within the plugin itself. No Inter-Plugin Coordination: There is no mechanism to prevent conflicts with other plugins that might be accessing the same files or performing I/O operations simultaneously. The only coordination is a 5-second debounce for reindexing 10 Background Operations: The plugin uses Why Your Issue PersistsThe persistent
NotesThe plugin's error handling shows awareness of the string length issue and recommends increasing partitions 14 , but this solution has inherent limitations for very large vaults. The codebase doesn't implement:
For vaults exceeding the architectural limits, the current implementation cannot reliably prevent these errors regardless of configuration adjustments. CitationsFile: src/rateLimiter.ts (L1-28) export class RateLimiter {
private lastRequestTime = 0;
private requestsPerMin: number;
constructor(requestsPerMin: number) {
this.requestsPerMin = requestsPerMin;
}
setRequestsPerMin(requestsPerMin: number) {
this.requestsPerMin = requestsPerMin;
}
getRequestsPerMin(): number {
return this.requestsPerMin;
}
async wait(): Promise<void> {
const now = Date.now();
const timeSinceLastRequest = now - this.lastRequestTime;
const timeToWait = Math.max(0, 60000 / this.requestsPerMin - timeSinceLastRequest);
if (timeToWait > 0) {
await new Promise((resolve) => setTimeout(resolve, timeToWait));
}
this.lastRequestTime = Date.now();
}
}File: src/search/indexOperations.ts (L42-58) constructor(
private app: App,
private dbOps: DBOperations,
private embeddingsManager: EmbeddingsManager
) {
const settings = getSettings();
this.rateLimiter = new RateLimiter(settings.embeddingRequestsPerMin);
this.embeddingBatchSize = settings.embeddingBatchSize;
this.checkpointInterval = 8 * this.embeddingBatchSize;
// Subscribe to settings changes
subscribeToSettingsChange(async () => {
const settings = getSettings();
this.rateLimiter = new RateLimiter(settings.embeddingRequestsPerMin);
this.embeddingBatchSize = settings.embeddingBatchSize;
this.checkpointInterval = 8 * this.embeddingBatchSize;
});File: src/search/indexOperations.ts (L107-183) for (let i = 0; i < allChunks.length; i += this.embeddingBatchSize) {
if (this.state.isIndexingCancelled) break;
await this.handlePause();
const batch = allChunks.slice(i, i + this.embeddingBatchSize);
try {
await this.rateLimiter.wait();
const embeddings = await embeddingInstance.embedDocuments(
batch.map((chunk) => chunk.content)
);
// Validate embeddings
if (!embeddings || embeddings.length !== batch.length) {
throw new Error(
`Embedding model returned ${embeddings?.length ?? 0} embeddings for ${batch.length} documents`
);
}
// Save batch to database
for (let j = 0; j < batch.length; j++) {
const chunk = batch[j];
const embedding = embeddings[j];
// Skip documents with invalid embeddings
if (!embedding || !Array.isArray(embedding) || embedding.length === 0) {
logError(`Invalid embedding for document ${chunk.fileInfo.path}: ${embedding}`);
this.dbOps.markFileMissingEmbeddings(chunk.fileInfo.path);
continue;
}
try {
await this.dbOps.upsert({
...chunk.fileInfo,
id: this.getDocHash(chunk.content),
content: chunk.content,
embedding,
created_at: Date.now(),
nchars: chunk.content.length,
});
// Mark success for this file
this.state.processedFiles.add(chunk.fileInfo.path);
} catch (err) {
// Log error but continue processing other documents in batch
this.handleError(err, {
filePath: chunk.fileInfo.path,
errors,
});
this.dbOps.markFileMissingEmbeddings(chunk.fileInfo.path);
continue;
}
}
// Update progress after the batch
this.state.indexedCount = this.state.processedFiles.size;
this.updateIndexingNoticeMessage();
// Calculate if we've crossed a checkpoint threshold
const previousCheckpoint = Math.floor(
(this.state.indexedCount - batch.length) / this.checkpointInterval
);
const currentCheckpoint = Math.floor(this.state.indexedCount / this.checkpointInterval);
if (currentCheckpoint > previousCheckpoint) {
await this.dbOps.saveDB();
console.log("Copilot index checkpoint save completed.");
}
} catch (err) {
this.handleError(err, {
filePath: batch?.[0]?.fileInfo?.path,
errors,
batch,
});
if (this.isRateLimitError(err)) {
break;
}
}
}File: src/search/indexOperations.ts (L189-201) setTimeout(() => {
this.dbOps
.saveDB()
.then(() => {
logInfo("Copilot index final save completed.");
this.dbOps.checkIndexIntegrity().catch((err) => {
logError("Background integrity check failed:", err);
});
})
.catch((err) => {
logError("Background save failed:", err);
});
}, 100); // 100ms delayFile: src/search/indexOperations.ts (L437-449) private isStringLengthError(error: any): boolean {
if (!error) return false;
// Check if it's a direct RangeError
if (error instanceof RangeError && error.message.toLowerCase().includes("string length")) {
return true;
}
// Check the error message at any depth
const message = error.message || error.toString();
const lowerMessage = message.toLowerCase();
return lowerMessage.includes("string length") || lowerMessage.includes("rangeerror");
}File: src/search/indexOperations.ts (L492-498) if (this.isStringLengthError(error)) {
new Notice(
"Vault is too large for 1 partition, please increase the number of partitions in your Copilot QA settings!",
10000 // Show for 10 seconds
);
return;
}File: src/search/chunkedStorage.ts (L36-52) public assignDocumentToPartition(docId: string, totalPartitions: number): number {
// 1. Convert string to array of characters
const chars = Array.from(docId); // e.g., "abc" -> ['a', 'b', 'c']
// 2. Create a hash using the djb2 algorithm
const hash = chars.reduce((acc, char) => {
// For each character:
// a. Left shift acc by 5 (multiply by 32): acc << 5
// b. Subtract original acc: (acc << 5) - acc
// This is equivalent to: acc * 31
// c. Add character code: + char.charCodeAt(0)
return (acc << 5) - acc + char.charCodeAt(0);
}, 0);
// 3. Take absolute value and modulo to get partition number
return Math.abs(hash) % totalPartitions;
}File: src/search/chunkedStorage.ts (L99-104) private async ensureDirectoryExists(filePath: string): Promise<void> {
const dir = filePath.substring(0, filePath.lastIndexOf("/"));
if (!(await this.app.vault.adapter.exists(dir))) {
await this.app.vault.adapter.mkdir(dir);
}
}File: src/search/chunkedStorage.ts (L164-172) // Create global data object (excluding partitioned fields)
const globalData = {
...rawData,
docs: { docs: {}, count: 0 },
index: {
...(rawData as any).index,
vectorIndexes: undefined,
},
};File: src/search/chunkedStorage.ts (L196-207) // For first partition, include global data
const finalPartitionData =
partitionIndex === 0
? {
...globalData,
docs: partitionData.docs,
index: {
...globalData.index,
vectorIndexes: partitionData.index.vectorIndexes,
},
}
: partitionData;File: src/search/chunkedStorage.ts (L209-216) const chunkPath = this.getChunkPath(partitionIndex);
await this.ensureDirectoryExists(chunkPath);
await this.app.vault.adapter.write(chunkPath, JSON.stringify(finalPartitionData));
if (getSettings().debug) {
console.log(`Saved partition ${partitionIndex + 1}/${numPartitions}`);
}
}File: src/search/dbOperations.ts (L116-147) async saveDB() {
if (Platform.isMobile && getSettings().disableIndexOnMobile) {
return;
}
if (!this.oramaDb || !this.chunkedStorage) {
// Instead of throwing immediately, try to initialize.
// Crucial for new user onboarding.
try {
await this.initializeDB(await EmbeddingsManager.getInstance().getEmbeddingsAPI());
// If still not initialized after attempt, then throw
if (!this.oramaDb || !this.chunkedStorage) {
throw new CustomError("Orama database not found.");
}
} catch (error) {
logError("Failed to initialize database during save:", error);
throw new CustomError("Failed to initialize and save database.");
}
}
try {
await this.chunkedStorage.saveDatabase(this.oramaDb);
this.hasUnsavedChanges = false;
if (getSettings().debug) {
logInfo("Orama database saved successfully at:", this.dbPath);
}
} catch (error) {
logError(`Error saving Orama database:`, error);
throw error;
}
}File: src/search/dbOperations.ts (L362-416) async upsert(docToSave: any): Promise<any | undefined> {
if (!this.oramaDb) throw new Error("DB not initialized");
const db = this.oramaDb;
// Use mutex to make the operation atomic
return await this.upsertMutex.runExclusive(async () => {
try {
// Calculate partition first
const partition = this.chunkedStorage?.assignDocumentToPartition(
docToSave.id,
getSettings().numPartitions
);
// Check if document exists
const existingDoc = await search(db, {
term: docToSave.id,
properties: ["id"],
limit: 1,
});
if (existingDoc.hits.length > 0) {
await remove(db, existingDoc.hits[0].id);
}
// Insert into the assigned partition
try {
await insert(db, docToSave);
logInfo(
`${existingDoc.hits.length > 0 ? "Updated" : "Inserted"} document ${docToSave.id} in partition ${partition}`
);
this.markUnsavedChanges();
return docToSave;
} catch (insertErr) {
logError(
`Failed to ${existingDoc.hits.length > 0 ? "update" : "insert"} document ${docToSave.id}:`,
insertErr
);
// If we removed an existing document but failed to insert the new one,
// we should try to restore the old document
if (existingDoc.hits.length > 0) {
try {
await insert(db, existingDoc.hits[0].document);
} catch (restoreErr) {
logError("Failed to restore previous document version:", restoreErr);
}
}
return undefined;
}
} catch (err) {
logError(`Error upserting document ${docToSave.id}:`, err);
return undefined;
}
});
}File: src/search/indexEventHandler.ts (L9-87) const DEBOUNCE_DELAY = 5000; // 5 seconds
export class IndexEventHandler {
private debounceTimer: number | null = null;
private lastActiveFile: TFile | null = null;
private lastActiveFileMtime: number | null = null;
constructor(
private app: App,
private indexOps: IndexOperations,
private dbOps: DBOperations
) {
this.initializeEventListeners();
}
private initializeEventListeners() {
if (getSettings().debug) {
console.log("Copilot Plus: Initializing event listeners");
}
this.app.workspace.on("active-leaf-change", this.handleActiveLeafChange);
this.app.vault.on("delete", this.handleFileDelete);
}
private handleActiveLeafChange = async (leaf: any) => {
if (Platform.isMobile && getSettings().disableIndexOnMobile) {
return;
}
const currentChainType = getChainType();
if (currentChainType !== ChainType.COPILOT_PLUS_CHAIN) {
return;
}
// Get the previously active file that we need to check
const fileToCheck = this.lastActiveFile;
const previousMtime = this.lastActiveFileMtime;
// Update tracking for the new active file
const currentView = leaf?.view;
this.lastActiveFile = currentView instanceof MarkdownView ? currentView.file : null;
this.lastActiveFileMtime = this.lastActiveFile?.stat?.mtime ?? null;
// If there was no previous file or it's the same as current, do nothing
if (!fileToCheck || fileToCheck === this.lastActiveFile) {
return;
}
// Safety check for file stats and mtime
if (!fileToCheck?.stat?.mtime || previousMtime === null) {
return;
}
// Only process markdown files that match inclusion/exclusion patterns
if (fileToCheck.extension === "md") {
const { inclusions, exclusions } = getMatchingPatterns();
const shouldProcess = shouldIndexFile(fileToCheck, inclusions, exclusions);
// Check if file was modified while it was active
const wasModified = previousMtime !== null && fileToCheck.stat.mtime > previousMtime;
if (shouldProcess && wasModified) {
this.debouncedReindexFile(fileToCheck);
}
}
};
private debouncedReindexFile = (file: TFile) => {
if (this.debounceTimer !== null) {
window.clearTimeout(this.debounceTimer);
}
this.debounceTimer = window.setTimeout(() => {
if (getSettings().debug) {
console.log("Copilot Plus: Triggering reindex for file ", file.path);
}
this.indexOps.reindexFile(file);
this.debounceTimer = null;
}, DEBOUNCE_DELAY);
}; |
Beta Was this translation helpful? Give feedback.
-
|
@logancyang # Single-File vs Batch Indexing Differences The Obsidian Copilot plugin's single-file and batch indexing processes differ primarily in how and when index data is persisted to disk, not in where it's stored. Both use the same underlying Orama database and ChunkedStorage system. Storage LocationBoth processes store data in the same location - the Orama database managed by Key Difference: Deferred vs Immediate PersistenceSingle-File Indexing (
|
Beta Was this translation helpful? Give feedback.
-
|
@logancyang To modify the Key Modification: Remove Checkpoint SavesThe primary change is to remove the checkpoint save logic that currently saves the database multiple times during the indexing process. Currently, This checkpoint logic should be removed or commented out. The function will then work similarly to How It WorksThe system already operates primarily in RAM:
Recommended ImplementationRemove this checkpoint code block: 6 Keep the final save at the end: 7 Notes
CitationsFile: src/search/indexOperations.ts (L163-172) // Calculate if we've crossed a checkpoint threshold
const previousCheckpoint = Math.floor(
(this.state.indexedCount - batch.length) / this.checkpointInterval
);
const currentCheckpoint = Math.floor(this.state.indexedCount / this.checkpointInterval);
if (currentCheckpoint > previousCheckpoint) {
await this.dbOps.saveDB();
console.log("Copilot index checkpoint save completed.");
}File: src/search/indexOperations.ts (L189-201) setTimeout(() => {
this.dbOps
.saveDB()
.then(() => {
logInfo("Copilot index final save completed.");
this.dbOps.checkIndexIntegrity().catch((err) => {
logError("Background integrity check failed:", err);
});
})
.catch((err) => {
logError("Background save failed:", err);
});
}, 100); // 100ms delayFile: src/search/indexOperations.ts (L571-571) this.dbOps.markUnsavedChanges();File: src/search/dbOperations.ts (L116-147) async saveDB() {
if (Platform.isMobile && getSettings().disableIndexOnMobile) {
return;
}
if (!this.oramaDb || !this.chunkedStorage) {
// Instead of throwing immediately, try to initialize.
// Crucial for new user onboarding.
try {
await this.initializeDB(await EmbeddingsManager.getInstance().getEmbeddingsAPI());
// If still not initialized after attempt, then throw
if (!this.oramaDb || !this.chunkedStorage) {
throw new CustomError("Orama database not found.");
}
} catch (error) {
logError("Failed to initialize database during save:", error);
throw new CustomError("Failed to initialize and save database.");
}
}
try {
await this.chunkedStorage.saveDatabase(this.oramaDb);
this.hasUnsavedChanges = false;
if (getSettings().debug) {
logInfo("Orama database saved successfully at:", this.dbPath);
}
} catch (error) {
logError(`Error saving Orama database:`, error);
throw error;
}
}File: src/search/dbOperations.ts (L362-416) async upsert(docToSave: any): Promise<any | undefined> {
if (!this.oramaDb) throw new Error("DB not initialized");
const db = this.oramaDb;
// Use mutex to make the operation atomic
return await this.upsertMutex.runExclusive(async () => {
try {
// Calculate partition first
const partition = this.chunkedStorage?.assignDocumentToPartition(
docToSave.id,
getSettings().numPartitions
);
// Check if document exists
const existingDoc = await search(db, {
term: docToSave.id,
properties: ["id"],
limit: 1,
});
if (existingDoc.hits.length > 0) {
await remove(db, existingDoc.hits[0].id);
}
// Insert into the assigned partition
try {
await insert(db, docToSave);
logInfo(
`${existingDoc.hits.length > 0 ? "Updated" : "Inserted"} document ${docToSave.id} in partition ${partition}`
);
this.markUnsavedChanges();
return docToSave;
} catch (insertErr) {
logError(
`Failed to ${existingDoc.hits.length > 0 ? "update" : "insert"} document ${docToSave.id}:`,
insertErr
);
// If we removed an existing document but failed to insert the new one,
// we should try to restore the old document
if (existingDoc.hits.length > 0) {
try {
await insert(db, existingDoc.hits[0].document);
} catch (restoreErr) {
logError("Failed to restore previous document version:", restoreErr);
}
}
return undefined;
}
} catch (err) {
logError(`Error upserting document ${docToSave.id}:`, err);
return undefined;
}
});
}File: src/search/chunkedStorage.ts (L106-224) async saveDatabase(db: Orama<any>): Promise<void> {
try {
const rawData: RawData = await save(db);
const numPartitions = getSettings().numPartitions;
if (numPartitions === 1) {
const legacyPath = this.getLegacyPath();
await this.ensureDirectoryExists(legacyPath);
await this.app.vault.adapter.write(
legacyPath,
JSON.stringify({
...rawData,
schema: db.schema,
})
);
return;
}
// NOTE: Orama RawData docs can be either an array or an object
const docsData = (rawData as any).docs?.docs;
const rawDocs = Array.isArray(docsData) ? docsData : Object.values(docsData || {});
if (getSettings().debug) {
console.log(`Starting save with ${rawDocs.length ?? 0} total documents`);
}
if (!rawDocs || rawDocs.length === 0) {
const metadata: ChunkMetadata = {
numPartitions,
vectorLength: db.schema.embedding.match(/\d+/)[0],
schema: db.schema,
lastModified: Date.now(),
documentPartitions: {},
};
const metadataPath = this.getMetadataPath();
await this.ensureDirectoryExists(metadataPath);
await this.app.vault.adapter.write(metadataPath, JSON.stringify(metadata));
if (getSettings().debug) {
console.log("Saved empty database state");
}
return;
}
const partitions = this.distributeDocumentsToPartitions(rawDocs, numPartitions);
const metadata: ChunkMetadata = {
numPartitions,
vectorLength: db.schema.embedding.match(/\d+/)[0],
schema: db.schema,
lastModified: Date.now(),
documentPartitions: Object.fromEntries(
rawDocs.map((doc: any) => [doc.id, this.assignDocumentToPartition(doc.id, numPartitions)])
),
};
await this.saveMetadata(metadata);
// Create global data object (excluding partitioned fields)
const globalData = {
...rawData,
docs: { docs: {}, count: 0 },
index: {
...(rawData as any).index,
vectorIndexes: undefined,
},
};
// Save partitions
for (const [partitionIndex, docs] of partitions.entries()) {
// Create partition-specific data
const partitionData = {
index: {
vectorIndexes: {
embedding: {
size: (rawData as any).index.vectorIndexes.embedding.size,
vectors: Object.fromEntries(
Object.entries((rawData as any).index.vectorIndexes.embedding.vectors).filter(
([id]) => docs.some((doc) => doc.id === id)
)
),
},
},
},
docs: {
docs: Object.fromEntries(docs.map((doc, index) => [(index + 1).toString(), doc])),
count: docs.length,
},
};
// For first partition, include global data
const finalPartitionData =
partitionIndex === 0
? {
...globalData,
docs: partitionData.docs,
index: {
...globalData.index,
vectorIndexes: partitionData.index.vectorIndexes,
},
}
: partitionData;
const chunkPath = this.getChunkPath(partitionIndex);
await this.ensureDirectoryExists(chunkPath);
await this.app.vault.adapter.write(chunkPath, JSON.stringify(finalPartitionData));
if (getSettings().debug) {
console.log(`Saved partition ${partitionIndex + 1}/${numPartitions}`);
}
}
if (getSettings().debug) {
console.log("Saved all partitions");
}
} catch (error) {
console.error(`Error saving database:`, error);
throw new CustomError(`Failed to save database: ${error.message}`);
}
} |
Beta Was this translation helpful? Give feedback.

We have moved over to index-free search in v3. Our "semantic search" option is a backward-compatible mode. Having the vector db inside the Electron/browser environment has its limitations. The right way for larger vaults is to use agentic search and/or with a dedicated standalone local vector db.
With the upcoming "self-host" mode we will have a much more scalable local vector db that is cross-platform but separate from the Obsidian environment.
FYI this is on our website: