Skip to content

feat: Store JSON occurrences in sqlite #321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mtaran
Copy link

@mtaran mtaran commented Jun 12, 2025

This is a follow-up to #319, which implemented the initial SCIP->sqlite conversion. It inserted the occurrences information into the DB as a compressed protobuf for reasons of compactness. But this is not conducive to querying or exploring the data in ad hoc ways. This PR changes the format to store a JSON representation of the occurrences instead, since JSON is natively supported by sqlite. The index I'm working with only increased from 18 MB to 33 MB after this change, which to me feels like it's worth the trade-off.

Let me know what you think!

Test plan

Updated the unit tests.

@mtaran
Copy link
Author

mtaran commented Jun 13, 2025

FYI: @varungandhi-src, @antonsviridov-src

@varungandhi-src
Copy link
Contributor

varungandhi-src commented Jun 17, 2025

The index I'm working with only increased from 18 MB to 33 MB after this change, which to me feels like it's worth the trade-off.

We have to work with multi-GB indexes sometimes, so size and DB creation speed are both definitely concerns for us. I would need to do a test of this patch on some of the larger indexes we have.

Could you share what kinds of queries you're interested in running? E.g. if we provided a CLI tool to do things such as finding definitions, finding references etc., would that be enough?

I wonder if it makes sense to put this (or the compressed version) behind a separate flag... 🤔

@mtaran
Copy link
Author

mtaran commented Jun 17, 2025

The specific thing I was trying to do is to make a dependency tree of the dependencies between code elements that use a certain API. If there was a CLI for finding references, I'd be able to use it to find where that API is used. But then I'd also need a way to find how those files (or ideally classes etc., but semanticdb didn't provide that info 😢) relate to each other, e.g. by finding what cross-references there are within that set of files.

When I gave Claude the schema etc., it was able to generate the correct SQL for me without issues (which is a great benefit of actually using a SQL DB for this sort of thing, and more specifically of using JSON for reference info):

WITH relevant_files AS (
    SELECT DISTINCT d.id, d.relative_path
    FROM documents d
    JOIN chunks c ON d.id = c.document_id
    JOIN mentions m ON c.id = m.chunk_id
    JOIN global_symbols gs ON m.symbol_id = gs.id
    WHERE gs.symbol LIKE '%/MyAPI#%' 
)
SELECT DISTINCT 
    d_dependent.relative_path as dependent_file,
    d_provider.relative_path as dependency_file
FROM relevant_files d_dependent
JOIN chunks c_dependent ON d_dependent.id = c_dependent.document_id
JOIN mentions m_dependent ON c_dependent.id = m_dependent.chunk_id
JOIN global_symbols gs ON m_dependent.symbol_id = gs.id
JOIN mentions m_provider ON gs.id = m_provider.symbol_id
JOIN chunks c_provider ON m_provider.chunk_id = c_provider.id
JOIN relevant_files d_provider ON c_provider.document_id = d_provider.id
WHERE m_dependent.role != 1  -- dependent file uses/references the symbol
  AND m_provider.role = 1     -- provider file defines the symbol
  AND d_dependent.id != d_provider.id  -- exclude self-dependencies
ORDER BY dependent_file, dependency_file;

I didn't expect that Sourcegraph would be using this sqlite version of the indexes, since I saw the original PR was marked as resolving #233 which just talked about external usage of this kind of DB as a convenient interchange format. IMHO using JSON better fulfils that specific purpose, but it's very understandable that Sourcegraph has different priorities internally.

I'm happy to use my fork if I ever need to, but it'd definitely be convenient if this JSON form of the DB were in the mainline CLI. Perhaps the DB could contain both occurrences_proto BLOB & occurrences_json TEXT fields, and the CLI could just have a --occurrences-format flag to control which gets populated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants