Skip to content

feat(silo): add a MutationProfile filter#1189

Open
Taepper wants to merge 1 commit intomainfrom
1179-mutation-profile
Open

feat(silo): add a MutationProfile filter#1189
Taepper wants to merge 1 commit intomainfrom
1179-mutation-profile

Conversation

@Taepper
Copy link
Collaborator

@Taepper Taepper commented Mar 2, 2026

resolves #1179

Summary

This adds a MutationProfile filter to silo. The behavior of this filter is outlined in #1179

PR Checklist

  • All necessary documentation has been adapted or there is an issue to do so.
  • The implemented feature is covered by an appropriate test.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog.

Note that this might contain changes that are on main, but not yet released.

Changelog:

0.11.1 (2026-03-11)

Features

  • silo: add a MutationProfile filter (9832a96)
  • silo: allow sequence-column inputs to be zstd-compressed in base64 format (1f2bbc5)

Bug Fixes

  • silo: let N (X for amino acids) also code for - (a081630)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new MutationProfile filter expression to SILO (per #1179) and introduces an optimized compilation path for large N-Of expressions over sequence positions to avoid repeated vertical-index lookups. This enables efficient “distance to profile” queries that expand into many per-position symbol conditions.

Changes:

  • Introduces NucleotideMutationProfile / AminoAcidMutationProfile expression that rewrites into an N-Of/Not form.
  • Adds a single-pass vertical-index DP helper (VerticalSequenceIndex::buildNOfDpTable) and a new NOf compile fast-path for SymbolInSet children on the same sequence.
  • Adds docs, integration tests, and performance benchmarks/utilities for profiling the new behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/silo/query_engine/filter/expressions/mutation_profile.h Defines the new MutationProfile expression template and JSON parsing hook.
src/silo/query_engine/filter/expressions/mutation_profile.cpp Implements profile construction (querySequence / sequenceId / mutations) and rewrite to Not(N-Of(...)).
src/silo/query_engine/filter/expressions/expression.cpp Registers new expression types: NucleotideMutationProfile and AminoAcidMutationProfile.
src/silo/query_engine/filter/expressions/symbol_in_set.h Adds getters needed for the new NOf compilation optimization.
src/silo/query_engine/filter/expressions/nof.cpp Adds optimized compile path that batches vertical-index access and inlines the threshold DP.
src/silo/storage/column/vertical_sequence_index.h Declares PositionQuery and buildNOfDpTable DP helper.
src/silo/storage/column/vertical_sequence_index.cpp Implements buildNOfDpTable with a forward scan over vertical_bitmaps.
src/silo/test/mutation_profile.test.cpp Adds integration tests for NucleotideMutationProfile behavior and error cases.
documentation/query_documentation.md Documents NucleotideMutationProfile and AminoAcidMutationProfile JSON formats and semantics.
performance/sequence_generator.h Adds shared benchmark utilities for generating synthetic sequences/reads and initializing DBs.
performance/nof_sequence_filter.cpp Adds a benchmark targeting the large-N-Of optimization via MutationProfile.
performance/many_short_read_filters.cpp Refactors to reuse sequence_generator.h.
performance/CMakeLists.txt Ensures benchmarks can include performance headers and adds the new benchmark target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Taepper Taepper force-pushed the 1179-mutation-profile branch from 3c324c7 to d15c000 Compare March 3, 2026 09:30
@Taepper Taepper force-pushed the 1179-mutation-profile branch from d15c000 to 9832a96 Compare March 11, 2026 14:08
}

template <typename SymbolType>
std::unique_ptr<Expression> MutationProfile<SymbolType>::rewrite(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this, we get a gigantic log message:

[2026-03-18 08:52:33.409] [logger] [debug] [database.cpp:531] Request Id [7abbed49-a609-4294-9941-4c4173da3621] - Filter after rewrite for partition 0: !([-2147483647-of:(main:symbol at position 1 in {-, C, G, T, Y, S, K, B}), (main:symbol at position 2 in {-, A, C, G, R, S, M, V}), (... goes on a bit for every position in the genome)

Does it make sense to do something about that?

"filterExpression": {
"type": "NucleotideMutationProfile",
"distance": 0,
"mutations": [{"position": 1, "symbol": "C"}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test where a mutation is out of bounds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add MutationProfile filter

3 participants