Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less stringent alignment and seed filtering #453

Draft
wants to merge 205 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
205 commits
Select commit Hold shift + click to select a range
94658fa
by default, don't put a limit on the number of alignments
hmusta Jun 5, 2023
764acc9
fix server errors
hmusta Jun 6, 2023
3ded0e8
fix simple_align integration tests
hmusta Jun 6, 2023
76a9d4d
cleanup
hmusta Jun 6, 2023
98332f1
minor
hmusta Jun 6, 2023
42b7dc2
cleanup
hmusta Jun 6, 2023
7115837
Update config.cpp
hmusta Jun 7, 2023
dbd28ee
if a label is discarded, add the seed as an alignment without extension
hmusta Jun 7, 2023
271fdba
if a seed is filtered out, report the filtered out part as an alignment
hmusta Jun 9, 2023
6978902
respect min_path_score
hmusta Jun 12, 2023
4bbd320
Add pandas as a requirement
hmusta Jun 12, 2023
e8855ef
test
hmusta Jun 13, 2023
7323418
handle annotations for dummy nodes:
hmusta Jun 14, 2023
2203f08
annotate dummy nodes during alignment
hmusta Jun 14, 2023
b5cacd6
fixes
hmusta Jun 15, 2023
d9bf01c
fix compilation issues on clang
hmusta Jun 15, 2023
e7c9921
fixes
hmusta Jun 15, 2023
92946b0
minor
hmusta Jun 15, 2023
b222c78
only BASIC graph for protein tests
hmusta Jun 15, 2023
f404bb3
avoid failing structural binding capture in lambda
hmusta Jun 15, 2023
b124536
don't fetch annotations for dummy node neighbours if they've e alread…
hmusta Jun 15, 2023
405472f
fix unit test for proteins
hmusta Jun 15, 2023
bf251be
helper function for merging seeds into maximal unique matches
hmusta Jun 15, 2023
6394160
merge labeled seeds
hmusta Jun 15, 2023
40cd62e
warn instead of assert failure when using masked DBGSuccinct
hmusta Jun 16, 2023
fabcc51
fix
hmusta Jun 16, 2023
ebeacaf
more reporting
hmusta Jun 16, 2023
b336f1e
fix
hmusta Jun 16, 2023
488a793
fix unit test
hmusta Jun 16, 2023
7e1012f
disable integration test
hmusta Jun 16, 2023
8ef3323
fix seed filtering after extension
hmusta Jun 16, 2023
d156dc2
ensure seeds are correctly sorted after filtering
hmusta Jun 16, 2023
5dbacf2
simplify suffix seeding
hmusta Jun 16, 2023
c1bee7e
fix
hmusta Jun 16, 2023
45af9ae
cleanup
hmusta Jun 19, 2023
f33f867
fixes
hmusta Jun 19, 2023
2b94995
fix for canonical graphs
hmusta Jun 19, 2023
f591394
cleanup
hmusta Jun 19, 2023
f5d3a2a
cleanup
hmusta Jun 19, 2023
15fb3ec
redo suffix seeding on primary graphs
hmusta Jun 19, 2023
9f00719
fix computation of number of matching characters. include first k-len…
hmusta Jun 19, 2023
824484b
remove redundant seeds
hmusta Jun 19, 2023
ab0f5f6
find fewer seeds
hmusta Jun 19, 2023
2e8e49d
find fewer seeds
hmusta Jun 19, 2023
beae397
add assertion
hmusta Jun 19, 2023
85e61cf
fixed
hmusta Jun 19, 2023
d7eb8c9
fix
hmusta Jun 20, 2023
4949bf4
avoid creating redundant suffix seeds
hmusta Jun 20, 2023
1b6c572
more seed filtering
hmusta Jun 20, 2023
c2b3e79
cleanup
hmusta Jun 20, 2023
8aa04b9
optim
hmusta Jun 20, 2023
c9fb9fb
fix
hmusta Jun 20, 2023
d945eed
print errors when setUpClass happens
hmusta Jun 20, 2023
e5af33a
fix
hmusta Jun 20, 2023
5ab52be
cleanup alignment aggregator
hmusta Jun 21, 2023
a1af22a
minor
hmusta Jun 21, 2023
bfa7015
fix
hmusta Jun 21, 2023
8d0afea
fix seed filtering
hmusta Jun 21, 2023
00d52ce
fix
hmusta Jun 21, 2023
54ddb56
extra asserts. ensure that seeds are in correct order
hmusta Jun 21, 2023
148d49e
minor
hmusta Jun 21, 2023
79504fb
off by one
hmusta Jun 21, 2023
96c5b6c
test
hmusta Jun 21, 2023
9987a00
test
hmusta Jun 22, 2023
2bd0d00
Merge remote-tracking branch 'origin/master' into hm/aln_alt
hmusta Jun 23, 2023
8ba12a4
rewrite fetching
hmusta Jun 23, 2023
ffa8ce7
cleanup
hmusta Jun 23, 2023
a999508
cleanup
hmusta Jun 23, 2023
9258561
more cleanup
hmusta Jun 23, 2023
3ee4efd
fix for coords in CANONICAL graphs
hmusta Jun 23, 2023
044d6b7
minor
hmusta Jun 23, 2023
d26221a
minor
hmusta Jun 23, 2023
e79f643
minor
hmusta Jun 23, 2023
8b7d64b
Add checks
hmusta Jun 23, 2023
f2698fb
check coordinate consistency when merging seeds
hmusta Jun 23, 2023
0e874ec
don't seed both orientations in a CANONICAL graph
hmusta Jun 26, 2023
9b8e7b7
Merge remote-tracking branch 'origin/master' into hm/aln_alt
hmusta Jun 26, 2023
568ced2
filter out low-complexity sub-k seeds
hmusta Jun 26, 2023
2286007
move low complexity filter outside
hmusta Jun 26, 2023
65476f8
more messages
hmusta Jun 26, 2023
ad879da
less verbose messages
hmusta Jun 26, 2023
d6a5eb7
disable unit test
hmusta Jun 27, 2023
d5c0b89
extra checks
hmusta Jun 27, 2023
a6ecdd2
fix
hmusta Jun 27, 2023
714da0c
less backtracking for unannotated graphs
hmusta Jun 27, 2023
f0f73f1
don't fetch labels for low complexity seeds
hmusta Jun 28, 2023
dc2de87
less prefix seeding
hmusta Jun 28, 2023
aa999c9
find fewer seeds
hmusta Jun 29, 2023
7f6ffb2
merge annotations better when forming mums
hmusta Jun 30, 2023
c026a3e
unneeded parameter
hmusta Jul 1, 2023
bff8daf
test
hmusta Jul 1, 2023
dcc6b3d
fix
hmusta Jul 3, 2023
8dc496c
minor, change default params
hmusta Jul 3, 2023
35dad6b
include more extensions if they add more labels
hmusta Jul 3, 2023
45cdede
fixes
hmusta Jul 4, 2023
d643222
fix
hmusta Jul 4, 2023
493a78d
don't report seeds filtered out after extensions
hmusta Jul 4, 2023
472cae3
fix unit test
hmusta Jul 4, 2023
63f20dd
fix labels
hmusta Jul 4, 2023
86bbe7a
simplify suffix seeding
hmusta Jul 4, 2023
441e4fd
path simplification
hmusta Jul 4, 2023
f80cede
Revert "path simplification"
hmusta Jul 4, 2023
e473e44
Revert "simplify suffix seeding"
hmusta Jul 4, 2023
50b39d2
cleanup suffix seeding
hmusta Jul 6, 2023
cadb30f
more filtration
hmusta Jul 6, 2023
e3fe498
Revert "more filtration"
hmusta Jul 6, 2023
e01668d
far fewer seeds
hmusta Jul 7, 2023
8e55c6f
fix
hmusta Jul 7, 2023
41456d7
fix
hmusta Jul 7, 2023
73c8185
fewer seeds
hmusta Jul 7, 2023
0b1d054
fix
hmusta Jul 7, 2023
13ced4c
cleanup. toggle shorter suffix seeds with a flag
hmusta Jul 10, 2023
8b65639
Add in heuristic chainer
hmusta Jul 11, 2023
548aed6
t1
hmusta Jul 11, 2023
c2d7129
t2
hmusta Jul 11, 2023
aada964
t3
hmusta Jul 11, 2023
96f5918
t4
hmusta Jul 11, 2023
b2969d8
t5
hmusta Jul 11, 2023
f005460
t6
hmusta Jul 11, 2023
7b0e442
t5
hmusta Jul 11, 2023
b0fee9d
fix
hmusta Jul 11, 2023
a35ebc2
fix compilation in clang
hmusta Jul 11, 2023
5f2c791
find k-mer seeds if present
hmusta Jul 11, 2023
a38c13f
disable some tests for protein graphs
hmusta Jul 11, 2023
b1d3161
minor
hmusta Jul 11, 2023
fb65f9e
minor
hmusta Jul 11, 2023
c67b38f
fix
hmusta Jul 11, 2023
6b27a25
allow splice if sharing at least one label
hmusta Jul 11, 2023
30e376a
extra check
hmusta Jul 11, 2023
8819044
t
hmusta Jul 12, 2023
872b769
Revert "extra check"
hmusta Jul 12, 2023
d4b3238
fewer redundant alignments
hmusta Jul 12, 2023
093f641
minor
hmusta Jul 12, 2023
e6dbd55
fixes
hmusta Jul 13, 2023
b2a3740
last try
hmusta Jul 13, 2023
8771730
try this
hmusta Jul 13, 2023
9603009
this works
hmusta Jul 14, 2023
9222ca4
cleanup
hmusta Jul 14, 2023
768f283
anchor filtering
hmusta Jul 14, 2023
e931660
fix
hmusta Jul 14, 2023
87f7c8d
cleanup
hmusta Jul 14, 2023
ba3467e
cleanup
hmusta Jul 14, 2023
26ddc6f
cleanup
hmusta Jul 14, 2023
b400284
Added extra check
hmusta Jul 14, 2023
f9c6936
extra checks
hmusta Jul 14, 2023
a22d910
minor
hmusta Jul 14, 2023
7689276
fixes
hmusta Jul 14, 2023
32fa3f4
fix
hmusta Jul 14, 2023
83d378b
discard chains all from the same alignment
hmusta Jul 14, 2023
3bbbc5c
fix corner cases
hmusta Jul 15, 2023
3ba2a43
minor
hmusta Jul 15, 2023
e5ec2a3
minor
hmusta Jul 15, 2023
337684a
pass score so far
hmusta Jul 15, 2023
feb72db
test
hmusta Jul 15, 2023
ad16920
fix
hmusta Jul 15, 2023
1d39d29
fixes
hmusta Jul 15, 2023
b0f463d
callback full alignments
hmusta Jul 15, 2023
ea16efb
fix scoring
hmusta Jul 15, 2023
793f5d1
discard chains with short front
hmusta Jul 15, 2023
4943979
only consider chains if they're better than input alignments
hmusta Jul 15, 2023
f39e03f
fewer suffix ranges
hmusta Jul 16, 2023
a186a3d
fixes
hmusta Jul 16, 2023
41279d0
lots of fixes
hmusta Jul 16, 2023
51ef02c
fixes
hmusta Jul 16, 2023
f2e5b74
better filtering
hmusta Jul 16, 2023
2d2400b
more filtering
hmusta Jul 16, 2023
00e6615
update repos before starting CI workflow
hmusta Jul 16, 2023
54b76ef
remove redundant checks
hmusta Jul 16, 2023
e238a24
only connect chains with subsets of labels
hmusta Jul 16, 2023
1d493d7
remove redundant
hmusta Jul 16, 2023
678469c
fix another corner case
hmusta Jul 16, 2023
c64bfc4
put chains at the front
hmusta Jul 16, 2023
d4b5535
keep adding chains as long as a label is extended
hmusta Jul 17, 2023
89b37b8
fix corner case
hmusta Jul 17, 2023
cbe9410
Merge remote-tracking branch 'origin/master' into hm/aln_alt
hmusta Jul 17, 2023
93dc25c
fixes for handling label diffs
hmusta Jul 18, 2023
bd675cd
chain one label at a time
hmusta Jul 18, 2023
32b2949
filter chains before constructing them
hmusta Jul 18, 2023
65b691d
cleanup
hmusta Jul 18, 2023
ee5f386
chain per label
hmusta Jul 18, 2023
12065ec
better seed merging
hmusta Jul 18, 2023
df970ae
fix compilation error
hmusta Jul 18, 2023
5a43e58
merge identical chains, fold in their labels
hmusta Jul 18, 2023
d11da11
merge identical chains if they have no labels
hmusta Jul 18, 2023
b2d36e6
don't apply the seed complexity filter to k-length seeds
hmusta Jul 19, 2023
7ec7a30
don't prematurely discard seeds if low coverage
hmusta Jul 19, 2023
f5b6f09
clean up suffix seeding
hmusta Jul 19, 2023
59fa85a
cleaned up suffix seeding
hmusta Jul 19, 2023
0ca80c7
extra check
hmusta Jul 19, 2023
2816905
fix test
hmusta Jul 19, 2023
e69d7d3
fix seed complexity check
hmusta Jul 19, 2023
1381823
fix complexity checking on rc strand
hmusta Jul 19, 2023
72ab02f
fix annotation fetching on CANONICAL mode graphs
hmusta Jul 19, 2023
eb0b609
change default seed length back to 19
hmusta Jul 24, 2023
2b3a748
Merge branch 'master' into hm/aln_alt
hmusta Jul 28, 2023
a0d4281
fix
hmusta Jul 28, 2023
220e388
fix
hmusta Aug 24, 2023
1c070d1
minor
hmusta Sep 7, 2023
d5f4026
fix for coordinates
hmusta Sep 7, 2023
0e934a8
for now, disable some checks
hmusta Sep 7, 2023
d1fc8c4
put vscode in gitignore
hmusta Sep 7, 2023
96a875c
minor cleanup
hmusta Sep 7, 2023
2684005
minor
hmusta Sep 7, 2023
5bb30fd
fewer checks in debug mode
hmusta Sep 7, 2023
5705309
remove superfluous asserts
hmusta Sep 8, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion metagraph/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ target_include_directories(caches INTERFACE external-libraries/caches/include)

add_library(eigen INTERFACE)
target_include_directories(eigen INTERFACE external-libraries/eigen)
target_compile_options(eigen INTERFACE -Wno-unused-but-set-variable)
target_compile_options(eigen INTERFACE -Wno-unused-but-set-variable -Wno-unused-parameter)


set(Boost_USE_STATIC_LIBS ON)
Expand Down
68 changes: 34 additions & 34 deletions metagraph/integration_tests/test_align.py

Large diffs are not rendered by default.

8 changes: 1 addition & 7 deletions metagraph/src/cli/config/config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -518,12 +518,6 @@ Config::Config(int argc, char *argv[]) {
print_usage_and_exit = true;
}

// only the best alignment is used in query
// |alignment_num_alternative_paths| must be set to 1
if (identity == QUERY && align_sequences
&& alignment_num_alternative_paths != 1)
print_usage_and_exit = true;

if (identity == ALIGN && infbase.empty())
print_usage_and_exit = true;

Expand Down Expand Up @@ -1063,7 +1057,7 @@ if (advanced) {
fprintf(stderr, "\t --align-only-forwards \t\t\tdo not align backwards from a seed on basic-mode graphs [off]\n");
fprintf(stderr, "\t --align-no-seed-complexity-filter \t\t\t\tdisable the filter for low-complexity seeds. [off]\n");
}
fprintf(stderr, "\t --align-alternative-alignments \t\tthe number of alternative paths to report per seed [1]\n");
fprintf(stderr, "\t --align-alternative-alignments \t\tthe maximum number of paths to report per seed [inf]\n");
fprintf(stderr, "\t --align-chain \t\t\t\tconstruct seed chains before alignment. Useful for long error-prone reads. [off]\n");
fprintf(stderr, "\t --align-post-chain \t\t\tperform multiple local alignments and chain them together into a single alignment. Useful for long error-prone reads. [off]\n");
fprintf(stderr, "\t \t\t\t\t\t\tA '$' inserted into the reference sequence indicates a jump in the graph.\n");
Expand Down
2 changes: 1 addition & 1 deletion metagraph/src/cli/config/config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ class Config {
int32_t alignment_min_path_score = 0;
int32_t alignment_xdrop = 27;

size_t alignment_num_alternative_paths = 1;
size_t alignment_num_alternative_paths = std::numeric_limits<size_t>::max();
size_t alignment_min_seed_length = 19;
size_t alignment_max_seed_length = std::numeric_limits<size_t>::max();
size_t alignment_max_num_seeds_per_locus = 1000;
Expand Down
7 changes: 2 additions & 5 deletions metagraph/src/cli/query.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1172,9 +1172,6 @@ int query_graph(Config *config) {

std::unique_ptr<align::DBGAlignerConfig> aligner_config;
if (config->align_sequences) {
assert(config->alignment_num_alternative_paths == 1u
&& "only the best alignment is used in query");

aligner_config.reset(new align::DBGAlignerConfig(
initialize_aligner_config(*config, *graph)
));
Expand Down Expand Up @@ -1233,10 +1230,10 @@ Alignment align_sequence(std::string *seq,
+ revised_config.left_end_bonus + revised_config.right_end_bonus;
auto alignments = aligner.align(*seq);

assert(alignments.size() <= 1 && "Only the best alignment is needed");

if (alignments.size()) {
// TODO: incorporate multiple alignments
auto &match = alignments[0];

// modify sequence for querying with the best alignment
if (match.get_offset()) {
*seq = graph.get_node_sequence(match.get_nodes()[0]).substr(0, match.get_offset())
Expand Down
6 changes: 3 additions & 3 deletions metagraph/src/cli/server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -151,13 +151,13 @@ std::string process_align_request(const std::string &received_message,

config.alignment_num_alternative_paths = json.get(
"max_alternative_alignments",
(uint64_t)config.alignment_num_alternative_paths).asInt();
(uint64_t)config.alignment_num_alternative_paths).asUInt64();

if (!config.alignment_num_alternative_paths) {
// TODO: better throw an exception and send an error response to the client
logger->warn("[Server] Got invalid value of alignment_num_alternative_paths = {}."
" The default value of 1 will be used instead...", config.alignment_num_alternative_paths);
config.alignment_num_alternative_paths = 1;
" The default value of inf will be used instead...", config.alignment_num_alternative_paths);
config.alignment_num_alternative_paths = std::numeric_limits<size_t>::max();
}

config.alignment_min_exact_match
Expand Down
31 changes: 0 additions & 31 deletions metagraph/src/graph/alignment/aligner_aggregator.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@ class AlignmentAggregator {
bool add_alignment(Alignment&& alignment);

score_t get_global_cutoff() const;
score_t get_score_cutoff(const Columns &labels) const;

std::vector<Alignment> get_alignments();

Expand All @@ -59,8 +58,6 @@ class AlignmentAggregator {
VectorMap<Column, PathQueue> path_queue_;
PathQueue unlabeled_;
ValCmp cmp_;

score_t get_label_cutoff(Column label) const;
};

// return true if the alignment was added
Expand Down Expand Up @@ -148,34 +145,6 @@ ::get_global_cutoff() const -> score_t {
return cur_max > 0 ? cur_max * config_.rel_score_cutoff : cur_max;
}

// TODO: define it the same way as in get_global_cutoff()?
template <class AlignmentCompare>
inline auto AlignmentAggregator<AlignmentCompare>
::get_score_cutoff(const Vector<Column> &labels) const -> score_t {
assert(labels.size());

score_t global_min = get_global_cutoff();

score_t min_score = std::numeric_limits<score_t>::max();
for (Column label : labels) {
min_score = std::min(min_score, get_label_cutoff(label));
if (min_score < global_min)
return global_min;
}
return min_score;
}

template <class AlignmentCompare>
inline auto AlignmentAggregator<AlignmentCompare>
::get_label_cutoff(Column label) const -> score_t {
auto find = path_queue_.find(label);
return find == path_queue_.end()
|| find->second.size() < config_.num_alternative_paths
|| config_.post_chain_alignments
? config_.ninf
: find->second.minimum()->get_score();
}

template <class AlignmentCompare>
inline std::vector<Alignment> AlignmentAggregator<AlignmentCompare>::get_alignments() {
// move all alignments to one vector
Expand Down
147 changes: 113 additions & 34 deletions metagraph/src/graph/alignment/aligner_labeled.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,11 @@ void LabeledExtender::flush() {
auto cur_labels = annotation_buffer_.get_labels(table_elem.node);
assert(cur_labels);

if (cur_labels->empty()) {
assert(table_elem.offset - this->seed_->get_offset() < graph_->get_k());
continue;
}

#ifndef NDEBUG
if (table[parent_i].offset >= 0
&& static_cast<size_t>(table[parent_i].offset) >= graph_->get_k() - 1) {
Expand Down Expand Up @@ -227,6 +232,13 @@ ::call_outgoing(node_index node,
auto next_labels = annotation_buffer_.get_labels(next);
assert(next_labels);

if (next_labels->empty()) {
assert(next_offset < graph_->get_k());
node_labels_.push_back(node_labels_[table_i]);
callback(next, c, score);
continue;
}

Columns intersect_labels;
std::set_intersection(columns.begin(), columns.end(),
next_labels->begin(), next_labels->end(),
Expand Down Expand Up @@ -263,6 +275,13 @@ ::call_outgoing(node_index node,
std::swap(base_coords, next_coords);
}

if (next_labels->empty()) {
assert(next_offset < graph_->get_k());
node_labels_.push_back(node_labels_[table_i]);
callback(next, c, score);
continue;
}

// check if at least one label has consistent coordinates
Columns intersect_labels;

Expand Down Expand Up @@ -454,7 +473,8 @@ ::LabeledAligner(const DeBruijnGraph &graph,
const DBGAlignerConfig &config,
const Annotator &annotator)
: DBGAligner<Seeder, Extender, AlignmentCompare>(graph, config),
annotation_buffer_(graph, annotator) {
annotation_buffer_(graph, annotator),
max_seed_length_(this->config_.max_seed_length) {
// do not use a global xdrop cutoff since we need separate cutoffs for each label
if (annotation_buffer_.has_coordinates()) {
logger->trace("Coordinates detected. Enabling seed chaining");
Expand All @@ -478,9 +498,12 @@ LabeledAligner<Seeder, Extender, AlignmentCompare>::~LabeledAligner() {
template <class Seeder, class Extender, class AlignmentCompare>
auto LabeledAligner<Seeder, Extender, AlignmentCompare>
::build_seeders(const std::vector<IDBGAligner::Query> &seq_batch,
const std::vector<AlignmentResults> &wrapped_seqs) const -> BatchSeeders {
const std::vector<AlignmentResults> &wrapped_seqs,
std::vector<std::pair<std::vector<Seed>, std::vector<Seed>>> &discarded_seeds) const -> BatchSeeders {
BatchSeeders seeders
= DBGAligner<Seeder, Extender, AlignmentCompare>::build_seeders(seq_batch, wrapped_seqs);
= DBGAligner<Seeder, Extender, AlignmentCompare>::build_seeders(seq_batch, wrapped_seqs, discarded_seeds);

assert(discarded_seeds.size() == seq_batch.size());

// now we're going to filter the seeds
logger->trace("Filtering seeds by label. Cur mem usage {} MB", get_curr_RSS() / 1e6);
Expand Down Expand Up @@ -532,7 +555,7 @@ ::build_seeders(const std::vector<IDBGAligner::Query> &seq_batch,
auto &[seeder, seeder_rc] = seeders[i];
auto &[seeds, num_matching] = counted_seeds[i];
if (seeds.size()) {
num_matching = filter_seeds(seeds);
num_matching = filter_seeds(seeds, discarded_seeds[i].first);
num_seeds_left += seeds.size();
}

Expand All @@ -542,7 +565,7 @@ ::build_seeders(const std::vector<IDBGAligner::Query> &seq_batch,
if (has_rc[i]) {
auto &[seeds, num_matching] = counted_seeds_rc[i];
if (seeds.size()) {
num_matching = filter_seeds(seeds);
num_matching = filter_seeds(seeds, discarded_seeds[i].second);
num_seeds_rc_left += seeds.size();
}

Expand Down Expand Up @@ -610,14 +633,16 @@ void matched_intersection(AIt a_begin, AIt a_end, BIt a_c_begin,

template <class Seeder, class Extender, class AlignmentCompare>
size_t LabeledAligner<Seeder, Extender, AlignmentCompare>
::filter_seeds(std::vector<Seed> &seeds) const {
::filter_seeds(std::vector<Seed> &seeds,
std::vector<Seed> &discarded_seeds) const {
if (seeds.empty())
return 0;

size_t query_size = seeds[0].get_clipping() + seeds[0].get_end_clipping()
+ seeds[0].get_query_view().size();

Columns labels;
Columns discarded_labels;

{
VectorMap<Column, sdsl::bit_vector> label_mapper;
Expand Down Expand Up @@ -660,47 +685,88 @@ ::filter_seeds(std::vector<Seed> &seeds) const {
DEBUG_LOG("Keeping {} / {} labels",
std::distance(label_counts.begin(), it), label_counts.size());

label_counts.erase(it, label_counts.end());
labels.reserve(it - label_counts.begin());
discarded_labels.reserve(label_counts.end() - it);

labels.reserve(label_counts.size());
for (const auto &[label, count] : label_counts) {
labels.push_back(label);
}
}

if (labels.empty()) {
seeds.clear();
return 0;
std::transform(label_counts.begin(), it, std::back_inserter(labels),
[&](const auto &a) { return a.first; });
std::transform(it, label_counts.end(), std::back_inserter(discarded_labels),
[&](const auto &a) { return a.first; });
}

std::sort(labels.begin(), labels.end());
std::sort(discarded_labels.begin(), discarded_labels.end());

for (size_t j = 0; j < seeds.size(); ++j) {
Seed &seed = seeds[j];
const std::vector<node_index> &nodes = seed.get_nodes();
assert(nodes.size() == 1);
if (!seed.label_encoder) {
seed.label_columns.clear();
auto [fetch_labels, fetch_coords] = annotation_buffer_.get_labels_and_coords(nodes[0]);
assert(fetch_labels);
if (annotation_buffer_.has_coordinates()) {
assert(fetch_coords);
matched_intersection(fetch_labels->begin(), fetch_labels->end(),
fetch_coords->begin(),
labels.begin(), labels.end(),
std::back_inserter(seed.label_columns),
std::back_inserter(seed.label_coordinates));
if (seed.get_offset() && seed.label_coordinates.size()) {
for (auto &tuple : seed.label_coordinates) {

// if a seed already as labels, use them
if (seed.label_encoder)
continue;

seed.label_columns.clear();
auto [fetch_labels, fetch_coords] = annotation_buffer_.get_labels_and_coords(nodes[0]);
assert(fetch_labels);
if (annotation_buffer_.has_coordinates()) {
Alignment::Columns discarded_columns;
Alignment::CoordinateSet discarded_coords;
bool added_discarded = false;
assert(fetch_coords);
assert(seed.label_coordinates.empty());
matched_intersection(fetch_labels->begin(), fetch_labels->end(),
fetch_coords->begin(),
labels.begin(), labels.end(),
std::back_inserter(seed.label_columns),
std::back_inserter(seed.label_coordinates));
matched_intersection(fetch_labels->begin(), fetch_labels->end(),
fetch_coords->begin(),
discarded_labels.begin(), discarded_labels.end(),
std::back_inserter(discarded_columns),
std::back_inserter(discarded_coords));

if (seed.label_columns.size())
seed.label_encoder = &annotation_buffer_.get_annotator().get_label_encoder();

if (discarded_columns.size()) {
added_discarded = true;
auto &discarded_seed = discarded_seeds.emplace_back(seed);
discarded_seed.label_encoder = &annotation_buffer_.get_annotator().get_label_encoder();
std::swap(discarded_seed.label_columns, discarded_columns);
std::swap(discarded_seed.label_coordinates, discarded_coords);
}

if (seed.get_offset()) {
for (auto &tuple : seed.label_coordinates) {
for (auto &coord : tuple) {
coord += seed.get_offset();
}
}

if (added_discarded) {
for (auto &tuple : discarded_seeds.back().label_coordinates) {
for (auto &coord : tuple) {
coord += seed.get_offset();
coord += discarded_seeds.back().get_offset();
}
}
}
} else {
std::set_intersection(fetch_labels->begin(), fetch_labels->end(),
labels.begin(), labels.end(),
std::back_inserter(seed.label_columns));
}
} else {
Alignment::Columns discarded_columns;

std::set_intersection(fetch_labels->begin(), fetch_labels->end(),
labels.begin(), labels.end(),
std::back_inserter(seed.label_columns));

std::set_intersection(fetch_labels->begin(), fetch_labels->end(),
discarded_labels.begin(), discarded_labels.end(),
std::back_inserter(discarded_columns));

if (discarded_columns.size()) {
auto &discarded_seed = discarded_seeds.emplace_back(seed);
discarded_seed.label_encoder = &annotation_buffer_.get_annotator().get_label_encoder();
std::swap(discarded_seed.label_columns, discarded_columns);
}

if (seed.label_columns.size())
Expand All @@ -718,6 +784,19 @@ ::filter_seeds(std::vector<Seed> &seeds) const {
return a.get_query_view().size() >= this->config_.min_seed_length;
}));

auto end = merge_into_unitig_mums(this->graph_, seeds.data(), seeds.data() + seeds.size(),
this->config_.min_seed_length, max_seed_length_);
seeds.erase(seeds.begin() + (end - seeds.data()), seeds.end());
assert(seeds.size());

if (discarded_seeds.size()) {
end = merge_into_unitig_mums(this->graph_, discarded_seeds.data(),
discarded_seeds.data() + discarded_seeds.size(),
this->config_.min_seed_length);
discarded_seeds.erase(discarded_seeds.begin() + (end - discarded_seeds.data()),
discarded_seeds.end());
}

return get_num_char_matches_in_seeds(seeds.begin(), seeds.end());
}

Expand Down
6 changes: 4 additions & 2 deletions metagraph/src/graph/alignment/aligner_labeled.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -149,10 +149,12 @@ class LabeledAligner : public DBGAligner<Seeder, Extender, AlignmentCompare>, pu
typedef typename DBGAligner<Seeder, Extender, AlignmentCompare>::BatchSeeders BatchSeeders;
BatchSeeders
virtual build_seeders(const std::vector<IDBGAligner::Query> &seq_batch,
const std::vector<AlignmentResults> &wrapped_seqs) const override final;
const std::vector<AlignmentResults> &wrapped_seqs,
std::vector<std::pair<std::vector<Seed>, std::vector<Seed>>> &discarded_seeds) const override final;

// helper for the build_seeders method
size_t filter_seeds(std::vector<Seed> &seeds) const;
size_t filter_seeds(std::vector<Seed> &seeds,
std::vector<Seed> &discarded_seeds) const;
};

} // namespace align
Expand Down
Loading