-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regions index #37
Merged
Merged
Regions index #37
Changes from 9 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
b5399c0
Rewrite asserts in assert_vcfs_close to give better failure messages
tomwhite 3cf19a1
Don't create a variant mask if no regions or targets are specified
tomwhite a47f7a6
Add index command to create an index to support efficient region queries
tomwhite 4c17f5a
Update vcf writer to use variant_position_end rather than computing i…
tomwhite 4f97d6d
Only load variant_contig/position chunks that overlap query region
tomwhite 8339dbc
Create index in tests
tomwhite ee74315
Support complement (^) in targets
tomwhite 8661998
refactor
tomwhite b3a56aa
Support both regions and targets at once
tomwhite b9c22e9
Use variant_length field rather than variant_position_end
tomwhite File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
import pytest | ||
|
||
# rewrite asserts in assert_vcfs_close to give better failure messages | ||
pytest.register_assert_rewrite("tests.utils") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,58 @@ | ||
import re | ||
from typing import Any, Optional | ||
|
||
import numcodecs | ||
import numpy as np | ||
import pandas as pd | ||
import pyranges | ||
import zarr | ||
from pyranges import PyRanges | ||
|
||
|
||
def parse_region(region: str) -> tuple[str, Optional[int], Optional[int]]: | ||
def create_index(vcz) -> None: | ||
"""Create an index to support efficient region queries.""" | ||
|
||
root = zarr.open(vcz, mode="r+") | ||
|
||
contig = root["variant_contig"] | ||
pos = root["variant_position"] | ||
end = root["variant_position_end"] | ||
|
||
assert contig.cdata_shape == pos.cdata_shape | ||
|
||
index = [] | ||
|
||
for v_chunk in range(pos.cdata_shape[0]): | ||
c = contig.blocks[v_chunk] | ||
p = pos.blocks[v_chunk] | ||
e = end.blocks[v_chunk] | ||
|
||
# create a row for each contig in the chunk | ||
d = np.diff(c, append=-1) | ||
c_start_idx = 0 | ||
for c_end_idx in np.nonzero(d)[0]: | ||
assert c[c_start_idx] == c[c_end_idx] | ||
index.append( | ||
( | ||
v_chunk, # chunk index | ||
c[c_start_idx], # contig ID | ||
p[c_start_idx], # start | ||
p[c_end_idx], # end | ||
np.max(e[c_start_idx : c_end_idx + 1]), # max end | ||
c_end_idx - c_start_idx + 1, # num records | ||
) | ||
) | ||
c_start_idx = c_end_idx + 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks just right! 🤩 |
||
|
||
index = np.array(index, dtype=np.int32) | ||
root.array( | ||
"region_index", | ||
data=index, | ||
compressor=numcodecs.Blosc("zstd", clevel=9, shuffle=0), | ||
overwrite=True, | ||
) | ||
|
||
|
||
def parse_region_string(region: str) -> tuple[str, Optional[int], Optional[int]]: | ||
"""Return the contig, start position and end position from a region string.""" | ||
if re.search(r":\d+-\d*$", region): | ||
contig, start_end = region.rsplit(":", 1) | ||
|
@@ -20,32 +66,10 @@ def parse_region(region: str) -> tuple[str, Optional[int], Optional[int]]: | |
return contig, None, None | ||
|
||
|
||
def parse_regions(targets: str) -> list[tuple[str, Optional[int], Optional[int]]]: | ||
return [parse_region(region) for region in targets.split(",")] | ||
|
||
|
||
def parse_targets(targets: str) -> list[tuple[str, Optional[int], Optional[int]]]: | ||
return [parse_region(region) for region in targets.split(",")] | ||
|
||
|
||
def regions_to_selection( | ||
all_contigs: list[str], | ||
variant_contig: Any, | ||
variant_position: Any, | ||
variant_length: Any, | ||
regions: list[tuple[str, Optional[int], Optional[int]]], | ||
): | ||
# subtract 1 from start coordinate to convert intervals | ||
# from VCF (1-based, fully-closed) to Python (0-based, half-open) | ||
variant_start = variant_position - 1 | ||
variant_end = variant_start + variant_length | ||
df = pd.DataFrame( | ||
{"Chromosome": variant_contig, "Start": variant_start, "End": variant_end} | ||
) | ||
|
||
# save original index as column so we can retrieve it after finding overlap | ||
df["index"] = df.index | ||
variants = pyranges.PyRanges(df) | ||
def regions_to_pyranges( | ||
regions: list[tuple[str, Optional[int], Optional[int]]], all_contigs: list[str] | ||
) -> PyRanges: | ||
"""Convert region tuples to a PyRanges object.""" | ||
|
||
chromosomes = [] | ||
starts = [] | ||
|
@@ -63,9 +87,139 @@ def regions_to_selection( | |
starts.append(start) | ||
ends.append(end) | ||
|
||
query = pyranges.PyRanges(chromosomes=chromosomes, starts=starts, ends=ends) | ||
return PyRanges(chromosomes=chromosomes, starts=starts, ends=ends) | ||
|
||
|
||
def parse_regions(regions: Optional[str], all_contigs: list[str]) -> Optional[PyRanges]: | ||
"""Return a PyRanges object from a comma-separated set of region strings.""" | ||
if regions is None: | ||
return None | ||
return regions_to_pyranges( | ||
[parse_region_string(region) for region in regions.split(",")], all_contigs | ||
) | ||
|
||
|
||
def parse_targets( | ||
targets: Optional[str], all_contigs: list[str] | ||
) -> tuple[Optional[PyRanges], bool]: | ||
"""Return a PyRanges object from a comma-separated set of region strings, | ||
optionally preceeded by a ^ character to indicate complement.""" | ||
if targets is None: | ||
return None, False | ||
complement = targets.startswith("^") | ||
return parse_regions( | ||
targets[1:] if complement else targets, all_contigs | ||
), complement | ||
|
||
|
||
def regions_to_chunk_indexes( | ||
regions: Optional[PyRanges], | ||
targets: Optional[PyRanges], | ||
complement: bool, | ||
regions_index: Any, | ||
): | ||
"""Return chunks indexes that overlap the given regions or targets. | ||
|
||
If both regions and targets are specified then only regions are used | ||
to find overlapping chunks (since targets are used later to refine). | ||
|
||
If only targets are specified then they are used to find overlapping chunks, | ||
taking into account the complement flag. | ||
""" | ||
|
||
# Create pyranges for chunks using the region index. | ||
# For regions use max end position, for targets just end position | ||
chunk_index = regions_index[:, 0] | ||
contig_id = regions_index[:, 1] | ||
start_position = regions_index[:, 2] | ||
end_position = regions_index[:, 3] | ||
max_end_position = regions_index[:, 4] | ||
df = pd.DataFrame( | ||
{ | ||
"chunk_index": chunk_index, | ||
"Chromosome": contig_id, | ||
"Start": start_position, | ||
"End": max_end_position if regions is not None else end_position, | ||
} | ||
) | ||
chunk_regions = PyRanges(df) | ||
|
||
if regions is not None: | ||
overlap = chunk_regions.overlap(regions) | ||
elif complement: | ||
overlap = chunk_regions.subtract(targets) | ||
else: | ||
overlap = chunk_regions.overlap(targets) | ||
if overlap.empty: | ||
return np.empty((0,), dtype=np.int64) | ||
chunk_indexes = overlap.df["chunk_index"].to_numpy() | ||
chunk_indexes = np.unique(chunk_indexes) | ||
return chunk_indexes | ||
|
||
|
||
def regions_to_selection( | ||
regions: Optional[PyRanges], | ||
targets: Optional[PyRanges], | ||
complement: bool, | ||
variant_contig: Any, | ||
variant_position: Any, | ||
variant_end: Any, | ||
): | ||
"""Return a variant selection that corresponds to the given regions and targets. | ||
|
||
If both regions and targets are specified then they are both used to find | ||
overlapping variants. | ||
""" | ||
|
||
# subtract 1 from start coordinate to convert intervals | ||
# from VCF (1-based, fully-closed) to Python (0-based, half-open) | ||
variant_start = variant_position - 1 | ||
|
||
if regions is not None: | ||
df = pd.DataFrame( | ||
{"Chromosome": variant_contig, "Start": variant_start, "End": variant_end} | ||
) | ||
# save original index as column so we can retrieve it after finding overlap | ||
df["index"] = df.index | ||
variant_regions = PyRanges(df) | ||
else: | ||
variant_regions = None | ||
|
||
if targets is not None: | ||
targets_variant_end = variant_position # length 1 | ||
df = pd.DataFrame( | ||
{ | ||
"Chromosome": variant_contig, | ||
"Start": variant_start, | ||
"End": targets_variant_end, | ||
} | ||
) | ||
# save original index as column so we can retrieve it after finding overlap | ||
df["index"] = df.index | ||
variant_targets = PyRanges(df) | ||
else: | ||
variant_targets = None | ||
|
||
if variant_regions is not None: | ||
regions_overlap = variant_regions.overlap(regions) | ||
else: | ||
regions_overlap = None | ||
|
||
if variant_targets is not None: | ||
if complement: | ||
targets_overlap = variant_targets.subtract(targets) | ||
else: | ||
targets_overlap = variant_targets.overlap(targets) | ||
else: | ||
targets_overlap = None | ||
|
||
if regions_overlap is not None and targets_overlap is not None: | ||
overlap = regions_overlap.overlap(targets_overlap) | ||
elif regions_overlap is not None: | ||
overlap = regions_overlap | ||
else: | ||
overlap = targets_overlap | ||
|
||
overlap = variants.overlap(query) | ||
if overlap.empty: | ||
return np.empty((0,), dtype=np.int64) | ||
return overlap.df["index"].to_numpy() |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variant_length
might be an easier name to understand and computing theend
is simple from that. But that's a discussion for elsewhere