Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version of num records #295

Merged
merged 7 commits into from
Jan 19, 2024
Merged

Conversation

jeromekelleher
Copy link
Contributor

I nerd-sniped myself into doing #294 馃槅

It seems to be mostly working, but I'm not familiar either with Cython or htslib, so I'm not that confident there's no nasty surprises lurking. The logic around opening indexes is tricky: I've tried to reproduce what's used elsewhere in the file, but I'm really not clear about the lifetimes etc.

It's also not clear to me when this fails - does it need to have an index created by a recent version of bcftools?

I'd like to do some more testing, where we check it against more bcf files, and ideally with indexes created by different tools.

I'll make a few inline comments on specifics.

I also fixed a few compiler warnings in 191654e - happy to drop this commit if it's off-topic.

Also needs CFLAGS="-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION"
which could be added to setup.py
@@ -78,7 +81,7 @@ cdef extern from "htslib/hts.h":
cdef extern from "htslib/tbx.h":

ctypedef struct tbx_t:
pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this so that Cython "knows about" the idx member of this struct. However, it has two other members - I'm not sure if it's safe to just ignore them? Originally I copied the full struct definition in here, but that seems worse, if it's not needed.

Copy link
Owner

@brentp brentp Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to find documentation of this in cython, but failed. My memory is that you should either declare none of the fields or all of the fields. The definition of tbx_t is simple enough that it is at least it is easy to explicitly list all of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

idx = self.idx.idx
assert self.hidx == NULL

if idx == NULL:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a bit about this, and since you need an index to compute this, it seems best to make it a hard error so that the user is alerted. Happy to discuss of course (also the type of the exception is debateable)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine with me. You could just document in num_records docstring that is will raise a ValueError if index is not found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

assert len(list(b)) == 10

b.set_index("{}/test-diff.csi".format(HERE))
print(b.num_records)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails on the hts_idx_get_stat call, but it's not clear why. I'm assuming this test case is used to check what happens when you automatically load the wrong index, but then replace with a good one?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it would fail on the first call either!?
So the print on 1365 works?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you could try reindexing, make a new test.snpeff.bcf.csi and see if it still fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was pretty weird. It seems that the return value of hts_idx_get_stat isn't actually diagnostic of errors, and bcftools index -n still reports a value when it is negative. So, I think we just have to ignore it. This test.snpeff.bcf file works fine when we do that.

Copy link
Owner

@brentp brentp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks very much @jeromekelleher
Would like to understand the one failure.

@@ -78,7 +81,7 @@ cdef extern from "htslib/hts.h":
cdef extern from "htslib/tbx.h":

ctypedef struct tbx_t:
pass
Copy link
Owner

@brentp brentp Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to find documentation of this in cython, but failed. My memory is that you should either declare none of the fields or all of the fields. The definition of tbx_t is simple enough that it is at least it is easy to explicitly list all of them.

idx = self.idx.idx
assert self.hidx == NULL

if idx == NULL:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine with me. You could just document in num_records docstring that is will raise a ValueError if index is not found.

cdef int ret, tid, nseq;
cdef hts_idx_t *idx = NULL;

if self.hidx == NULL and self.idx == NULL:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not required for this PR, but perhaps we should extract out this stuff to a _find_index method since it's repeated elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created an _open_index function that's shared between num_records and seqnames, as they are very similar. I'm a bit wary of the other places the indexes are used, as it's really not clear to me when you have to use the tbx struct (which has an embedded hts_idx) and when you can use the other for queries. I thought it best to leave well enough alone for now.

assert len(list(b)) == 10

b.set_index("{}/test-diff.csi".format(HERE))
print(b.num_records)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it would fail on the first call either!?
So the print on 1365 works?

assert len(list(b)) == 10

b.set_index("{}/test-diff.csi".format(HERE))
print(b.num_records)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you could try reindexing, make a new test.snpeff.bcf.csi and see if it still fails

@jeromekelleher
Copy link
Contributor Author

OK, I think this is ready for another look @brentp !

I added an extra index-freeing step in set_index, as I'm pretty sure it was leaking memory when set_index is called multiple times (or after seqnames, etc). This seems a fairly harmless change?

@brentp brentp merged commit 5d5c6bd into brentp:main Jan 19, 2024
12 checks passed
@brentp
Copy link
Owner

brentp commented Jan 19, 2024

nicely done. will tag a new version shortly.

@jeromekelleher
Copy link
Contributor Author

Awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants