Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partition_into_regions fails to partition correctly in when absent contigs are listed in the header. #1169

Closed
benjeffery opened this issue Jan 17, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@benjeffery
Copy link
Collaborator

benjeffery commented Jan 17, 2024

I have a 56GB VCF which contains the variants for part of chr20, however the header of the VCF lists over 2000 contigs! This confuses partition_into_regions which returns a region for each of the absent contigs and leaves all the actual variants in the VCF in one huge part. It also somewhat splits up the first (absent) contig.

Digging into the code we have:

def get_sequence_names(vcf_path: PathType, index: Any) -> Any:
    try:
        # tbi stores sequence names
        return index.sequence_names
    except AttributeError:
        # ... but csi doesn't, so fall back to the VCF header
        return VCF(vcf_path).seqnames

The VCF I am using is indexed with a csi file, so this is where the confusion is happening as sgkit is using the header to determine what csi means by contig 0.

Here (samtools/bcftools#816 (comment)) they say that the tabix metadata is in the CSI aux field, and indeed I can see a string chr20 in that field (along with other things) so maybe we can try to read the contig names from the CSI? Will dig a bit further, but thought worth reporting in case anyone has experience here.

@benjeffery
Copy link
Collaborator Author

Hard coding the return value of get_sequence_name as a hack fixes this issue, so now seeing if we can get that from the CSI reliably.

@jeromekelleher
Copy link
Collaborator

Could reindex with tabix as a quick workaround, and see if the code works in that case?

@benjeffery
Copy link
Collaborator Author

Yes, I'd have to copy the VCF somewhere with write perms. Will try once this parse is complete, its over half way done.

@jeromekelleher
Copy link
Collaborator

Right - can't have an index anywhere else except in the same dir. Sigh.

@jeromekelleher
Copy link
Collaborator

I think this can be fixed by adding some functionality to cyvcf2. I made a start on something similar here:

brentp/cyvcf2#295

Alternatively, we are actually fully parsing the indexes locally, so could probably derive the counts per-contig there.

@jeromekelleher
Copy link
Collaborator

Can you dig in a bit more here @benjeffery, and maybe give us the output of partition_into_regions? I wonder if the CSI index just isn't splitting up the these small regions effectively

@jeromekelleher
Copy link
Collaborator

Ah, I'm just after hitting this issue now. CSI indexed VCFs with multiple contigs in the header are not being treated correctly.

@jeromekelleher
Copy link
Collaborator

Marking this as a bug, as there's a good chance of data-loss as a result of this.

@jeromekelleher jeromekelleher added the bug Something isn't working label Feb 27, 2024
@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Mar 3, 2024

Closing in favour of #1201 ( think this is an instance of that bug)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants