Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLA-Genotyping: <error> Converting "HLA-A*01:01:01:01" of size 17 to int failed #151

Open
ghost opened this issue May 20, 2024 · 3 comments

Comments

@ghost
Copy link

ghost commented May 20, 2024

Good morning, I wanted to ask for some guidance regarding an issue I've been running into while genotyping structural variants using graphtyper. I am attempting a test-run of graphtyper v2 on multiple samples with HLA contigs, the VCF file has 30 samples called by manta and merged using JasmineSV (similarly to svimmer it preserves the sv information from the original VCF file, and I was successful in using graphtyper's genotype_sv command to genotype structural variants from manta and smoove merged by JasmineSV).

The problem is that my true samples have HLA contigs which have structural variants which must be genotyped as well. When genotyping with the appropriate reference genome, regardless of if I use a region file or a specific HLA contig as a region for genotyping, I get the following error:

<error> Converting "HLA-A*01:01:01:01" of size 17 to int failed.

Due to this issue I decided to change the HLA contig names in the regions file, VCF file, and reference fasta (separate copies for each so as to no overwrite the original files) to use underscores instead of colons, i.e. "HLA-A*01_01_01_01", and for a short while this method seemingly worked as the command began to run without issue, until it attempted to genotype the HLA regions.

Command Used:

 ./graphtyper genotype_sv /homes/lcass09/sv_calling/output/graphtyper_output/Homo_sapiens_assembly38_HLA.fasta /homes/user/sv_calling/
output/graphtyper_output/jasmine_30_samples_HLA_strands_sorted.vcf.gz --sams=/homes/user/sv_calling/o
utput/SURVIVOR_merge_output/30_bam_samples.txt --region_file=/homes/user/sv_calling/output/graphtyper
_output/sorted_combined_chromosome_contig_list.txt --output=/homes/user/sv_calling/output/graphtyper_
output/30_Samples_Mantaonly      

Output observed after mutliple hours:


[W::tbx_parse1] VCF INFO/END=22916 is smaller than POS at chr1:137018                                   
This tag will be ignored. Note: only one invalid END tag will be reported.                              
[2024-05-15 20:39:25.780] <warning> [constructor.cpp:719] I do not know how to add an insertion at position 44044804                                                                                            
[2024-05-15 20:39:36.525] <warning> [constructor.cpp:719] I do not know how to add an insertion at position 44044804                                                                                            
[2024-05-16 09:56:59.814] <warning> [constructor.cpp:719] I do not know how to add an insertion at position 94051932                                                                                            
[2024-05-16 09:57:12.866] <warning> [constructor.cpp:719] I do not know how to add an insertion at position 94051932                                                                                            
[2024-05-16 11:33:48.384] <warning> [constructor.cpp:719] I do not know how to add an insertion at position 80930428                                                                                            
[2024-05-16 16:26:31.266] <warning> [constructor.cpp:719] I do not know how to add an insertion at position 32030496                                                                                            
[2024-05-16 16:26:42.076] <warning> [constructor.cpp:719] I do not know how to add an insertion at position 32030496                                                                                            
[2024-05-16 20:17:50.450] <error> hts_reader.cpp:113 Failed to query region 'HLA-A*01_01_01_01:1-203503'
[2024-05-16 20:17:50.467] <error> hts_reader.cpp:113 Failed to query region 'HLA-A*01_01_01_01:1-203503'
[2024-05-16 20:17:50.467] <error> hts_reader.cpp:113 Failed to query region 'HLA-A*01_01_01_01:1-203503'
[2024-05-16 20:17:50.467] <error> hts_reader.cpp:113 Failed to query region 'HLA-A*01_01_01_01:1-203503'
[2024-05-16 20:17:50.467] <error> hts_reader.cpp:113 Failed to query region 'HLA-A*01_01_01_01:1-203503'
[2024-05-16 20:17:50.467] <error> hts_reader.cpp:113 Failed to query region 'HLA-A*01_01_01_01:1-203503'
*** Error in './graphtyper': corrupted double-linked list: 0x0000000001ee1460 ***                       
Segmentation fault (core dumped)           

I'm not exactly sure why this is happening or how I can circumvent this, it's relatively important to keep the HLA contigs for my purposes so if there's any way to deal with this I would greatly appreciate any guidance or insight.

Thank you for your time and patience.

@hannespetur
Copy link
Member

Hello, sorry about the problems regarding colons in contig names.

I think the error means that HLA-A*01_01_01_01 contig is not present in your BAMs because you have changed the contig names in your FASTA/VCF/regions files but not BAMs. Is it possible for you to reheader your BAMs with the new contig names?

If that does not work, can you extract a small subset of your data and share it with me for reproducing on my end?

Best,
Hannes

@ghost
Copy link
Author

ghost commented May 27, 2024

No worries, is there a way to get around that issue aside from changing the contig names in the files? A lot of tools including bcftools also seem to struggle with HLA contigs in a very similar manner.

Unfortunately the bam files themselves cannot be re-written, I can try to duplicate them and then reheader the copies to have matching HLA contigs with the FASTA and VCF files of interest.

As for the shared data, I sincerely apologise that I cannot share data at the moment, but I'll get back to you on that after I try the re-headering very soon.

@ghost
Copy link
Author

ghost commented Jul 16, 2024

@hannespetur

Apologies for the delay, the fix was successful and the genotyping ran on the HLA contigs, I now have one more issue to resolve, I re-attempted the genotyping using a merged VCF of 50 samples called by both manta and smoove, first merged by caller using jasmine for each sample, and then all samples were merged using jasmine once again.

Once the genotyping had been run, I used bcftools concat to create a final merged vcf, the input vcf file started off with about 127,000 structural variants, and the output vcf had 157,000 structural variants, when filtering to only keep those with SVMODEL=AGGREGATED, this number goes down to 63,000 structural variants in total.

I did not filter by PASS for the input or output vcf. I unfortunately cannot share my data to show an example.

I also checked the number of structural variants prior to merging the final output vcf, and it was still 157,000 structural variants.

Is this due to an automatic filtering that graphtyper carries out? If so, what kind of filtering and can you explain it in some detail? Could it also have to do with graphtyper's lower genotyping accuracy on other callers such as smoove?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant