Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paired reads have different names #18

Open
Mahmoudbassuoni opened this issue Mar 20, 2023 · 11 comments
Open

paired reads have different names #18

Mahmoudbassuoni opened this issue Mar 20, 2023 · 11 comments

Comments

@Mahmoudbassuoni
Copy link

Mahmoudbassuoni commented Mar 20, 2023

Hi, I am trying to run the alignment using bwa mem for the 2 files "U0a_CGATGT_L001_R1_001.fastq.gz" "U0a_CGATGT_L001_R2_001.fastq.gz" I already got from the FTP site with the reference "GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz" and the command I am using is
bwa mem -t 16 -R '@RG\tID:H814YADXX.5.CGATGT.1101\tSM:HG001\tPL:illumina' GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz U0a_CGATGT_L001_R1_001.fastq.gz U0a_CGATGT_L001_R2_001.fastq.gz | samtools view -b - >HG001.GRCh38_no_alt_analysis_set.bam

but I am getting an error with the sequence headers:

[mem_sam_pe]` paired reads have different names: "HWACAGATTTTGT", "HWI-D00360:5:H814YADXX:1:1102:11719:83283"
[mem_sam_pe] paired reads have different names: "HWACTATTDDD", "HWI-D00360:5:H814YADXX:1:1102:11293:83492"
[mem_sam_pe] paired reads have different names: "@@faaa(+:A0&AA", "HWI-D00360:5:H814YADXX:1:1102:11730:83321"
[mem_sam_pe] paired reads have different names: "HWI-A@HWI-D00360:5:H814YX:1:1102:10399:83348", "HWI-00360:5:H814YADXX:1:1102:11699:83300"
[mem_sam_pe] paired reads have different names: "ACD00360TJJJJC@AGCCCTGCACCACCTAATAAGAACTGGAAAGTCEEDDDDDDDD", "HWI-D00360:5:H814YADXX:1:1102:11719:83361"
[mem_sam_pe] paired reads have different names: "HWCTAAAATC:BDDDDFDDDDDDCEDDDHJJEHIIIJJJHHH>HFFEEEEET:83ACDDDDTAAATTEDDDDDDEDDDDJJFHJJJJJJJJJJJJJJJJJJJJJJIJJJJJJ@T4BJJJTTATCTTG>FGGCAGGCTJJIJJJJJEDEECDDFAAGTAAADDDDDDDCTCTTCTTGTTTTCCCC>AGCC60:5:HC814YJDDDCCDDIGCCCTTC1IIIIHIEDDD@FFFCTTC1IIIIHIEDCCC;>CC60:5:H:0:CGADXX:1:1ATGTTTA:N:0:CGAC>CGAC>CG3AGGCTGAGGYADXX:JJJJJJJJIJJA0360GAIAGEEDEEEEC:GJIIJJJC:0:CGATGIFFFHHHHHJJJJJDEDDDDDGDEDDDDGTTTTTAT@HHJJJTGT", "HWI-D00360:5:H814YADXX:1:1102:11549:83491"
[mem_sam_pe] paired reads have different names: "HWCATCCTCCCAAGACTAADD@FFFC99:833C99:833CGCTTTGFHH@FFFFDDDCCCDCFB:>CA8>A??CC:A:ACTTACTCAAAAAACTATH814CAAATGCAGDDD:TTAAGTTCACAGCGA8DEDDDDDGJJJJJJDDDDDBDDDDDDDDDDDDDTGGACTTTJJHHHF60:5:HHH@FFFGTGGCAGGCTCCTGTAACGDDDDDDDDATGAACTCIACTAGDDDBBDDG9ATGGAATTTGACTTGADXX:1CACCTGCCAAACATACCCGTCTTTACC(G36CAGACCACCTGGACTTCCAGGEECDCDCDGAGGCCTGGCCATGTTATATGAAGTGIDXX:1CACCTGCCAAACATACCCGT", "HWI-D00360:5:H814YADXX:1:1102:11746:83407"
[mem_sam_pe] paired reads have different names: "HWACTATTDEFFFHHHHCCTTGTGTE:@DDDD49?IJJIGIG83407", "HWI-D00360:5:H814YADXX:1:1102:11545:83354"

I have tried to sorting the 2 files using fastq-sort but still getting the same error, anyone can help ?

@chunlinxiao
Copy link
Contributor

chunlinxiao commented Mar 20, 2023

You need to use the sequence.index file (https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/NA12878/sequence.index.NA12878_Illumina300X_wgs_09252015 in your case) to match R1 and R2 files.

For 300X ILMN raw reads, some R1/R2 files may have same names, but located in different directories, e.g.,

ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz cabfe5b609fb1fe11619fdc72060185c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz 6f0faed9249c1a850e6ce57c61e26e04 HG001

ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_006_AH81VLADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz cc35b61053fe7505715f93175bbb16c4 ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_006_AH81VLADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz cd12a23c3d71061e1bc673ce8c598dba HG001

Hope this helps.

@Mahmoudbassuoni
Copy link
Author

yeah I have used the forward and the reverse reads for the same run from the same folder which is supposed to be on the same line in the link you posted. so I mean I used the links for the ftp from one line which is supposed to be matching the same run.

@chunlinxiao
Copy link
Contributor

In your example, can you post the full path of the two files you were using for mapping? have you checked the md5?

@Mahmoudbassuoni
Copy link
Author

Hi,
That was the Forward strand:
"ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz"
and that was the Reverse one:
"ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz"

@Mahmoudbassuoni
Copy link
Author

I have checked the md5 now and it looks something wrong with the files download, I am downloading it now and will check it again, and get back to you. Thanks,

@Mahmoudbassuoni
Copy link
Author

Hi , @chunlinxiao
I have downloaded the files again but still the output of the md5sum not matching the one on the ftp site, I am not sure what could be wrong, I have tried the same thing with another 2 strands and the same happens.

@Mahmoudbassuoni
Copy link
Author

I have tried to do the alignment process using 2 paired reads from the folder "giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/" and it went fine but I am not sure of the data quality as those files are from 2014 however the other files above are from 2020 so it is supposed to be more reliable

@chunlinxiao
Copy link
Contributor

thanks for the update and glad your alignment process was fine now - I also tested your pairs on our side, nothing was wrong, so the paired data is fine.

Regarding the md5, we recently performed a metadata collection/analysis regarding all fastqs, involving gunzip/gzip - this may produce different md5s (from different gz file header if not using gzip -n ). However, the uncompressed file (fastq file) are unchanged with identical md5. The sequence.index files may need to be updated accordingly.

@Mahmoudbassuoni
Copy link
Author

so what do you think of depending on the old FastQs from 2014 ? I am running a benchmarking process so is it fine to use those fastqs and then using the VCFs from the NIST V4 directory ?

@jzook
Copy link
Contributor

jzook commented Mar 22, 2023

Hi @Mahmoudbassuoni - all of the files in those directories were generated ~2014. They are probably ok to use for some purposes, but if you want to understand how your methods work on more recent illumina data, you may want to use data from this publication: https://doi.org/10.1101/2020.12.11.422022.

@chunlinxiao
Copy link
Contributor

Hi @Mahmoudbassuoni , the md5s were updated in sequence.index.NA12878_Illumina300X_wgs_09252015_updated (you can follow the link from the table).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants