Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: giraffe mapping of CRAMs with multiple RGs #151

Open
jjfarrell opened this issue Jan 28, 2025 · 0 comments
Open

Feature Request: giraffe mapping of CRAMs with multiple RGs #151

jjfarrell opened this issue Jan 28, 2025 · 0 comments

Comments

@jjfarrell
Copy link

jjfarrell commented Jan 28, 2025

The present WDL workflow only supports only one RG per cram. When the workflow creates paired end fastq files, the RG is not preserved and insert size estimates are based on reads from all RGs in the cram. This is problematic when each RG may have different insert sizes. If the RG is added to the paired end fastq files, unfortunately the kmc step breaks and does not recognize the read pairs.

If a cram has multiple RGs, the cram should initially be split into multiple bam files for each RG (https://www.htslib.org/doc/samtools-split.html). The paired-end fastq files collated from the bams could then preserve the RG. The giraffe alignment will then be based on the insert size of each RG in the cram. Each giraffe mapped RG bam can then be merged into one bam with each of the RGs in the header and each read properly tagged with the original RG.

This is how the RG is presently specified when using giraffe in the wdl tasks. Each RG is specified as "1" and the RGs in the original cram are lost.

        vg giraffe \
          --progress \
          --read-group "ID:1 LB:lib1 SM:~{in_sample_name} PL:illumina PU:unit1" \
          --sample "~{in_sample_name}" \
          --output-format BAM \
          ~{in_giraffe_options} \
          --ref-paths ~{in_ref_dict} \
          -f ~{in_left_read_pair_chunk_file} -f ~{in_right_read_pair_chunk_file} \
          -x ~{in_xg_file} \
          -H ~{in_gbwt_file} \
          -g ~{in_ggbwt_file} \
          -d ~{in_dist_file} \
          -m ~{in_min_file} \
          -t ~{in_map_cores} > ~{in_sample_name}.${READ_CHUNK_ID}.bam
    >>>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant