Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding align_trim.pl #99

Closed
ARastrojo opened this issue Jan 12, 2022 · 17 comments · Fixed by #137
Closed

Question regarding align_trim.pl #99

ARastrojo opened this issue Jan 12, 2022 · 17 comments · Fixed by #137

Comments

@ARastrojo
Copy link

Hi all,

I am using align_trim.py script to remove primers from the alignment. In me case, we have performed SARS2 amplification with v3 primer set, and then sequenced with illumina (Miseq).

I was expecting align_trim.py to trims primers from aligned reads only if reads starts or ends in the same positions as a primers starts or ends, and both reads and primers are in the same orientation (I think it has no sense to remove a reverse primer from a forward read, as this forward read can come from and overlapped amplicon). But after executing the program, we can see that all positions in the alignment were a primers is mapped are remove, and all bases up to the end of the read (when the read is forward and the primers is reverse for instance), among other strange things, it could be long to explain...

Here is a image of a couple of reads (in a very sad region, with very few mapped reads)

Captura de pantalla 2022-01-13 a las 0 34 03

So, I do not know is I am losing something is the program, or if in nanopore reads, where I do not have any experience, this behavior of the program is the expected.

Best,

Alberto

@BioWilko
Copy link
Collaborator

Hi Alberto

align_trim is designed for full amplicon unpaired read data and relies on that type of data to function properly. There is version of it I have adapted to work with paired and fragmented illumina data which you can find here but I'm still validating this so it comes with absolutely zero guarantees.

@charlesfoster
Copy link

Hi @BioWilko,

Based on your comment above, am I correct in assuming that artic's align_trim also does not function optimally on Nanopore data if a rapid barcoding kit has been used, since amplicons can be fragmented? We exclusively use the rapid-barcoding kit here instead of the ligation sequencing kit(s). I've been playing around with alternatives to align_trim in a snakemake workflow, and currently I'm trialling using samtools ampliconclip instead.

Thanks,
Charles

@BioWilko
Copy link
Collaborator

BioWilko commented Feb 10, 2022

That is correct however I also have a version of align_trim I've modified to work in the case of fragmented nanopore data available here if you utilise the --fragmented flag however as before this comes with zero guarantees, use at your own risk etc etc but based on my very limited testing it seems to work.

@charlesfoster
Copy link

Thanks for the reply. I gave the script a crack by (1) dropping it in as a replacement in a clone of the artic-ncov2019 conda environment, and (2) creating a new conda environment after cloning your fork of the fieldbioinformatics repo. The align_trim step errored out for me. No biggie, but just in case you find it useful as an edge case and if you want to continue modifying align_trim, the error was as follows:

Traceback (most recent call last):
  File "/home/vrl/miniconda3/envs/artic-clone/bin/align_trim", line 8, in <module>
    sys.exit(main())
  File "/home/vrl/miniconda3/envs/artic-clone/lib/python3.6/site-packages/artic/align_trim.py", line 379, in main
    go(args)
  File "/home/vrl/miniconda3/envs/artic-clone/lib/python3.6/site-packages/artic/align_trim.py", line 232, in go
    p1 = find_primer(bed, segment.reference_start, "+", args)
  File "/home/vrl/miniconda3/envs/artic-clone/lib/python3.6/site-packages/artic/align_trim.py", line 45, in find_primer
    key=itemgetter(0),
ValueError: min() arg is an empty sequence

I get a bit lost in some of the functions, so I'm not sure what the issue is.

@BioWilko
Copy link
Collaborator

BioWilko commented Feb 11, 2022

Would you be willing to share the fastqs for the data you used here? It would be super handy for getting to the bottom of the issue

Edit: Oh you used conda, that will cause problems since I haven't packaged the fork, you should follow the compile from source instructions to use this fork properly.

@charlesfoster
Copy link

Apologies for my slow reply. I previously installed your fork with the following steps:

  1. git clone https://github.com/BioWilko/fieldbioinformatics.git
  2. cd fieldbioinformatics
  3. mamba env create -f environment.yml # I changed the name of the env to artic-clone first
  4. conda activate artic-clone
  5. python setup.py install

Forgive me, but I can't see any different instructions for compiling from source, have I missed something?

I checked for the --fragmented flag with align trim -h:

usage: align_trim [-h] [--normalise NORMALISE] [--report REPORT] [--start]
                  [--no-read-groups] [--verbose] [--remove-incorrect-pairs]
                  [--fragmented]
                  bedfile

Trim alignments from an amplicon scheme.

positional arguments:
  bedfile               BED file containing the amplicon scheme

optional arguments:
  -h, --help            show this help message and exit
  --normalise NORMALISE
                        Subsample to n coverage per strand
  --report REPORT       Output report to file
  --start               Trim to start of primers instead of ends
  --no-read-groups      Do not divide reads into groups in SAM output
  --verbose             Debug mode
  --remove-incorrect-pairs
  --fragmented

I used the following command to trim the primers:

align_trim --fragmented --normalise 10000 nCoV-2019.scheme.bed --remove-incorrect-pairs --report test.alignreport.txt < test.sorted.bam

Same error as above.

However, when I tried to replicate this problem today on a different computer, I could not do so. When I follow the link you provided above, there is no longer a --fragmented command line argument. Now I'm thoroughly confused!

image

In any case, the problematic reads and bamfile can be found here: https://unsw-my.sharepoint.com/:f:/g/personal/z3533036_ad_unsw_edu_au/ErSNhdfncMRDvddJFQhV0N4BPB4Y3LRp9jfsaCFIUagz6g?e=M4o0Q8

@BioWilko
Copy link
Collaborator

Sorry that's on me entirely, I had to revert my repo for a version bump PR. However chris wright at ONT has written an amplicon overlap based version of align_trim which should suit your purposes which you can find here: https://github.com/epi2me-labs/fieldbioinformatics/tree/align_trim

@jonbra
Copy link

jonbra commented Sep 5, 2024

Hi, I think I might have the same problem. Tried to use the version you linked to from epi2me, but got the same problem.
I am using Nanopore data from the Nanopore rapidkit. And we are sequencing a small virus genome, ca. 3kb, and it's (partially) circular. The majority of reads are skipped as "not correctly paired" and the output of align_trim is completely empty. In the end I get a longshot error: "Max read coverage set to 0. printing empty VCF file".

@BioWilko
Copy link
Collaborator

BioWilko commented Sep 5, 2024

Fieldbioinformatics doesn't support circular genomes (yet) but I don't know what the specific issue would be in this case, my assumption would be that only amplicons crossing the reference boundaries would be an issue....

If you're having trouble with the epi2me version of fieldbioinformatics I suggest you raise the issues there since it differs in a few key ways, but if all you want is rapid barcoding support then I suggest you try the pre-release of fieldbioinformatics 1.4.0, this does support rapid barcoded data but not circular genomes.

@jonbra
Copy link

jonbra commented Sep 6, 2024

Thanks @BioWilko for the quick response! I managed to run the 1.4.0 release, but still have similar issues. More than 60000 reads map in the first round, but it seems that everything is removed by align_trim because the reads are not correctly paired. Could it be something wrong with my primer file?

@BioWilko
Copy link
Collaborator

BioWilko commented Sep 6, 2024

It sounds like it, if you can share your reads, reference, and bed file I can have a look and give you a better answer

@jonbra
Copy link

jonbra commented Sep 6, 2024

Thank you so much for the help!
Do you have access to download this file?
https://www.dropbox.com/scl/fi/ep0r3g5gqz3nbqs2lrbwt/hbv_artic.tar?rlkey=w0dmecvnuhz4fe3l57nb9msky&st=nrjvxr7l&dl=0

@jonbra
Copy link

jonbra commented Sep 6, 2024

This is the command I used and the 1.4.0 release:
artic minion --model r1041_e82_400bps_hac_g615 --normalise 500 --threads 6 --scheme-directory ../primer-schemes/ --bed ../primer-schemes/Ringlander/V4/Ringlander.scheme.bed --ref ../primer-schemes/Ringlander/V4/Ringlander.reference.fasta --read-file /home/jonr/Prosjekter/HBV/work/32/5d92f661e2c02eb1ca4c03d808e57a/HBV010bc01.fastq.gz MINION_TEST

@BioWilko
Copy link
Collaborator

BioWilko commented Sep 6, 2024

Ah, here's your problem, the bedfile is malformed:

NC_003977.2     3094    3112    HBV_3131_LEFT   1       +
NC_003977.2     333     355     HBV_353_RIGHT   1       -
NC_003977.2     301     321     HBV_299_LEFT    2       +
NC_003977.2     670     694     HBV_692_RIGHT   2       -
NC_003977.2     636     657     HBV_634_LEFT    1       +
NC_003977.2     1000    1021    HBV_1019_RIGHT  1       -
NC_003977.2     960     985     HBV_958_LEFT    2       +
NC_003977.2     1299    1315    HBV_1313_RIGHT  2       -
NC_003977.2     1265    1285    HBV_1263_LEFT   1       +
NC_003977.2     1611    1629    HBV_1627_RIGHT  1       -
NC_003977.2     1593    1610    HBV_1590_LEFT   2       +
NC_003977.2     1943    1967    HBV_1965_RIGHT  2       -
NC_003977.2     1908    1932    HBV_1906_LEFT   1       +
NC_003977.2     2307    2331    HBV_2329_RIGHT  1       -
NC_003977.2     2053    2072    HBV_2049_LEFT   2       +
NC_003977.2     2474    2496    HBV_2500_RIGHT  2       -
NC_003977.2     2450    2474    HBV_2454_LEFT   1       +
NC_003977.2     2835    2856    HBV_2860_RIGHT  1       -
NC_003977.2     2807    2831    HBV_2811_LEFT   2       +
NC_003977.2     3166    2       HBV_2_RIGHT     2       -

The number in the middle of the primer name is the amplicon number and they need to match e.g. HBV_1_LEFT and HBV_1_RIGHT fieldbioinformatics requires matching names to know which primers are used to amplify which amplicon.

If you want to see the bedfile specifications you can see them here: https://github.com/ChrisgKent/primal-page, fieldbioinformatics supports any of these formats so long as it is used consistently.

@jonbra
Copy link

jonbra commented Sep 6, 2024

Ah... I see. Thanks!! These are custom primers we designed without primalscheme. I tried to format everything corectly, but didn't notice this 😳

@jonbra
Copy link

jonbra commented Sep 6, 2024

The pipeline works now

@BioWilko
Copy link
Collaborator

BioWilko commented Sep 6, 2024

Unfortunately it still won't be able to handle amplicons crossing the reference boundary, that functionality will hopefully be added soon!

@BioWilko BioWilko linked a pull request Sep 20, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants