Skip to content

scripts used to analyse horizontal transfer and evolution of transposable elements in 307 vertebrate species

Notifications You must be signed in to change notification settings

jeanlain/HTvertebrates

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTvertebrates

Scripts used in "Horizontal transfer and evolution of transposable elements in vertebrates" by Hua-Hao Zhang, Jean Peccoud, Min-Rui-Xuan Xu, Xiao-Gu Zhang, Clément Gilbert (doi: 10.1038/s41467-020-15149-4).

These scripts are publicly available to indicate how parts of the analysis were automated. There is no guaranty regarding their use. For those who want to use the pipeline, see below:

Requirements

The pipeline was not tested with other versions of the above programs, but more recent versions probably work.

Hardware requirements: a linux cluster with

  • ≥200 CPUs
  • ≥0.5 TB of system memory
  • ≥2 TB of free hard drive space
  • an internet connection to download the genome sequences (>10 MB/s is recommended).

On this hardware, the pipeline should take 1-2 months to complete.

File description

The R scripts whose names start with numbers performed successive stages of the analysis. The purpose of each script is described by comments at the beginning of the script.

  • HTvFunctions.R contains functions required for the other scripts and is sourced automatically.
  • circularPlots.R contains functions used to draw Figure 2 of the paper.
  • The remaining scripts are launched via Rscript (for long, CPU-intensive tasks) from the scripts whose names start with numbers.

The files in directory additonal_files are required by the scripts:

  • supplementary-data1-genomes_and_accessions.txt gives general information about the genomes and is used to download genomes sequences from ncbi.
  • ftp_links.txt contains URL to the genome sequences, also used to download genome sequences.
  • timetree.nwk is the timetree (newick format) used through the analysis.
  • namedClades.txt is a table of major vertebrate clades in this tree, with their names and color codes used to make some of the paper's figures (these figures are generated by the scripts).
  • superF.txt makes the correspondance between repeatModeler family codes (first column), TE class (2nd column) and more common TE superfamily names (3rd column). It is used in stages 15 and 16.
  • supplementary-data3-TEcomposition_per_species.txt is generatd by the scripts and is provided with the paper, but we also provide it here if to facilitate the reproduction of the results.

The directory demo_TeKaKs is provided to demo the script TEKaKs.R (see Demonstration of TEKaKs.R below), but is not required to run the pipeline.

Installation

In a bash-compatible terminal that can execute git, paste

git clone https://github.com/jeanlain/HTvertebrates.git
cd HTvertebrates/

Alternatively, download https://github.com/jeanlain/HTvertebrates/archive/master.zip and uncompress the zip file.

Usage

Run R scripts whose name start with numbers in the corresponding order, always from the HTvertebrates/ directory, which should be set as the working directory. We recommand running these scripts (except stage 1) in interactive mode.

Adapting this pipeline to other datasets, hardware configuration, and automating all procedures require modifications to the code. Some parts of the analysis were not automated.

Demonstration of TEKaKs.R to compute pairwise Ka and Ks on tranposable elements

We detail how to run TEKaKs.R on a demo dataset, but we remind that this script (as all others) is not intended for use in any other context than the study associated with the paper.

This script computes Ka, Ks, and well as overall molecular distances on pairs of homologous transposable elements (TEs), based on HSPs between these TEs and on HSPs between TEs and proteins (HSPs are not generated by this script). See the paper's method section for a description of the approach.

The hardware requirement for this demo is a Mac/Unix/Linux computer with at last 8GB of RAM, 1GB of free hard drive space, and which is able to execute R 3.4+ in a terminal. A Windows computer cannot run the script as some R functions are not supported under windows (namely those of the parallel R package).

A Mac computer may have issues installing the igraph R package from sources, as macOS lacks a fortran compiler. However, the igraph package may be installed manually by specifying NOT to install packages from sources (which is not possible to do via Rscript).

The other programs mentioned in the Requirements section need not be installed for this demo.

The demo_TeKaKs/ directory must be immediately within the HTvertebrates/ directory. It contains the following:

  • TEhitFile.txt is a file of TE-TE HSPs in typical blast tabular format, but only listing sequence names and HSP coordinates.
  • blastxFile.txt is a tabular file of TE-protein HSPs. The fields indicate the TE sequence name, start and end coordinates of the HSP on this sequence (where start < end), start coordinate of the HSP on the protein and whether the TE sequence in aligned on the protein in reverse direction.
  • fastaFile.fas is a fasta file of the TE sequences whose names are in the two aforementioned files.

The nature of these files is also explained by comments in TEKaKs.R.

To run the demo, paste the following in the terminal session that you used to install the pipeline:

Rscript TEKaKs.R demo_TeKaKs/TEhitFile.txt demo_TeKaKs/blastxFile.txt demo_TeKaKs/fastaFile.fas demo_TeKaKs/output 2

where demo_TeKaKs/output is the output folder (automatically created) and 2 is the number of CPUs to use.

The script should run in less than 5 minutes on a standard desktop PC.

Results will be found in demo_TeKaKs/output/allKaKs.txt. This tabular file contains the following fields:

  • hit is an identifier for each HSP, which corresponds to the row index of each HSP in TEhitFile.txt.
  • ka, ks, vka and vks are the results of Ka and Ks computations (see the kaks() function of seqinr),
  • length is the length of the alignment on which the above were computed.
  • nMut is the number of substitutions in this alignment.
  • K80distance and rawDistance are molecular distances (according to Kimura 1980 or without any correction) between sequences in the HSP. These are computed before any of the processing required for the Ka Ks computations (the removal of certain nucleotides and codons, see the method section of the paper).

Output of the pipeline

More than 1TB of intermediate files are generated. The final output corresponds to results of the publication (please see the publication for their description).

  • Figure2.pdf, Figure3.pdf and Figure4.pdf are produced at stages 14, 15 and 16 respectively. They correspond to figures of the main text
  • FigureS1.pdf is generated at stage 5. It corresponds to supplementary figure 1.
  • FigureS2.pdf is generated at stage 11. It corresponds to supplementary figure 2.
  • FigureS[3-6].pdf are generated at stage 16. They corresponds to supplementary figures 3-6.
  • tableS1.txt is generated at stage 15. It corresponds to supplementary table 1.
  • tableS2.txt is generated at stage 16. It corresponds to supplementary table 2.
  • supplementary-data3-TEcomposition_per_species.txt is generated at stage 2.
  • supplementary-data4-retained_hits.txt is generated stage 12.

About

scripts used to analyse horizontal transfer and evolution of transposable elements in 307 vertebrate species

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages