Scripts used in "Horizontal transfer and evolution of transposable elements in vertebrates" by Hua-Hao Zhang, Jean Peccoud, Min-Rui-Xuan Xu, Xiao-Gu Zhang, Clément Gilbert (doi: 10.1038/s41467-020-15149-4).
These scripts are publicly available to indicate how parts of the analysis were automated. There is no guaranty regarding their use. For those who want to use the pipeline, see below:
-
R 3.4+ with the following packages:
- data.table 1.11.4
- stringi 1.2.4
- matrixStats 0.54
- igraph 1.2.4.1
- ape 5.1
- seqinr 3.4-5
- Biostrings 2.52
- RColorBrewer 1.1-2
(If not found, these packages are installed automatically by the pipeline.)
-
RepeatModeler 1.0.10
-
RepeatMasker 4.0.7
-
BUSCO 3.0.1
-
ncbi blast+ 2.6.0
-
diamond 0.9.19
-
seqtk 1.2-r94
-
Slurm Workload Manager 17.11.7
The pipeline was not tested with other versions of the above programs, but more recent versions probably work.
Hardware requirements: a linux cluster with
- ≥200 CPUs
- ≥0.5 TB of system memory
- ≥2 TB of free hard drive space
- an internet connection to download the genome sequences (>10 MB/s is recommended).
On this hardware, the pipeline should take 1-2 months to complete.
The R scripts whose names start with numbers performed successive stages of the analysis. The purpose of each script is described by comments at the beginning of the script.
HTvFunctions.R
contains functions required for the other scripts and is sourced automatically.circularPlots.R
contains functions used to draw Figure 2 of the paper.- The remaining scripts are launched via
Rscript
(for long, CPU-intensive tasks) from the scripts whose names start with numbers.
The files in directory additonal_files
are required by the scripts:
supplementary-data1-genomes_and_accessions.txt
gives general information about the genomes and is used to download genomes sequences from ncbi.ftp_links.txt
contains URL to the genome sequences, also used to download genome sequences.timetree.nwk
is the timetree (newick format) used through the analysis.namedClades.txt
is a table of major vertebrate clades in this tree, with their names and color codes used to make some of the paper's figures (these figures are generated by the scripts).superF.txt
makes the correspondance between repeatModeler family codes (first column), TE class (2nd column) and more common TE superfamily names (3rd column). It is used in stages 15 and 16.supplementary-data3-TEcomposition_per_species.txt
is generatd by the scripts and is provided with the paper, but we also provide it here if to facilitate the reproduction of the results.
The directory demo_TeKaKs
is provided to demo the script TEKaKs.R
(see Demonstration of TEKaKs.R
below), but is not required to run the pipeline.
In a bash-compatible terminal that can execute git, paste
git clone https://github.com/jeanlain/HTvertebrates.git
cd HTvertebrates/
Alternatively, download https://github.com/jeanlain/HTvertebrates/archive/master.zip and uncompress the zip file.
Run R scripts whose name start with numbers in the corresponding order, always from the HTvertebrates/
directory, which should be set as the working directory.
We recommand running these scripts (except stage 1) in interactive mode.
Adapting this pipeline to other datasets, hardware configuration, and automating all procedures require modifications to the code. Some parts of the analysis were not automated.
We detail how to run TEKaKs.R
on a demo dataset, but we remind that this script (as all others) is not intended for use in any other context than the study associated with the paper.
This script computes Ka, Ks, and well as overall molecular distances on pairs of homologous transposable elements (TEs), based on HSPs between these TEs and on HSPs between TEs and proteins (HSPs are not generated by this script). See the paper's method section for a description of the approach.
The hardware requirement for this demo is a Mac/Unix/Linux computer with at last 8GB of RAM, 1GB of free hard drive space, and which is able to execute R 3.4+ in a terminal. A Windows computer cannot run the script as some R functions are not supported under windows (namely those of the parallel
R package).
A Mac computer may have issues installing the igraph
R package from sources, as macOS lacks a fortran compiler. However, the igraph
package may be installed manually by specifying NOT to install packages from sources (which is not possible to do via Rscript
).
The other programs mentioned in the Requirements
section need not be installed for this demo.
The demo_TeKaKs/
directory must be immediately within the HTvertebrates/
directory. It contains the following:
TEhitFile.txt
is a file of TE-TE HSPs in typical blast tabular format, but only listing sequence names and HSP coordinates.blastxFile.txt
is a tabular file of TE-protein HSPs. The fields indicate the TE sequence name, start and end coordinates of the HSP on this sequence (where start < end), start coordinate of the HSP on the protein and whether the TE sequence in aligned on the protein in reverse direction.fastaFile.fas
is a fasta file of the TE sequences whose names are in the two aforementioned files.
The nature of these files is also explained by comments in TEKaKs.R
.
To run the demo, paste the following in the terminal session that you used to install the pipeline:
Rscript TEKaKs.R demo_TeKaKs/TEhitFile.txt demo_TeKaKs/blastxFile.txt demo_TeKaKs/fastaFile.fas demo_TeKaKs/output 2
where demo_TeKaKs/output
is the output folder (automatically created) and 2
is the number of CPUs to use.
The script should run in less than 5 minutes on a standard desktop PC.
Results will be found in demo_TeKaKs/output/allKaKs.txt
. This tabular file contains the following fields:
hit
is an identifier for each HSP, which corresponds to the row index of each HSP inTEhitFile.txt
.ka
,ks
,vka
andvks
are the results of Ka and Ks computations (see thekaks()
function ofseqinr
),length
is the length of the alignment on which the above were computed.nMut
is the number of substitutions in this alignment.K80distance
andrawDistance
are molecular distances (according to Kimura 1980 or without any correction) between sequences in the HSP. These are computed before any of the processing required for the Ka Ks computations (the removal of certain nucleotides and codons, see the method section of the paper).
More than 1TB of intermediate files are generated. The final output corresponds to results of the publication (please see the publication for their description).
Figure2.pdf
,Figure3.pdf
andFigure4.pdf
are produced at stages 14, 15 and 16 respectively. They correspond to figures of the main textFigureS1.pdf
is generated at stage 5. It corresponds to supplementary figure 1.FigureS2.pdf
is generated at stage 11. It corresponds to supplementary figure 2.FigureS[3-6].pdf
are generated at stage 16. They corresponds to supplementary figures 3-6.tableS1.txt
is generated at stage 15. It corresponds to supplementary table 1.tableS2.txt
is generated at stage 16. It corresponds to supplementary table 2.supplementary-data3-TEcomposition_per_species.txt
is generated at stage 2.supplementary-data4-retained_hits.txt
is generated stage 12.