Skip to content

Miscellaneous bioinformatics scripts and utils I use

Notifications You must be signed in to change notification settings

FNL-MoCha/biofx_utils

 
 

Repository files navigation

Miscellaneous Bioinformatics Scripts

This is a set of tools I developed to do some basic, but helpful things with Ion Torrent NGS data in general. Some of the tools, as described below are more general in nature, however, and hopefully useful little bits to do quick jobs.

Included Scripts

For now, here is a brief description of the tools included in this repo. Each one has a full set of help docs, which can be accessed with by passing the -h or --help option to the script.

getseq.pl

Current Version: v1.3.0_110515

Requirements:

  • Perl Modules:

    • LWP::Simple
    • XML::Twig
    • Sort::Versions
    • Data::Dump

Description:

Script to query the UCSC DAS server and pull out sequence information given a position string. Can use either a single string entry:

<<<chrx:123435

or can use a file with positions in the same format and get a batch output.

vcfExtractor.pl

Current Version: v8.1.102218

Requirements:

  • VCF Tools
  • Perl Modules:
    • Sort::Versions
    • Data::Dump

Description:

Script to parse Ion Torrent specific VCF files and pull out variant data. This works with TS v4.2 and v5.0 files, both run with or without Ion Reporter systems.

In order to run this utility, you'll need to have the package installed and VCF Tools in your $PATH. There may be some other non-standard Perl modules to be installed, such as Sort::Versions, Data::Dump, etc. All can easily be installed from CPAN as usual.

See the help documentation for this script for details on the options and functionality of this tool:

$ vcfExtractor.pl -h

readlength_histogram.pl

Current Version: v7.25.031418

Requirements:

Description:

Read in an Ion Torrent BAM file and generate a readlength histogram plot from the sample. This script will require the Perl Statistics::R module, as well as, the most excellent ggplot2 library in the R Statistics package.

map_refs.py

Current Version: v0.1.020918

Requirements:

Description:

Map coordinates of two reference assemblies (e.g. hg18 and hg19) together in order. This utility requires Konstantin's excellent python pyliftover library which leverages the UCSC liftOver utility for mapping reference assemblies.

germline_merge.pl

Current Version: v0.4.030118

Requirements:

  • vcfExtractor.pl

Description:

Combine OCA Ion Reporter blood and tumor VCFs to generate a tumor / normal comparison file. In reality, though the labels will be 'blood' and 'tumor' related, the data really can be generated by comparing any two VCF files from Ion Reporter.

get_clinvar_variant_data.py

Current Version: 2.0_111417

Requirements:

Description:

Input a file, comma separated list, or a single ClinVar ID and get a table of variant information derived from ClinVar using the eutils API functionality of NCBI. No filtering possible for now, but will be added later.

get_pathway.py

Current Version: v1.3.121318

Requirements:

Description:

Using a pathway lookup table in resources, generate get a list of oncogenic related pathways for a gene or set of genes. Need to continue to refine the pathways lookup tables, but the hope is that this will be a good annotator tool that can be implemented into other pipelines.

protein_domain_retrieve.py

Current Version: v0.2.121517

Requirements:

Description: Protein Domain Retrieval Script Starting with a correctly formatted HUGO gene ID, retrieve protein domain position information from EMBL in a JSON format that can be used as a lookup DB in other programs. You can either load a comma separated string of IDs, or a batchfile containing a list of IDs, one per line, to look up.

get_gene_by_coord.pl

Current Version: v0.8.121718

Requirements:

Description:

Script to read in a GRCh37 (hg19) coordinate in the format chr:position and output a HUGO gene name. Can input a comma separated list of coords or a file containing a batch of coords to lookup, one per line. This script is written with parallel processing in mind, so it's really fast to look up data batchwise.

About

Miscellaneous bioinformatics scripts and utils I use

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Perl 78.4%
  • Python 21.6%