Skip to content

Pending Tasks

Toni Westbrook edited this page Jun 25, 2015 · 39 revisions

####Research, Design, Development####

  • Tool for automated/scripted GO analysis of PALADIN results to measure effectiveness (of identifying functionality, as opposed to taxonomy)
  • Post-processing functionality (SAM translation, embedding protein info, BAM generation, output options)
  • Options refinement (remove BWA options not usable/relevant to PALADIN, ensure argument checking works with new PALADIN related options
  • Code cleanup
  • Incorporate ORF detection hinting depending on results obtained below

  • Fix intentional memory leak issue in the getSequenceORF function. (6/21/15)
  • For each major algorithm variant that we will use, create a command line argument to select it and set appropriate options. Part of this will be detection of single vs multi frame protein translation in the reference index during the alignment phase and adjusting functionality accordingly (including for Post-processing, see below) (6/21/15)
  • Create script to generate a mapping between UniProt/RefSeq IDs and CDS entry for mapped reads (for use in GO work) (6/16/15)
  • Create second version of UniProt nucleotide database with the references removed for our 6 MCBS913 species (6/9/15)
  • Fix >2GB indexing issue (inherent to BWA), necessary for UniProt nucleotide testing (6/10/15)
  • Memory leak in sequence header name parsing issue (causing huge amounts of memory to be used)
  • Build nucleotide sequence database for each corresponding protein sequence in the UniProt database (for the 95% where possible)
  • Move multi-frame protein translation from indexing to alignment - then choose best alignments during SAM output
  • Research reference set used by PhyloSift, clone environment in PALADIN
  • Add command arguments for ORF length

####Testing and Verification####

  • For all tests, perform accompanying GO analysis (create an individual FINISHED item below for each test complete, remove this item when all tests done)
  • Align: Generated metagenomic reads against UniProt (Full and Filtered) using Novo, plain
  • Align: Generated metagenomic reads against UniProt (Full and Filtered) using Novo, degenerate
  • Align: Real metagenomic reads against UniProt using BWA
  • Align: Real metagenomic reads against UniProt using Novo, plain
  • Align: Real metagenomic reads against UniProt using Novo, degenerate
  • Test with 150 sized reads
  • After incorporating any hinting functionality from tests above, retest all algorithm variants on the real metagenomic reads (with full UniProt DB)

  • Test specific stop codons frequency ordering vs GC content (similar to test below) to see if specific stops differ from overall pattern (6/24/15)
  • Verify alignment of mapped reads, develop error measure, look for patterns/recurring issues in misalignment (The non-GO portion of this is done - verification now needs to be performed at the functional level) (6/24/15)
  • Complete tests on GC content vs stop codon frequency frame order (6/17/15)
  • Generated metagenomic reads against UniProt (Full and Filtered) using BWA (6/17/15)
  • Complete tests on stop codon frequency frame order (6/14/15)
  • Run all generated metagenome reads tests again using the filtered UniProt database (6/11/15)
  • Complete test of algorithm variant 1 on the generated metagenome reads (6/10/15)
  • Run stop codon frequency stats (counts, frame order) on UniProt, MCBS913 species, and Acidovorax
  • Test variant 2 and 3 completed with generated metagenome reads
  • Initial test of ORF detection strategies (See initial list here)

####Paper####

Clone this wiki locally