ChangeLog

Version 0.40.0
  * Documentation is no longer packaged with the downloads. It's on the GitHub wiki.
  * Discontinued building the Linux binary since it is unlikely to be portable.
    If you were using this and would like us to keep generating it, let us know!


Version 0.39.0 
  * Fixed MAJOR ERRORS in assigning mutations to promoter regions of genes (#374)
  * We are migrating the breseq documentation to the GitHub wiki. Please check there
    for the most updated version. The docs packaged with breseq are sunsetting.
  * Commands like breseq BAM2COV and BAM2ALN are now more forgiving in the ways regions can be specified.
    * You can include commas in positions, which will be removed. REL606:627,165-627,185 or REL606:627165-627185
    * You can provide just a single position REL606:627165 or REL606:627165-627165
    * Hyphen-like characters (–) are automatically converted to hyphens (-).
  * Removed an extra comma in gdtools COMPARE table output header (#368)

Version 0.38.3
  * Miscellanous bug fixes related to nonstandard reference file loading.
  * Fixed problem adding --isescan-results with Genbank input to breseq CONVERT-REFERENCE.
  * Some commands (breseq BAM2ALN, breseq BAM2COV) will now create output directories if they don't already exist.
  * Breseq now allows plus (+) and minus (−) signs to be used in reference seq ids, in addition to underscores and hyphens.
  * Changed syntax for experimental breseq SOFT-CLIPPING command to allow multiple BAM inputs.

Version 0.38.2
  * Miscellanous bug fixes and updates to reference file handling.
  * Updates to reference format documentation to describe how to get better IS prediction.
  * Added experimental breseq SOFT-CLIPPING subcommand.

Version 0.38.1
  * Fix for infinite loop triggered by help when terminal window was too narrow.
  
Version 0.38.0
  * breseq can now analyze long-read sequencing data. It will split long reads into short subsequence
    reads that are of a length that works for its read mapping parameters and new junction analysis.
    Read splitting can be controlled with the --long-read-* options. If you are looking for structural
    variation, you should be performing de novo assembly and genome comparisons using other tools!
    breseq will not be able to identify or fully resolve large or complex structural variants.
  * If you are using the current generation of Nanopore long read data, we recommend using the 
    --nanopore option, which sets read mapping parameters that speed up the analysis and filters 
    out predictions in error-prone homopolymer repeats of 4 or more bases.
  * The MacOSX executable is now universal (should run on both Intel and M1 macs).
  * Changed how reads that map equally to the reference and new junction get resolved to
    eliminate junctions that could get assigned high frequencies due to very little evidence,
    particularly in polymorphism mode. This approach is more conservative and may decrease the
    predicted frequencies of junctions, but should only have minor effects in consensus mode.

Version 0.37.1
  * Fixed bug in the way that indels in homopolymers were realigned for consistency.
    that sporadically caused fatal "Covariate 'quality' exceeded enforced maximum" errors.
  * Fixed fatal "Attempt to add duplicate of this existing entry" error that could arise when
    using user evidence Genome Diff files merged from outputs of previous runs.

Version 0.37.0
  * Warns that breseq results can vary (usually only slightly) with different
    bowtie2 versions and recommends a specific version for consistency.
  * Added back code for shifting indels to improve mutation calling.
  * Added an option for gdtools APPLY to output a new GD file showing the
    positions of each mutation in the (mutated) output genome.
  * Updated documentation related to installation and explaining junction orientations.
  * Fixed crash that could occur if there was a deletion spanning the origin of a
    circular reference sequence.

Version 0.36.1
  * Added an appendix to the manual on how reference sequence files are loaded/used.
  * When loading GenBank files that have a LOCUS line and source feature with different
    lengths, the LOCUS line length is now used to improve compatibility with sequences
    that have been edited in programs that do not update the source feature.
  * Added a warning when CDS features with lengths that are not a multiple of 3 are loaded
    with a suggestion to change these to pseudogenes. Changed a fatal error when annotating a
    mutation in an incomplete final codon in one of these CDS features to a warning and generic coding mutation.
  * breseq will no longer throw a fatal error if read files that have the same name
    but are located at different paths are input.
  * Fixed gdtools MERGE not preserving header information from first GD file.
  * Fixed treatment of UNDEFINED by gdtools FILTER/REMOVE to allow matches to != UNDEFINED.

Version 0.36.0
  * Fixed an issue causing JC evidence to not save info in annotated GD files.
  * Fixed a crash when annotating GD files containing INV mutations.
  * Fixed empty line at the end of the FASTQ input causing an error.
  * Major updates to gdtools commands.
  * gdtools COMPARE/ANNOTATE
    * New TABLE format that provides a text (CSV) version of HTML style compare table.
    * TEXT_* output columns added if -b option provided.
    * Changed HTML default to not repeat header lines for easier loading into Excel.
      If you liked these, you can add them back as before with --repeat-header 10.
  *gdtools FILTER/REMOVE
    * Added the ability to preserve evidence entries.
    * Added control over commenting/removing filtered lines, renumbering IDs, and verbose messages.
    * Fixed (and explained) logic for matching the UNDEFINED value
    * New ASSIGNED field for filtering on whether evidence is used.
    * Added some examples of usage to the command line help.

Version 0.35.7
  * Speedup and fix of rare bug related to annotating mutations to properly account for cases
    in which multiple base substitutions occur in the same codon.

Version 0.35.6
  * Should be compatible with GenBank reference files produced by Prokka and
    NCBI PGAP, and with GFF3 files produced by PGAP.

Version 0.35.5
  * Bug fix for incorrect "Plots could not be created" message in HTML pages when -j is >1.
  * Bug fix for #254: "Provided translation table #0 does not have 64 codons". This was a rare
    case where two SNPs occur in the same codon of a CDS and also overlap a noncoding gene.

Version 0.35.4
  * Critical bug fix for crashes during output creation when -j is >1.

Version 0.35.3
  * Fixed problem where HTML coverage and alignment files were not produced. (Bug introduced in v0.35.2)
  * Fixed overlap and annotation position problems applying INT mutations with length zero.
  * Added -seq-id|-s option to gdtools APPLY so that one can remove extra reference sequences that
    were only provided for INT, CON, MOB, etc. mutations before output.

Version 0.35.2
  * Generation of output files is now multithreaded.
  * Changed --use-version-as-seq-id option to --genbank-field-for-seq-id to give more control over
    choosing which part of the header is used for the sequence ID. The default is to assign to the
    first existing field with the preference: LOCUS > ACCESSION > VERSION.
  * Fixed GenBank parsing to be more forgiving of different LOCUS lines output by other programs.
    Now only gives a warning if the sequence length is missing and looks for the linear|circular
    designation at different locations within the LOCUS line. Falls back to the source feature
    length if the sequence length is missing from the LOCUS line.
  * Changed priority of name used for features in GenBank files to name > gene > locus_tag > label > note.
  * Added mutation type INT for applying GenomeDiffs to insert a sequence with annotations.
  * Fixed multiple identical GD entry error related to --user-evidence leading to
    double mutation predictions.
  * Changed error from loading reference genome files with repeat regions annotated
    with complex/multiple locations to a more informative warning.
  * Fixed major memory leak during the output step related to assigning read counts to
    JC evidence from BAM files. Rewrote to further improve performance by making access
    to the required files for this step persistent.
  * Added summary.json and output.vcf to the output directory (they are also present in the data directory).
  * Fixed --total-only mode for breseq BAM2COV to give expected output in --table mode.

Version 0.35.1
  * Fixed bug introduced in v0.35.0 that resulted in displaying an error message that
    "Coverage plot could not be generated" instead of the plot on MC evidence pages.
  * Added missing input file 'empty.fastq' needed for ./run_tests.sh.
  * Added FASTA output option to gdtools COMPARE/ANNOTATE. This will generate a
    multi-FASTA file with a genotype alignment suitable for building phylogenetic
    trees using parsimony methods. The content is equivalent to that of the PHYLIP
    output option, but without the length limitation on the genome names.
  * Added --use-version-as-seq-id option to breseq. If present, it will use the full
    VERSION in an input GenBank file (e.g., NC_001416.1) as the sequence ID instead of
    the LOCUS (e.g., NC_001416). You will need to use the converted reference file
    (data/reference.gff) for further breseq and gdtools operations on breseq output
    generated using this option.
  * While not generally recommended, using read and reference file names and output paths
    that contain spaces in them should now work.
  * Fixed bug in gdtools ANNOTATE that caused it to fail with --preserve-evidence.
  * Miscellaneous improvements to other gdtools subcommands.

Version 0.35.0
   * Increased specificity of JC prediction. Reads are now only assigned to candidate
     junctions that are their best matches. They are never assigned to other junctions,
     even if they match them better than the reference. To get the earlier behavior
     (which might be preferred for consensus mode or for predicting polymorphisms from
     a population with little diversity use the --junction-allow-suboptimal-matches option.
     This option can result in some reads being misassigned to the wrong junctions.
   * Fixed various errors when using --user-evidence files and --aligned-sam input.

Version 0.34.2
   * Improved compatibility with versions of R compiled with limited graphics support
   * Increased resolution of coverage plots

Version 0.34.1
   * Fixed bug that predicted zero-length deletions in circular genomes.
   * Fixed bug that caused runs to fail when an empty read file was provided as input.
   * Fixed predict mutations junction side_1_position > side_2_position bug.
   * Added pipeline control options (--skip-*-prediction). Implementation is still experimental.

Version 0.34.0
   * Fixed bug where two rejected JC evidence items could predict a mutation.
   * Added new option --junction-minimum-pr-no-read-start-per-position to prevent saturation
     of coverage in high-coverage samples from leading to rejection of valid JC prediction.
     Set this option to 0.0 if you want to use the behavior of previous versions.
   * Changed default for --consensus-frequency-cutoff to be 0.8 in polymorphism mode,
     instead of 0.0 to prevent accepting rejected polymorphism RA items as consensus predictions.
   * Added the ability to predict DEL spanning the origin of a circular reference sequence.
   * Changed plotting scripts to be more compatible with fonts available to different versions of R.
   * Miscellaneous fixes to make gdtools commands more robust.

Version 0.33.2
   * Fixed extremely rare fatal error in constructing junction sequences with overlap
     that ended at the boundaries of reference sequences.
   * Multiple gdtools COMPARE updates. Fixed bug leading to empty PHYLIP output and
     restored sorting of columns in HTML output based on GenomeDiff metadata.

Version 0.33.1
   * CRITICAL bug fix for a crash during the final output step when there are
     large deletions overlapping a single gene.
   * Positions of indel mutations are no longer normalized in polymorphism mode
     because this process can fail and lead to duplicate GD entries. Note: you *can*
     still use gdtools NORMALIZE on a GD file output by a polymorphism mode run if you
     want to compare it to GD files from consensus mode. Just be aware that NORMALIZE
     may fail if you have certain complicated cases with multiple base substitutions
     and indels near one another such that shifting them can lead to conflicts.
   * Added option to preserve evidence entries to gdtools ANNOTATE.

Version 0.33.0
   * Changed behavior of gdtools SUBTRACT to ignore mutation frequencies by default;
     and fixed floating point error when frequencies are taken into account
   * Separated output of normal annotation of mutations in a GD file and
     additional HTML-specific fields. gdtools ANNOTATE/COMPARE has a new option to
     add the supplementary html_* and position_start/end fields if you need them.
   * Changed character separating annotation information about multiple overlapping
      genes to '|' from ';'.
   * Improved automatic detection of libunwind and added configure option (--without-libunwind).
     Changed default settings to not use libunwind for binary builds to improve compatibility.

Version 0.32.1
   * Added --minimum-mapping-quality option. Currently it is OFF by default.
   * Fixed incorrect ordering of read mapping criteria that could lead to longer but worse alignment
     being accepted over shorter match that did not meet --required-match-fraction
   * Removed the gdtools MERGE subcommand (it had become equivalent to gdtools UNION).
   * Fixed adding correct gene features to copies created by AMP mutations in gdtools APPLY.

Version 0.32.0
   * If a mutation overlaps multiple genes, its effect on each gene is now annotated.
     (Previously only the first gene encountered was annotated.) The relevant Genome Diff
     fields use a semicolon to separate the information about each impacted gene.
   * New output of summary information and settings in JSON format in data directory file
     'summary.json.
   * Corrected case in polymorphism mode where failing consensus mutations could be
     incorrectly predicted if a polymorphism was rejected. More thorough and consistent
     output of rejection information for RA evidence.
   * Updates to VCF output.
   * Fixed problem reading some FASTQ files (discovered on Nanopore data).
   * More granular options to filter predictions on the coverage of reads supporting them:
     e.g., --polymorphism-minimum-variant-coverage
           --polymorphism-minimum-total-coverage
           --polymorphism-minimum-variant-coverage-each-strand
           --polymorphism-minimum-total-coverage-each-strand
     WARNING: Previous versions of these options have been renamed for consistency.
   * New options to customize bowtie2 options used for alignment. (Only intended for expert users!)
       --bowtie2-scoring, --bowtie2-stage1, --bowtie2-stage2, --bowtie2-junction
   * Removed extra slash in paths to invoke R scripts.
   * Removed split alignment pieces that were completely soft-padded in output BAM file.
   * C++11 is now required for compilation. Optimization added to default compiler settings.
   * Fixed compatibility issues when compiling with certain versions of GCC.

Version 0.31.1
   * Critical fix for checking new bowtie2 versions with extra decimal place (e.g., 2.3.3.1).
   * Additional bug fixes for fitting coverage distribution.
   * Fix for `gdtools SUBTRACT` when using both polymorphism and consensus inputs.

Version 0.31.0
   * Fixed critical error that could give negative positions for mutations when a reference sequence
     fragment was called as deleted.
   * Fixed several cases where fitting coverage distribution could fail or lead to a fatal error.

Version 0.30.2
   * Improved robustness of fitting coverage distribution and added fallbacks if fitting fails.
   * More information at the command line and on the HTML summary page about when insufficient
     coverage leads to an entire reference sequence being called as deleted and when predictions
     will be less reliable due to failure fitting the coverage distribution.
   * More detailed debugging information output for errors and crashes.

Version 0.30.1
   * Reports an error if Bowtie2 version 2.3.1 is used. Upgrade to a newer Bowtie2 version.
   * Command line options added to allow control over how reads are filtered.
   * Reads with length ≤18 are now ignored by default to avoid problematic memory usage associated
     with how they can ambiguously match many different junction sequences.
   * Fixed bugs related to duplicate INS/DEL mutations being predicted in polymorphism mode.

Version 0.30.0
   * Options used in calls to bowtie2 were changed to be compatible with v2.3.0. These changes
     may result in (small) changes in how reads are aligned versus in older breseq versions.
   * Fixed bug in polymorphism mode option --polymorphism-reject-surrounding-homopolymer-length.
     and its default setting. This filter had been operating differently, and more stringently than
     intended, resulting in the rejection of certain polymorphisms when the same base as the variant
     base was present in the reference genome at positions immediately before and after the mutation.
   * HTML and GD output now include a POLYMORPHISM_REJECT field describing why a putative polymorphism
     was reassigned as a consensus mutation with 100% frequency (i.e., rejected as a polymorphism).
   * Fixed rare cases where certain GenomeDiff files output by breseq failed when they were used
     with gdtools APPLY.
   * Fixed various bugs related to reference files containing nonstandard bases, spliced genes, and
     certain seq_id names.

Version 0.29.0
   * CRITICAL fixes for polymorphism mode bugs that could lead to missing high frequency
     mutations when compiled on certain platforms and in very high coverage samples.
   * Updated included SAMtools version to 1.3.1; added multithreading to samtools sort steps.
   * Other fixes to enable output from other mapping programs using --aligned-sam mode.
   * Other minor bug fixes and improvements.

Version 0.28.1
   * Fixed R out-of-range integer error encountered when one of the reference
     sequences had very high read-depth coverage.
   * Extra validation steps for Genome Diff files with complex mutational series
     (using 'within' and 'before' fields) may require updates (corrections) for
     manually edited Genome Diff files when they are used with certain gdtools commands.

Version 0.28.0
   * Revamped how gene locations are handled internally to deal with complicated
     cases: trans-spliced genes, internal frameshifts, indeterminate start/end
     coordinated found in incomplete genomes, genes that cross the origin of
     circular chromosomes, etc.
   * Optimizations to improve slow fitting of very high coverage reference sequences,
     and for junction predictions involving these sequences (for example, in samples
     containing high copy number plasmids).
   * Various fixes for gdtools commands.

Version 0.27.2
   * breseq CL-TABULATE command for analyzing changes in contingency loci
     (hypermutable homopolymer stretches).
   * Bug fixes and updates to various gdtools and breseq utility commands.
   * Fixed rare bug leading to duplicate mutation predictions in repeat regions.

Version 0.27.1
   * Improved stability; fixed two sources of rare crashes.
   * The samtools executable is no longer included/installed/required.
   * Fixed backward compatibility with RA lines that was broken in 0.27.0

Version 0.27.0
   * Now accepts gzipped (*.gz) FASTQ input files.
   * Logic and options controlling predictions of consensus versus polymorphic
     mutations improved and now explained via flowcharts in the documentation.
   * Three new tutorials from EMBO course added to the documentation.
   * Continued updates and improvements to various gdtools commands.

Version 0.26.1
   * Fixed a crash that sometimes occurred when processing polymorphism statistics.
   * Corrected handling of shifted positions for some indel mutations.
   * Updates to several gdtools commands. Usage changed in some cases.

Version 0.26.0
   * First version with binary distributions for Linux and MacOSX.
   * Updated HTML documentation (installation instructions, annotated bibliography, FAQ).

Version 0.25d
   * Fixed problems introduced when program_data_path moved to being relative
     rather than absolute; affected bam2cov for example.

Version 0.25c
    * Maintenance release with bug fix for a rare problem affecting
      de novo assembled reference sequences.

Version 0.25b
    * Maintenance release with bug fixes.
    * Fixed crash of several gdtools commands.
    * Fixed crash under rare circumstances in polymorphism mode.

Version 0.25a
    * Maintenance release with bug fixes.
    * Fixed floating point error encountered on some compilers/systems.
    * Fixed compilation of samtools/breseq with MacOSX clang compiler.
    * Miscellaneous bug fixes related to gdtools commands.

Version 0.25
    * Release for publication describing consensus SV prediction:
      Barrick, J.E., Colburn, G., Deatherage D.E., Traverse, C.C.,
      Strand, M.D., Borges, J.J., Knoester, D.B., Reba, A., Meyer, A.G. (2014)
      Identifying structural variation in haploid microbial genomes from short-read
      resequencing data using breseq. BMC Genomics. 15:1039.

Version 0.24
    * Fix for crash when some junction alignment files had long names.
    * Added Variant Call Format (VCF) output and associated gdtools command GD2VCF.
    * Improved robustness of fitting coverage distribution for low coverage reference sequences.
    * Fixed compilation on Mac OS 10.9 / Clang
    * Fixed errors in parsing some GenBank files.
    * Fixed error causing crash when a polymorphism was very close to the beginning of a sequence.
    * Fixed errors with assigning added or deleted sequence at ends of complex mobile element insertions.
    * Fixed error predicting zero length deletions causing crash with certain reference sequences.
    * Improvements and updates to gdtools commands.
    * Changed the HTML display of read alignments to highlight differences from the reference.
    * Improved precision of polymorphism predictions for low-frequency mutations.
    * Other miscellaneous bug fixes.

Version 0.23
    * Compiling in Cygwin should now be possible.
    * Fix for crash on single files containing multiple references when one that was not
      the last had a lowercase nucleotide in it.
    * Fix for incorrect amino acid annotations for some codon changes related to the use of
      alternative start codons and when there are multiple base substitutions in one codon.
    * Added DEVELOPER instruction file for compiling the repository version of code.

Version 0.22
    * Fixed crash due to change in output format in newer bowtie2 versions (≥2.0.4).
      breseq should now produce consistent results for all versions (≥2.0.0-beta7).
    * Added -s,--junction-only-reference option to include reference sequences
      that can be part of a new junction, but are not otherwise processed for calling
      mutations. This option is useful if you want to find mobile elements that are not
      native to the main genome, e.g. the locations of transposons in a mutant strain, and
      have another reference sequence that contains those mobile elements.
    * Improved matching MC evidence to certain MOBs.
    * Removed duplicate -j command line option definition.
    * Updated bundled SAMtools to version 0.18.
    * Command line now prints versions and locations of external programs used by breseq.

Version 0.21
    * Fixed improperly failure of check for bowtie2 versions 2.0.0+.
    * Adjusted bowtie2 seed parameters to regain some degradation in sensitivity in
      short read data sets caused by previous choices.
    * Added limits to numbers of alignments to record per read (2000) and the total
      number of pairs to consider when creating junction candidates to prevent excessive
      memory usage in genomes (like yeast) with many repeats (telomeric regions).
    * Fixed DESTDIR usage in automake files.

Version 0.20
    * Fixed merging results of first and second alignment stages when read numbers
      overflowed a counter used to rename reads.
    * Fixed negative junction skew predictions when the number of bases in a dataset
      overflowed uint32_t size.
    * Fixed the display of some statistics on the Summary output page.
    * Fixes to detecting required bowtie2 and R versions.
    * Fixed improper predictions of deletion size from junctions in rare cases.
	* Fixed possible overflow in base numbers that could cause incorrect calculations of
	  average coverage that led to spurious junction predictions.

Version 0.19
    * bowtie2 instead of ssaha2 now used for mapping reads. All reads are mapped as if
      they are single-ended. Use the -j option to control how many processors bowtie2
      alignment uses.
    * Various improvements to junction prediction.
    * Added option to run pipeline on SAM file of aligned reads (--aligned-sam)
      instead of unaligned FASTQ to enable users to map reads with their favorite
      program. In this mode only missing coverage and read alignment evidence are
      compiled and used to predict mutations. (There is no new junction evidence).
    * Read alignment evidence now uses mapping scores in its probability model, greatly
      reducing some kinds of false positives.
    * Mixed positions with predicted frequencies ≥10% are output by default on the
      marginal evidence page even when running in normal (consensus) mode using a
      conservative version of --polymorphism-prediction in order to display variants
      in genes which may have duplicated or other errors caused by unknown divergence
      from the reference sequence.

2012-06-12 Version 0.18
   * Fixed misnumbering and mistranslation of amino acid changes in genes on the
     reverse genomic strand: a huge bug introduced in breseq-0.17e. (Thanks to Tim
     Cooper for reporting this problem.)
   * Improved use of different genetic codes and translation of split protein reading
     frames. All NCBI translation table codes should be recognized correctly now in
     GenBank input files. Alternative initiation codons should be correctly interpreted.
     Internal frameshifts in genes should be correctly translated.
   * Fixed crash using output step when no junctions were predicted. (Thanks Charles
     Traverse for reporting.)

2012-05-19 Version 0.17e
    * Updated documentation to change genomediff to gdtools.
    * Standardized some arguments to gdtools (still experimental).
    * Reference sequences now show strands in HTML alignments.

2012-05-03 Version 0.17d
    * Added back access to polymorphism-* options at command line. (Use breseq -h to
      see a list of all options, instead of just commonly used options.)

2012-04-22 Version 0.17c
    * Support for nonbacterial genetic codes.
    * Fixed possible fatal bugs with MC evidence when using multiple reference sequences.

2012-04-13 Version 0.17b
    * Updates to gdtools utility and documentation.
    * Fixed some spurious predictions of very large insertions from only JC evidence.
    * Other various changes.

2012-03-16 Version 0.17
    * Changed how junctions output to html.
    * Added various sub-commands.
    * Fixed bugs in HTML output.
    * Fixed bugs in mutation predictions.
    * Other numerous changes.
    * Portions of the HTML documentation are out of date, especially those describing the
      junction prediction scoring scheme and the genomediff utilities.

2011-10-19 Version 0.16
    * Fixed fatal errors when predicting junction candidates.
    * Changed how new candidate junctions are scored and accepted in preparation for
      calculating E-values.
    * Other bug fixes.


2011-10-04 Version 0.15
    * Better prediction of amplification mutations from junction evidence.
    * Fix for predictions involving junctions to nested repeat_regions.
    * All reads required to match by 90% of their length by default, greatly reducing
      spurious junction predictions in data sets with adaptor contamination and when
      using a diverged reference genome.

2011-09-10 Version 0.14
    * This update primarily addresses several bugs reported by users, including:
        * Errors using some NCBI/SRA datasets.
        * Errors when using simulated read data.
        * Errors when some reference contigs have very little or no coverage.
    * In addition, Perl is no longer required for installation. breseq has been
      completely ported to C++ with this version. If you encounter any issues we
      might have introduced during this transition, please let us know.

2011-08-15 Version 0.13
    * Fixed error using input FASTQ with names containing spaces, such as those from the SRA.

2011-07-31 Version 0.12
    * Miscellaneous updates.

2011-07-14 Version 0.11
    * Removed BOOST dependency.