Skip to content

Generate and process BAM files from Illumina sequencing instrument files

Notifications You must be signed in to change notification settings

wtsi-npg/illumina2bam

Repository files navigation

Here is a collection of tools to generate or process BAM/SAM files using Picard Java API.

It currently includes Illumina2bam, BamIndexDecoder, BamReadTrimmer, BamMerger, BamTagStripper, BamQualityQuantisation, ChangeBamHeader, SplitBamByReadGroup and AlignmentFilter etc.

It is a NetBeans Java project. You should be able to open it from NetBeans directly.

To test: ant test.

To generate jar files: ant jar.

You can find more information from http://gq1.github.com/illumina2bam/.

---------------------------------------------------

Illumina2bam

This tool converts Illumina BCL files directly to bam files using Picard JAVA API.

It will create a bam file for an entire lane.

It puts an indexing read quality and sequence in tags (we're suggesting the use of) "QT" and "BC" e.g. QT:Z:HHHHHHHH   BC:Z:ATCACGTT, on the first read record.
User can overwrite them from command line.

You can pipe the output bam to BamIndexDecoder to deplex.

It puts what preceding Illumina code has run e.g. RTA, in PG headers.

By default it only puts in PF filtered reads. There's an option to put all reads in the BAM.

Default bcl2qseq will also exclude sequence identified as TruSeq controls from passing the qseq filter field. Currently we don't do that yet. These reads should be included in the output.

We expect read qualities to differ from those produced via default bcl2qseq as we'll give Ns a quality of 0 (rather than 2 which we're not convinced about), and we'll not do the EAMSS(?) flattening of the 3' tail of read to  qvals of 2 a.k.a. "The Killer Bs".

There's also an option to record the second basecall from SCL files.

RunInfo or runParamaters under runfolder or Config XML files under Intensities and BaseCalls directory are used to get tile list, cycle ranges per read and other meta data.

CLOCS and filter files are also needed. If Clocs file missing, it will try to use locs or pos file instead.

All reads are supposed in one read group, by default, which id is 1. Sample and library name can be passed from command line, otherwise use unknown.

EXAMPLE TO RUN:

java -jar "Illumina2bam.jar" INTENSITY_DIR=testdata/110323_HS13_06000_B_B039WABXX/Data/Intensities LANE=1 OUTPUT=testdata/6000_1.bam  VALIDATION_STRINGENCY=STRICT CREATE_INDEX=false CREATE_MD5_FILE=true FIRST_TILE=1101 TILE_LIMIT=1  COMPRESSION_LEVEL=1

HELP:

java -jar "Illumina2bam.jar" -h

USAGE: Illumina2bam [options]

Convert Illumina BCL to BAM or SAM file. Version: 0.03

Options:

--help
-h                            Displays options specific to this tool.

--stdhelp
-H                            Displays options specific to this tool AND options common to all Picard command line 
                              tools.

--version                     Displays program version.

RUN_FOLDER=File
R=File                        Illumina runfolder directory including runParameters xml file under it, upwards two 
                              levels from Intensities directory if not given.  Default value: null. 

INTENSITY_DIR=File
I=File                        Illumina intensities directory including config xml file, and clocs, locs or pos files 
                              under lane directory.  Required. 

BASECALLS_DIR=File
B=File                        Illumina basecalls directory including config xml file, and filter files, bcl, maybe scl 
                              files under lane cycle directory, using BaseCalls directory under intensities if not 
                              given.   Default value: null. 

LANE=Integer
L=Integer                     Lane number.  Required. 

OUTPUT=File
O=File                        Output file name.  Required. 

GENERATE_SECONDARY_BASE_CALLS=Boolean
E2=Boolean                    Including second base call or not, default false.  Default value: false. This option can 
                              be set to 'null' to clear the default value. Possible values: {true, false} 

PF_FILTER=Boolean
PF=Boolean                    Filter cluster or not, default true.  Default value: true. This option can be set to 
                              'null' to clear the default value. Possible values: {true, false} 

READ_GROUP_ID=String
RG=String                     ID used to link RG header record with RG tag in SAM record, default 1.  Default value: 1. 
                              This option can be set to 'null' to clear the default value. 

SAMPLE_ALIAS=String
SM=String                     The name of the sequenced sample, using library name if not given.  Default value: null. 

LIBRARY_NAME=String
LB=String                     The name of the sequenced library, default unknown.  Default value: unknown. This option 
                              can be set to 'null' to clear the default value. 

STUDY_NAME=String
ST=String                     The name of the study.  Default value: null. 

PLATFORM_UNIT=String
PU=String                     The platform unit, using runfolder name plus lane number if not given.  Default value: 
                              null. 

RUN_START_DATE=Iso8601Date    The start date of the run, read from config file if not given.  Default value: null. 

SEQUENCING_CENTER=String
SC=String                     Sequence center name, default SC for Sanger Center.  Default value: SC. This option can 
                              be set to 'null' to clear the default value. 

PLATFORM=String               The name of the sequencing technology that produced the read, default ILLUMINA.  Default 
                              value: ILLUMINA. This option can be set to 'null' to clear the default value. 

FIRST_TILE=Integer            If set, this is the first tile to be processed (for debugging).  Note that tiles are not 
                              processed in numerical order.  Default value: null. 

TILE_LIMIT=Integer            If set, process no more than this many tiles (for debugging).  Default value: null. 

BARCODE_SEQUENCE_TAG_NAME=String
BC_SEQ=String                 Tag name for barcode sequence.  Default value: BC. This option can be set to 'null' to 
                              clear the default value. 

BARCODE_QUALITY_TAG_NAME=String
BC_QUAL=String                Tag name for barcode quality.  Default value: QT. This option can be set to 'null' to 
                              clear the default value. 

SECOND_BARCODE_SEQUENCE_TAG_NAME=String
SEC_BC_SEQ=String             Tag name for second  barcode sequence.  Default value: null. 

SECOND_BARCODE_QUALITY_TAG_NAME=String
SEC_BC_QUAL=String            Tag name for second barcode quality.  Default value: null. 

-----------------------------------------------

BamReadTrimmer

Strip part of a read (fixed position) - typically a prefix of the forward read, and optionally place this and its quality in BAM tags.

EXAMPLE TO RUN:

java -jar "BamReadTrimmer.jar" INPUT=testdata/bam/6210_8.sam OUTPUT=testdata/6210_8_trimmed.bam FIRST_POSITION_TO_TRIM=1 TRIM_LENGTH=3 CREATE_MD5_FILE=true ONLY_FORWARD_READ=true SAVE_TRIM=true TRIM_BASE_TAG=rs TRIM_QUALITY_TAG=qs VALIDATION_STRINGENCY=SILENT

-----------------------------------------------

BamMerger

Merge BAM/SAM alignment info in a bam with the data in an unmapped BAM file, producing a third BAM file that has alignment data and all the additional data from the unmapped BAM. The SQ records and alignment PG records in the aligned bam file will be added to the header of unmampped bam file to form the new header of output file.

EXAMPLE TO RUN:

java -jar "BamMerger.jar" ALIGNED=testdata/bam/6210_8_aligned.sam I=testdata/bam/6210_8.sam OUTPUT=testdata/6210_8_merged.bam VALIDATION_STRINGENCY=SILENT

-----------------------------------------------

STANDARD PICARD OPTIONS:

TMP_DIR=File                  Default value: /tmp/username. This option can be set to 'null' to clear the default value. 

VERBOSITY=LogLevel            Control verbosity of logging.  Default value: INFO. This option can be set to 'null' to 
                              clear the default value. Possible values: {ERROR, WARNING, INFO, DEBUG} 

QUIET=Boolean                 Whether to suppress job-summary info on System.err.  Default value: false. This option 
                              can be set to 'null' to clear the default value. Possible values: {true, false} 

VALIDATION_STRINGENCY=ValidationStringency
                              Validation stringency for all SAM files read by this program.  Setting stringency to 
                              SILENT can improve performance when processing a BAM file in which variable-length data 
                              (read, qualities, tags) do not otherwise need to be decoded.  Default value: STRICT. This 
                              option can be set to 'null' to clear the default value. Possible values: {STRICT, 
                              LENIENT, SILENT} 

COMPRESSION_LEVEL=Integer     Compression level for all compressed files created (e.g. BAM and GELI).  Default value: 
                              5. This option can be set to 'null' to clear the default value. 

MAX_RECORDS_IN_RAM=Integer    When writing SAM files that need to be sorted, this will specify the number of records 
                              stored in RAM before spilling to disk. Increasing this number reduces the number of file 
                              handles needed to sort a SAM file, and increases the amount of RAM needed.  Default 
                              value: 500000. This option can be set to 'null' to clear the default value. 

CREATE_INDEX=Boolean          Whether to create a BAM index when writing a coordinate-sorted BAM file.  Default value: 
                              false. This option can be set to 'null' to clear the default value. Possible values: 
                              {true, false} 

CREATE_MD5_FILE=Boolean       Whether to create an MD5 digest for any BAM files created.    Default value: false. This 
                              option can be set to 'null' to clear the default value. Possible values: {true, false} 

OPTIONS_FILE=File             File of OPTION_NAME=value pairs.  No positional parameters allowed.  Unlike command-line 
                              options, unrecognized options are ignored.  A single-valued option set in an options file 
                              may be overridden by a subsequent command-line option.  A line starting with '#' is 
                              considered a comment.  This option may be specified 0 or more times.