The release notes for each release are provided in the release-notes.md
file that accompanies the data files.
A table with brief descriptions for each data file is provided in the data-files-description.md
file included in the download.
PBTA data files are all files derived from samples (e.g., tumors, cell lines) that are processed upstream of this repository and are not the product of any analysis code in the AlexsLemonade/OpenPBTA-analysis
repository.
MendQC output files *readDist.txt
and *bam_umend_qc.tsv
, along with a manifest mapping filename to biospecimen, are provided.
pbta-mend-qc-results.tar.gz
pbta-mend-qc-manifest.tsv
Somatic Single Nucleotide Variant (SNV) data are provided in Annotated MAF format files for each of the applied software packages and denoted with the pbta-snv
prefix. Briefly, VCFs are VEP annotated and converted to MAF format via vcf2maf to produce the following files.
pbta-snv-lancet.vep.maf.gz
pbta-snv-mutect2.vep.maf.gz
pbta-snv-strelka2.vep.maf.gz
pbta-snv-vardict.vep.maf.gz
A subset of TCGA brain tumor MAF files are also provided and denoted with the pbta-tcga-snv
prefix:
pbta-tcga-snv-lancet.vep.maf.gz
pbta-tcga-snv-mutect2.vep.maf.gz
pbta-tcga-snv-strelka2.vep.maf.gz
The manifest file pbta-tcga-manifest.tsv
contains primary diagnosis information as well as a columns denoting which BED files correspond to that sample.
Capture_Kit
refers to the BED file(s) obtained from GDC (hg19).
BED_In_Use
is the file used for mutation calling and what should be used for downstream analyses (lifted over to hg38).
For some samples, the capture kit was not uniquely identified by the GDC (denoted by |
in Capture_Kit
) and the intersection of multiple BEDs was used.
These new BED files are included in the data download.
Somatic Copy Number Variant (CNV) data are provided in a modified SEG format for each of the applied software packages and denoted with the pbta-cnv
prefix.
Somatic copy number data is only generated for whole genome sequencing (WGS) samples.
pbta-cnv-cnvkit.seg.gz
is the the CNVkit SEG file. This file contains an additional columncopy.num
to denote copy number of each segment, derived from the CNS file output of the algorithm described here.pbta-cnv-controlfreec.tsv.gz
is the ControlFreeC TSV file. It is a merge of*_CNVs
files produced from the algorithm, and columns are described here.
The copy number annotated in the CNVkit SEG file is annotated with respect to ploidy 2, however, the status annotated in the ControlFreeC TSV file is annotated with respect to inferred ploidy from the algorithm, which is recorded in the pbta-histologies.tsv
file. See the table below for examples of possible interpretations.
Ploidy | Copy Number | Gain/Loss Interpretation |
---|---|---|
2 | 0 | Deep deletion; homozygous deletion |
2 | 1 | Loss; hemizygous deletion |
2 | 2 | Copy neutral |
2 | 3 | Gain; one copy gain |
2 | 4 | Gain; two copy gain |
2 | 5+ | Amplification; possible amplification |
3 | 0 | Deep deletion; 3 copy loss |
3 | 1 | Loss; 2 copy loss |
3 | 2 | Loss; 1 copy loss |
3 | 3 | Copy neutral |
3 | 4 | Gain; one copy gain |
3 | 5 | Gain; two copy gain |
3 | 6 | Gain; three copy gain |
3 | 7+ | Amplification; possible amplification |
4 | 0 | Deep deletion; 4 copy loss |
4 | 1 | Loss; 3 copy loss |
4 | 2 | Loss; 2 copy loss |
4 | 3 | Loss; 1 copy loss |
4 | 4 | Copy neutral |
4 | 5 | Gain; one copy gain |
4 | 6 | Gain; two copy gain |
4 | 7 | Gain; two copy gain |
4 | 8+ | Amplification; possible amplification |
Somatic Structural Variant Data (Somatic SV) are provided in the Annotated Manta TSV format produced by the applied software packages and denoted with the pbta-sv
prefix.
Somatic structural variant data is only generated for whole genome sequencing (WGS) samples.
pbta-sv-manta.tsv.gz
Gene expression estimates from the applied software packages are provided as a feature (e.g., gene or transcript) by sample matrix.
For each method/measure (e.g., RSEM TPM, RSEM isoform counts), there are two matrices provided: one for each library selection method (poly-A
, stranded
).
See this notebook for more information about why this was necessary.
Gene expression are available in multiple forms in the following files:
pbta-gene-counts-rsem-expected_count.polya.rds
pbta-gene-counts-rsem-expected_count.stranded.rds
pbta-gene-expression-kallisto.polya.rds
pbta-gene-expression-kallisto.stranded.rds
pbta-gene-expression-rsem-fpkm-collapsed.polya.rds
pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds
pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds
pbta-gene-expression-rsem-fpkm.polya.rds
pbta-gene-expression-rsem-fpkm.stranded.rds
pbta-gene-expression-rsem-tpm.polya.rds
pbta-gene-expression-rsem-tpm.stranded.rds
pbta-isoform-counts-rsem-expected_count.polya.rds
pbta-isoform-counts-rsem-expected_count.stranded.rds
pbta-isoform-expression-rsem-tpm.polya.rds
pbta-isoform-expression-rsem-tpm.stranded.rds
See the data description file for more information about the individual gene expression files.
If your analysis requires de-duplicated gene symbols as row names, please use the collapsed matrices provided as part of the data download (see below).
STAR *Log.final.out
files, along with a manifest mapping filename to biospecimen, are provided (see section 4.1 here).
pbta-star-log-final.tar.gz
pbta-star-log-manifest.tsv
Gene Fusions produced by the applied software packages are provided as Arriba TSV and STARFusion TSV respectively.
These files are denoted with the prefix pbta-fusion
.
pbta-fusion-arriba.tsv.gz
pbta-fusion-starfusion.tsv.gz
Harmonized clinical data are released as tab separated values in the following file:
pbta-histologies.tsv
Many analyses rely on examining DNA-seq (e.g., WGS/WXS/Panel) and RNA-seq data together.
The identifiers in the Kids_First_Biospecimen_ID
column of the pbta-histologies.tsv
file will be non-overlapping for different experimental strategies, as these identifiers map to a library or assay.
Analysts should use the identifiers in the sample_id
to connect DNA-seq and RNA-seq assays from the same sample.
For an example, see the sample_id
mapping step of the OncoPrint pipeline.
Note that some individual participants (tracked via the Kids_First_Participant_ID
column) will have multiple samples included in the PBTA dataset.
Please see the independent specimens files section for more information.
Analysis files are created by a script in analyses/*
.
They can be viewed as derivatives of PBTA data files.
Consensus mutation files are products of the analyses/snv-callers
analysis module.
pbta-snv-consensus-mutation.maf.tsv.gz
contains consensus calls for SNVs and small indels. The calls in the file are created as the intersection of calls from Strelka2, Mutect2, Lancet, where position, change, and sample were the same for all callers. Multinucleotide variant calls from Mutect2 and Lancet were separated into consecutive SNVs before merging. All columns in the included file are derived from the Strelka2 calls. Note that this file is not strictly a MAF file, as it adds a Variant Allele Frequency (VAF
) column and does not contain a version comment as the first line.pbta-snv-consensus-mutation-tmb-all.tsv
includes tumor mutation burden statistics that are calculated based calculated from Strelka2 and Mutect2 SNV consensus, and the intersection of Strelka2 and Mutect2 BED windows sizes.pbta-snv-consensus-mutation-tmb-coding.tsv
contains coding only tumor mutation burden statistics calculated from the number of coding sequence Strelka2, Mutect2, and Lancet consensus SNVs and size of the intersection of all three callers' BED windows and the Gencode v27 coding sequence regions.
Mutation calls that overlap hotspots from MSKCC cancer hotspot database or overlapping TERT promoter region are retained even if called by 1 caller ,excluding Vardict-only calls because Vardict uniquely calls a large number (~39 million) of very low VAF mutations as discussed here suggesting these could be false calls.
pbta-snv-scavenged-hotspots.maf.tsv.gz
Collapsed expression matrices are products of the analyses/collapse-rnaseq
analysis module.
In cases where more than one Ensembl gene identifier maps to the same gene symbol, the instance of the gene symbol with the maximum mean FPKM in the RSEM FPKM file is retained to produce the following files:
pbta-gene-expression-rsem-fpkm-collapsed.polya.rds
pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
Independent specimen lists are the products of the analyses/independent-samples
analysis module.
For participants with multiple tumor specimens, independent specimen lists are provided as TSV files with columns for participant ID (Kids_First_Participant_ID
) and specimen ID (Kids_First_Biospecimen_ID
).
The methods are described here).
These files are used for analyses such as mutation co-occurence, where repeated samples might cause bias.
independent-specimens.wgs.primary.tsv
with WGS samples and only primary tumors.independent-specimens.wgs.primary-plus.tsv
as above, but including non-primary tumors where a primary tumor sample is not available.independent-specimens.wgswxs.primary.tsv
Only primary tumors, but with WXS where WGS is not available.independent-specimens.wgswxs.primary-plus.tsv
as above, but including non-primary tumors where a primary tumor sample is not available.
Note that these independent specimen files do not address the issue of participants with multiple tumor specimens in RNA-seq data at this time.
The filtered and prioritized fusion and downstream files are a product of the analyses/fusion_filtering
analysis module.
pbta-fusion-putative-oncogenic.tsv
contains the filtered and prioritized fusions. The methods are described here.pbta-fusion-recurrently-fused-genes-byhistology.tsv
is a table that includes counts of recurrently fused genes by broad histology.pbta-fusion-recurrently-fused-genes-bysample.tsv
contains a binary matrix that denotes the presence or absence of a recurrently fused gene in an individual RNA-seq specimen.fusion_summary_embryonal_foi.tsv
contains a binary matrix that denotes the presence or absence of a recurrent embryonal tumor fusions of interest per individual RNA-seq specimen.fusion_summary_ependymoma_foi.tsv
contains a binary matrix that denotes the presence or absence of a recurrent ependymal tumor fusions of interest per individual RNA-seq specimen.
Copy number consensus calls from the copy number and structural variant callers are a product of the analyses/copy_number_consensus_call
analysis module.
pbta-cnv-consensus.seg.gz
contains consensus segments and segment means (log R ratios) from two or more callers, as described in the analysis README.
Focal copy number files map the consensus calls (genomic segments) above to genes for downstream analysis and are a product of the analysis/focal-cn-file-preparation
.
Note: these files contain biospecimens and genes with copy number changes; neutral regions are excluded.
consensus_seg_annotated_cn_autosomes.tsv.gz
contains focal gene copy number alterations for all autosomes.consensus_seg_annotated_cn_x_and_y.tsv.gz
contains focal gene copy number alterations for the sex chromosomes.
pbta-cnv-cnvkit-gistic.zip
is the output of running GISTIC 2.0 on the CNVkit results (pbta-cnv-cnvkit.seg
).
pbta-cnv-consensus-gistic.zip
is the output of running GISTIC 2.0 on the CNV consensus calls (pbta-cnv-consensus.seg.gz
), described below.
The scripts used to run GISTIC are linked here: CNVkit and Consensus calls.
Note that GISTIC is run on the entire cohort and therefore the output reflects regions that are significantly amplified or deleted across the entire cohort.
The GISTIC output data files below, which are commonly leveraged for downstream analyses, are described in more detail on the Broad Institute's GenePattern website.
all_lesions.conf_90.txt
(90% confidence level): this file contains significant regions of amplification and deletion and samples with amplifications/deletions in each of these regionsamp_genes.conf_90.txt
(90% confidence level): table of amplification peaks and genes within themdel_genes.conf_90.txt
(90% confidence level): table of deletion peaks and genes within themall_thresholded.by_genes.txt
: table of high- and low-level amplifications and deletions using sample-specific thresholds for high-level (output insample_cutoffs.txt
file) and default low-level thresholds (+/-0.1)
all_data_by_genes.txt
: This file contains a table of gene symbol, gene ID, cytoband, and Log R Ratios (LRR) for each sample (not thresholded).broad_data_by_genes.txt
: This file contains a table of gene symbol, gene ID, cytoband, and LRR for each sample.focal_data_by_genes.txt
: This file contains a matrix of gene LRR by sample.sample_seg_counts.txt
: By default, samples with >2500 segments are excluded from GISTIC analyses; samples are annotated as included or excluded in this file.broad_values_by_arm.txt
: This file contains a matrix of chromosomal arm LRR by sample.
broad_values_by_arm.txt
for molecular subtyping in which chromosomal arms are commonly gained/amplified or deletedall_thresholded.by_genes.txt
for gene-level copy-number analyses
The clinical manifest will be updated and versioned as molecular subgroups are identified based on genomic analyses.
Analyses related to molecular subtyping are as follows: