Skip to content

Latest commit

 

History

History
118 lines (102 loc) · 5.6 KB

File metadata and controls

118 lines (102 loc) · 5.6 KB

JSON files output by Expansion Hunter

JSON files generated by Expansion Hunter contain information about sample parameters (SampleParameters field) and analysis results summarized by locus (LocusResults field). The locus results contain these fields

  • AlleleCount The expected number of alleles at the locus
  • Coverage Estimated read coverage at the locus
  • FragmentLength The fragment size estimated from read pairs fully contained in either the left or right flank of the repeat region
  • LocusId Locus identifier
  • ReadLength Mean read length at the locus
  • Variants Genotypes and other information describing each variant analyzed at the locus

The structure of records in the Variant field is different for repeats and small variants (insertions, deletions, and swaps).

Repeat records

Repeat records contain the following fields.

  • VariantId Unique variant identifier
  • RepeatUnit Repeat unit in the reference orientation
  • VariantType Always set to "Repeat"
  • VariantSubtype Either "Repeat" or "RareRepeat"
  • Genotype Repeat genotype given by the size of each repeat allele
  • GenotypeConfidenceInterval Size confidence interval for each repeat allele
  • CountsOfSpanningReads Summary of identified spanning reads given as an array with entries (n, m) where n is the number of repeat units spanned by the flanking read and m is the number of such reads
  • CountsOfFlankingReads An analog of CountsOfSpanningReads for in-repeat reads
  • CountsOfInrepeatReads An analog of CountsOfSpanningReads for spanning reads
  • CountsOfHighQualityUnambiguousReads (optional) Summary of high-quality unambiguous reads by allele size, in the same format as CountsOfSpanningReads. Only present when allele quality metrics are enabled. Only counts reads that overlap the repeat region. High-quality means match rate ≥ 0.9; unambiguous means the read is consistent with only one haplotype. See AlleleQualityMetrics for details.
  • ReferenceRegion 0-based half-open reference coordinates of the repeat region (chrom:start-end)
  • ConsensusSequences (optional) Array of consensus sequences, one per allele. See Consensus sequences below for details.
  • ConsensusSequencesReadSupport (optional) Array of read support strings corresponding to ConsensusSequences. See Consensus sequences below for details.

Small variant records

Records for small variants contain the following fields.

  • VariantId Unique variant identifier
  • VariantType Always set to "SmallVariant"
  • VariantSubtype Either "Insertion", "Deletion", or "Swap"
  • RepeatUnit Repeat unit in the reference orientation
  • Genotype Formed, as usual, from 0s and 1s corresponding to ref and alt alleles
  • CountOfAltReads Number of reads supporting the alt allele
  • CountOfRefReads Number of reads supporting the ref allele
  • ReferenceRegion Reference region of the variant

Allele quality metrics

When quality metrics are enabled (the default), repeat records will also include an AlleleQualityMetrics field containing per-allele quality information. This provides detailed metrics useful for filtering and quality assessment. See AlleleQualityMetrics for full documentation.

Example:

"AlleleQualityMetrics": {
    "VariantId": "HTT",
    "Alleles": [
        {
            "AlleleNumber": 1,
            "AlleleSize": 19,
            "Depth": 15.0,
            "QD": 0.92,
            "MeanInsertedBasesWithinRepeats": 0.0,
            "MeanDeletedBasesWithinRepeats": 0.0,
            "StrandBiasBinomialPhred": 0.0,
            "LeftFlankNormalizedDepth": 1.2,
            "RightFlankNormalizedDepth": 1.1,
            "HighQualityUnambiguousReads": 12,
            "ConfidenceIntervalDividedByAlleleSize": 0.0
        }
    ]
}

Consensus sequences

The JSON output file now provides a simple consensus sequence for each allele. It is generated by taking the most common base at each position across confidently-placed reads (the darker-colored reads in a REViewer visualization). Insertions, deletions, and soft-clipped bases within reads are not currently incorporated into the consensus sequence. A single consensus sequence is generated for homozygous or hemizygous genotypes.

Limitation: Consensus sequences are only generated for loci with a single repeat variant. For loci with multiple repeat variants, consensus building is skipped and a warning is logged. This is because the current implementation builds consensus across the entire locus graph, which would produce a mixed consensus that is not meaningful to attach to individual variant findings.

  • ConsensusSequences Array of consensus sequences, one per allele. Positions not overlapped by any confidently-placed reads are reported as 'N'.
  • ConsensusSequencesReadSupport Array of strings, one per allele. Each string is a sequence of digits the same length as the corresponding consensus sequence, and each digit (0-9) represents the number of confidently-placed reads supporting the base reported at that position in the consensus sequence. A value of '9' indicates 9 or more reads.

Example:

"ConsensusSequences": ["CAGCAG", "CAGCAGCAGCAGCAGCAGCAGCAGCNNNGG"],
"ConsensusSequencesReadSupport": ["555566", "888877763333333333333333300022"]

Catalog field passthrough

When --copy-catalog-fields is used, any extra fields from the input variant catalog will be copied to each locus record in the output JSON. This allows annotation fields like Gene, Diseases, PathogenicMin, etc. to flow through without a separate join step.