JSON files output by Expansion Hunter

JSON files generated by Expansion Hunter contain information about sample parameters (SampleParameters field) and analysis results summarized by locus (LocusResults field). The locus results contain these fields

AlleleCount The expected number of alleles at the locus
Coverage Estimated read coverage at the locus
FragmentLength The fragment size estimated from read pairs fully contained in either the left or right flank of the repeat region
LocusId Locus identifier
ReadLength Mean read length at the locus
Variants Genotypes and other information describing each variant analyzed at the locus

The structure of records in the Variant field is different for repeats and small variants (insertions, deletions, and swaps).

Repeat records

Repeat records contain the following fields.

VariantId Unique variant identifier
RepeatUnit Repeat unit in the reference orientation
VariantType Always set to "Repeat"
VariantSubtype Either "Repeat" or "RareRepeat"
Genotype Repeat genotype given by the size of each repeat allele
GenotypeConfidenceInterval Size confidence interval for each repeat allele
CountsOfSpanningReads Summary of identified spanning reads given as an array with entries (n, m) where n is the number of repeat units spanned by the flanking read and m is the number of such reads
CountsOfFlankingReads An analog of CountsOfSpanningReads for in-repeat reads
CountsOfInrepeatReads An analog of CountsOfSpanningReads for spanning reads
CountsOfHighQualityUnambiguousReads (optional) Summary of high-quality unambiguous reads by allele size, in the same format as CountsOfSpanningReads. Only present when allele quality metrics are enabled. Only counts reads that overlap the repeat region. High-quality means match rate ≥ 0.9; unambiguous means the read is consistent with only one haplotype. See AlleleQualityMetrics for details.
ReferenceRegion 0-based half-open reference coordinates of the repeat region (chrom:start-end)
ConsensusSequences (optional) Array of consensus sequences, one per allele. See Consensus sequences below for details.
ConsensusSequencesReadSupport (optional) Array of read support strings corresponding to ConsensusSequences. See Consensus sequences below for details.

Small variant records

Records for small variants contain the following fields.

VariantId Unique variant identifier
VariantType Always set to "SmallVariant"
VariantSubtype Either "Insertion", "Deletion", or "Swap"
RepeatUnit Repeat unit in the reference orientation
Genotype Formed, as usual, from 0s and 1s corresponding to ref and alt alleles
CountOfAltReads Number of reads supporting the alt allele
CountOfRefReads Number of reads supporting the ref allele
ReferenceRegion Reference region of the variant

Allele quality metrics

When quality metrics are enabled (the default), repeat records will also include an AlleleQualityMetrics field containing per-allele quality information. This provides detailed metrics useful for filtering and quality assessment. See AlleleQualityMetrics for full documentation.

Example:

"AlleleQualityMetrics": {
    "VariantId": "HTT",
    "Alleles": [
        {
            "AlleleNumber": 1,
            "AlleleSize": 19,
            "Depth": 15.0,
            "QD": 0.92,
            "MeanInsertedBasesWithinRepeats": 0.0,
            "MeanDeletedBasesWithinRepeats": 0.0,
            "StrandBiasBinomialPhred": 0.0,
            "LeftFlankNormalizedDepth": 1.2,
            "RightFlankNormalizedDepth": 1.1,
            "HighQualityUnambiguousReads": 12,
            "ConfidenceIntervalDividedByAlleleSize": 0.0
        }
    ]
}

Consensus sequences

The JSON output file now provides a simple consensus sequence for each allele. It is generated by taking the most common base at each position across confidently-placed reads (the darker-colored reads in a REViewer visualization). Insertions, deletions, and soft-clipped bases within reads are not currently incorporated into the consensus sequence. A single consensus sequence is generated for homozygous or hemizygous genotypes.

Limitation: Consensus sequences are only generated for loci with a single repeat variant. For loci with multiple repeat variants, consensus building is skipped and a warning is logged. This is because the current implementation builds consensus across the entire locus graph, which would produce a mixed consensus that is not meaningful to attach to individual variant findings.

ConsensusSequences Array of consensus sequences, one per allele. Positions not overlapped by any confidently-placed reads are reported as 'N'.
ConsensusSequencesReadSupport Array of strings, one per allele. Each string is a sequence of digits the same length as the corresponding consensus sequence, and each digit (0-9) represents the number of confidently-placed reads supporting the base reported at that position in the consensus sequence. A value of '9' indicates 9 or more reads.

Example:

"ConsensusSequences": ["CAGCAG", "CAGCAGCAGCAGCAGCAGCAGCAGCNNNGG"],
"ConsensusSequencesReadSupport": ["555566", "888877763333333333333333300022"]

Catalog field passthrough

When --copy-catalog-fields is used, any extra fields from the input variant catalog will be copied to each locus record in the output JSON. This allows annotation fields like Gene, Diseases, PathogenicMin, etc. to flow through without a separate join step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON files output by Expansion Hunter

Repeat records

Small variant records

Allele quality metrics

Consensus sequences

Catalog field passthrough

FilesExpand file tree

05_OutputJsonFiles.md

Latest commit

History

05_OutputJsonFiles.md

File metadata and controls

JSON files output by Expansion Hunter

Repeat records

Small variant records

Allele quality metrics

Consensus sequences

Catalog field passthrough