JSON files generated by Expansion Hunter contain information about sample
parameters (SampleParameters field) and analysis results summarized by
locus (LocusResults field). The locus results contain these fields
AlleleCountThe expected number of alleles at the locusCoverageEstimated read coverage at the locusFragmentLengthThe fragment size estimated from read pairs fully contained in either the left or right flank of the repeat regionLocusIdLocus identifierReadLengthMean read length at the locusVariantsGenotypes and other information describing each variant analyzed at the locus
The structure of records in the Variant field is different for repeats and
small variants (insertions, deletions, and swaps).
Repeat records contain the following fields.
VariantIdUnique variant identifierRepeatUnitRepeat unit in the reference orientationVariantTypeAlways set to "Repeat"VariantSubtypeEither "Repeat" or "RareRepeat"GenotypeRepeat genotype given by the size of each repeat alleleGenotypeConfidenceIntervalSize confidence interval for each repeat alleleCountsOfSpanningReadsSummary of identified spanning reads given as an array with entries(n, m)wherenis the number of repeat units spanned by the flanking read andmis the number of such readsCountsOfFlankingReadsAn analog ofCountsOfSpanningReadsfor in-repeat readsCountsOfInrepeatReadsAn analog ofCountsOfSpanningReadsfor spanning readsCountsOfHighQualityUnambiguousReads(optional) Summary of high-quality unambiguous reads by allele size, in the same format asCountsOfSpanningReads. Only present when allele quality metrics are enabled. Only counts reads that overlap the repeat region. High-quality means match rate ≥ 0.9; unambiguous means the read is consistent with only one haplotype. See AlleleQualityMetrics for details.ReferenceRegion0-based half-open reference coordinates of the repeat region (chrom:start-end)ConsensusSequences(optional) Array of consensus sequences, one per allele. See Consensus sequences below for details.ConsensusSequencesReadSupport(optional) Array of read support strings corresponding toConsensusSequences. See Consensus sequences below for details.
Records for small variants contain the following fields.
VariantIdUnique variant identifierVariantTypeAlways set to "SmallVariant"VariantSubtypeEither "Insertion", "Deletion", or "Swap"RepeatUnitRepeat unit in the reference orientationGenotypeFormed, as usual, from0s and1s corresponding to ref and alt allelesCountOfAltReadsNumber of reads supporting the alt alleleCountOfRefReadsNumber of reads supporting the ref alleleReferenceRegionReference region of the variant
When quality metrics are enabled (the default), repeat records will also include
an AlleleQualityMetrics field containing per-allele quality information. This
provides detailed metrics useful for filtering and quality assessment. See
AlleleQualityMetrics for full documentation.
Example:
"AlleleQualityMetrics": {
"VariantId": "HTT",
"Alleles": [
{
"AlleleNumber": 1,
"AlleleSize": 19,
"Depth": 15.0,
"QD": 0.92,
"MeanInsertedBasesWithinRepeats": 0.0,
"MeanDeletedBasesWithinRepeats": 0.0,
"StrandBiasBinomialPhred": 0.0,
"LeftFlankNormalizedDepth": 1.2,
"RightFlankNormalizedDepth": 1.1,
"HighQualityUnambiguousReads": 12,
"ConfidenceIntervalDividedByAlleleSize": 0.0
}
]
}The JSON output file now provides a simple consensus sequence for each allele. It is generated by taking the most common base at each position across confidently-placed reads (the darker-colored reads in a REViewer visualization). Insertions, deletions, and soft-clipped bases within reads are not currently incorporated into the consensus sequence. A single consensus sequence is generated for homozygous or hemizygous genotypes.
Limitation: Consensus sequences are only generated for loci with a single repeat variant. For loci with multiple repeat variants, consensus building is skipped and a warning is logged. This is because the current implementation builds consensus across the entire locus graph, which would produce a mixed consensus that is not meaningful to attach to individual variant findings.
ConsensusSequencesArray of consensus sequences, one per allele. Positions not overlapped by any confidently-placed reads are reported as 'N'.ConsensusSequencesReadSupportArray of strings, one per allele. Each string is a sequence of digits the same length as the corresponding consensus sequence, and each digit (0-9) represents the number of confidently-placed reads supporting the base reported at that position in the consensus sequence. A value of '9' indicates 9 or more reads.
Example:
"ConsensusSequences": ["CAGCAG", "CAGCAGCAGCAGCAGCAGCAGCAGCNNNGG"],
"ConsensusSequencesReadSupport": ["555566", "888877763333333333333333300022"]When --copy-catalog-fields is used, any extra fields from the input variant
catalog will be copied to each locus record in the output JSON. This allows
annotation fields like Gene, Diseases, PathogenicMin, etc. to flow through
without a separate join step.