Skip to content

Conversation

@Alex-Wengg
Copy link
Contributor

@Alex-Wengg Alex-Wengg commented Dec 8, 2025

Summary

  • SystemInfo.isAppleSilicon and SystemInfo.isIntelMac to detect platform
  • AsrModels.isModelValid() validates all 4 Parakeet components (Preprocessor, Encoder, Decoder, Joint) can load without corruption
  • Reuse decoder state arrays to prevent memory accumulation during streaming
  • Handle non-contiguous strides in copyData

What VoiceInk should do (app side)

Issue VoiceInk Fix
Intel Mac users selecting Parakeet Use SystemInfo.isIntelMac to hide/disable Parakeet models in UI
Infinite "Transcribing" hang Add timeout to transcription calls with user-facing error
20-30s delay after sleep Show "Loading model..." UI during model load (ANE recompilation is Apple's anecompilerservice, cannot be sped up)
Model corruption Use AsrModels.isModelValid() before transcription, prompt re-download if invalid

@claude

This comment was marked as outdated.

claude[bot]

This comment was marked as outdated.

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 728.9x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 806.5x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 14.5% <20% Diarization Error Rate (lower is better)
RTFx 3.36x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 15.506 5.0 Fetching diarization models
Model Compile 6.645 2.1 CoreML compilation
Audio Load 0.084 0.0 Loading audio file
Segmentation 32.750 10.5 VAD + speech detection
Embedding 309.125 98.9 Speaker embedding extraction
Clustering (VBx) 2.839 0.9 Hungarian algorithm + VBx clustering
Total 312.526 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 14.5% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 344.7s processing • Test runtime: 5m 45s • 12/13/2025, 09:40 PM EST

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 14.34x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 8.572 11.7 Fetching diarization models
Model Compile 3.674 5.0 CoreML compilation
Audio Load 0.101 0.1 Loading audio file
Segmentation 21.938 30.0 Detecting speech regions
Embedding 36.563 50.0 Extracting speaker voices
Clustering 14.625 20.0 Grouping same speakers
Total 73.198 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 73.1s diarization time • Test runtime: 1m 55s • 12/13/2025, 09:38 PM EST

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 3.42x
test-other 1.35% 0.00% 2.49x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.40% 0.00% 3.52x
test-other 1.00% 0.00% 2.43x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.40x Streaming real-time factor
Avg Chunk Time 2.186s Average time to process each chunk
Max Chunk Time 2.993s Maximum chunk processing time
First Token 2.619s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.39x Streaming real-time factor
Avg Chunk Time 2.257s Average time to process each chunk
Max Chunk Time 2.953s Maximum chunk processing time
First Token 2.322s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 7m44s • 12/13/2025, 09:43 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

Alex-Wengg and others added 3 commits December 9, 2025 22:43
Changes TdtDecoderState.update() to copy data into existing arrays
instead of replacing array references.

Before: hiddenState = newArray (orphans old array, memory accumulates)
After: hiddenState.copyData(from: newArray) (reuses same array)

This prevents MLMultiArray instances from accumulating over long
transcription sessions, which could cause progressive slowdown.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update tests to verify that TdtDecoderState.update() reuses existing
arrays and copies values into them, rather than checking for object
identity with the new arrays from decoder output.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The previous memcpy-based copy assumed contiguous memory layout, but
CoreML output arrays may have different strides than our ANE-aligned
arrays. This could cause incorrect data copying and affect WER.

Now checks if both arrays are contiguous before using fast memcpy,
otherwise falls back to element-by-element copy that respects strides.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@Alex-Wengg Alex-Wengg changed the title Fix: Add timeout support and stalling prevention mechanisms Add Intel Mac detection and model validation utilities Dec 13, 2025
mutating func update(from decoderOutput: MLFeatureProvider) {
hiddenState = decoderOutput.featureValue(for: "h_out")?.multiArrayValue ?? hiddenState
cellState = decoderOutput.featureValue(for: "c_out")?.multiArrayValue ?? cellState
// Copy data into existing arrays instead of replacing them to avoid memory leaks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we jus t remove the memory optimziation and see how much worse the perf is? irrc its not a huge performance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about RTFx? latency

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not affect WER

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTFx is about the same after running 1000 files on the main vs this branch

- Add SystemInfo.isAppleSilicon and SystemInfo.isIntelMac for architecture detection
- Add AsrModels.isModelValid() to validate Parakeet models can load
  - Returns false on Intel Macs (no ANE support)
  - Validates all 4 model components (Preprocessor, Encoder, Decoder, Joint)
  - Uses CPU-only loading to avoid triggering ANE compilation during validation

These utilities help apps guard UI for Intel Mac users and validate model integrity.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Alex-Wengg and others added 2 commits December 13, 2025 21:14
Replace vDSP/memcpy-based implementations with simple loops:
- Remove import Accelerate
- Simplify resetData(to:) to use basic loop
- Simplify copyData(from:) to use basic loop
- Remove isContiguousLayout() helper

Benchmarks show the simple implementation is ~8% faster and uses
84% less memory (179 MB vs 1.14 GB peak) than the optimized version.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Copy link
Member

@BrandonWeng BrandonWeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but remove json files pls

@Alex-Wengg Alex-Wengg merged commit 6c352d8 into main Dec 14, 2025
9 checks passed
@Alex-Wengg Alex-Wengg deleted the fix/stalling-issues branch December 14, 2025 02:47
Alex-Wengg added a commit that referenced this pull request Dec 18, 2025
## Summary
- `SystemInfo.isAppleSilicon` and `SystemInfo.isIntelMac` to detect
platform
- `AsrModels.isModelValid()` validates all 4 Parakeet components
(Preprocessor, Encoder, Decoder, Joint) can load without corruption
- Reuse decoder state arrays to prevent memory accumulation during
streaming
- Handle non-contiguous strides in copyData

### What VoiceInk should do (app side)

| Issue | VoiceInk Fix |
|-------|-------------|
| Intel Mac users selecting Parakeet | Use `SystemInfo.isIntelMac` to
hide/disable Parakeet models in UI |
| Infinite "Transcribing" hang | Add timeout to transcription calls with
user-facing error |
| 20-30s delay after sleep | Show "Loading model..." UI during model
load (ANE recompilation is Apple's `anecompilerservice`, cannot be sped
up) |
| Model corruption | Use `AsrModels.isModelValid()` before
transcription, prompt re-download if invalid |

---------
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
## Summary
- `SystemInfo.isAppleSilicon` and `SystemInfo.isIntelMac` to detect
platform
- `AsrModels.isModelValid()` validates all 4 Parakeet components
(Preprocessor, Encoder, Decoder, Joint) can load without corruption
- Reuse decoder state arrays to prevent memory accumulation during
streaming
- Handle non-contiguous strides in copyData

### What VoiceInk should do (app side)

| Issue | VoiceInk Fix |
|-------|-------------|
| Intel Mac users selecting Parakeet | Use `SystemInfo.isIntelMac` to
hide/disable Parakeet models in UI |
| Infinite "Transcribing" hang | Add timeout to transcription calls with
user-facing error |
| 20-30s delay after sleep | Show "Loading model..." UI during model
load (ANE recompilation is Apple's `anecompilerservice`, cannot be sped
up) |
| Model corruption | Use `AsrModels.isModelValid()` before
transcription, prompt re-download if invalid |

---------
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
## Summary
- `SystemInfo.isAppleSilicon` and `SystemInfo.isIntelMac` to detect
platform
- `AsrModels.isModelValid()` validates all 4 Parakeet components
(Preprocessor, Encoder, Decoder, Joint) can load without corruption
- Reuse decoder state arrays to prevent memory accumulation during
streaming
- Handle non-contiguous strides in copyData

### What VoiceInk should do (app side)

| Issue | VoiceInk Fix |
|-------|-------------|
| Intel Mac users selecting Parakeet | Use `SystemInfo.isIntelMac` to
hide/disable Parakeet models in UI |
| Infinite "Transcribing" hang | Add timeout to transcription calls with
user-facing error |
| 20-30s delay after sleep | Show "Loading model..." UI during model
load (ANE recompilation is Apple's `anecompilerservice`, cannot be sped
up) |
| Model corruption | Use `AsrModels.isModelValid()` before
transcription, prompt re-download if invalid |

---------
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
## Summary
- `SystemInfo.isAppleSilicon` and `SystemInfo.isIntelMac` to detect
platform
- `AsrModels.isModelValid()` validates all 4 Parakeet components
(Preprocessor, Encoder, Decoder, Joint) can load without corruption
- Reuse decoder state arrays to prevent memory accumulation during
streaming
- Handle non-contiguous strides in copyData

### What VoiceInk should do (app side)

| Issue | VoiceInk Fix |
|-------|-------------|
| Intel Mac users selecting Parakeet | Use `SystemInfo.isIntelMac` to
hide/disable Parakeet models in UI |
| Infinite "Transcribing" hang | Add timeout to transcription calls with
user-facing error |
| 20-30s delay after sleep | Show "Loading model..." UI during model
load (ANE recompilation is Apple's `anecompilerservice`, cannot be sped
up) |
| Model corruption | Use `AsrModels.isModelValid()` before
transcription, prompt re-download if invalid |

---------
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
## Summary
- `SystemInfo.isAppleSilicon` and `SystemInfo.isIntelMac` to detect
platform
- `AsrModels.isModelValid()` validates all 4 Parakeet components
(Preprocessor, Encoder, Decoder, Joint) can load without corruption
- Reuse decoder state arrays to prevent memory accumulation during
streaming
- Handle non-contiguous strides in copyData

### What VoiceInk should do (app side)

| Issue | VoiceInk Fix |
|-------|-------------|
| Intel Mac users selecting Parakeet | Use `SystemInfo.isIntelMac` to
hide/disable Parakeet models in UI |
| Infinite "Transcribing" hang | Add timeout to transcription calls with
user-facing error |
| 20-30s delay after sleep | Show "Loading model..." UI during model
load (ANE recompilation is Apple's `anecompilerservice`, cannot be sped
up) |
| Model corruption | Use `AsrModels.isModelValid()` before
transcription, prompt re-download if invalid |

---------
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
## Summary
- `SystemInfo.isAppleSilicon` and `SystemInfo.isIntelMac` to detect
platform
- `AsrModels.isModelValid()` validates all 4 Parakeet components
(Preprocessor, Encoder, Decoder, Joint) can load without corruption
- Reuse decoder state arrays to prevent memory accumulation during
streaming
- Handle non-contiguous strides in copyData

### What VoiceInk should do (app side)

| Issue | VoiceInk Fix |
|-------|-------------|
| Intel Mac users selecting Parakeet | Use `SystemInfo.isIntelMac` to
hide/disable Parakeet models in UI |
| Infinite "Transcribing" hang | Add timeout to transcription calls with
user-facing error |
| 20-30s delay after sleep | Show "Loading model..." UI during model
load (ANE recompilation is Apple's `anecompilerservice`, cannot be sped
up) |
| Model corruption | Use `AsrModels.isModelValid()` before
transcription, prompt re-download if invalid |

---------
SGD2718 pushed a commit that referenced this pull request Jan 4, 2026
## Summary
- `SystemInfo.isAppleSilicon` and `SystemInfo.isIntelMac` to detect
platform
- `AsrModels.isModelValid()` validates all 4 Parakeet components
(Preprocessor, Encoder, Decoder, Joint) can load without corruption
- Reuse decoder state arrays to prevent memory accumulation during
streaming
- Handle non-contiguous strides in copyData

### What VoiceInk should do (app side)

| Issue | VoiceInk Fix |
|-------|-------------|
| Intel Mac users selecting Parakeet | Use `SystemInfo.isIntelMac` to
hide/disable Parakeet models in UI |
| Infinite "Transcribing" hang | Add timeout to transcription calls with
user-facing error |
| 20-30s delay after sleep | Show "Loading model..." UI during model
load (ANE recompilation is Apple's `anecompilerservice`, cannot be sped
up) |
| Model corruption | Use `AsrModels.isModelValid()` before
transcription, prompt re-download if invalid |

---------
Alex-Wengg added a commit that referenced this pull request Jan 5, 2026
## Summary
- `SystemInfo.isAppleSilicon` and `SystemInfo.isIntelMac` to detect
platform
- `AsrModels.isModelValid()` validates all 4 Parakeet components
(Preprocessor, Encoder, Decoder, Joint) can load without corruption
- Reuse decoder state arrays to prevent memory accumulation during
streaming
- Handle non-contiguous strides in copyData

### What VoiceInk should do (app side)

| Issue | VoiceInk Fix |
|-------|-------------|
| Intel Mac users selecting Parakeet | Use `SystemInfo.isIntelMac` to
hide/disable Parakeet models in UI |
| Infinite "Transcribing" hang | Add timeout to transcription calls with
user-facing error |
| 20-30s delay after sleep | Show "Loading model..." UI during model
load (ANE recompilation is Apple's `anecompilerservice`, cannot be sped
up) |
| Model corruption | Use `AsrModels.isModelValid()` before
transcription, prompt re-download if invalid |

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants