Skip to content

Transcription restarts corner cases difficult to handle with combined transcription+translation #117

Open
@angrave

Description

@angrave

Transcription update Will fail to update (by design) if there are multiple Transcription entities for the same language and video.
Some comments:

Most of this invalid data 2020 data but we should address the whole dataset and then implement a constraint, together with adding a new column eg. "source" or "kind" to allow multiple transcripts per video

The new implementation looks for the min-max across all-languages. i.e. Find the last caption for each language. Determine the earliest one and then trim the audio from there. The max time is then used to ensure captions are only added once we reach an unprocessed time for that particular language.

This is useful because sometimes one particular language is lagging e.g. it stopped when one translation never arrived.

However some videos have large portions of time where there is no transcription (event=NOMATCH). We don't want to have to transcribe that audio again when we do a restart, but simply recording the lastsuccesstime is insufficient.

Also ... If we add a new translation it would start from the beginning (and use the uncorrected transcriptions).

This suggests a future design should separate out the transcription from the translation, would save some credits, rather than paying for NOMATCH regions twice, if we have to restart the task. (This would also allow translations of artificially inserted captions e.g. [silence] etc)

The worst case is an hour long silence, which fails half way. The restart would start from the beginning. Fortunately, "ServiceTimeouts" do not seem to occur if there are no transcriptions to translate.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions