Skip to content

Creating a diarization broadcast corpus

judyfong edited this page Jul 17, 2020 · 10 revisions

Requirements

  • Gecko
  • rttm files
  • corresponding videos of episode

Tips

Label speaker turns which last at least 60 ms. (CHANGED)

Each speaker gets their own speaker number per recording/episode.

Unknown speakers get labelled Unknown 01 etc.

There are at least two ways to create the csv file.

  1. Follow Aríel's video called My how to make csv SUPEREASY.mp4. In it he uses VSCode, extension json2csv, and does some formatting.
  2. Add all the speakers to one segment in Gecko and copy over the list then remove them back all again to create initial list for the csv file.

Process

  1. Generate the proposed rttm files for 28 episodes that week.
  2. Labelling - Gecko
    1. Open Gecko If you use the Gecko version linked here then you can save partially corrected json files and reload them back into the editor to edit later.
    2. Upload the video file & rttm file
    3. Adjust the segment start and end times to match speaker turns.
    4. Add missing speaker turns.
    5. Correct speaker labels/numbers. Add new ones if necessary
    6. Write down the full speaker names which correspond to each speaker number. These go in a csv file.
    7. Label music, foreign language, or noise. They're available as default labels.
    8. Segments which are only silence can be deleted.
    9. Review the segments in case you missed anything or added tiny segments.
    10. Export as json, srt, and rttm.
  3. Turn in the csv, json, srt, and corrected rttm files to the relevant folders. Then get new rttm and video files.
  4. Repeat for a new episode.
  5. Judy reports the new DER with that week's data. When it is under 10%, this project is done.

reco2spk_num2spk_name.csv

format

<recording/episode id>, <speaker_number in rttm file>, <speaker name>

example

Fréttirkl1900-5022010T0,1, Bogi Águstsson

Frequently Asked Questions (FAQ)

1. Do I designate a number for each speaker and if so should all speakers have different numbers?

Yes all speakers should have different numbers and names for each episode. This means you cannot have multiple speakers with the name "protestor" in the csv file for example. If you are forced to use a name like "protestor" it'd be better to start naming them "Unknown 001", "Unknown 002", Unknown 003", etc.

2. Things like noise or music, do those segments also need to be designated with numbers?

No, numbers should only indicate speakers.

3. Is there a way to add more number segment labels?

Yes, if you look at the left side panel there's an option at the bottom. It's just like adding a new speaker.

4. "Check regions error" How do I recover from this error?

Sometimes you'll get a check regions error. Sometimes you can see the overlapping errors and can delete it visually. If that is not the case, then you can use the keyboard Shortcuts to go to the region right after the error. Then use the shortcut to go to the Previous region (ctrl + shift + <-) in order to access the region with the error. Then delete the region (Ctrl + backspace/ctrl + del). This can make the error go away. Keep deleting until there are no more regions in that area. Now, if it's a speech segment, you'll have to recreate and relabel that segment. Also, you can periodically save your json files and reload them if you encounter an error so that you don't have to start completely from scratch.