This script is designed to run data pipelines for diarization, transcription, and metadata extraction, specifically for feeding the BrazSpeechData platform. It includes functionalities for exporting corpus datasets and transcribing audio files.
The script uses Poetry for dependencies management, and keep in mind that the Dockerized version ISN'T working. In order to start using this project, follow the steps bellow:
First, clone the repo to your desired folder.
git clone https://github.com/nilc-nlp/braz-speech-pipeline.git
Then create a virtual environment to hold the dependencies, with
cd braz-speech-pipeline
python -m venv .venv
source .venv/bin/activate
Attention! This project needs Python 3.8 or higher to work.
Install the required dependencies by running:
poetry install
If you need to add new dependencies, run:
poetry add <dependency>
In order to use the Google Drive API, you need to create a token file, similar to the one in example.token.json
. To do so, follow the steps on this link and save the file as token.json
in the root folder of this project.
This step is necessary to access the raw files on Google Drive. If you don't need to access the raw files, you can skip this step.
To use this script, navigate to the directory where the script is located and run:
poetry run python main.py [COMMAND] [OPTIONS]
Replace [COMMAND]
with one of the commands listed below, and choose between the available options.
If you need any help, run:
poetry run python main.py --help
or
poetry run python main.py [COMMAND] --help
Use this command to export corpus datasets in various formats.
Option | Description | Type | Default | Required |
---|---|---|---|---|
--corpus_id |
Corpus ID | int | None | Yes |
--output-folder |
Output folder | Path | './data/export' | No |
--export-audio-to-formats |
List of audio formats to export (e.g., wav, mp3) | List[AudioFormat] | ["wav", "mp3"] | No |
--sample-rate |
Sample rate for audio export | int | 48000 | No |
--google-drive-folder-ids |
List of Google Drive folder IDs for source audios | List[str] | None | No |
--filter-format |
Specify which files format to read from Google Drive | AudioFormat | None | No |
--original-audios |
Whether to export original audios | bool | False | No |
--csv |
Export data to CSV format | bool | False | No |
--continuous-text |
Export concatenated text from audio segments | bool | False | No |
--speakers-text |
Export text files organized by speaker | bool | False | No |
--json-metadata |
Export audio metadata in JSON format | bool | False | No |
--textgrid |
Export data in TextGrid format for use with Praat | bool | False | No |
--all |
Export all data | bool | False | No |
--debug |
When activated, will export only 10 audios | bool | False | No |
To run the export script, you should use the base command
poetry run python main.py export [OPTIONS]
Then you can specify which data to export by using the options listed above. For example, to export all data from MUPE (corpus 1), run:
poetry run python main.py export --corpus_id 1 --all --google-drive-folder-ids ID_OF_FOLDER_ON_DRIVE --export-audio-to-formats mp3 --debug
If you want everything, but the original audios, you can run:
poetry run python main.py export --corpus_id 1 --csv --textgrid --continuous-text --speakers-text --json-metadata --debug
If you need any help, run:
poetry run python main.py export --help
This command will create an export of the entire database in CSV format. It will export on file corpus_{corpus_id}_audios.csv
and another one corpus_{corpus_id}_segments.csv
. The first, contains a list of all audios inside the corpus, with their metadata. The second, contains a list of all segments for each audio, with their metadata, the ASR transcription and the final transcription. For this last one, it's also exported a .parquet
file for faster loading.
This command will create a text file for each audio, containing the concatenated text of all segments. The text files will be saved in a folder named {audio_name}_concatenated_text.txt
inside the output folder.
This command will create a text file for each audio, containing the text of each segment, organized by speaker. The text files will be saved in a folder named {audio_name}_by_speaker.txt
inside the output folder.
This command will create a JSON file for each audio, containing the metadata of the audio. The JSON files will be saved in a folder named {audio_name}_metadata.json
inside the output folder.
This command will create a TextGrid file for each audio, containing the metadata of the audio. The TextGrid files will be saved in a folder named {audio_name}.textgrid
inside the output folder.
This command will create a copy of the original audios, in the format specified by the export-audio-to-formats
option. The audios will be saved in a folder named original_audios
inside the output folder. You can also specify the final sample-rate
for the audios.
This script provides functionalities for transcribing audio files using the WhisperX library. It supports various features like diarization, speaker identification, and alignment of transcribed segments with audio, specifically tailored for processing audio datasets.
Use this command to transcribe audio files from specified Google Drive folders.
Option | Description | Type | Required | Default |
---|---|---|---|---|
--corpus-id |
Unique identifier for the corpus | int | Yes | None |
--folder-ids |
List of Google Drive folder IDs containing audio files | List[str] | Yes | None |
--output_folder |
Directory path for saving the output | Path | No | "./data/" |
--storage-output-folder-id |
Instance of Database for database operations | Google Drive folder ID to save the transcriptions. If none is provided, the transcriptions will be saved in the same folder as the audios. | No | None |
--format-filter |
Filter audios by format | [wav,mp4,mp3] | No | None |
--save-to-drive |
Flag to save transcriptions to Google Drive | bool | No | False |
--save-to-db |
Flag to save transcriptions to database | bool | No | False |
--transfer-to-server |
Flag to transfer transcriptions to server | bool | No | False |
To run the transcribe script, you should use the base command
poetry run python main.py transcribe --corpus_id [CORPU_ID] --folder_ids [IDs] [OPTIONS]
You can choose between the flags: --save-to-drive
, --save-to-db
and --transfer-to-server
. The first one will upload the segments and transcription files back to GoogleDrive, the second will insert the entries on the BrazSpeechPlatform database, and the third will transfer the files to the server, to make them available on the platform.
If you need any help, run:
poetry run python main.py transcribe --help
-
feat
add support for other ASR services -
feat
add support for other Repositories other than Google Drive -
fix
Dockerfile for running the script -
refact
inject dependencies on the script for better modularity -
tests
add unit tests for the script