Welcome! This repo contains scripts for performing automatic speech recognition (ASR) on political ad videos.
This repo is a part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.
To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. The scripts in this repo are part of the Data Processing Step in our pipeline.
The scripts in this repository work to perform automatic speech recognition on political ad videos, producing a .csv
file that contains the videos' text recognition results.
The Automatic Speech Recognition (ASR) codes require Google Cloud credentials to interact with Google Cloud Storage, Google BigQuery, and the Google Cloud Speech-to-Text API.
To run the script in this repo, you need to have your own Google Cloud credentials in the form of a JSON file.
Here is how you can set up the credentials:
-
Register with Google Cloud Platform (GCP) and create a project.
NOTE: If you are on a restricted Google account, such as a school account, that prevents you from creating a Google Cloud project, you will need to use a different account.
-
Set up your Google Cloud project for Speech-to-Text and BigQuery:
- Go to the Google Cloud Console.
- Click the project drop-down and select or create the project for which you want to add an API key.
- Click the navigation menu (three lines in the top left corner) and select "API & Services".
- Click "Library" in the left side panel, then search for and enable the "Cloud Speech-to-Text API", "BigQuery API", "Cloud Resource Manager API" and "Service Usage API".
NOTE: In order to use the Google Cloud APIs, the project must be linked to an active billing account. Documentation on how to enable billing for your project can be found here.
-
Create a Service Account:
- In the Cloud Console, click the navigation menu and select "IAM & Admin".
- Click "Service Accounts" in the left side panel.
- Click "Create Service Account" located on the top under the search bar.
- Enter a name for the service account.
- Grant the service account access to your project by assigning it the BigQuery Admin and Storage Object Admin roles. There's no need to grant any users access to the service account, and so you can click through to "Done" after assigning the roles.
- Click on the service account you just created.
- Under the "Keys" tab on the top, click "Add Key", click "Create New Key", choose "JSON", and click "Create". This will download a JSON key file, which you should save in your local.
For more information about setting up Google Cloud credentials for ASR, you can go to Google's ASR documentation.
-
Install the Google Cloud CLI:
In order to handle access authorization as well as copy files between your local drive and Google Storage, you need to install the
gsutil
andgcloud
command-line tools, which can be achieved by installing the Google Cloud CLI. Follow these instructions in order to download the required package, using your new service account email and project ID as credentials when walking through thegcloud init
command.NOTE: If you are receiving the error
gcloud: command not found
, try opening a new terminal window in order to have your changes take effect. -
Then, in order to confirm the authorization of the gcloud CLI using a service account key, run the following command, where
KEY_FILE
is replaced with the full path to your service account key file (source)gcloud auth login --cred-file=KEY_FILE
If you're told that the account is already authenticated, then you are good to go, and do not have to overwrite the existing credentials. You can double-check the list of accounts whose credentials are stored on the local system using the command:
gcloud auth list
and switch the active account by running the command below, where
ACCOUNT
is the full email address of the account:gcloud config set account ACCOUNT
Make sure that the service account you just created is the active account.
The ASR scripts require that you have a dataset, table and storage bucket within your project.
To create a dataset (source):
- Open the BigQuery page in the Google Cloud console.
- In the Explorer panel, which can be found directly to the right of the BigQuery panel (you may have to toggle it open), select the project for which you want to create a dataset.
- Click the Actions button (three circles in a column) of the project, and then click the Create dataset option.
- Enter any unique name for the Dataset ID. Leave all other fields as the default.
- Proceed directly to the Create Dataset button and click.
To create a table (source):
- In the BigQuery Explorer panel, select a project and dataset for which you want to create a table.
- Click the Actions button (three circles in a column) of the dataset, and then click the Create table option.
- Enter any table name for the Table field, and leave all other fields as the default.
- Proceed directly to the Create Table button and click.
To create a storage bucket (source):
- In Google Cloud Console, go to the Cloud Storage Buckets page.
- Click the + Create button.
- Enter a name for the bucket, and leave all other fields as their default.
- Proceed directly to the Create button and click.
We recommend creating and activating a Python virtual environment before running the .py scripts:
python3 -m venv venv
source venv/bin/activate
If you want to stop the virtual environment at some point, you can deactivate it:
deactivate
Additionally, to run the scripts in this repo, you need to install the following dependencies:
pip3 install pandas
pip3 install sox
pip3 install tqdm
pip3 install google-cloud-speech
pip3 install google-cloud-bigquery
pip3 install google-auth
The exact version of each package we used is listed in the requirements.txt file. You can install the specific versions by instead running, for instance, pip3 install pandas==2.2.2
. This should not be necessary, but may be helpful if you run into errors.
Here is how we run the files 01_asr.py
and 02_asr.py
:
-
Run the following bash code to copy wav files to Google Storage. If you have not yet created a Cloud Storage Bucket within your Google Cloud project, you can do so easily by following these instructions.
storage_bucket_path
should be replaced with the path and/or name of your Storage Bucket. Furthermore, this code assumes that theautomatic-speech-recognition
folder is your current directory. Otherwise, you should replace./sample_wavs
with the local path leading to thesample_wavs
folder in this repository.gsutil -m cp -r ./sample_wavs gs://storage_bucket_path
-
Look through the scripts and insert your own credentials/filepaths wherever it is specified. Comments in the code indicate where this is necessary.
-
Run the scripts in order:
python3 01_asr.py python3 02_asr.py
-
(Optional) After running both scripts, run the the following bash code if you wish to copy the csv file directly from Google storage to your local storage. The placeholder
asr_demo
should be replaced with the path and/or name of your Storage Bucket. Note that the fields of thecsv
that is retrieved using this command are not in order, which is why in02_asr.py
we choose to manually make a query in order to retrieve the results.gsutil cp gs://asr_demo/results/*.csv ./Results/
When you run 01_asr.py
and 02_asr.py
, the resulting data is saved in two locations: a Results
folder located in your local automatic-speech-recognition
, and a results
folder located in your chosen Storage Bucket. The data will be in csv
format, entitled asr_results.csv
in your local and gs_asr_results.csv
in the Storage Bucket.
-
Individual records of data
asr_results.csv
contain the following fields:filename
: the unique identifier of the video filegoogle_asr_text
: the videos' text recognition result from Google Cloud Speech-to-Text APIstt_confidence
: the confidence score of the text recognition result
We would like to thank our supporters!
This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008.
The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.