Shannon Liu, Maria Teresa Parreira, Wendy Ju
Read the paper here!
This repository contains the code to develop machine learning models to classify robot failure (binary classification) and successive robot failure (multiclass classification).
We explore a range of machine learning strategies to detect successive robot error for a single user or across multiple participants. Model training on single participants allows systems to learn each individual's unique way of signaling robot errors, while training on multiple participants tests generalization to unseen participants.
Our models use data extracted from videos collected in prior work and adopt different data splitting, feature representation, modality combinations, model architecture, and fusion strategies.
Data splitting strategies used:
- Error Detection (binary classification)
- Multiple Error Detection (multiclass classification)
- First Error to Successive Errors Generalization (binary classification)
- Successive Error Discrimination (multiclass classification) These data splitting strategies were implemented in create_data_splits.py
Features were represented as:
- raw non-normalized features
- normalized features
- normalized features with principal component analysis (PCA) applied
Modalities included facial, pose, audio, and text embeddings, and 15 combinations of these modalities were used during training.
Model architectures explored included:
Fusion strategies included:
- Early fusion (concatenating features before model input)
- Intermediate fusion (processing features then concatenating and training)
- Late fusion (training features and combining predictions)
Shannon Liu, Maria Teresa Parreira, Wendy Ju
Read the paper here!
This repository contains the code and data for a study exploring human responses to successive conversational errors made by robots.
Our study investigates these responses through a user study involving 26 participants, where we examine the behavioral and emotional shifts that occur when users interact with a robot named HelperBot that was wizarded to make successive conversational errors.
When analyzing the human reactions to robot failures, a codebook was created to classify verbal and nonverbal reactions, and a statistical analysis was conducted to learn more about the relationship between facial, audio, and body pose features and HelperBot's successive error. More details elaborating on the codebook and statistical analysis can be found in supplemental material or below.
This work contributes valuable insights into improving human-robot communication, particularly in scenarios where robots make repeated errors, and has potential applications for building more resilient and adaptive robotic systems.
The codebook for analyzing the video dataset of interactions between the participant and HelperBot during HelperBot's successive errors is found on page 2 of supplemental material.
The code used to plot annotations is found at plot.ipynb.
The data used to plot the annotations is found in the data subdirectory.
The plots created are found in the plots subdirectory.
A statistical analysis of the video dataset of interactions between the participant and HelperBot during HelperBot's successive errors is found on page 3 of supplemental material.
The code used for the analysis is found at full_statsanalysis.ipynb.
Recognizing robot failure by analyzing human reactions and behaviors toward in-person robot failures.
The experiment involved a participant and a robot interacting or conversing with each other in a private room. The robot was controlled by a researcher and was engineered to create at least 3 errors.
The robot failure: not understanding the participant's order.
The robot failure was verbalized as "Sorry, I do not understand" and occurred at least 3 times. Afterward, the interaction ended with robot verbalizing "OK, I will call the researcher".
- Analysis of Human Reactions to Robot Failure
- Features
- Labels
- Training
- Participant Exclusion
- Principal Component Analysis
See HRI25_LBR for more details on the study and findings.
Feature extraction was performed on the participant to understand and analyze facial expressions, body movements, and speech that might convey underlying emotions during the human-robot interaction. After each feature extraction tool was applied to the sample's videos, resulting outputs were processed into readable forms which were then merged into one collective CSV file documenting all feature data per frame per participant.
The OpenFace toolkit was used to detect facial landmarks and action units and required a video file path as input. The output consisted of a CSV file containing feature data per frame for the video.
The OpenFace toolkit can be found here: https://github.com/TadasBaltrusaitis/OpenFace
Irrelevant features were excluded from the OpenFace output and this process was completed while merging all feature data. Facial feature exclusion is found in this Python script: feature_merge.py. For the facial feature exclusion portion of the script, the CSV file path of the facial features for each participant and final facial feature list were required as input.
Final facial feature list (mainly action units):
facial_features = ['AU01_r', 'AU02_r', 'AU04_r', 'AU05_r', 'AU06_r', 'AU07_r', 'AU09_r', 'AU10_r',
'AU12_r', 'AU14_r', 'AU15_r', 'AU17_r', 'AU20_r', 'AU23_r', 'AU25_r', 'AU26_r', 'AU45_r', 'AU01_c',
'AU02_c', 'AU04_c', 'AU05_c', 'AU06_c', 'AU07_c', 'AU09_c', 'AU10_c', 'AU12_c', 'AU14_c', 'AU15_c',
'AU17_c', 'AU20_c', 'AU23_c', 'AU25_c', 'AU26_c', 'AU28_c', 'AU45_c', 'gaze_0_x', 'gaze_0_y',
'gaze_0_z', 'gaze_1_x', 'gaze_1_y', 'gaze_1_z', 'gaze_angle_x', 'gaze_angle_y']
The OpenPose toolkit and the BODY_25 model were used to obtain keypoints of the participant's body features and required video file path as input and a JSON file path to store the output. The JSON files were parsed and converted into CSV files listing pose features per frame for each video file with this script: parse_openpose.py.
The following was executed in the command line interface to return a JSON file of 25 keypoints per frame:
bin\OpenPoseDemo.exe --video {input_video_file_path} --write_video {output_file_path} --write_json {output_file_path}
Only upper body keypoints were relevant to the video dataset so the lower body keypoints (mid hip, right hip, left hip, right knee, left knee, right ankle, left ankle, right big toe, left big toe, right small toe, left small toe, right heel, left heel) were removed during preprocessing. The Python script used to exclude lower body features is found here: feature_exclusion.py. Required input for the script includes the directory path to the directory holding all participant CSV files and a CSV file path for each participant to store output.
Final pose feature list:
pose_features = ['nose_x', 'nose_y', 'neck_x', 'neck_y', 'rightshoulder_x', 'rightshoulder_y',
'leftshoulder_x', 'leftshoulder_y', 'rightelbow_x', 'rightelbow_y', 'leftelbow_x', 'leftelbow_y',
'rightwrist_x', 'rightwrist_y', 'leftwrist_x', 'leftwrist_y','righteye_x', 'righteye_y',
'lefteye_x', 'lefteye_y', 'rightear_x', 'rightear_y', 'leftear_x', 'leftear_y']
The OpenPose toolkit can be found here: https://github.com/CMU-Perceptual-Computing-Lab/openpose
In addition to the original features produced with OpenPose, another column titled "[original feature name]_delta" was appended and included the change in value from the previous frame's original feature value to the current frame's original feature value. The Python script used for updating the CSV with delta values is found here: features_delta_calculations.py. Required input to obtain delta calculations for each participant includes the CSV file path of pose features for the participant and a CSV file path to store the output or additional delta columns and values.
The openSMILE toolkit and the eGeMAPSv02 configuration were used to obtain audio features from the interaction between the participant and the robot and required an audio file path as input and a CSV file path as output. The following Python code was executed to return a CSV file of audio features:
import opensmile
import os
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)
file = f"{input_audio_file_path}"
y = smile.process_file(file)
y.to_csv(f"{output_file_path}")The script can also be found here: audio_extraction.py. Required inputs for the script included a directory of all participant audio files and a CSV file path for each participant to store output.
Final audio feature list:
audio_features = ['Loudness_sma3', 'alphaRatio_sma3', 'hammarbergIndex_sma3', 'slope0-500_sma3', 'slope500-1500_sma3',
'spectralFlux_sma3', 'mfcc1_sma3', 'mfcc2_sma3', 'mfcc3_sma3', 'mfcc4_sma3', 'F0semitoneFrom27.5Hz_sma3nz',
'jitterLocal_sma3nz', 'shimmerLocaldB_sma3nz,HNRdBACF_sma3nz', 'logRelF0-H1-H2_sma3nz', 'logRelF0-H1-A3_sma3nz',
'F1frequency_sma3nz', 'F1bandwidth_sma3nz', 'F1amplitudeLogRelF0_sma3nz', 'F2frequency_sma3nz', 'F2bandwidth_sma3nz',
'F2amplitudeLogRelF0_sma3nz', 'F3frequency_sma3nz', 'F3bandwidth_sma3nz', 'F3amplitudeLogRelF0_sma3nz']
Only audio segments (partitioned via speaker diarization) spoken by the participant were relevant to the study, other segments were removed during preprocessing.
The openSMILE toolkit can be found here: https://audeering.github.io/opensmile/about.html or https://github.com/audeering/opensmile/
Speaker diarization was used to identify timestamps indicating when the participant and the robot were speaking.
pyannote speaker diarization toolkit was used to extract timestamps. The following Python code was executed to return an RTTM file of speaker timestamps from an audio file.
from pyannote.audio import Pipeline
import torch
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=f"{huggingface_authentication_token}")
diarization = pipeline(f"{input_audio_file_path}")
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
with open(f"{output_file_path}", "w") as rttm:
diarization.write_rttm(rttm)The output of the speaker diarization was an RTTM file. This was converted into a CSV file and "overlapped" with the openSMILE audio extraction CSV outputs to remove the timestamps at which other speakers were speaking.
pyannote speaker diarization toolkit can be found here: https://huggingface.co/pyannote/speaker-diarization-3.1 or https://github.com/pyannote/pyannote-audio
Once the timestamps for each speaker were extracted, audio features corresponding to the participant's speech were retained, while those corresponding to other speakers' speech were removed. The Python script used to achieve this is found here: filter_audio_features.py. Required inputs for the script to filter features for a single participant include the CSV file of openSMILE audio extracted features, the CSV file of timestamps produced from speaker diarization, and a CSV file path to store the output.
The CSV files of features achieved from openSMILE used start and end timestamps in the format "0 days 00:00:00.00", and each row included audio features for 0.02 seconds (between the start and end timestamps). However, this study is interested in utilizing frames instead of timestamps. Therefore, start timestamps were converted into frames. For multiple rows with the same frame number, the average feature value was calculated and used as the frame's feature value for each feature. The Python script used to convert timestamps to frames and process features is found here: audio_features_frames.py. Required inputs to obtain frames included a directory of all participant CSV files of filtered audio features and a CSV file path for each participant to store the new output CSV.
After facial, pose, and audio features were processed to include relevant features per frame per participant, all features were merged into one CSV file representing the features per frame of a single participant. The Python script for merging features for each participant is found here: feature_merge.py
All participant's features were merged into a collective CSV file containing all rows from each participant's merged features data. The Python script for merging all participant feature data is found here: feature_all_participants.py.
Features were checked for NaN and inf values and then normalized for model training. The Python script for checking feature values is found here: check_features.py and for normalizing features is found here normalization.py
There were three feature sets used when training our model.
The full feature set contains all final facial, pose, and audio features listed previously.
A statistical analysis was used to assess statistically significant features. Feature set: stats_features
A random forest was used to assess feature importance. Features were retained up to the point where the importance drop exceeded 40%. Feature set: rf_features
Two types of labeling methods were used: binary labeling and multiclass labeling. A label was applied to each frame of the video of the interaction between the participant and the robot. Labels were initially appended to the pose feature CSV files.
The Python script used for efficient labeling are found here: features_create_labels_csvs.py. Required inputs for the script to store binary and multiclass labels for each participant included a CSV file to store the features and labels, the frame numbers for when binary labels should be added, and the frame numbers for when multiclass labels should be added. Additionally, an input directory path was required for the location to store CSV files of features and labels for each participant.
- "0" labeled frames from the beginning of the video to the first "Sorry, I do not understand" error and after "OK, I will call the researcher" to the end of the video.
- "1" labeled frames from the first "Sorry, I do not understand" error to "OK, I will call the researcher".
- "0" labeled frames from the beginning of the video to the first "Sorry, I do not understand" error and after "OK, I will call the researcher" to the end of the video.
- "1" labeled frames from the first "Sorry, I do not understand" error to the second "Sorry, I do not understand".
- "2" labeled frames from the second "Sorry, I do not understand" error to the third "Sorry, I do not understand".
- "3" labeled frames from the third "Sorry, I do not understand" error to "OK, I will call the researcher".
The same labeling procedure was be used for additional errors.
The following charts contain the amount and percentages of labels noted per participant.
| participant | Label "0" Count | Label "0" Percentage | Label "1" Count | Label "1" Percentage | Total Number of Labels |
|---|---|---|---|---|---|
| 2 | 569 | 48.9% | 595 | 51.1% | 1164 |
| 4 | 601 | 61.2% | 381 | 38.8% | 982 |
| 5 | 575 | 49.6% | 584 | 50.4% | 1159 |
| 6 | 619 | 51.9% | 573 | 48.1% | 1192 |
| 7 | 609 | 53.8% | 524 | 46.2% | 1133 |
| 8 | 606 | 59.8% | 408 | 40.2% | 1014 |
| 9 | 582 | 55.8% | 461 | 44.2% | 1043 |
| 10 | 589 | 48.3% | 630 | 51.7% | 1219 |
| 11 | 588 | 55.2% | 477 | 44.8% | 1065 |
| 12 | 615 | 45.8% | 728 | 54.2% | 1343 |
| 14 | 571 | 54.9% | 469 | 45.1% | 1040 |
| 16 | 577 | 43.0% | 764 | 57.0% | 1341 |
| 17 | 585 | 51.7% | 546 | 48.3% | 1131 |
| 18 | 577 | 45.1% | 703 | 54.9% | 1280 |
| 19 | 579 | 49.8% | 583 | 50.2% | 1162 |
| 20 | 595 | 44.4% | 746 | 55.6% | 1341 |
| 21 | 583 | 49.0% | 607 | 51.0% | 1190 |
| 22 | 571 | 52.1% | 526 | 47.9% | 1097 |
| 23 | 596 | 50.1% | 593 | 49.9% | 1189 |
| 25 | 577 | 50.9% | 556 | 49.1% | 1133 |
| 26 | 519 | 45.1% | 631 | 54.9% | 1150 |
| 27 | 608 | 39.2% | 942 | 60.8% | 1550 |
| 28 | 600 | 53.1% | 531 | 46.9% | 1131 |
| 29 | 517 | 32.6% | 1069 | 67.4% | 1586 |
- Average percentage of "0" labels: 49.6%
- Average percentage of "1" labels: 50.4%
| participant | Label "0" Count | Label "0" Percentage | Label "1" Count | Label "1" Percentage | Label "2" Count | Label "2" Percentage | Label "3" Count | Label "3" Percentage | Total Number of Labels |
|---|---|---|---|---|---|---|---|---|---|
| 2 | 569 | 48.9% | 197 | 16.9% | 202 | 17.4% | 196 | 16.8% | 1164 |
| 4 | 601 | 61.2% | 132 | 13.4% | 129 | 13.1% | 120 | 12.2% | 982 |
| 5 | 575 | 49.6% | 221 | 19.1% | 130 | 11.2% | 233 | 20.1% | 1159 |
| 6 | 619 | 51.9% | 166 | 13.9% | 147 | 12.3% | 260 | 21.8% | 1192 |
| 7 | 609 | 53.8% | 159 | 14.0% | 194 | 17.1% | 171 | 15.1% | 1133 |
| 8 | 606 | 59.8% | 143 | 14.1% | 136 | 13.4% | 129 | 12.7% | 1014 |
| 9 | 582 | 55.8% | 136 | 13.0% | 164 | 15.7% | 161 | 15.4% | 1043 |
| 10 | 589 | 48.3% | 166 | 13.6% | 275 | 22.6% | 189 | 15.5% | 1219 |
| 11 | 588 | 55.2% | 140 | 13.1% | 206 | 19.3% | 131 | 12.3% | 1065 |
| 12 | 615 | 45.8% | 166 | 12.4% | 165 | 12.3% | 397 | 29.6% | 1343 |
| 14 | 571 | 54.9% | 147 | 14.1% | 160 | 15.4% | 162 | 15.6% | 1040 |
| 16 | 577 | 43.0% | 165 | 12.3% | 263 | 19.6% | 336 | 25.1% | 1341 |
| 17 | 585 | 51.7% | 155 | 13.7% | 195 | 17.2% | 196 | 17.3% | 1131 |
| 18 | 577 | 45.1% | 137 | 10.7% | 187 | 14.6% | 379 | 29.6% | 1280 |
| 19 | 579 | 49.8% | 145 | 12.5% | 212 | 18.2% | 226 | 19.4% | 1162 |
| 20 | 595 | 44.4% | 168 | 12.5% | 276 | 20.6% | 302 | 22.5% | 1341 |
| 21 | 583 | 49.0% | 156 | 13.1% | 205 | 17.2% | 246 | 20.7% | 1190 |
| 22 | 571 | 52.1% | 158 | 14.4% | 162 | 14.8% | 206 | 18.8% | 1097 |
| 23 | 596 | 50.1% | 130 | 10.9% | 341 | 28.7% | 122 | 10.3% | 1189 |
| 25 | 577 | 50.9% | 137 | 12.1% | 218 | 19.2% | 201 | 17.7% | 1133 |
| 26 | 519 | 45.1% | 170 | 14.8% | 193 | 16.8% | 268 | 23.3% | 1150 |
| 27 | 608 | 39.2% | 154 | 9.9% | 298 | 19.2% | 490 | 31.6% | 1550 |
| 28 | 600 | 53.1% | 161 | 14.2% | 234 | 20.7% | 136 | 12.0% | 1131 |
Participant 29's interaction consisted of 5 errors. Therefore, it contained 2 additional labels "4" and "5".
| participant | Label "0" Count | Label "0" Percentage | Label "1" Count | Label "1" Percentage | Label "2" Count | Label "2" Percentage | Label "3" Count | Label "3" Percentage | Label "4" Count | Label "4" Percentage | Label "5" Count | Label "5" Percentage | Total Number of Labels |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29 | 511 | 32.2% | 233 | 14.7% | 167 | 10.5% | 191 | 12.0% | 223 | 14.1% | 261 | 16.5% | 1586 |
- Average percentage of "0" label: 49.5%
- Average percentage of "1" label: 12.7%
- Average percentage of "2" label: 16.6%
- Average percentage of "3" label: 19.0%
- Average percentage of additional labels: 2.2%
The Python script that assisted with the label analysis is found here: label_analysis.py
We evaluated model performance using two training approaches:
- Interparticipant training
- Models are trained on data from some participants, then tested on different unseen participants.
- Models can assess how well a system can predict its failures based on reactions of new individuals.
- Intraparticipant training
- Models are trained on a subset of one participant's data, then tested on a different subset of the same participant's data.
- Models can assess how well a system can predict its failures based on unseen reactions from the same individual.
Our models were trained on 15 combinations of different modalities including facial, pose, audio, and text features.
Our models were trained using 3 different feature sets including the full set of features, statistically significant features, and features of most importance. Please refer to this section for more details on feature set selection.
Our models were trained using 3 different datasets including the original dataset itself, the normalized dataset, and the dataset resulting from a principal component analysis (PCA). Please refer to this section for more details on PCA.
We explored fusion strategies to combine the features from different modalities:
- Early Fusion
- Modality features are concatenated, then input into the model.
- Intermediate Fusion
- Each modality is processed independently, then intermediate representations are concatenated and input into further layers of a model.
- Late Fusion
- Each modality is trained independently, then their predictions are concatenated.
We explored different model architectures to assess performance across different complexity levels and modalities.
- Long Short Term Memory (LSTM)
- Gated Recurrent Network (GRU)
- MiniRocket Model
- Linear Classifiers
- K Nearest Neighbor (KNN)
- Random Forest (RF)
- Stochastic Gradient Descent (SGD)
- Support Vector Machine (SVM)
- Multilayer Perceptron (MLP)
- Audio Spectrogram Transformer (AST) for audio features
These models were trained using interparticipant and intraparticipant splits and with different fusion strategies, feature sets, and datasets as explained above.
Participants were excluded based on the following reasons:
- Failed protocol resulting in no reaction to failures
- Distractions not involved in the experiment resulting in no reaction to failures
- Feature extraction compound confidence scores below 0.50.
Final number of participants: 24.
PCA is a method used to reduce the number of variables in a large dataset by retaining patterns in the data. PCA was conducted on the dataset of 84 features containing facial, pose, and audio features. The short script below was used to retain 90% of the variance and apply the PCA to the dataset.
participant_frames_labels = df.iloc[:, :4]
x = df.iloc[:, 4:]
x = StandardScaler().fit_transform(x.values)
pca = PCA()
principal_components = pca.fit_transform(x)
print(principal_components.shape)
pca = PCA(n_components=0.90)
principal_components = pca.fit_transform(x)
print(principal_components.shape)
principal_df = pd.DataFrame(data=principal_components, columns=['principal component ' + str(i) for i in range(principal_components.shape[1])])
principal_df = pd.concat([participant_frames_labels, principal_df], axis=1)The script was embedded into the create_data_splits_pca.py method in create_data_splits.py.
The resulting dataframe consisted of 41 principal components.
Running PCA on pose, facial, and audio features separately yielded 13, 24, and 7 principal components, respectively.