Training Models to Detect Successive Robot Errors from Human Reactions (NERC 2025)

Shannon Liu, Maria Teresa Parreira, Wendy Ju

Read the paper here!

Overview

This repository contains the code to develop machine learning models to classify robot failure (binary classification) and successive robot failure (multiclass classification).

We explore a range of machine learning strategies to detect successive robot error for a single user or across multiple participants. Model training on single participants allows systems to learn each individual's unique way of signaling robot errors, while training on multiple participants tests generalization to unseen participants.

Our models use data extracted from videos collected in prior work and adopt different data splitting, feature representation, modality combinations, model architecture, and fusion strategies.

Intraparticipant Training Models

Data splitting strategies used:

Error Detection (binary classification)
Multiple Error Detection (multiclass classification)
First Error to Successive Errors Generalization (binary classification)
Successive Error Discrimination (multiclass classification) These data splitting strategies were implemented in create_data_splits.py

Features were represented as:

raw non-normalized features
normalized features
normalized features with principal component analysis (PCA) applied

Modalities included facial, pose, audio, and text embeddings, and 15 combinations of these modalities were used during training.

Model architectures explored included:

LSTM
- LSTM binary
- LSTM multiclass
GRU
- GRU binary
- GRU multiclass

Fusion strategies included:

Early fusion (concatenating features before model input)
Intermediate fusion (processing features then concatenating and training)
Late fusion (training features and combining predictions)

"I'm Done": Describing Human Reactions to Successive Robot Failure (HRI 2025)

Shannon Liu, Maria Teresa Parreira, Wendy Ju

Read the paper here!

Overview

This repository contains the code and data for a study exploring human responses to successive conversational errors made by robots.

Our study investigates these responses through a user study involving 26 participants, where we examine the behavioral and emotional shifts that occur when users interact with a robot named HelperBot that was wizarded to make successive conversational errors.

When analyzing the human reactions to robot failures, a codebook was created to classify verbal and nonverbal reactions, and a statistical analysis was conducted to learn more about the relationship between facial, audio, and body pose features and HelperBot's successive error. More details elaborating on the codebook and statistical analysis can be found in supplemental material or below.

This work contributes valuable insights into improving human-robot communication, particularly in scenarios where robots make repeated errors, and has potential applications for building more resilient and adaptive robotic systems.

Codebook

The codebook for analyzing the video dataset of interactions between the participant and HelperBot during HelperBot's successive errors is found on page 2 of supplemental material.

The code used to plot annotations is found at plot.ipynb.

The data used to plot the annotations is found in the data subdirectory.

The plots created are found in the plots subdirectory.

Statistical Analysis

A statistical analysis of the video dataset of interactions between the participant and HelperBot during HelperBot's successive errors is found on page 3 of supplemental material.

The code used for the analysis is found at full_statsanalysis.ipynb.

BAD Robots IRL

Recognizing robot failure by analyzing human reactions and behaviors toward in-person robot failures.

The experiment involved a participant and a robot interacting or conversing with each other in a private room. The robot was controlled by a researcher and was engineered to create at least 3 errors.

The robot failure: not understanding the participant's order.

The robot failure was verbalized as "Sorry, I do not understand" and occurred at least 3 times. Afterward, the interaction ended with robot verbalizing "OK, I will call the researcher".

Analysis of Human Reactions to Robot Failure

See HRI25_LBR for more details on the study and findings.

Features

Feature extraction was performed on the participant to understand and analyze facial expressions, body movements, and speech that might convey underlying emotions during the human-robot interaction. After each feature extraction tool was applied to the sample's videos, resulting outputs were processed into readable forms which were then merged into one collective CSV file documenting all feature data per frame per participant.

Feature Extraction

Facial Features

The OpenFace toolkit was used to detect facial landmarks and action units and required a video file path as input. The output consisted of a CSV file containing feature data per frame for the video.

The OpenFace toolkit can be found here: https://github.com/TadasBaltrusaitis/OpenFace

Feature Exclusion

Irrelevant features were excluded from the OpenFace output and this process was completed while merging all feature data. Facial feature exclusion is found in this Python script: feature_merge.py. For the facial feature exclusion portion of the script, the CSV file path of the facial features for each participant and final facial feature list were required as input.

Final facial feature list (mainly action units):

facial_features = ['AU01_r', 'AU02_r', 'AU04_r', 'AU05_r', 'AU06_r', 'AU07_r', 'AU09_r', 'AU10_r',
'AU12_r', 'AU14_r', 'AU15_r', 'AU17_r', 'AU20_r', 'AU23_r', 'AU25_r', 'AU26_r',  'AU45_r', 'AU01_c',
'AU02_c', 'AU04_c', 'AU05_c', 'AU06_c', 'AU07_c', 'AU09_c', 'AU10_c', 'AU12_c', 'AU14_c', 'AU15_c',
'AU17_c', 'AU20_c', 'AU23_c', 'AU25_c', 'AU26_c', 'AU28_c', 'AU45_c', 'gaze_0_x', 'gaze_0_y',
'gaze_0_z', 'gaze_1_x', 'gaze_1_y', 'gaze_1_z', 'gaze_angle_x', 'gaze_angle_y']

Pose Features & Estimation

The OpenPose toolkit and the BODY_25 model were used to obtain keypoints of the participant's body features and required video file path as input and a JSON file path to store the output. The JSON files were parsed and converted into CSV files listing pose features per frame for each video file with this script: parse_openpose.py.

The following was executed in the command line interface to return a JSON file of 25 keypoints per frame:

bin\OpenPoseDemo.exe --video {input_video_file_path} --write_video {output_file_path} --write_json {output_file_path}

Feature Exclusion

Only upper body keypoints were relevant to the video dataset so the lower body keypoints (mid hip, right hip, left hip, right knee, left knee, right ankle, left ankle, right big toe, left big toe, right small toe, left small toe, right heel, left heel) were removed during preprocessing. The Python script used to exclude lower body features is found here: feature_exclusion.py. Required input for the script includes the directory path to the directory holding all participant CSV files and a CSV file path for each participant to store output.

Final pose feature list:

pose_features = ['nose_x', 'nose_y', 'neck_x', 'neck_y', 'rightshoulder_x', 'rightshoulder_y',
'leftshoulder_x', 'leftshoulder_y', 'rightelbow_x', 'rightelbow_y', 'leftelbow_x', 'leftelbow_y',
'rightwrist_x', 'rightwrist_y', 'leftwrist_x', 'leftwrist_y','righteye_x', 'righteye_y',
'lefteye_x', 'lefteye_y', 'rightear_x', 'rightear_y', 'leftear_x', 'leftear_y']

The OpenPose toolkit can be found here: https://github.com/CMU-Perceptual-Computing-Lab/openpose

Body Keypoint Delta Calculations

In addition to the original features produced with OpenPose, another column titled "[original feature name]_delta" was appended and included the change in value from the previous frame's original feature value to the current frame's original feature value. The Python script used for updating the CSV with delta values is found here: features_delta_calculations.py. Required input to obtain delta calculations for each participant includes the CSV file path of pose features for the participant and a CSV file path to store the output or additional delta columns and values.

Audio Features

The openSMILE toolkit and the eGeMAPSv02 configuration were used to obtain audio features from the interaction between the participant and the robot and required an audio file path as input and a CSV file path as output. The following Python code was executed to return a CSV file of audio features:

import opensmile
import os
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)
file = f"{input_audio_file_path}"
y = smile.process_file(file)
y.to_csv(f"{output_file_path}")

The script can also be found here: audio_extraction.py. Required inputs for the script included a directory of all participant audio files and a CSV file path for each participant to store output.

Final audio feature list:

audio_features = ['Loudness_sma3', 'alphaRatio_sma3', 'hammarbergIndex_sma3', 'slope0-500_sma3', 'slope500-1500_sma3',
'spectralFlux_sma3', 'mfcc1_sma3', 'mfcc2_sma3', 'mfcc3_sma3', 'mfcc4_sma3', 'F0semitoneFrom27.5Hz_sma3nz',
'jitterLocal_sma3nz', 'shimmerLocaldB_sma3nz,HNRdBACF_sma3nz', 'logRelF0-H1-H2_sma3nz', 'logRelF0-H1-A3_sma3nz',
'F1frequency_sma3nz', 'F1bandwidth_sma3nz', 'F1amplitudeLogRelF0_sma3nz', 'F2frequency_sma3nz', 'F2bandwidth_sma3nz',
'F2amplitudeLogRelF0_sma3nz', 'F3frequency_sma3nz', 'F3bandwidth_sma3nz', 'F3amplitudeLogRelF0_sma3nz']

Only audio segments (partitioned via speaker diarization) spoken by the participant were relevant to the study, other segments were removed during preprocessing.

The openSMILE toolkit can be found here: https://audeering.github.io/opensmile/about.html or https://github.com/audeering/opensmile/

Speaker Diarization

Speaker diarization was used to identify timestamps indicating when the participant and the robot were speaking.

pyannote speaker diarization toolkit was used to extract timestamps. The following Python code was executed to return an RTTM file of speaker timestamps from an audio file.

from pyannote.audio import Pipeline
import torch
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=f"{huggingface_authentication_token}")
diarization = pipeline(f"{input_audio_file_path}")
for turn, _, speaker in diarization.itertracks(yield_label=True):
        print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
    with open(f"{output_file_path}", "w") as rttm:
        diarization.write_rttm(rttm)

The output of the speaker diarization was an RTTM file. This was converted into a CSV file and "overlapped" with the openSMILE audio extraction CSV outputs to remove the timestamps at which other speakers were speaking.

pyannote speaker diarization toolkit can be found here: https://huggingface.co/pyannote/speaker-diarization-3.1 or https://github.com/pyannote/pyannote-audio

Once the timestamps for each speaker were extracted, audio features corresponding to the participant's speech were retained, while those corresponding to other speakers' speech were removed. The Python script used to achieve this is found here: filter_audio_features.py. Required inputs for the script to filter features for a single participant include the CSV file of openSMILE audio extracted features, the CSV file of timestamps produced from speaker diarization, and a CSV file path to store the output.

Timestamp to Frame Conversion

The CSV files of features achieved from openSMILE used start and end timestamps in the format "0 days 00:00:00.00", and each row included audio features for 0.02 seconds (between the start and end timestamps). However, this study is interested in utilizing frames instead of timestamps. Therefore, start timestamps were converted into frames. For multiple rows with the same frame number, the average feature value was calculated and used as the frame's feature value for each feature. The Python script used to convert timestamps to frames and process features is found here: audio_features_frames.py. Required inputs to obtain frames included a directory of all participant CSV files of filtered audio features and a CSV file path for each participant to store the new output CSV.

Feature Merge

After facial, pose, and audio features were processed to include relevant features per frame per participant, all features were merged into one CSV file representing the features per frame of a single participant. The Python script for merging features for each participant is found here: feature_merge.py

All participant's features were merged into a collective CSV file containing all rows from each participant's merged features data. The Python script for merging all participant feature data is found here: feature_all_participants.py.

Features were checked for NaN and inf values and then normalized for model training. The Python script for checking feature values is found here: check_features.py and for normalizing features is found here normalization.py

Feature Selection

There were three feature sets used when training our model.

Full Feature Set

The full feature set contains all final facial, pose, and audio features listed previously.

Stats Feature Set

A statistical analysis was used to assess statistically significant features. Feature set: stats_features

Random Forest (RF) Feature Set

A random forest was used to assess feature importance. Features were retained up to the point where the importance drop exceeded 40%. Feature set: rf_features

Labels

Two types of labeling methods were used: binary labeling and multiclass labeling. A label was applied to each frame of the video of the interaction between the participant and the robot. Labels were initially appended to the pose feature CSV files.

The Python script used for efficient labeling are found here: features_create_labels_csvs.py. Required inputs for the script to store binary and multiclass labels for each participant included a CSV file to store the features and labels, the frame numbers for when binary labels should be added, and the frame numbers for when multiclass labels should be added. Additionally, an input directory path was required for the location to store CSV files of features and labels for each participant.

Binary Labeling

"0" labeled frames from the beginning of the video to the first "Sorry, I do not understand" error and after "OK, I will call the researcher" to the end of the video.
"1" labeled frames from the first "Sorry, I do not understand" error to "OK, I will call the researcher".

Multiclass Labeling

"0" labeled frames from the beginning of the video to the first "Sorry, I do not understand" error and after "OK, I will call the researcher" to the end of the video.
"1" labeled frames from the first "Sorry, I do not understand" error to the second "Sorry, I do not understand".
"2" labeled frames from the second "Sorry, I do not understand" error to the third "Sorry, I do not understand".
"3" labeled frames from the third "Sorry, I do not understand" error to "OK, I will call the researcher".

The same labeling procedure was be used for additional errors.

Label Analysis

The following charts contain the amount and percentages of labels noted per participant.

Binary Labeling

participant	Label "0" Count	Label "0" Percentage	Label "1" Count	Label "1" Percentage	Total Number of Labels
2	569	48.9%	595	51.1%	1164
4	601	61.2%	381	38.8%	982
5	575	49.6%	584	50.4%	1159
6	619	51.9%	573	48.1%	1192
7	609	53.8%	524	46.2%	1133
8	606	59.8%	408	40.2%	1014
9	582	55.8%	461	44.2%	1043
10	589	48.3%	630	51.7%	1219
11	588	55.2%	477	44.8%	1065
12	615	45.8%	728	54.2%	1343
14	571	54.9%	469	45.1%	1040
16	577	43.0%	764	57.0%	1341
17	585	51.7%	546	48.3%	1131
18	577	45.1%	703	54.9%	1280
19	579	49.8%	583	50.2%	1162
20	595	44.4%	746	55.6%	1341
21	583	49.0%	607	51.0%	1190
22	571	52.1%	526	47.9%	1097
23	596	50.1%	593	49.9%	1189
25	577	50.9%	556	49.1%	1133
26	519	45.1%	631	54.9%	1150
27	608	39.2%	942	60.8%	1550
28	600	53.1%	531	46.9%	1131
29	517	32.6%	1069	67.4%	1586

Average percentage of "0" labels: 49.6%
Average percentage of "1" labels: 50.4%

Multiclass Labeling

participant	Label "0" Count	Label "0" Percentage	Label "1" Count	Label "1" Percentage	Label "2" Count	Label "2" Percentage	Label "3" Count	Label "3" Percentage	Total Number of Labels
2	569	48.9%	197	16.9%	202	17.4%	196	16.8%	1164
4	601	61.2%	132	13.4%	129	13.1%	120	12.2%	982
5	575	49.6%	221	19.1%	130	11.2%	233	20.1%	1159
6	619	51.9%	166	13.9%	147	12.3%	260	21.8%	1192
7	609	53.8%	159	14.0%	194	17.1%	171	15.1%	1133
8	606	59.8%	143	14.1%	136	13.4%	129	12.7%	1014
9	582	55.8%	136	13.0%	164	15.7%	161	15.4%	1043
10	589	48.3%	166	13.6%	275	22.6%	189	15.5%	1219
11	588	55.2%	140	13.1%	206	19.3%	131	12.3%	1065
12	615	45.8%	166	12.4%	165	12.3%	397	29.6%	1343
14	571	54.9%	147	14.1%	160	15.4%	162	15.6%	1040
16	577	43.0%	165	12.3%	263	19.6%	336	25.1%	1341
17	585	51.7%	155	13.7%	195	17.2%	196	17.3%	1131
18	577	45.1%	137	10.7%	187	14.6%	379	29.6%	1280
19	579	49.8%	145	12.5%	212	18.2%	226	19.4%	1162
20	595	44.4%	168	12.5%	276	20.6%	302	22.5%	1341
21	583	49.0%	156	13.1%	205	17.2%	246	20.7%	1190
22	571	52.1%	158	14.4%	162	14.8%	206	18.8%	1097
23	596	50.1%	130	10.9%	341	28.7%	122	10.3%	1189
25	577	50.9%	137	12.1%	218	19.2%	201	17.7%	1133
26	519	45.1%	170	14.8%	193	16.8%	268	23.3%	1150
27	608	39.2%	154	9.9%	298	19.2%	490	31.6%	1550
28	600	53.1%	161	14.2%	234	20.7%	136	12.0%	1131

Participant 29's interaction consisted of 5 errors. Therefore, it contained 2 additional labels "4" and "5".

participant	Label "0" Count	Label "0" Percentage	Label "1" Count	Label "1" Percentage	Label "2" Count	Label "2" Percentage	Label "3" Count	Label "3" Percentage	Label "4" Count	Label "4" Percentage	Label "5" Count	Label "5" Percentage	Total Number of Labels
29	511	32.2%	233	14.7%	167	10.5%	191	12.0%	223	14.1%	261	16.5%	1586

Average percentage of "0" label: 49.5%
Average percentage of "1" label: 12.7%
Average percentage of "2" label: 16.6%
Average percentage of "3" label: 19.0%
Average percentage of additional labels: 2.2%

The Python script that assisted with the label analysis is found here: label_analysis.py

Model Training

We evaluated model performance using two training approaches:

Interparticipant training
- Models are trained on data from some participants, then tested on different unseen participants.
- Models can assess how well a system can predict its failures based on reactions of new individuals.
Intraparticipant training
- Models are trained on a subset of one participant's data, then tested on a different subset of the same participant's data.
- Models can assess how well a system can predict its failures based on unseen reactions from the same individual.

Our models were trained on 15 combinations of different modalities including facial, pose, audio, and text features.

Our models were trained using 3 different feature sets including the full set of features, statistically significant features, and features of most importance. Please refer to this section for more details on feature set selection.

Our models were trained using 3 different datasets including the original dataset itself, the normalized dataset, and the dataset resulting from a principal component analysis (PCA). Please refer to this section for more details on PCA.

We explored fusion strategies to combine the features from different modalities:

Early Fusion
- Modality features are concatenated, then input into the model.
Intermediate Fusion
- Each modality is processed independently, then intermediate representations are concatenated and input into further layers of a model.
Late Fusion
- Each modality is trained independently, then their predictions are concatenated.

We explored different model architectures to assess performance across different complexity levels and modalities.

Long Short Term Memory (LSTM)
Gated Recurrent Network (GRU)
MiniRocket Model
Linear Classifiers
- K Nearest Neighbor (KNN)
- Random Forest (RF)
- Stochastic Gradient Descent (SGD)
- Support Vector Machine (SVM)
- Multilayer Perceptron (MLP)
Audio Spectrogram Transformer (AST) for audio features

These models were trained using interparticipant and intraparticipant splits and with different fusion strategies, feature sets, and datasets as explained above.

Participant Exclusion

Participants were excluded based on the following reasons:

Failed protocol resulting in no reaction to failures
Distractions not involved in the experiment resulting in no reaction to failures
Feature extraction compound confidence scores below 0.50.

Final number of participants: 24.

Principal Component Analysis (PCA)

PCA is a method used to reduce the number of variables in a large dataset by retaining patterns in the data. PCA was conducted on the dataset of 84 features containing facial, pose, and audio features. The short script below was used to retain 90% of the variance and apply the PCA to the dataset.

participant_frames_labels = df.iloc[:, :4]
x = df.iloc[:, 4:]
x = StandardScaler().fit_transform(x.values)
pca = PCA()
principal_components = pca.fit_transform(x)
print(principal_components.shape)

pca = PCA(n_components=0.90)
principal_components = pca.fit_transform(x)
print(principal_components.shape)

principal_df = pd.DataFrame(data=principal_components, columns=['principal component ' + str(i) for i in range(principal_components.shape[1])])
principal_df = pd.concat([participant_frames_labels, principal_df], axis=1)

The script was embedded into the create_data_splits_pca.py method in create_data_splits.py.

The resulting dataframe consisted of 41 principal components.

Running PCA on pose, facial, and audio features separately yielded 13, 24, and 7 principal components, respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
HRI25_LBR		HRI25_LBR
preprocessing		preprocessing
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

IRL-CT/badrobotsIRL

Folders and files

Latest commit

History

Repository files navigation

Training Models to Detect Successive Robot Errors from Human Reactions (NERC 2025)

Overview

Intraparticipant Training Models

"I'm Done": Describing Human Reactions to Successive Robot Failure (HRI 2025)

Overview

Codebook

Statistical Analysis

BAD Robots IRL

Table of Contents

Analysis of Human Reactions to Robot Failure

Features

Feature Extraction

Facial Features

Feature Exclusion

Pose Features & Estimation

Feature Exclusion

Body Keypoint Delta Calculations

Audio Features

Speaker Diarization

Timestamp to Frame Conversion

Feature Merge

Feature Selection

Full Feature Set

Stats Feature Set

Random Forest (RF) Feature Set

Labels

Binary Labeling

Multiclass Labeling

Label Analysis

Binary Labeling

Multiclass Labeling

Model Training

Participant Exclusion

Principal Component Analysis (PCA)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages