HAR - Tennis stroke classification with ML

This project details how to classify a tennis player stroke with Machine Learning with data gathered from an Apple Watch.

The application required to gather this data is TennisIO. It needs the following REST API to export data from the device: TennisIOAPI.

It's recommended to run this notebook at Google Colab, where it's built.

#@title
# Import dependencies
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
import math
import seaborn as sns
import matplotlib.ticker as plticker

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

Loading data

Data is loaded from JSON files. Each of them is a list of sensor data from the device at a given timestamp.

# Check files to load
folder = 'trainments'
files = os.listdir(folder)

# Load data from .json files
frames = []
for f in files:
    if '.json' in f:
        d = pd.read_json(f'{folder}/{f}')
        frames.append(d)
        print(f'Loaded {f}')

data = pd.concat(frames, ignore_index=True)

Loaded 20230625_011216.json
Loaded 20230625_095026.json
Loaded 20230625_013636.json
Loaded 20230625_011333.json
Loaded 20230625_094944.json
Loaded 20230625_124903.json
Loaded 20230625_094722.json
Loaded 20230625_094808.json
Loaded 20230625_125655.json
Loaded 20230625_013444.json
Loaded 20230625_011052.json
Loaded 20230624_113411.json
Loaded 20230625_013541.json
Loaded 20230625_094833.json
Loaded 20230625_013330.json
Loaded 20230625_094910.json
Loaded 20230625_094747.json

Prepare data

Once loaded, we need to solve some tricky aspects present in data:

The timestamp gathered from the device has a precision of a second while it generates almost 50 samples for each second. We should expand the precision adding made up milliseconds to each sample.
Each file contains one kind of movement repeated in a period of time. We need to window the data for each movement done.
In order to train the model, we should convert all movements in summaries for each period. This will let us use more simple algorithms.
Finally, we'll split the data into train and test datasets in order to validate the model created with data not used to create it.

Timestamp precision

First, let's solve the timestamp precision issue. We'll need to know how many samples exist for each second in order to split the second in this number of samples.

# Group samples by second
grouped_per_second = data.groupby(["identifier", "kind", "timestamp"], as_index=False)['identifier'].count()
grouped_per_second['first_index'] = grouped_per_second.apply(lambda row: data[data['timestamp'] == row['timestamp']].index[0], axis=1)

grouped_per_second.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	kind	timestamp	identifier	first_index
0	Drive	2023-06-24 11:34:11	12	17738
1	Drive	2023-06-24 11:34:12	50	17750
2	Drive	2023-06-24 11:34:13	50	17800
3	Drive	2023-06-24 11:34:14	50	17850
4	Drive	2023-06-24 11:34:15	51	17900

# Give timestamp millisecond precision and add an attribute to know the time since the trainment start
data['timestamp_millis'] = data['timestamp'].dt.round('L')
data['timestamp_millis'] = data.apply(
    lambda row: row['timestamp_millis'] +
        pd.to_timedelta(
            np.linspace(50, 950, num=grouped_per_second[grouped_per_second['timestamp'] == row['timestamp']].iloc[0]['identifier'])
            [row.name-grouped_per_second[grouped_per_second['timestamp'] == row['timestamp']].iloc[0]['first_index']]
            .astype(int),
            unit='ms'),
    axis=1)

data['time_since_start'] = data['timestamp_millis'] - data['identifier'].astype('datetime64[ns]')

/var/folders/3_/kkbh_njd6r3794xrzcrfrkj00000gp/T/ipykernel_54602/471430095.py:12: UserWarning: Parsing dates in %d-%m-%Y %H:%M:%S format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  data['time_since_start'] = data['timestamp_millis'] - data['identifier'].astype('datetime64[ns]')

data.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	identifier	kind	pitch	roll	yaw	timestamp	xAcceleration	yAcceleration	zAcceleration	xRotation	yRotation	zRotation	timestamp_millis	time_since_start
0	25-06-2023 01:12:16	Drive	-1.052237	0.473681	0.278520	2023-06-25 01:12:16	-0.024398	0.001278	0.001278	-0.007708	0.014792	-0.095422	2023-06-25 01:12:16.050	0 days 00:00:00.050000
1	25-06-2023 01:12:16	Drive	-1.053003	0.471329	0.276144	2023-06-25 01:12:16	0.025908	-0.014818	-0.014818	-0.016041	-0.029510	-0.036892	2023-06-25 01:12:16.150	0 days 00:00:00.150000
2	25-06-2023 01:12:16	Drive	-1.053979	0.470275	0.275632	2023-06-25 01:12:16	0.026299	0.011936	0.011936	-0.062542	-0.011987	-0.024491	2023-06-25 01:12:16.250	0 days 00:00:00.250000
3	25-06-2023 01:12:16	Drive	-1.055339	0.470844	0.276280	2023-06-25 01:12:16	0.000859	0.009694	0.009694	-0.068776	-0.006747	-0.026771	2023-06-25 01:12:16.350	0 days 00:00:00.350000
4	25-06-2023 01:12:16	Drive	-1.056421	0.471384	0.277111	2023-06-25 01:12:16	0.019985	0.011526	0.011526	-0.041852	-0.005576	0.012301	2023-06-25 01:12:16.450	0 days 00:00:00.450000

Window data

We're going to window the data manually. First, let's define a filter for values of xAcceleration near 0. This will help us to identify each movement.

# Set global attributes
data['x_accel_smoothed'] = data.apply(lambda row: 0 if row['xAcceleration'] < 0.1 and row['xAcceleration'] > -0.1 else row['xAcceleration'], axis=1)

Then, we plot this attribute and we can see a pattern for each movement.

loc = plticker.MultipleLocator(base=1000000000)
sns.set(rc={'figure.figsize':(20,10)})

_tmp = data[data['identifier'] == '25-06-2023 09:50:26']

axes = sns.lineplot(y='x_accel_smoothed', x='time_since_start', data=_tmp)
axes.xaxis.set_major_locator(loc)

We define the start and the end of each movement in a file called windows.xlsx, that will be used to split the data.

windows = pd.read_excel('{}/windows.xlsx'.format(folder))
windows

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Identifier	From	To	Kind
0	2023-06-24 11:34:11	0.60	0.90	Drive
1	2023-06-24 11:34:11	0.90	1.20	Drive
2	2023-06-24 11:34:11	1.20	1.85	Drive
3	2023-06-24 11:34:11	1.85	2.15	Drive
4	2023-06-24 11:34:11	2.15	2.45	Drive
...	...	...	...	...
188	2023-06-25 09:50:26	1.05	1.30	Backhand
189	2023-06-25 09:50:26	1.30	1.50	Backhand
190	2023-06-25 09:50:26	1.50	1.80	Backhand
191	2023-06-25 09:50:26	1.80	2.05	Backhand
192	2023-06-25 09:50:26	2.05	2.30	Backhand

193 rows × 4 columns

splitted_data = []
for index, row in windows.iterrows():
    d = data[
        (data['identifier'] == row['Identifier'].strftime('%d-%m-%Y %H:%M:%S')) &
        (data['time_since_start'] >= pd.to_timedelta(row['From']*10, unit='S')) &
        (data['time_since_start'] < pd.to_timedelta(row['To']*10, unit='S'))
        ]
    splitted_data.append(d)

print(f'Total movements: {len(splitted_data)}')

Total movements: 193

Sum up features

We have a really low number of samples to train a model. That's why we should build some features to avoid using a Neural Network model for this time series data.

We're going to use the mean of each attribute for each movement.

numeric_columns = ['xAcceleration', 'yAcceleration', 'zAcceleration', 'pitch', 'yaw', 'roll', 'xRotation', 'yRotation', 'zRotation']
columns = numeric_columns + ['kind']

grouped_movement_data = pd.DataFrame([], columns=columns)

for index, movement in enumerate(splitted_data):
    d = []
    for column in numeric_columns:
        d.append(movement[column].mean())
    d.append(movement['kind'].iloc[0])
    grouped_movement_data = pd.concat([grouped_movement_data, pd.DataFrame([d], columns=columns)], ignore_index=True)

grouped_movement_data.groupby('kind')['kind'].hist()

kind
Backhand    Axes(0.125,0.11;0.775x0.77)
Drive       Axes(0.125,0.11;0.775x0.77)
Name: kind, dtype: object

Split data into train and test sets

We're using the method train_test_split from sklearn.model_selection to automatically select the train and test sets from the whole data. It shuffles the samples and select a given percentage for testing.

X_train, X_test, y_train, y_test = train_test_split(grouped_movement_data[numeric_columns], grouped_movement_data['kind'], test_size=0.2)
y_train.hist()
y_test.hist()

<Axes: >

Build the model

We're going to check the accuracy given from different algorithms: kNN, SVC and Decision Tree.

# kNN model
nbrs = KNeighborsClassifier(n_neighbors=2).fit(X_train, y_train)
y_pred = nbrs.predict(X_test)

print(f'kNN gives an accuracy of {accuracy_score(y_test, y_pred)*100}%')

kNN gives an accuracy of 100.0%

# SVC model
svc = SVC()
svc.fit(X_train, y_train)

y_pred = svc.predict(X_test)
print(f'SVC gives an accuracy of {accuracy_score(y_test, y_pred)*100}%')

SVC gives an accuracy of 100.0%

# Decision Tree model
tree = DecisionTreeClassifier()
tree = tree.fit(X_train, y_train)

y_pred = tree.predict(X_test)
y_probab = tree.predict_proba(X_test)

accuracy_score(y_test, y_pred)
print(f'Decision Tree gives an accuracy of {accuracy_score(y_test, y_pred)*100}%')

cm = confusion_matrix(y_test, y_pred, labels=tree.classes_)
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=tree.classes_)
fig, ax = plt.subplots(figsize=(5,5))
cmd.plot(ax=ax)
plt.grid(False)
plt.show()

Decision Tree gives an accuracy of 97.43589743589743%

Conclusion

All three models result in an accuracy near the 100%. Features selected to build these models are classifying correctly each tennis stroke.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README_files		README_files
trainments		trainments
README.ipynb		README.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HAR - Tennis stroke classification with ML

Loading data

Prepare data

Timestamp precision

Window data

Sum up features

Split data into train and test sets

Build the model

Conclusion

About

Releases

Packages

ivangonzalezz/TennisIOML

Folders and files

Latest commit

History

Repository files navigation

HAR - Tennis stroke classification with ML

Loading data

Prepare data

Timestamp precision

Window data

Sum up features

Split data into train and test sets

Build the model

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages