Analysis of Serie A data

I am using data from the last 10 Serie A seasons, in order to predict this season
All data from the 2010/2011 season until February 27th, 2021 is being taken into account

Step 1: Combine all season data

Combined all 10 csv files from previous seasons into 1 csv to analyze

Step 2: Delete all betting columns

Keep:

Column	Description
Season	Season
Date	Match Date (dd/mm/yy)
AwayTeam	Away Team
HomeTeam	Home Team

AS	Away Team Shots
HS	Home Team Shots
AST	Away Team Shots on Target
HST	Home Team Shots on Target
HTAG	Half Time Away Team Goals
HTHG	Half Time Home Team Goals
HTR	Half Time Result (H=Home Win, D=Draw, A=Away Win)
	(Dummies) H=3, D=1, A=0
FTAG	Full Time Away Team Goals
FTHG	Full Time Home Team Goals
FTR	Full Time Result (H=Home Win, D=Draw, A=Away Win)
	(Dummies) H=3, D=1, A=0
AC	Away Team Corners
HC	Home Team Corners
AF	Away Team Fouls Committed
HF	Home Team Fouls Committed
AY	Away Team Yellow Cards
HY	Home Team Yellow Cards
AR	Away Team Red Cards
HR	Home Team Red Cards

Step 3: Correlation

Very obvious anti-correlations, not much learned here
One thing, away shots on target much less or more than home shots on target. Away team is more varied in shots

Step 4: Creating Index for categorical variables

I converted the Half-time and Full time results into ordinal columns, with a home win = 3, a draw = 1, and a home loss = 0
Made Home and Away teams into numeric values

Their index:

Home_Team_Index	Team
0	None
1	Atalanta
2	Bari
3	Benevento
4	Bologna
5	Brescia
6	Cagliari
7	Carpi
8	Catania
9	Cesena
10	Chievo
11	Crotone
12	Empoli
13	Fiorentina
14	Frosinone
15	Genoa
16	Inter
17	Juventus
18	Lazio
19	Lecce
20	Livorno
21	Milan
22	Napoli
23	Novara
24	Palermo
25	Parma
26	Pescara
27	Roma
28	Sampdoria
29	Sassuolo
30	Siena
31	Spal
32	Spezia
33	Torino
34	Udinese
35	Verona

Step 5: Logistic Regression

I ended up using a softmax regression, as the dependent variable has 3 outcomes
First normalize the model's x and y, then look at things like accuracy and baseline score. Looked at baseline score as well, pretty poor in comparison
Using Cohen's score to observe randomness, there is no randomness in our model

Feature Importance - found that:
The biggest factor in a home win is Full_Time_Away_Team_Goals, followed by Away_Team_Shots_on_Target
The biggest factor in home loss is Full_Time_Home_Team_Goals, followed by Half_Time_Result, and Home_Team_Shots_on_Target

I used Ridge regression instead of lasso to highlight categories that were pretty effective on the dataset

Step 6: Random Forest

Used another model called random forest to validate earlier findings. The accuracy, baseline accuracy, and cohen's score were relatively the same

Step 7: K-Folds

Use K-folds along with the random forest model so that the model could be tested with "Real-World data" and see how it reacts during every fold
Was generally pretty solid, model ended at 0.993 accuracy

Step 8: Hyperparameters

Used a hyperparameter grid with some out-there parameters so that the model could adjust itself
An example is features; if I were to run hyper parameters for max_features, and the grid found that 0.05 is a better parameter, it would mean that most of my categories were useless

Step 9: Predictions

Used the values from the random forest predictions (could have also used Soft-Max Regression)
I combined these predictions to the index for the Home Teams made earlier)

The final table looks like this:

Position	Club	Points
1	Juventus	92
2	Roma	91
3	Napoli	83
4	Inter	77
5	Milan	73
6	Sampdoria	58
7	Lazio	58
8	Genoa	49
9	Fiorentina	49
10	Atalanta	49
11	Cagliari	44
12	Parma	36
13	Udinese	35
14	Bologna	34
15	Torino	27
16	Palermo	26
17	Sassuolo	21
18	Chievo	21
19	Empoli	15
20	Catania	15

Accuracy Score = 67.75%

Step 10: Accuracy check

I checked my accuracy in terms of league positions by downloading the current data for the Serie A, and then coming up with a points-scale where the higher number= of points equals teh worse prediction
I got a score of 67.75%, which is pretty good but in general I had no correct league positions in the predictions, so the generally high accuracy score can be seen as misleading; for example, Juventus was predicted first when in reality they are currently 4th, so this added +4 which could have been higher using a better scoring scale

Issues I came across

The final table I revealed came up with had 35 teams, which I then had to cut to 20 teams. Surprisingly the table still looks very realistic, so I assume that the extra games taken into account would mean that the final table could have had less points. The teams at the bottom had almost no points, so I also assume that they were not taken into account much by the model purely because they were in the 2nd division for most of that time
I was not able to make a prediction that used only the most recent season's data. When I tried, it ended up with way fewer points than I wanted
There seemed ot be no randomness in my model which is concerning, and I fear this is because the model is only using the training data and not the test data. Still have much to learn

This was my first attempt at something like this, so I will focus on improving this more and more, either with better data or more recent data. If you have any suggestions please let me know!

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
img		img
raw_data		raw_data
README.md		README.md
README.txt		README.txt
accuracy_check.ipynb		accuracy_check.ipynb
analysis.ipynb		analysis.ipynb
analysis_last_season_only.ipynb		analysis_last_season_only.ipynb
current_standings.csv		current_standings.csv
data.csv		data.csv
datapackage.json		datapackage.json
predictions.csv		predictions.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of Serie A data

Step 1: Combine all season data

Step 2: Delete all betting columns

Step 3: Correlation

Step 4: Creating Index for categorical variables

Step 5: Logistic Regression

Step 6: Random Forest

Step 7: K-Folds

Step 8: Hyperparameters

Step 9: Predictions

Accuracy Score = 67.75%

Step 10: Accuracy check

Issues I came across

About

Releases

Packages

Languages

kingazaan/serie_a_analysis

Folders and files

Latest commit

History

Repository files navigation

Analysis of Serie A data

Step 1: Combine all season data

Step 2: Delete all betting columns

Step 3: Correlation

Step 4: Creating Index for categorical variables

Step 5: Logistic Regression

Step 6: Random Forest

Step 7: K-Folds

Step 8: Hyperparameters

Step 9: Predictions

Accuracy Score = 67.75%

Step 10: Accuracy check

Issues I came across

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages