Our group wanted to know fans (and the lack of fans) affect player performance. We were also required to use machine learning, so here we are.
We decided to train two models on pre-COVID player stats to predict the outcome of the game: win or loss. This is our "performance with fans" baseline.
Next we run the fanless player game stats into the trained model to quantify the impact of missing fans on our predictions.
We pulled player stats per game from 2015 through 2020 via the SportRadar API, first Schedule per year then looping through the games to get Game Summary per player stats.
The two models are Random Forrest and Sequential with hidden layers.
On to the code!
# get the latest and greatest sklearn for your models!
!pip install sklearn --upgrade
# next import dependencies for data preprocessing
import pandas as pd
from datetime import datetime as dt
import numpy as np
import requests
import config
import json
import time
import datetime
from pprint import pprint
# pull in the per game player stats with fans, conveniently saved as a csv
df = pd.read_csv('data_files/fullplayerstatslist.csv')
# here's a peak at the raw data
df.head()
# let's trim the data down to our X factors...
df_dropped = df[df['Min_played'] != "00:00"]
df_dropped = df_dropped[df_dropped['Crowd'] != 'Covid']
df_dropped = df_dropped[df_dropped['Crowd'] != '0']
df_dropped= df_dropped[["Points", "Free_Throw_Percent",
"Two_Pt_Percent",
"Three_Pt_Percent", "Assists",
"Rebounds", "Offensive_Rebounds",
"Steals", "Personal_Fouls",
"Flagrant_Fouls", "Tech_Fouls",
"Turnovers",
"Home_Away", "win"
]].reset_index(drop = True)
df_dropped.head()
# grab every stat except the 'win' column as your X features
X = df_dropped.drop('win', axis=1)
print(X.shape)
# set your y to predict to 'win'
y = df_dropped['win']
print(y.shape)
# now import the tools to train and scale
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from tensorflow.keras.utils import to_categorical
# split X and y into train and test groups
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)
# now scale the X data to keep everything reasonable
X_scaler = MinMaxScaler().fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
Now our stats are X features and our y to predict is wins. Aside from a little more y preprocessing for the Sequential model, let's start making models!
# we tried a few n_estimator settings (100, 1000) and landed on the sweet spot through trial and error: 200
from sklearn.ensemble import RandomForestClassifier
# define the model
modelRF = RandomForestClassifier(n_estimators=200)
# train on training data
modelRF.fit(X_train_scaled, y_train)
# let's see how our Random Forrest did!
print(f"Training Data Score: {round(modelRF.score(X_train_scaled, y_train)*100,2)}%")
print(f"Testing Data Score: {round(modelRF.score(X_test_scaled, y_test)*100,2)}%")
# last since we're running Random Forrest let's rank the top features
feature_names = X.columns.tolist()
preSelected_features = sorted(zip(modelRF.feature_importances_, feature_names), reverse=True)
ranked_features = pd.DataFrame(preSelected_features, columns=['Score', 'Feature'])
ranked_features = ranked_features.set_index('Feature')
ranked_features
Random Forrest gets us to 69% accuracy predicting win/loss based on per player game stats - not bad!
Unsurprisingly Points, Rebounds, Shooting Percentage, and Assists have the greatest impact.
So pre-COVID Random Forrest model is ready to go. Let's get Sequential with a few hidden layers next.
# Sequential Deep learning picked because the model predicted male/female voices in our class exercise. Looking for the same binary decision: win or loss
# also our Sequential model threw an error the first time out so we added LabelEncoder to y (and it worked!)
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
encoded_y_train = label_encoder.transform(y_train)
encoded_y_test = label_encoder.transform(y_test)
# Then we need to convert y labels to one-hot-encoding
y_train_categorical = to_categorical(encoded_y_train)
y_test_categorical = to_categorical(encoded_y_test)
# import the Sequential model and Dense for the hidden layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Create model and add layers
# tried increasing units by 100 per layer (ex 100, 200, 300, 400); less accurate
# tried fewer and more hidden layers but best accuracy was three additional layers, 100 units, 100 epochs
# input_dim set to 13 because we have 13 X factors!
# layer activation set to relu because the X factors are all over the place and we want a relative model
# last the final 2 until layer to find wins/losses set to softmax as it's a binary decision and we want to make it softly
model = Sequential()
model.add(Dense(units=100, activation='relu', input_dim=13))
model.add(Dense(units=100, activation='relu'))
model.add(Dense(units=100, activation='relu'))
model.add(Dense(units=100, activation='relu'))
model.add(Dense(units=2, activation='softmax'))
# Compile and fit the model
# optimizer, loss, and metrics set same as the male/female vocal predictions - keeping it binary
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# let's summarize and make sure we're ready to train!
model.summary()
# ok time to train our Sequential model
# tried fewer epochs but we had computer processing power to spare and the accuracy went up until we hit 100
model.fit(
X_train_scaled,
y_train_categorical,
epochs=100,
shuffle=True,
verbose=2
)
# let's see how our Sequential model did!
model_loss, model_accuracy = model.evaluate(
X_test_scaled, y_test_categorical, verbose=2)
print(
f"Normal Neural Network - Loss: {model_loss}, Accuracy: {model_accuracy}")
# last since we're running Sequential let's lay out the predictions against the actuals
encoded_predictions = model.predict_classes(X_test_scaled)
prediction_labels = label_encoder.inverse_transform(encoded_predictions)
print(f"First 10 Predictions: {prediction_labels[:10]}")
print(f"First 10 Actual labels: {y_test[:10].tolist()}")
# we can even put them all together into a data frame
pd.DataFrame({"Prediction": prediction_labels, "Actual": y_test}).reset_index(drop=True)
Sequential comes in at 60% accuracy (and up to 68% accuracy in epoch 100). Again not bad!
Also don't forget to save your model with: model.save('models/deepLearningSequential.h5')
Now it's time to see how our pre-COVID models perform with fanless player game stats.
# pull in the postCovid player data
covidDF = pd.read_csv('data_files/player_stats_2019_pst.csv')
# dropping unwanted fields to match preCovid columns
# don't drop Covid games obviously!
covidDF_dropped = covidDF[covidDF['Min_played'] != "00:00"]
covidDF_dropped= covidDF_dropped[["Points", "Free_Throw_Percent",
"Two_Pt_Percent",
"Three_Pt_Percent", "Assists",
"Rebounds", "Offensive_Rebounds",
"Steals", "Personal_Fouls",
"Flagrant_Fouls", "Tech_Fouls",
"Turnovers",
"Home_Away", "win"
]].reset_index(drop = True)
covidDF_dropped.head()
# same X features
X = covidDF_dropped.drop('win', axis=1)
print(X.shape)
# same y goal
y = covidDF_dropped['win']
print(y.shape)
# scale the X features to predict
X_predict_scaled = X_scaler.transform(X)
print(f"COVID Data Score: {round(modelRF.score(X_predict_scaled, y)*100,2)}%")
Playing without fans drops our Random Forrest prediction by 15% (down to 54% vs fan data prediction of 69%)
# encode and one-hot-encoding as with the trained model y
label_encoder.fit(y)
encoded_y_actual = label_encoder.transform(y)
y_actual_categorical = to_categorical(encoded_y_actual)
model_loss, model_accuracy = model.evaluate(
X_predict_scaled, y_actual_categorical, verbose=2)
print(
f"Normal Neural Network - Loss: {model_loss}, Accuracy: {model_accuracy}")
Playing without fans drops our Sequential prediction by 8% (down to 52% from 60%)