Machine Learning the NBA

Let's predict wins and see if the model holds up without fans

Our group wanted to know fans (and the lack of fans) affect player performance. We were also required to use machine learning, so here we are.

We decided to train two models on pre-COVID player stats to predict the outcome of the game: win or loss. This is our "performance with fans" baseline.

Next we run the fanless player game stats into the trained model to quantify the impact of missing fans on our predictions.

We pulled player stats per game from 2015 through 2020 via the SportRadar API, first Schedule per year then looping through the games to get Game Summary per player stats.

The two models are Random Forrest and Sequential with hidden layers.

On to the code!

# get the latest and greatest sklearn for your models!
!pip install sklearn --upgrade

Requirement already up-to-date: sklearn in c:\programdata\anaconda3\lib\site-packages (0.0)
Requirement already satisfied, skipping upgrade: scikit-learn in c:\programdata\anaconda3\lib\site-packages (from sklearn) (0.22.1)
Requirement already satisfied, skipping upgrade: numpy>=1.11.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn->sklearn) (1.18.1)
Requirement already satisfied, skipping upgrade: scipy>=0.17.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn->sklearn) (1.4.1)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn->sklearn) (0.14.1)

# next import dependencies for data preprocessing
import pandas as pd
from datetime import datetime as dt
import numpy as np
import requests
import config
import json
import time
import datetime
from pprint import pprint

Preprocess the Data!

# pull in the per game player stats with fans, conveniently saved as a csv
df = pd.read_csv('data_files/fullplayerstatslist.csv')

# here's a peak at the raw data
df.head()

# let's trim the data down to our X factors...
df_dropped = df[df['Min_played'] != "00:00"]
df_dropped = df_dropped[df_dropped['Crowd'] != 'Covid']
df_dropped = df_dropped[df_dropped['Crowd'] != '0']
df_dropped= df_dropped[["Points", "Free_Throw_Percent",
                  "Two_Pt_Percent",
                  "Three_Pt_Percent", "Assists",
                  "Rebounds", "Offensive_Rebounds",
                  "Steals", "Personal_Fouls",
                  "Flagrant_Fouls", "Tech_Fouls",
                  "Turnovers",
                  "Home_Away", "win"
                  ]].reset_index(drop = True)
df_dropped.head()

# grab every stat except the 'win' column as your X features
X = df_dropped.drop('win', axis=1)
print(X.shape)

(120384, 13)

# set your y to predict to 'win'
y = df_dropped['win']
print(y.shape)

(120384,)

# now import the tools to train and scale
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from tensorflow.keras.utils import to_categorical

# split X and y into train and test groups
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

# now scale the X data to keep everything reasonable
X_scaler = MinMaxScaler().fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

Now our stats are X features and our y to predict is wins. Aside from a little more y preprocessing for the Sequential model, let's start making models!

Into the Random Forest

# we tried a few n_estimator settings (100, 1000) and landed on the sweet spot through trial and error: 200
from sklearn.ensemble import RandomForestClassifier

# define the model
modelRF = RandomForestClassifier(n_estimators=200)

# train on training data
modelRF.fit(X_train_scaled, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

# let's see how our Random Forrest did!
print(f"Training Data Score: {round(modelRF.score(X_train_scaled, y_train)*100,2)}%")
print(f"Testing Data Score: {round(modelRF.score(X_test_scaled, y_test)*100,2)}%")

Training Data Score: 95.29%
Testing Data Score: 69.43%

# last since we're running Random Forrest let's rank the top features
feature_names = X.columns.tolist()
preSelected_features = sorted(zip(modelRF.feature_importances_, feature_names), reverse=True)
ranked_features = pd.DataFrame(preSelected_features, columns=['Score', 'Feature'])
ranked_features = ranked_features.set_index('Feature')
ranked_features

Random Forrest gets us to 69% accuracy predicting win/loss based on per player game stats - not bad!

Unsurprisingly Points, Rebounds, Shooting Percentage, and Assists have the greatest impact.

So pre-COVID Random Forrest model is ready to go. Let's get Sequential with a few hidden layers next.

Create a Sequential Deep Learning Model

# Sequential Deep learning picked because the model predicted male/female voices in our class exercise. Looking for the same binary decision: win or loss

# also our Sequential model threw an error the first time out so we added LabelEncoder to y (and it worked!)
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
encoded_y_train = label_encoder.transform(y_train)
encoded_y_test = label_encoder.transform(y_test)

# Then we need to convert y labels to one-hot-encoding
y_train_categorical = to_categorical(encoded_y_train)
y_test_categorical = to_categorical(encoded_y_test)

# import the Sequential model and Dense for the hidden layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Create model and add layers
# tried increasing units by 100 per layer (ex 100, 200, 300, 400); less accurate
# tried fewer and more hidden layers but best accuracy was three additional layers, 100 units, 100 epochs
# input_dim set to 13 because we have 13 X factors!
# layer activation set to relu because the X factors are all over the place and we want a relative model
# last the final 2 until layer to find wins/losses set to softmax as it's a binary decision and we want to make it softly

model = Sequential()
model.add(Dense(units=100, activation='relu', input_dim=13))
model.add(Dense(units=100, activation='relu'))
model.add(Dense(units=100, activation='relu'))
model.add(Dense(units=100, activation='relu'))
model.add(Dense(units=2, activation='softmax'))

# Compile and fit the model
# optimizer, loss, and metrics set same as the male/female vocal predictions - keeping it binary

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# let's summarize and make sure we're ready to train!
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_5 (Dense)              (None, 100)               1400      
_________________________________________________________________
dense_6 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_7 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_8 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_9 (Dense)              (None, 2)                 202       
=================================================================
Total params: 31,902
Trainable params: 31,902
Non-trainable params: 0
_________________________________________________________________

# ok time to train our Sequential model
# tried fewer epochs but we had computer processing power to spare and the accuracy went up until we hit 100

model.fit(
    X_train_scaled,
    y_train_categorical,
    epochs=100,
    shuffle=True,
    verbose=2
)

Epoch 1/100
2822/2822 - 2s - loss: 0.6715 - accuracy: 0.5880
Epoch 2/100
2822/2822 - 2s - loss: 0.6672 - accuracy: 0.5950
Epoch 3/100
2822/2822 - 2s - loss: 0.6663 - accuracy: 0.5969
Epoch 4/100
2822/2822 - 1s - loss: 0.6654 - accuracy: 0.5986
Epoch 5/100
2822/2822 - 1s - loss: 0.6646 - accuracy: 0.5986
Epoch 6/100
2822/2822 - 2s - loss: 0.6641 - accuracy: 0.5993
Epoch 7/100
2822/2822 - 1s - loss: 0.6637 - accuracy: 0.5999
Epoch 8/100
2822/2822 - 1s - loss: 0.6635 - accuracy: 0.6005
Epoch 9/100
2822/2822 - 1s - loss: 0.6627 - accuracy: 0.6021
Epoch 10/100
2822/2822 - 1s - loss: 0.6627 - accuracy: 0.6015
Epoch 11/100
2822/2822 - 1s - loss: 0.6621 - accuracy: 0.6014
Epoch 12/100
2822/2822 - 1s - loss: 0.6617 - accuracy: 0.6024
Epoch 13/100
2822/2822 - 1s - loss: 0.6615 - accuracy: 0.6019
Epoch 14/100
2822/2822 - 1s - loss: 0.6610 - accuracy: 0.6029
Epoch 15/100
2822/2822 - 1s - loss: 0.6606 - accuracy: 0.6041
Epoch 16/100
2822/2822 - 1s - loss: 0.6600 - accuracy: 0.6034
Epoch 17/100
2822/2822 - 1s - loss: 0.6595 - accuracy: 0.6042
Epoch 18/100
2822/2822 - 1s - loss: 0.6591 - accuracy: 0.6055
Epoch 19/100
2822/2822 - 1s - loss: 0.6583 - accuracy: 0.6060
Epoch 20/100
2822/2822 - 1s - loss: 0.6577 - accuracy: 0.6055
Epoch 21/100
2822/2822 - 1s - loss: 0.6570 - accuracy: 0.6050
Epoch 22/100
2822/2822 - 1s - loss: 0.6569 - accuracy: 0.6070
Epoch 23/100
2822/2822 - 1s - loss: 0.6558 - accuracy: 0.6076
Epoch 24/100
2822/2822 - 1s - loss: 0.6550 - accuracy: 0.6096
Epoch 25/100
2822/2822 - 1s - loss: 0.6545 - accuracy: 0.6091
Epoch 26/100
2822/2822 - 1s - loss: 0.6536 - accuracy: 0.6097
Epoch 27/100
2822/2822 - 1s - loss: 0.6527 - accuracy: 0.6117
Epoch 28/100
2822/2822 - 1s - loss: 0.6522 - accuracy: 0.6124
Epoch 29/100
2822/2822 - 1s - loss: 0.6509 - accuracy: 0.6126
Epoch 30/100
2822/2822 - 1s - loss: 0.6499 - accuracy: 0.6144
Epoch 31/100
2822/2822 - 1s - loss: 0.6487 - accuracy: 0.6152
Epoch 32/100
2822/2822 - 1s - loss: 0.6481 - accuracy: 0.6152
Epoch 33/100
2822/2822 - 1s - loss: 0.6467 - accuracy: 0.6162
Epoch 34/100
2822/2822 - 1s - loss: 0.6461 - accuracy: 0.6175
Epoch 35/100
2822/2822 - 1s - loss: 0.6451 - accuracy: 0.6175
Epoch 36/100
2822/2822 - 1s - loss: 0.6438 - accuracy: 0.6195
Epoch 37/100
2822/2822 - 1s - loss: 0.6436 - accuracy: 0.6196
Epoch 38/100
2822/2822 - 1s - loss: 0.6421 - accuracy: 0.6214
Epoch 39/100
2822/2822 - 1s - loss: 0.6409 - accuracy: 0.6226
Epoch 40/100
2822/2822 - 1s - loss: 0.6404 - accuracy: 0.6248
Epoch 41/100
2822/2822 - 1s - loss: 0.6386 - accuracy: 0.6248
Epoch 42/100
2822/2822 - 1s - loss: 0.6376 - accuracy: 0.6259
Epoch 43/100
2822/2822 - 1s - loss: 0.6365 - accuracy: 0.6274
Epoch 44/100
2822/2822 - 1s - loss: 0.6350 - accuracy: 0.6291
Epoch 45/100
2822/2822 - 1s - loss: 0.6341 - accuracy: 0.6291
Epoch 46/100
2822/2822 - 1s - loss: 0.6323 - accuracy: 0.6320
Epoch 47/100
2822/2822 - 1s - loss: 0.6315 - accuracy: 0.6310
Epoch 48/100
2822/2822 - 1s - loss: 0.6298 - accuracy: 0.6328
Epoch 49/100
2822/2822 - 1s - loss: 0.6284 - accuracy: 0.6345
Epoch 50/100
2822/2822 - 1s - loss: 0.6275 - accuracy: 0.6346
Epoch 51/100
2822/2822 - 1s - loss: 0.6263 - accuracy: 0.6347
Epoch 52/100
2822/2822 - 1s - loss: 0.6241 - accuracy: 0.6383
Epoch 53/100
2822/2822 - 1s - loss: 0.6235 - accuracy: 0.6384
Epoch 54/100
2822/2822 - 1s - loss: 0.6228 - accuracy: 0.6398
Epoch 55/100
2822/2822 - 1s - loss: 0.6221 - accuracy: 0.6392
Epoch 56/100
2822/2822 - 1s - loss: 0.6192 - accuracy: 0.6417
Epoch 57/100
2822/2822 - 1s - loss: 0.6188 - accuracy: 0.6423
Epoch 58/100
2822/2822 - 1s - loss: 0.6183 - accuracy: 0.6443
Epoch 59/100
2822/2822 - 1s - loss: 0.6157 - accuracy: 0.6460
Epoch 60/100
2822/2822 - 1s - loss: 0.6152 - accuracy: 0.6462
Epoch 61/100
2822/2822 - 1s - loss: 0.6120 - accuracy: 0.6488
Epoch 62/100
2822/2822 - 1s - loss: 0.6125 - accuracy: 0.6471
Epoch 63/100
2822/2822 - 1s - loss: 0.6102 - accuracy: 0.6497
Epoch 64/100
2822/2822 - 1s - loss: 0.6091 - accuracy: 0.6501
Epoch 65/100
2822/2822 - 1s - loss: 0.6085 - accuracy: 0.6511
Epoch 66/100
2822/2822 - 1s - loss: 0.6065 - accuracy: 0.6536
Epoch 67/100
2822/2822 - 1s - loss: 0.6054 - accuracy: 0.6527
Epoch 68/100
2822/2822 - 1s - loss: 0.6038 - accuracy: 0.6539
Epoch 69/100
2822/2822 - 1s - loss: 0.6018 - accuracy: 0.6557
Epoch 70/100
2822/2822 - 1s - loss: 0.6022 - accuracy: 0.6570
Epoch 71/100
2822/2822 - 1s - loss: 0.6007 - accuracy: 0.6558
Epoch 72/100
2822/2822 - 1s - loss: 0.5987 - accuracy: 0.6597
Epoch 73/100
2822/2822 - 1s - loss: 0.5976 - accuracy: 0.6597
Epoch 74/100
2822/2822 - 1s - loss: 0.5959 - accuracy: 0.6604
Epoch 75/100
2822/2822 - 1s - loss: 0.5974 - accuracy: 0.6608
Epoch 76/100
2822/2822 - 1s - loss: 0.5939 - accuracy: 0.6636
Epoch 77/100
2822/2822 - 1s - loss: 0.5926 - accuracy: 0.6626
Epoch 78/100
2822/2822 - 1s - loss: 0.5912 - accuracy: 0.6647
Epoch 79/100
2822/2822 - 1s - loss: 0.5912 - accuracy: 0.6652
Epoch 80/100
2822/2822 - 1s - loss: 0.5891 - accuracy: 0.6650
Epoch 81/100
2822/2822 - 1s - loss: 0.5892 - accuracy: 0.6665
Epoch 82/100
2822/2822 - 1s - loss: 0.5888 - accuracy: 0.6671
Epoch 83/100
2822/2822 - 1s - loss: 0.5868 - accuracy: 0.6679
Epoch 84/100
2822/2822 - 1s - loss: 0.5845 - accuracy: 0.6689
Epoch 85/100
2822/2822 - 1s - loss: 0.5834 - accuracy: 0.6696
Epoch 86/100
2822/2822 - 1s - loss: 0.5842 - accuracy: 0.6691
Epoch 87/100
2822/2822 - 1s - loss: 0.5824 - accuracy: 0.6720
Epoch 88/100
2822/2822 - 1s - loss: 0.5820 - accuracy: 0.6713
Epoch 89/100
2822/2822 - 1s - loss: 0.5781 - accuracy: 0.6737
Epoch 90/100
2822/2822 - 1s - loss: 0.5785 - accuracy: 0.6738
Epoch 91/100
2822/2822 - 1s - loss: 0.5768 - accuracy: 0.6745
Epoch 92/100
2822/2822 - 1s - loss: 0.5771 - accuracy: 0.6755
Epoch 93/100
2822/2822 - 1s - loss: 0.5756 - accuracy: 0.6754
Epoch 94/100
2822/2822 - 1s - loss: 0.5746 - accuracy: 0.6753
Epoch 95/100
2822/2822 - 1s - loss: 0.5743 - accuracy: 0.6772
Epoch 96/100
2822/2822 - 1s - loss: 0.5715 - accuracy: 0.6793
Epoch 97/100
2822/2822 - 1s - loss: 0.5712 - accuracy: 0.6799
Epoch 98/100
2822/2822 - 1s - loss: 0.5714 - accuracy: 0.6794
Epoch 99/100
2822/2822 - 1s - loss: 0.5700 - accuracy: 0.6788
Epoch 100/100
2822/2822 - 1s - loss: 0.5698 - accuracy: 0.6812

<tensorflow.python.keras.callbacks.History at 0x24c026c3248>

# let's see how our Sequential model did!
model_loss, model_accuracy = model.evaluate(
    X_test_scaled, y_test_categorical, verbose=2)
print(
    f"Normal Neural Network - Loss: {model_loss}, Accuracy: {model_accuracy}")

941/941 - 0s - loss: 0.8607 - accuracy: 0.6030
Normal Neural Network - Loss: 0.8607043027877808, Accuracy: 0.6029704809188843

# last since we're running Sequential let's lay out the predictions against the actuals
encoded_predictions = model.predict_classes(X_test_scaled)
prediction_labels = label_encoder.inverse_transform(encoded_predictions)

WARNING:tensorflow:From <ipython-input-33-ee58f50ea8ca>:2: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).

print(f"First 10 Predictions:   {prediction_labels[:10]}")
print(f"First 10 Actual labels: {y_test[:10].tolist()}")

First 10 Predictions:   [1 1 1 1 1 1 0 0 1 1]
First 10 Actual labels: [0, 1, 0, 0, 0, 1, 0, 1, 1, 1]

# we can even put them all together into a data frame
pd.DataFrame({"Prediction": prediction_labels, "Actual": y_test}).reset_index(drop=True)

Sequential comes in at 60% accuracy (and up to 68% accuracy in epoch 100). Again not bad!

Also don't forget to save your model with: model.save('models/deepLearningSequential.h5')

Now it's time to see how our pre-COVID models perform with fanless player game stats.

Preprosess the Fanless Data!

# pull in the postCovid player data
covidDF = pd.read_csv('data_files/player_stats_2019_pst.csv')

# dropping unwanted fields to match preCovid columns
# don't drop Covid games obviously!
covidDF_dropped = covidDF[covidDF['Min_played'] != "00:00"]
covidDF_dropped= covidDF_dropped[["Points", "Free_Throw_Percent",
                  "Two_Pt_Percent",
                  "Three_Pt_Percent", "Assists",
                  "Rebounds", "Offensive_Rebounds",
                  "Steals", "Personal_Fouls",
                  "Flagrant_Fouls", "Tech_Fouls",
                  "Turnovers",
                  "Home_Away", "win"
                  ]].reset_index(drop = True)
covidDF_dropped.head()

# same X features
X = covidDF_dropped.drop('win', axis=1)
print(X.shape)

(1689, 13)

# same y goal
y = covidDF_dropped['win']
print(y.shape)

(1689,)

# scale the X features to predict
X_predict_scaled = X_scaler.transform(X)

Random Forest Pre-COVID Trained vs Fanless Data

print(f"COVID Data Score: {round(modelRF.score(X_predict_scaled, y)*100,2)}%")

COVID Data Score: 54.0%

Playing without fans drops our Random Forrest prediction by 15% (down to 54% vs fan data prediction of 69%)

Sequential Pre-COVID Trained vs Fanless Data

# encode and one-hot-encoding as with the trained model y
label_encoder.fit(y)
encoded_y_actual = label_encoder.transform(y)
y_actual_categorical = to_categorical(encoded_y_actual)

model_loss, model_accuracy = model.evaluate(
    X_predict_scaled, y_actual_categorical, verbose=2)
print(
    f"Normal Neural Network - Loss: {model_loss}, Accuracy: {model_accuracy}")

53/53 - 0s - loss: 1.1353 - accuracy: 0.5240
Normal Neural Network - Loss: 1.1353408098220825, Accuracy: 0.5239787101745605

Playing without fans drops our Sequential prediction by 8% (down to 52% from 60%)

	Unnamed: 0	Unnamed: 0.1	First_Name	Last_Name	player_id	Position	Points	Free_Throw_Percent	Two_Pt_Percent	Three_Pt_Percent	...	Turnovers	Team	Home_Away	win	Team_points	Min_played	Crowd	Stadium_Cap	game_id	game_date
0	0	0	LeBron	James	0afbe608-940a-4d5d-a1f7-468718c67d91	F	19	50.0	81.818	0.0	...	4	Cavaliers	1	1	117	32:23	20562	20562	0da78f13-73ac-4465-8e31-ecc3029a5dc6	2016-10-25T23:30:00+00:00
1	1	1	James	Jones	09d25155-c3be-4246-a986-55921a1b5e61	G-F	5	100.0	0.000	100.0	...	0	Cavaliers	1	1	117	5:30	20562	20562	0da78f13-73ac-4465-8e31-ecc3029a5dc6	2016-10-25T23:30:00+00:00
2	2	2	J.R.	Smith	5934134d-0d27-42ea-a554-4b0e3e85ce56	G-F	8	0.0	20.000	25.0	...	0	Cavaliers	1	1	117	25:14	20562	20562	0da78f13-73ac-4465-8e31-ecc3029a5dc6	2016-10-25T23:30:00+00:00
3	3	3	Kay	Felder	8d3acdd5-9b5a-4d69-9912-de42d979c31a	G	0	0.0	0.000	0.0	...	0	Cavaliers	1	1	117	00:00	20562	20562	0da78f13-73ac-4465-8e31-ecc3029a5dc6	2016-10-25T23:30:00+00:00
4	4	4	Mike	Dunleavy	4ec1bff7-ec1b-488b-8a24-aed83e62b4ce	G-F	4	0.0	100.000	0.0	...	0	Cavaliers	1	1	117	22:32	20562	20562	0da78f13-73ac-4465-8e31-ecc3029a5dc6	2016-10-25T23:30:00+00:00

For The Love of Basketball