Predicting Wins at the Highest Level in the NBA

By Bhavesh Kemburu

The NBA Finals represent the biggest stage in basketball, for two teams from each conference pursuing the ultimate prize. Winning the ship often results in lucrative contracts during the offseason, statues built(literally) outside the stadium, and the support of many fanbases. Some series have been heavily one-sided while others have been extremely close, and often pivoted in the opposite direction due to a series of events. I want to see if stats tell the story of how teams won each game during the NBA Finals game. I want to analyze different stats by each NBA team during the Finals, outside of points made, to see if I can predict the winners of each NBA Finals, and furthermore analyze which areas appear to be the most pivotal towards winning individual games at the highest stage. This dataset starts after 1980 as this is when the NBA added the three point line.

If you are unfamiliar with basketball terminology, this stats link is very helpful.

Data Collection

The dataset can be found in this Kaggle link. According to the description it has been scraped from Basketball Reference I recommend taking a look at Basketball Reference for more insight into how basketball data is organized and collected.

Loading the Data

The two CSV files we are extracting data from are identical and structure, but one corresponds to the data of the winners of each series, while others correspond to the losers. It is important to note that each series is a best of seven, meaning that one team does not have to win all the games to win an entire series. We will look both at individual game stats as well as averages over series.

Removing Irrelevant Statistics

Fortunately, the data has been scraped thoroughly so there is not much preprocessing we need to conduct. In determining whether a team will win a particular game or series, I have chosen to remove certain statistics, because they are either irrelevant or are too obvious in terms of predicting which team would win. Here are the following statistics I have chosen to remove:

Game - Game #. Given that we have year, the game number won't be useful in grouping each series.

MP - Total minutes played. This is the same for both teams in a game.

FG - Field Goals Made. This and TP will easily allow for calculations for points made.

FGA - Field Goals Attempted. We will use percentage instead.

TP - Three Pointers Made. Same reason as FG.

TPA - Three Pointers Attempted. Same reason as FGA.

FT - Free Throws Made. Same reason as FG.

FTA - Free Throws Attempted. Same reason as FGA.

PTS - Points scored. Obviously, whoever scores more points wins the game. We will keep it for now for further data analysis.

Analyzing Individual Stats

Through my anecdotal experience watching NBA games, I just wanted to make predictions on certain statistics and their impact on winning games:

Home - Home/Away Team --> Home team is usually favored to win a game, so we'll give the nod in terms of scoring to here.

ORB - Offensive rebounds --> Let's say fewer offensive rebounds, since they'll miss fewer shots overall.

DRB - Defensive rebounds --> Let's say more defensive rebounds for this team, they are boxing out after the opposing team misses

TRB - Total rebounds --> More rebounds overall, since defensive rebounds are more common than offensive.

AST - Assists --> I'm a firm believer in that team basketball wins championships. However, I also know that star players dominate the game. Certain teams like the Warriors are more assist heavy.

STL - Steals --> Defense wins championships too, so I'll go for more steals which lead to fastbreak points. However, this could also tie into assists/turnovers, so more research needed here.

BLK - Blocks --> I'll side with team with more blocks.

TOV - Turnovers --> More assists will lead to more turnovers. Perhaps assist/turnover ratio could help.

PF - Personal fouls --> Also difficult to analyze, generally more fouls lead to more free throws which slows down game pace. However, fouls can often be due to tighter defense, end of game situations(fouling rather than giving up a shot), and other factors. I think only large differences in fouls makes an impact, and even then not very sizeable.

TPP - three point percentage --> This could vary widely per era, so I would say that higher three point percentage is preferred especially in later eras.

FGP - field goal percentage. --> Higher field goal percentage wins.

FTP - free throw percentage. --> Normally higher free throw percentage is better.

Data Processing and Analysis

Let's plot relationships for winning teams and their individual statistics over the years, as well as differentials between the winning and losing teams of each series.

Winning Team in a Series Statistics Over Time

Here are some of the trends observed from teams that won.

Field Goal Percentage - Overall field goal percentage shows a decreasing trend according to the regression plot.

Three Point Percentage - With some teams not attempting any threes at all over the course of a series, more threes. appear to be attempted over time and a higher field goal percentage occurs as a result.

Free Throw Percentage - Relatively static on average.

Average Offensive Rebounds - Has decreased on average over time.

Average Defensive Rebounds - Has slightly increased over time.

Average Total Rebounds - Has decreased on average over time, mostly due to the dropoff in offensive rebounds.

Average Assists Per Game - Has decreased(surprisingly) over time on average. Some interesting data points however, are the above average amounts in recent years, which was unsurprising. These Twitter tweets about the Warriors summarizes why this team was so dominant for a stretch of years.

Average Steals Per Game - No clear trend here, a slight decrease on average.

Average Blocks Per Game - No clear trend here either.

Average Turnovers Per Game - Slight decrease over time. Seems to correlate with a decrease in average assists.

Average Personal Fouls Per Game - Slight decrease on average over time. Might be due to a greater amount of three point attempts, meaning fewer drives overall into the paint as a percentage of total possessions.

Differentials Per Series Average(Winning Versus Losing Team)

While the above data is interesting and could potentially be used to support theories on how the game of basketball has evolved over time. However, to better understand whether these stats helped teams win games, we will also need to take a look at how the losing team in a series performed. We can plot average differentials between winning and losing teams over the course of each series every year.

Looking at the differentials among different stats, many don't show any clear variation and are relatively close to zero. Ultimately I was wrong in my predictions for several stats as well such as assists being highly important, as well as free throw percentage. Thus, predicting winners of an NBA Finals series solely off of team stats does not seem helpful.

Exploring Competitiveness of Series

Because many of these graphs don't show a high variability in stats between winning and losing teams of each series overall, I want to see how competitive each series were. Let's show the distribution of 4, 5, 6, and 7 game series during this period. My plots above are averages among each series, so it would make sense that during a series with more games the differentials will be lower.

As you can see, the majority of series last 6 games, and as many sweeps exist as 7 game series. Since more series as a percentage are 6-7 games than 4-5 games, I do not believe that these statistics are enough to predict who wins an NBA series, as they are averaged over each series where each series could be skewed by one incredible performance in a single game. To verify, let's plot average point differential as well.

From the chart above, point differentials are on average about 5 points per game for the winning team. However, obviously for 6 and seven game series point differential averages are much lower, denoting a more competitive series compared to 4 and 5 game series. Thus, just using these statistics averaged over a series may not be enough to predict which team wins a series. However, using machine learning, we can try to predict individual game wins.

Machine Learning

Preparing Data

Because our data is in two different CSV files, we will need to combine them into one table for preparation. We will take the differences between each of the columns to represent our new dataset. We do not care about the individual teams since there is a column that already shows W/L predictions. Because the purpose is to compare the stats between the winning and losing team each feature column will represent a difference between two existing columns.

Decision Tree Classifier

We will employ a decision tree classifier using our various statistics. I recommend reading more about the decision tree classifier model here I have chosen a decision tree classifier here because I am predicting whether a team wins or loses a game, a binary variable. We will need to create a new table first, that shows the differential between the winner and loser table. We will also split our data into training and testing data, with 90% of data as training and 10% as test. Since there could be variance in accuracy scores, I'll run 50 iterations of decision tree and take an average.

I noticed a high variation in the accuracy scores when using Decision Tree, so I decided to perform averaging 10 times over 20 iterations of train/test/split. Analyzing these results, we can see that performing a 90/10 split results in high variance among the accuracies, ranging from 80-85%. From these accuracy results, there does not seem to be a clear trend in the way the classifier is predicting each time.

Conclusion

As a quick recap, we wanted to analyze trends in team statistics in games played during the NBA Finals. We first cleaned up our data by removing irrelevant columns for analysis. We then took a look at what the individual statistics might mean. We then plotted trends of winning teams in terms of each statistic, as well as differentials between winning and losing teams for each statistic. We also looked at how competitive series were in terms of 4, 5, 6, and 7 game series that may skew statistics in a certain way, and plotted point differentials and colored points differently based on series length. Finally, we incorporated machine learning to predict how well statistics could be used on their own to determine whether a team won a particular game.

From the high variance in the results and relatively small sample size of games(~220), I'd say statistics on their own are not enough to predict whether a team wins a particular game. I was incorrect in what I observed in certain trends among winning teams, such as assist differential making a high impact.

For further future analyses, we could analyze more advanced stats, such as offensive rating, and assist:turnover ratio, to see whether or not these more refined advanced stats could more accurately predict the outcome of games.

Hope you enjoyed reading this, and learned a thing or too about NBA stats!