Predicting the 2020 NBA Champion with Machine Learning

using historical league rankings of regular season stats

Trustin Yoon
Towards Data Science

--

I, among millions of others, was absolutely devastated when COVID-19 suspended the NBA season back in March. As a die-hard Lakers fan, I was excited to see LeBron and co. finally take us to the promised land. But after having to suffer through the last 6 years of utter mediocracy without a single playoff performance and coming off mourning the death of the late great Kobe Bean Bryant (RIP), I thought,

Did it really have to be this season that a freak invisible virus comes out of nowhere when the Lakers finally have a shot at the title?

But on July 31st, basketball is back.

The restart is still a month away, though, and I’ve already consumed all content relating to basketball imaginable, from watching grainy 70s NBA footage to 6th grade AAU highlights. I wanted to offer the basketball community a little glimpse into the season’s future while we patiently wait for tip-off in Orlando. I’ve always loved deep analysis of teams and predicting the champion every season before the playoffs starts. And after a month of learning how to code and use data analytical tools, I felt ready to start my first machine learning project to predict the 2020 NBA Champion. (full Github repo can be found here: https://github.com/trustinyoon/2020-NBA-Chip-Predictor)

Data Collection

Why I used teams’ league rankings of regular season stats

I wanted to create a model that used data solely from regular season games to predict the NBA champion. I have always been fascinated by a team’s regular season rankings and believe that they hold immense weight in determining teams that make deep runs in the playoffs. Initially, I thought about using the per game team stats, however, the game has changed so drastically in terms of strategy, pace, play style, athletes, and even the rules of the game itself that there would have been too much noise if I compared per game stats across different seasons. For example, the average points per game scored by a team in the 2003–2004 season was 93.4 points whereas last season’s was 111.2. It is inaccurate to compare players or teams across different eras by comparing the raw numbers alone.

The best way to determine a team’s strengths and weaknesses is to compare its performance relative to the rest of the competition for each season. Therefore, I decided to use regular season league rankings in every team stat category as features to train my model to predict the number of playoff wins. Using league rankings of team stats instead of per game team stats shows a clearer picture of what strategies and executions are most influential in determining playoff success across various seasons.

Scraping

To collect the data, I web scraped the regular season league rankings off of basketball-reference.com of every playoff team since the 2002–2003 season using Beautiful Soup. League rankings are numbered 1–30 since there are 30 total teams: a league rank of 1 being the best, 30 being the worst. I did not scrape older seasons since the playoff structure was changed in 2004 so that an NBA Championship must consist of winning 4 best of 7 game series, whereas prior seasons did not. I also did not include teams that did not make the playoffs since that would skew my dependent variable data of the number of playoff wins.

Example of one team’s regular season table that I scraped from basketball-reference.com. The league rankings are shown in the ‘Lg Rank’ rows.
This table was scraped to collect the data of number of playoff wins for every playoff team since 2003 for my y variable.

Data Cleaning

After constructing the dataframes in Pandas, I ran a correlation matrix to filter out the league rankings (independent variables) that had a Pearson’s Correlation Coefficient less than .25 with number of playoff wins (dependent variable). This left me with the following league rankings that were at least moderately correlated with playoff wins: field goal %, 3 point %, 2 point %, defensive rebounds, opponent field goal %, opponent 2 point %, opponent blocks, regular season wins, margin of victory, simple rating system (SRS), overall offensive rating, overall defensive rating, effective field goal %, opponent effective field goal %, and attendance.

Multicollinearity of selected features

Being ranked in the top 3 for Regular Season Wins (W), Margin of Victory (MOV), and Simple Rating System (SRS) seem to be decent indicators of being able to predict the NBA Champion. However, SRS and MOV are highly correlated since point differential is used in calculating both. I decided to remove MOV since SRS is slightly more accurate in that it takes the difficulty of season schedule into account. Attendance was also dropped since the remainder of the 2020 season will be played in Orlando’s Disney World with no live crowds.

Effective Field Goal Percentages (eFG% and O_eFG%) are highly correlated with their corresponding Field Goal Percentages (2P%, O_2P%, FG%, O_FG%), so I took out the latter four since Effective Field Goal Percentages calculate a slightly more accurate rate of shooting efficiency. eFG% is also slightly more correlated with number of playoff wins.

After checking for multicollinearity, I was left with the following features for my models:

  • 3 point %
  • defensive rebounds
  • opponent blocks
  • regular season wins
  • simple rating system
  • overall offensive rating
  • overall defensive rating
  • effective field goal %
  • opponent effective field goal%

Exploratory Data Analysis

Championship Team Regular Season League Rankings over Time

There is more variability in the 2000s than the 2010s in terms of a champion’s regular season team offensive stats. The most interesting trend here is the 3P% as the Golden State Warriors have revolutionized basketball with shooting 3’s since 2014. The 3 pointer has become the most important shot in the game due to its higher EV and ability to space the floor to create more open 2 point field goals. We see that generally, champions of recent years tend to be ranked in the top 5–10 for each offensive team stat and are generally ranked top 5 in 3P%.
Regular season team defense stat rankings seem to have much less variability than team offense stat rankings, and being ranked in the top 5–10 in these three categories can be a good indicator of a playoff champion. Defensive rebounding (DRB) is important, but slightly less important than limiting opponent field goal efficiency (O_eFG%) and overall defensive rating (DRtg).
The amount of wins (W) and Simple Rating System (SRS) in the regular season seem to be the most consistent indicators of predicting a champion. It is rare for a championship team to rank outside the top 5 in number of regular season wins or SRS.

Underrated influence of limiting opponent’s blocks

I found the Opponent’s Blocks (O_BLK) category to be the most interesting feature that is moderately correlated with playoff wins. This finding made me hypothesize that teams who rank in the top percentile in O_BLK give their opponents a more difficult time defending them, which equates to a greater win count. Additionally, the majority of blocks in basketball come from very short range shots, so teams whose opponents are less able to block shots are able to score much easier at a higher clip since there is a much higher probability of making a short range shot (ex. layup/dunk) vs a mid-range 2 or 3 pointer.

Prediction Models

Linear Regression — — — — Random Forests — —— — — XGBoost

Projected number of playoff wins for each of the 22 teams invited to Orlando 2020 season restart

Training and testing sets were cross validated in the Random Forests and XGBoost models which were helpful in lowering MAEs and preventing overfitting.

In each model, the Milwaukee Bucks are projected to have the most playoff wins with the Lakers just trailing 😒. The reigning champs Toronto Raptors surprisingly rank high as well despite being largely written off by many sports analysts after the departure of the 2019 Finals MVP Kawhi Leonard.

But before I bet all my stimulus money that the Bucks will win it all this season, there are a few things to consider first.

Conclusion

Limitations

My main concern with the data is that the sample size of championship teams (16 playoff wins) is very small. This means that the best team in each model will always have an expected win count less than 16 which is obviously not sufficient to win the title. However, I interpreted the scores as a relative scale to see which team has the highest predicted value of wins as the champion.

The models’ MAE also varied upon each resplitting of test and train data. The values generally ranged between 2 and 3 for each model. I used the predictions and MAE’s from a random trial for the summary findings.

Summary

Using regular season stat league rankings, I found that being in the top league rankings of amount of Wins (W) and Simple Rating System (SRS) during the regular season are the best indicators of predicting the champion. In recent trends, a higher offensive rating than defensive rating has translated into a greater chance of winning the title as well.

The best prediction performance was achieved by the Random Forest Regressor with a MAE of 2.65, and projected the Bucks as favorites for the title with 13 expected wins. XGBoost had a smaller Mean Absolute Error of 2.44, however, it predicted that the 1st place Bucks would only win 9.3 games which are far off from the 16 needed for a championship.

The Milwaukee Bucks were chosen by each model to be the favorite to win the title. The Lakers, Raptors, and Clippers are usually mixed in the standings behind the Bucks.

Future Work

Using number of predicted wins is probably not the most efficient method in determining the championship team. I will look to enhance my model predictions by providing what a team’s probability of winning 16 playoff games is instead of its expected value.

I plan to keep updating the dataset every season and continue to refine the existing models/add new ones. There is a lot of room for improvement in the algorithms, parameters, and statistics I used considering this is my first data science project.

--

--