Every year right before the weather
breaks from winter to spring, college basketball fans from both near and far
find themselves facing 63 of the toughest decisions they face all year, how to
fill out their tournament brackets for March Madness. With a lot riding on how well your first
round selections do, these selections cannot be taken lightly. By taking a closer look at what a team and
its coach has done in the past and in their season just prior to the
tournament, we are attempting to produce a statistical analysis of the
probability that a team wins their first round game. Looking at last year’s tournament and already
knowing the outcome of the games, we will use our data source to find the
probability of this first round winning percentage. Then, by producing these probabilities, we
can analyze our outcome, like seeing the probability of a team winning or
upsetting a higher seed. With this data we could even take our analyzed data a
step further, and see if we have found any strong trends in our statistical
analyses that could help us in making future predictions in the tournament.
Data Base:
Our Data was found by way of the
World Wide Web. While the majority of
the information was found at www.ncaa.com
and www.collegerpi.com, we also visited
many of the colleges and universities men’s basketball web sites. The Data Base we produced consists of all 64
teams from the 2000-2001 NCAA basketball tournament. Along with this list of teams we include the
following in formation: Tournament Seeds (where a 1 seed is high and a 16 seed
is considered low), Strength of Schedule percentage and rank from 2000-2001
regular season (where higher percentages and lower number rankings equal a good
rating), wins from 2000-2001 regular season, losses from 2000-2001 regular
season, winning percentage from 2000-2001 regular season, Coaches all-time
winning percentage in NCAA tournament, Number of wins from the 2000-2001 NCAA
tournament, RPI percentage and ranking from 2000-2001 regular season (where
higher percentages and lower number ratings equal a good ranking).
Conditional Probabilities:
Conditional Probabilities are probabilities that tell us the chance of
one event happening given that we already know that another related event has
occurred. Here you'll see four tables that give us the Conditional
Probabilities of teams winning a game given we already know: Coaches Historical
Winning % in the Tournament, Strength of Schedule Ranking, Teams Winning % from
Regular Season, and Top/Bottom 32 seeds.
In attempting to read these tables we must first realize our columns and rows. Our columns have been broken down into Top 32 and Bottom 32. This meaning when concerning the information at hand in each different table we first listed the 64 tournament teams from highest to lowest, and then broke them into two categories, the higher group or "Top 32" and the lower group or "Bottom 32". It is then stated in parenthesis where the breaking point for the top and bottom falls. Our rows are simply broken down into "Wins" and "Loss's".
Table P-1: Probability
of Winning Given Coaches Historical Winning % in the Tournament
|
|
Top
32 (Win%>. 419) |
Bottom
32 (Win%<. 419) |
Total |
|
Wins |
23/32=.
72 |
9/32=.
28 |
32 |
|
Loss’s |
9/32=.
28 |
23/32=.
72 |
32 |
|
Total |
32 |
32 |
64 |
Interpretation:
Table P-1 gives us the conditional probability of a
team winning their first round game given that there coach’s historical winning
percentage in the tournament ranked amongst the top 32 out of the 64
teams. If a team’s coach had a
percentage greater than .419, then we found 72% of these teams won there first
game. The correlation and covariance can
be seen in Tables C1 and CV1, respectively.
The covariance between the two has a positive sign, indicating that
there is a positive relationship (the better the coaches winning %, the more
likely you are to win your first game).
The correlation between the two is .494, which isn’t exactly close to 1,
yet it still shows us that there is a pretty good relationship between the two.
Table P-2: Probability
of Winning Given Strength of Schedule
|
|
Top
32 (SOS Ranking <59) |
Bottom
32 (SOS Ranking >59) |
Total |
|
Wins |
19/32=.
59 |
13/32=.
41 |
32 |
|
Loss’s |
13/32=.
41 |
19/32=.
59 |
32 |
|
Total |
32 |
32 |
64 |
Interpretation:
Table P-2 gives us the conditional probability of a
team winning their first round game given their S.O.S. (strength of schedule)
Ranking is lower than 59 (1 is the best possible ranking). If a team has a Ranking lower than 59, then
we found they won 59% of their first round games. The covariance (still found in CV-1), between
the two is positive, indicating the harder schedule you play, the better your
chance to win the first game. Our
correlation for S.O.S. and winning the first game is .358, so this shows that
our relationship is pretty good, and that your strength of schedule does matter
when you get into the tournament.
Table P-3: Probability
of Winning Given Winning % from Regular Season
|
|
Top
32 (Win%>. 71428) |
Bottom
32 (Win%<. 71875) |
Total |
|
Wins |
21/32=.
66 |
11/32=.
34 |
32 |
|
Loss’s |
11/32=.
34 |
21/32=.
66 |
32 |
|
Total |
32 |
32 |
64 |
Interpretation:
Table P-3 gives us the conditional probability of a team
winning their first round game given their Winning Percentage from Regular
Season. The top 32 teams are teams
having a winning percentage Greater than .71428. Teams with a percentage greater than .71428
won 66% of their games. The correlation
and covariance between the two are found in Tables C-1 and CV-1 are .345 and
.039, respectively. From these
statistics, we conclude that there is a positive relation between winning
percentage and winning their first game in the tournament (the better your winning
percentage in the regular season, the better chance you have at winning your
first game).
Table P-4: Probability
of Winning Given a Higher Seed for Game 1
|
|
Top
32 (32 top seeds) |
Bottom
32 (32 bottom seeds) |
Total |
|
Wins |
19/32=.
59 |
13/32=.
41 |
32 |
|
Loss’s |
13/32=.
41 |
19/32=.
59 |
32 |
|
Total |
32 |
32 |
64 |
Interpretation:
Table P-4 gives us the conditional probability of a
team winning their first round game given they were a higher seed (highest seed
being a 1, lowest seed being a 16) in their first game. So giving a team had a higher seed in game
one, we found that 59% of these teams won their games. We found the covariance to be –3.38, which is
a negative relationship. Our correlation
we found between the two to be -.547.
This negative relationship between the two seems that this relation is
not strong at all. This makes sense
because the computer reads higher numbers (those towards 16) as a better
seed. This means that as the seed
increases to 16, the chance of winning that first game decreases, so this
output makes sense.
Table SS-1: Summary Statistics for all 64 Teams
|
Number |
SOS Rank |
Wins |
Loss |
Winning
% |
RPI |
Coaches
Winning% |
Mean |
91.9219 |
21.5313 |
8.2031 |
.7241 |
47 |
.3485 |
Median |
59.5 |
21 |
8.5 |
.7165 |
32.5 |
.4095 |
Mode |
33 |
21 |
9 |
.75 |
N/A |
0 |
|
StdDev |
88.6352 |
2.817 |
2.5646 |
.0854 |
43.877 |
.3035 |
|
Min |
1 |
15 |
2 |
.5333 |
1 |
0 |
|
Max |
316 |
29 |
14 |
.9286 |
188 |
.857 |
|
Count |
64 |
64 |
64 |
64 |
64 |
64 |
Table SS-2: Summary Statistics for 32 First Round Winners
|
Number |
SOS Rank |
Wins |
Loss |
Winning
% |
RPI |
Coaches
Winning% |
Mean |
69.0938 |
22.5938 |
7.4063 |
.7541 |
30.7813 |
.4787 |
Median |
41 |
23 |
7 |
.7626 |
23 |
.5975 |
Mode |
33 |
21 |
6 |
.8065 |
N/A |
0 |
|
StdDev |
72.3479 |
2.4078 |
2.5255 |
.0806 |
29.8639 |
.2859 |
|
Min |
3 |
18 |
2 |
.6 |
1 |
0 |
|
Max |
316 |
29 |
12 |
.9285 |
132 |
.857 |
|
Count |
32 |
32 |
32 |
32 |
32 |
32 |
These two tables obviously provide us with Summary
Statistics. Though, the statistic that
may be most interesting for us to look at could simple be the mean for each
category, both in table SS-1 and table SS-2.
Lets see what we can find comparing the means from both tables.
The mean SOS ranking for all 64
teams is 91.9219, and the mean SOS ranking for the 32 first round winners is
69.0968. With 1 being the best rank a
team could have, the difference has definitely moved in a positive direction,
the 32 first round winners have a difference of 22.8251 rankings better than
the mean for all 64 teams. This is a
fairly large difference between the two. This could be a helpful bit of
information for future predictions, that on average the winning teams from
round one had a SOS ranking lower than 70, meaning teams with higher SOS
rankings than about 70 are less likely to win.
Moving on to the mean Winning %,
since winning % is basically a summary of wins and losses, lets compare the
mean winning % from the tables. In table
SS-1 the mean winning % for all 64 teams is .7241, and in table SS-2 our mean
winning % is .7541. Though the mean for
the winners is higher, the difference between the two means is not very
significant, only .03. This statistic,
obviously doesn't differ much between all 64 teams and the 32 first round
winners, and also may not be very helpful for future predictions.
Our mean RPI 's for both table will much
resemble the form that the SOS ranking took, this is because the RPI ratings
take into consideration SOS ranking. Our
mean RPI in table SS-1 is 47, and our RPI in table SS-2 is 30.7813. The best RPI ranking a team can have is 1;
with this in mind the 32 first round winners have a mean that is 16.2187
ratings better. Taking this into
consideration maybe useful in making future predictions about first round
tournament games, on average the winners RPI danced around a ranking of
30. So in the future, you may consider
teams with an RPI of 30 or better to have a greater chance of winning.
Finally we take a look at coach's
historical winning % in the tournament.
The mean in SS-1 is .3485, and the mean in SS-2 is .4787. In terms off winning percentage, we find our
difference for the 32 first round winners is about 13% better. For all 64 coaches, there is an average of
winning 1/3rd of there tournament games, for the 32 first round winners the
coaches improve to having a winning % close to .5 or half of their games. This could be a helpful to take into
consideration when predicting the outcome of future first round games, because
obviously, coaches with a higher winning % are more likely to win than those
coaches with lower winning %.
These
stats give us some interesting feedback; info that makes you think it may be
helpful in making future predictions.
Something to keep in mind though is that these stats are only averages,
and averages from only 1 years tournament.
So we can’t exactly put full stock into what has been produce, but we
can use this info as a starting point, something to consider for the future.
Regression and Significance
of the Regression Output:
|
SUMMARY
OUTPUT |
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression Statistics |
|
|
|
|
|
|
Multiple
R |
0.591993 |
|
|
|
|
|
R
Square |
0.3504557 |
|
|
|
|
|
Adjusted
R Square |
0.3179785 |
|
|
|
|
|
Standard
Error |
1.1157019 |
|
|
|
|
|
Observations |
64 |
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
Regression |
3 |
40.29692663 |
13.43230888 |
10.79081626 |
9.11E-06 |
|
Residual |
60 |
74.68744837 |
1.244790806 |
|
|
|
Total |
63 |
114.984375 |
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
|
|
Intercept |
5.1842585 |
3.369993832 |
1.538358465 |
0.129218432 |
|
|
Coaches
Winning % |
0.9141747 |
0.620327506 |
1.473696822 |
0.145790666 |
|
|
Tournament
Seed |
-0.171975 |
0.066074836 |
-2.602733186 |
0.011635596 |
|
|
Strength
Of Schedule |
-5.699141 |
5.246348565 |
-1.086306246 |
0.281688599 |
|
Interpretation of
Coefficients:
Our intercept is our constant and can more specifically be referred to
as chance of winning a game in the tournament.
Our coefficient of the intercept is 5.1842585, which means if all other
variables equal 0, our chance of winning a games in the tournament is equal to
5.1842585. This really doesn’t make any
sense.
The coefficient of coaches winning %
(historical winning percentage in tournament) is equal to 0.9141747, which
means for every 1% coaches winning percentage goes up by, our y will increase
by 0.9141747.
The coefficient of tournament seed is
-0.171975. This mean that as your
tournament seed increase by 1, your y will decrease by -0.171975. This coefficient does not makes sense at a
glance, but when you stop and look at what is truly being said, the outcome
becomes much more clear. The computer
reads a good or higher seed to be higher numbers, so for a tournament seed to
increase would mean from like a 1 seed to a 2 seed. So the computer says, as your tournament seed
increases, chances of winning the game decrease. Instead, of what we know to be true, as
tournament seed increases, the chances of winning a game should increase, since
a 1 seed is a high seed and 16 is considered low.
The coefficient of SOS is -5.699141. This means that as your SOS rating increases by one, your chances of winning a game will decrease by -5.699141. This interpretation makes sense for the same reason the coefficient of tournament seed did. Although, we think of a high SOS rating to be 1, the computer reads a high SOS rating to be a higher number, i.e. 100. So the coefficient the computer provides makes sense, saying as SOS rating increases (moves towards larger numbers) the chances of winning a game decreases.
Significance of our Model:
In looking at our overall significance of our model we take a look at our “significance F” that is provided in our output. In treating this figure like a P-value, which it basically is, we can measure to see at what significance our model can be accepted. Our “significance F” is 9.11E-06, which is about equal to .000009. This means in taking an alpha of .01, which our ”significance F“ is less than, we can conclude that our model is significant at 99% confidence. This also means we can accept the Null hypothesis (b0=0,b1=1, b2=0, b3=0) at 99%.
Interpretation of R squared:
R squared is 0.3504557. This tells us our regression line explains 35% of the variation in y. Considering that the closer to 1 this number is, the better our R squared is, this is not a very informative model.
Our adjusted R squared is equal to .3179785. This tells us the same info as our R-squared, but takes into consideration our variables of coaches winning %, SOS rating, and tournament seed. Our adjusted R-squared is a lower percentage than R-squared, although not by a large amount, this still means one of the variable and providing some info that may not be very useful.
Multicollinearity Test:
Since our adjusted R-squared fell from our r-squared, though not by much, we decided to take a look at our correlation table and see if any of our variables had a correlation greater than .7. Sure enough, the correlation between SOS rating and tournament seed was almost .82. This means they are much too similar, and one of the two must be tossed from our model. If we ran the test again we decided we would toss out SOS rating, because the tournament seeds take SOS and many other factors into consideration also.
Conclusion:
With all the data that has been collected and then analyzed, what can we really say we found from doing all this? By the information we have produced for this project, we believe we have definitely comes across some interesting statistics. We have found that some of our variables have a very strong impact or correlation on winning percentage. Other times, we found data that we originally thought would have a strong impact didn’t have the effect we thought. So now, we can take this information, especially the information that provided us with strong results, and use it for our future predictions. Perhaps, we could test the strength of our results in filling out next years tournament, we could even fill out a 2001-2002 bracket based on our data analysis and see how true to form our results remain. Overall, our findings should be taken pretty seriously, we probably could of found another variable that would have had some more significance, but by using our model one should be able to predict future first round game winners.
Table C-1:
|
Correlations |
Tournament Seed |
Strength Of Schedule |
S.O.S Rank |
Wins |
Loss |
Winning % |
RPI |
RPI % |
Coaches Winning % |
NCAA Tourney Wins |
|
Tournament Seed |
1 |
|
|
|
|
|
|
|
|
|
|
Strength Of Schedule |
-0.8247 |
1 |
|
|
|
|
|
|
|
|
|
S.O.S Rank |
0.81988 |
-0.989 |
1 |
|
|
|
|
|
|
|
|
Wins |
-0.4063 |
0.0652 |
-0.096 |
1 |
|
|
|
|
|
|
|
Loss |
0.23645 |
0.2035 |
-0.166 |
-0.793 |
1 |
|
|
|
|
|
|
Winning % |
-0.3035 |
-0.1248 |
0.0879 |
0.8939 |
-0.9795 |
1 |
|
|
|
|
|
RPI |
0.85339 |
-0.8639 |
0.8902 |
-0.4532 |
0.23664 |
-0.3247 |
1 |
|
|
|
|
RPI % |
-0.9165 |
0.85214 |
-0.862 |
0.5292 |
-0.3299 |
0.41227 |
-0.9646 |
1 |
|
|
|
Coaches Winning % |
-0.625 |
0.38662 |
-0.398 |
0.4543 |
-0.3286 |
0.38365 |
-0.5027 |
0.5537 |
1 |
|
|
NCAA Tournament Wins |
-0.5475 |
0.35822 |
-0.353 |
0.3901 |
-0.306 |
0.3451 |
-0.4064 |
0.5111 |
0.4942 |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
Table CV-1:
|
|
Tournament Seed |
Strength Of Schedule |
S.O.S Rank |
Wins |
Loss |
Winning % |
RPI |
RPI % |
Coaches Winning % |
NCAA Tourney Wins |
|
Tournament Seed |
21.25 |
|
|
|
|
|
|
|
|
|
|
Strength Of Schedule |
-0.1868 |
0.0024 |
|
|
|
|
|
|
|
|
|
S.O.S Rank |
332.37 |
-4.2741 |
7733.4 |
|
|
|
|
|
|
|
|
Wins |
-5.2344 |
0.009 |
-23.662 |
7.8115 |
|
|
|
|
|
|
|
Loss |
2.7734 |
0.0254 |
-37.219 |
-5.639 |
6.4744 |
|
|
|
|
|
|
Winning % |
-0.1185 |
-0.0005 |
0.6546 |
0.2116 |
-0.2111 |
0.0072 |
|
|
|
|
|
RPI |
171.3 |
-1.8485 |
3408.8 |
-55.16 |
26.219 |
-1.197 |
1896 |
|
|
|
|
RPI % |
-0.1691 |
0.0017 |
-3.0331 |
0.0592 |
-0.0336 |
0.0014 |
-1.681 |
0.0016 |
|
|
|
Coaches Winning % |
-0.8675 |
0.0057 |
-10.539 |
0.3823 |
-0.2517 |
0.0098 |
-6.591 |
0.0067 |
0.0907 |
|
|
NCAA Tournament Wins |
-3.3828 |
0.0236 |
-41.595 |
1.4614 |
-1.0437 |
0.0392 |
-23.72 |
0.0274 |
0.1995 |
1.7966 |