Every year right before the weather breaks from winter to spring, college basketball fans from both near and far find themselves facing 63 of the toughest decisions they face all year, how to fill out their tournament brackets for March Madness.  With a lot riding on how well your first round selections do, these selections cannot be taken lightly.  By taking a closer look at what a team and its coach has done in the past and in their season just prior to the tournament, we are attempting to produce a statistical analysis of the probability that a team wins their first round game.  Looking at last year’s tournament and already knowing the outcome of the games, we will use our data source to find the probability of this first round winning percentage.  Then, by producing these probabilities, we can analyze our outcome, like seeing the probability of a team winning or upsetting a higher seed. With this data we could even take our analyzed data a step further, and see if we have found any strong trends in our statistical analyses that could help us in making future predictions in the tournament.

 

 Data Base:

            Our Data was found by way of the World Wide Web.  While the majority of the information was found at www.ncaa.com and www.collegerpi.com, we also visited many of the colleges and universities men’s basketball web sites.  The Data Base we produced consists of all 64 teams from the 2000-2001 NCAA basketball tournament.  Along with this list of teams we include the following in formation: Tournament Seeds (where a 1 seed is high and a 16 seed is considered low), Strength of Schedule percentage and rank from 2000-2001 regular season (where higher percentages and lower number rankings equal a good rating), wins from 2000-2001 regular season, losses from 2000-2001 regular season, winning percentage from 2000-2001 regular season, Coaches all-time winning percentage in NCAA tournament, Number of wins from the 2000-2001 NCAA tournament, RPI percentage and ranking from 2000-2001 regular season (where higher percentages and lower number ratings equal a good ranking).

 

Conditional Probabilities:

            Conditional Probabilities are probabilities that tell us the chance of one event happening given that we already know that another related event has occurred. Here you'll see four tables that give us the Conditional Probabilities of teams winning a game given we already know: Coaches Historical Winning % in the Tournament, Strength of Schedule Ranking, Teams Winning % from Regular Season, and Top/Bottom 32 seeds.

            In attempting to read these tables we must first realize our columns and rows.  Our columns have been broken down into Top 32 and Bottom 32.  This meaning when concerning the information at hand in each different table we first listed the 64 tournament teams from highest to lowest, and then broke them into two categories, the higher group or "Top 32" and the lower group or "Bottom 32".  It is then stated in parenthesis where the breaking point for the top and bottom falls.  Our rows are simply broken down into "Wins" and "Loss's".

Table P-1: Probability of Winning Given Coaches Historical Winning % in the Tournament

 

Top 32 (Win%>. 419)

Bottom 32 (Win%<. 419)

Total

Wins

23/32=. 72

9/32=. 28

32

Loss’s

9/32=. 28

23/32=. 72

32

Total

32

32

64

Interpretation: 

Table P-1 gives us the conditional probability of a team winning their first round game given that there coach’s historical winning percentage in the tournament ranked amongst the top 32 out of the 64 teams.  If a team’s coach had a percentage greater than .419, then we found 72% of these teams won there first game.  The correlation and covariance can be seen in Tables C1 and CV1, respectively.  The covariance between the two has a positive sign, indicating that there is a positive relationship (the better the coaches winning %, the more likely you are to win your first game).  The correlation between the two is .494, which isn’t exactly close to 1, yet it still shows us that there is a pretty good relationship between the two.

 

Table P-2: Probability of Winning Given Strength of Schedule

 

Top 32 (SOS Ranking <59)

Bottom 32 (SOS Ranking >59)

Total

Wins

19/32=. 59

13/32=. 41

32

Loss’s

13/32=. 41

19/32=. 59

32

Total

32

32

64

Interpretation:

Table P-2 gives us the conditional probability of a team winning their first round game given their S.O.S. (strength of schedule) Ranking is lower than 59 (1 is the best possible ranking).  If a team has a Ranking lower than 59, then we found they won 59% of their first round games.  The covariance (still found in CV-1), between the two is positive, indicating the harder schedule you play, the better your chance to win the first game.  Our correlation for S.O.S. and winning the first game is .358, so this shows that our relationship is pretty good, and that your strength of schedule does matter when you get into the tournament.

 

Table P-3: Probability of Winning Given Winning % from Regular Season

 

Top 32 (Win%>. 71428)

Bottom 32 (Win%<. 71875)

Total

Wins

21/32=. 66

11/32=. 34

32

Loss’s

11/32=. 34

21/32=. 66

32

Total

32

32

64

Interpretation:

Table P-3 gives us the conditional probability of a team winning their first round game given their Winning Percentage from Regular Season.  The top 32 teams are teams having a winning percentage Greater than .71428.  Teams with a percentage greater than .71428 won 66% of their games.  The correlation and covariance between the two are found in Tables C-1 and CV-1 are .345 and .039, respectively.  From these statistics, we conclude that there is a positive relation between winning percentage and winning their first game in the tournament (the better your winning percentage in the regular season, the better chance you have at winning your first game).

 

Table P-4: Probability of Winning Given a Higher Seed for Game 1

 

Top 32 (32 top seeds)

Bottom 32 (32 bottom seeds)

Total

Wins

19/32=. 59

13/32=. 41

32

Loss’s

13/32=. 41

19/32=. 59

32

Total

32

32

64

            Interpretation:

Table P-4 gives us the conditional probability of a team winning their first round game given they were a higher seed (highest seed being a 1, lowest seed being a 16) in their first game.  So giving a team had a higher seed in game one, we found that 59% of these teams won their games.  We found the covariance to be –3.38, which is a negative relationship.  Our correlation we found between the two to be -.547.  This negative relationship between the two seems that this relation is not strong at all.  This makes sense because the computer reads higher numbers (those towards 16) as a better seed.  This means that as the seed increases to 16, the chance of winning that first game decreases, so this output makes sense.

 

Summary Statistics ~

Table SS-1:  Summary Statistics for all 64 Teams

Number

SOS Rank

Wins

Loss

Winning %

RPI

Coaches Winning%

Mean

91.9219

21.5313

8.2031

.7241

47

.3485

Median

59.5

21

8.5

.7165

32.5

.4095

Mode

33

21

9

.75

N/A

0

StdDev

88.6352

2.817

2.5646

.0854

43.877

.3035

Min

1

15

2

.5333

1

0

Max

316

29

14

.9286

188

.857

Count

64

64

64

64

64

64

 

Table SS-2:  Summary Statistics for 32 First Round Winners

Number

SOS Rank

Wins

Loss

Winning %

RPI

Coaches Winning%

Mean

69.0938

22.5938

7.4063

.7541

30.7813

.4787

Median

41

23

7

.7626

23

.5975

Mode

33

21

6

.8065

N/A

0

StdDev

72.3479

2.4078

2.5255

.0806

29.8639

.2859

Min

3

18

2

.6

1

0

Max

316

29

12

.9285

132

.857

Count

32

32

32

32

32

32

 

These two tables obviously provide us with Summary Statistics.  Though, the statistic that may be most interesting for us to look at could simple be the mean for each category, both in table SS-1 and table SS-2.  Lets see what we can find comparing the means from both tables.

            The mean SOS ranking for all 64 teams is 91.9219, and the mean SOS ranking for the 32 first round winners is 69.0968.  With 1 being the best rank a team could have, the difference has definitely moved in a positive direction, the 32 first round winners have a difference of 22.8251 rankings better than the mean for all 64 teams.  This is a fairly large difference between the two. This could be a helpful bit of information for future predictions, that on average the winning teams from round one had a SOS ranking lower than 70, meaning teams with higher SOS rankings than about 70 are less likely to win.

            Moving on to the mean Winning %, since winning % is basically a summary of wins and losses, lets compare the mean winning % from the tables.  In table SS-1 the mean winning % for all 64 teams is .7241, and in table SS-2 our mean winning % is .7541.  Though the mean for the winners is higher, the difference between the two means is not very significant, only .03.  This statistic, obviously doesn't differ much between all 64 teams and the 32 first round winners, and also may not be very helpful for future predictions.

            Our mean RPI 's for both table will much resemble the form that the SOS ranking took, this is because the RPI ratings take into consideration SOS ranking.  Our mean RPI in table SS-1 is 47, and our RPI in table SS-2 is 30.7813.  The best RPI ranking a team can have is 1; with this in mind the 32 first round winners have a mean that is 16.2187 ratings better.  Taking this into consideration maybe useful in making future predictions about first round tournament games, on average the winners RPI danced around a ranking of 30.  So in the future, you may consider teams with an RPI of 30 or better to have a greater chance of winning.

            Finally we take a look at coach's historical winning % in the tournament.  The mean in SS-1 is .3485, and the mean in SS-2 is .4787.  In terms off winning percentage, we find our difference for the 32 first round winners is about 13% better.  For all 64 coaches, there is an average of winning 1/3rd of there tournament games, for the 32 first round winners the coaches improve to having a winning % close to .5 or half of their games.  This could be a helpful to take into consideration when predicting the outcome of future first round games, because obviously, coaches with a higher winning % are more likely to win than those coaches with lower winning %.

These stats give us some interesting feedback; info that makes you think it may be helpful in making future predictions.  Something to keep in mind though is that these stats are only averages, and averages from only 1 years tournament.  So we can’t exactly put full stock into what has been produce, but we can use this info as a starting point, something to consider for the future.

 

Regression and Significance of the Regression Output:

SUMMARY OUTPUT

 

 

 

 

 

 

 

 

 

 

Regression Statistics

 

 

 

 

Multiple R

0.591993

 

 

 

 

R Square

0.3504557

 

 

 

 

Adjusted R Square

0.3179785

 

 

 

 

Standard Error

1.1157019

 

 

 

 

Observations

64

 

 

 

 

 

 

 

 

 

 

ANOVA

 

 

 

 

 

 

df

SS

MS

F

Significance F

Regression

3

40.29692663

13.43230888

10.79081626

9.11E-06

Residual

60

74.68744837

1.244790806

 

 

Total

63

114.984375

 

 

 

 

 

 

 

 

 

 

Coefficients

Standard Error

t Stat

P-value

 

Intercept

5.1842585

3.369993832

1.538358465

0.129218432

 

Coaches Winning %

0.9141747

0.620327506

1.473696822

0.145790666

 

Tournament Seed

-0.171975

0.066074836

-2.602733186

0.011635596

 

Strength Of Schedule

-5.699141

5.246348565

-1.086306246

0.281688599

 

 

Interpretation of Coefficients:

            Our intercept is our constant and can more specifically be referred to as chance of winning a game in the tournament.  Our coefficient of the intercept is 5.1842585, which means if all other variables equal 0, our chance of winning a games in the tournament is equal to 5.1842585.  This really doesn’t make any sense. 

            The coefficient of coaches winning % (historical winning percentage in tournament) is equal to 0.9141747, which means for every 1% coaches winning percentage goes up by, our y will increase by 0.9141747.

            The coefficient of tournament seed is -0.171975.  This mean that as your tournament seed increase by 1, your y will decrease by -0.171975.  This coefficient does not makes sense at a glance, but when you stop and look at what is truly being said, the outcome becomes much more clear.  The computer reads a good or higher seed to be higher numbers, so for a tournament seed to increase would mean from like a 1 seed to a 2 seed.  So the computer says, as your tournament seed increases, chances of winning the game decrease.  Instead, of what we know to be true, as tournament seed increases, the chances of winning a game should increase, since a 1 seed is a high seed and 16 is considered low.

            The coefficient of SOS is -5.699141.  This means that as your SOS rating increases by one, your chances of winning a game will decrease by -5.699141.  This interpretation makes sense for the same reason the coefficient of tournament seed did.  Although, we think of a high SOS rating to be 1, the computer reads a high SOS rating to be a higher number, i.e. 100.  So the coefficient the computer provides makes sense, saying as SOS rating increases (moves towards larger numbers) the chances of winning a game decreases. 

 

Significance of our Model:

            In looking at our overall significance of our model we take a look at our “significance F” that is provided in our output.  In treating this figure like a P-value, which it basically is, we can measure to see at what significance our model can be accepted.  Our “significance F” is 9.11E-06, which is about equal to .000009.  This means in taking an alpha of .01, which our ”significance F“ is less than, we can conclude that our model is significant at 99% confidence.  This also means we can accept the Null hypothesis (b0=0,b1=1, b2=0, b3=0) at 99%.

 

Interpretation of R squared:

            R squared is 0.3504557.  This tells us our regression line explains 35% of the variation in y.  Considering that the closer to 1 this number is, the better our R squared is, this is not a very informative model.

            Our adjusted R squared is equal to .3179785.  This tells us the same info as our R-squared, but takes into consideration our variables of coaches winning %, SOS rating, and tournament seed. Our adjusted R-squared is a lower percentage than R-squared, although not by a large amount, this still means one of the variable and providing some info that may not be very useful.

 

Multicollinearity Test:

            Since our adjusted R-squared fell from our r-squared, though not by much, we decided to take a look at our correlation table and see if any of our variables had a correlation greater than .7.  Sure enough, the correlation between SOS rating and tournament seed was almost .82.  This means they are much too similar, and one of the two must be tossed from our model.  If we ran the test again we decided we would toss out SOS rating, because the tournament seeds take SOS and many other factors into consideration also.

 

Conclusion:

            With all the data that has been collected and then analyzed, what can we really say we found from doing all this?  By the information we have produced for this project, we believe we have definitely comes across some interesting statistics.  We have found that some of our variables have a very strong impact or correlation on winning percentage.  Other times, we found data that we originally thought would have a strong impact didn’t have the effect we thought.  So now, we can take this information, especially the information that provided us with strong results, and use it for our future predictions.  Perhaps, we could test the strength of our results in filling out next years tournament, we could even fill out a 2001-2002 bracket based on our data analysis and see how true to form our results remain.  Overall, our findings should be taken pretty seriously, we probably could of found another variable that would have had some more significance, but by using our model one should be able to predict future first round game winners.


Table C-1:

 

Correlations 

Tournament Seed

Strength Of Schedule

S.O.S Rank

Wins

Loss

Winning %

RPI

RPI %

Coaches Winning %

NCAA Tourney Wins

Tournament Seed

1

 

 

 

 

 

 

 

 

 

Strength Of Schedule

-0.8247

1

 

 

 

 

 

 

 

 

S.O.S Rank

0.81988

-0.989

1

 

 

 

 

 

 

 

Wins

-0.4063

0.0652

-0.096

1

 

 

 

 

 

 

Loss

0.23645

0.2035

-0.166

-0.793

1

 

 

 

 

 

Winning %

-0.3035

-0.1248

0.0879

0.8939

-0.9795

1

 

 

 

 

RPI

0.85339

-0.8639

0.8902

-0.4532

0.23664

-0.3247

1

 

 

 

RPI %

-0.9165

0.85214

-0.862

0.5292

-0.3299

0.41227

-0.9646

1

 

 

Coaches Winning %

-0.625

0.38662

-0.398

0.4543

-0.3286

0.38365

-0.5027

0.5537

1

 

NCAA Tournament Wins

-0.5475

0.35822

-0.353

0.3901

-0.306

0.3451

-0.4064

0.5111

0.4942

1

 

 

 

 

 

 

 

 

 

 

 

 

Table CV-1:

 

 

Tournament Seed

Strength Of Schedule

S.O.S Rank

Wins

Loss

Winning %

RPI

RPI %

Coaches Winning %

NCAA Tourney Wins

Tournament Seed

21.25

 

 

 

 

 

 

 

 

 

Strength Of Schedule

-0.1868

0.0024

 

 

 

 

 

 

 

 

S.O.S Rank

332.37

-4.2741

7733.4

 

 

 

 

 

 

 

Wins

-5.2344

0.009

-23.662

7.8115

 

 

 

 

 

 

Loss

2.7734

0.0254

-37.219

-5.639

6.4744

 

 

 

 

 

Winning %

-0.1185

-0.0005

0.6546

0.2116

-0.2111

0.0072

 

 

 

 

RPI

171.3

-1.8485

3408.8

-55.16

26.219

-1.197

1896

 

 

 

RPI %

-0.1691

0.0017

-3.0331

0.0592

-0.0336

0.0014

-1.681

0.0016

 

 

Coaches Winning %

-0.8675

0.0057

-10.539

0.3823

-0.2517

0.0098

-6.591

0.0067

0.0907

 

NCAA Tournament Wins

-3.3828

0.0236

-41.595

1.4614

-1.0437

0.0392

-23.72

0.0274

0.1995

1.7966