Udacity Intro to Data Analysis: Project
Version 2.2, April 2016 (version update - prose edits)
The Ex-Cubs factor is a belief set deep into baseball culture that the Chicago Cubs are so bad that even having their former players on a team is fatal to that team's World Series chances - that Cub-ness is infectious. The theory was proposed on October 15, 1981 by Ron Berler, a freelance journalist and Cubs fan. Berler suggested in an article that "it is utterly impossible for a team with three or more ex-Cubs to win the series." Berler based this on a pattern that he observed in the post-1945 era. (1945 was the last year the Chicago Cubs made it to a World Series).[1]
Since that article, there have been notable exceptions to this 'rule,' particularly the 2001 Arizona Diamondbacks and the 2008 Phillies, who triumphed with as many as six Ex-Cubs on the roster. This analysis attempts to see whether the pattern itself has any merit.
Our null hypothesis is that: the odds of winning a World Series are not significantly different in light of the number of former Chicago Cubs on the team roster.
In more precise terms, the mean number of Ex-Cubs is the same for both winning and losing World Series teams.
I'd like to acknowledge this article by May and Santen in the Fall 2014 Baseball Research Journal, which is both hysterically funny and more rigorous than the analysis below. I made a point of not reading it in order to avoid plagiarism until I finished here but want to note that it's worth a read.
We use Sean Lahman's baseball database at http://baseball1.com. Although the database is a cornucopia of data delights, we will restrict ourselves to three tables:
We import the basic Python analysis libraries and proceed to load and cut down the data.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
master = pd.read_csv('baseballdatabank-master/core/Master.csv')
teams = pd.read_csv('baseballdatabank-master/core/Teams.csv')
appearances = pd.read_csv('baseballdatabank-master/core/Appearances.csv')
#cut teams down to post-1945 World Series contenders
teams = teams.loc[teams.yearID > 1945,:]
wsteams = teams.loc[(teams.LgWin == 'Y')][['teamID', 'yearID', 'name', 'WSWin','lgID']]
We are choosing to trust this data set's detail completeness - it was, after all, assembled by a baseball fan and statistician. For the sake of due diligence, however, let's look for any missing years in the data we're proposing to analyze:
# Hat-tip to http://stackoverflow.com/questions/16974047/efficient-way-to-find-missing-elements-in-an-integer-sequence
# for this little gem:
def missing_elements(L):
start, end = L[0], L[-1]
return sorted(set(range(start, end + 1)).difference(L))
years = wsteams.yearID.unique()
missing_elements(years)
Aha: The Great Baseball Strike of 1994. We can proceed, as the missing data is to be expected.
Now we have to sort out where each member of each team played in previous years. We must exclude records for people who played previous seasons with their current team (because if you're playing for the Cubs right now, you aren't an Ex-Cub). We also must, year-by-year, make sure we only get the records for players previous to that year. Although the possibility that Cub-ness is so pervasive that the contagion impacts teams whose players are doomed to be future Cubs is intriguing, this is out of the scope of this analysis.
We built a simple function that steps through the teams from 1945-present and assembles their rosters from the Appearances table. It then builds a data frame of counts per player of all former teams and returns that data frame to the calling code.
# This function looks at a given team and retrieves the previous years' appearances for
# other teams, returning a data frame that is concatenated to a larger history one.
def getplayers(team, year, outcome):
# pull all the members of this team this year
thisteam = appearances.ix[(appearances['yearID'] == year) & (appearances['teamID'] == team)]
#get all their appearances in the past except for with this team
previousteams = appearances[(appearances['playerID'].isin(thisteam['playerID'])) & (appearances['yearID'] < year) & (appearances['teamID'] != team)]
#build the new data frame
teamcounts = pd.DataFrame({'seasons' : previousteams.groupby( [ "playerID", "teamID"] ).size()}).reset_index()
teamcounts['seriesTeam'] = team
teamcounts['seriesYear'] = year
teamcounts['WSWin'] = outcome
return teamcounts
#initialize the history table
history = pd.DataFrame(columns=['playerID', 'teamID', 'seasons', 'seriesTeam', 'seriesYear'])
# collect the old team memberships
# I desperately wanted to use apply to build out this table but couldn't quite make it work
for index, row in wsteams.iterrows():
history = pd.concat([history, getplayers(row['teamID'], row['yearID'], row['WSWin'])])
Here's an example of the table as it's built: For each of the players of the 1946 Boston Red Sox (who lost that year), we gathered their pre-Sox careers with one new row per previous team. Mace Brown (brownma01) played one season for the Dodgers and seven (poor slob) for the Pirates.
history.head()
Now, we can build out the summaries from the big history table.
The function below gathers up the numbers of ex-players for each team in each year's World Series contenders. The function takes a single team and builds a table listing their alumni's appearances in future World Series. We made early experiments comparing the number of Ex-Cubs and their impact to the number of Ex-Cubs multiplied by their total seasons with the team, wondering whether Cubness is magnified by the amount of time they spent exposed to the sadness. Without further cluttering this already too-long document, it just appears to be more noise, and although the -sum fields are calculated in the function, they aren't used later. I've kept them around in case I want to poke at this more, however.
When this function groups up appearances per player per team per year, however, it doesn't catch the player-team-years when the count is zero. We must go back to add any missing years when a team's alumni didn't even make the World Series, or the year-over-year averages will be wrong.
def sliceoneexteam(team):
teamplayersums = history.loc[history.teamID == team,:].groupby(['seriesTeam', 'seriesYear','WSWin'])['seasons'].sum()
teamplayercounts = history.loc[history.teamID == team,:].groupby(['seriesTeam', 'seriesYear','WSWin'])['playerID'].count()
flatplayers = pd.DataFrame(teamplayercounts.reset_index())
flatplayersums = pd.DataFrame(teamplayersums.reset_index())
flatplayers['playerseasons'] = flatplayersums['seasons']
flatplayers['teamID'] = flatplayers['seriesTeam']
flatplayers['yearID'] = flatplayers['seriesYear']
#having to put these back in order to merge the missing years in
flatplayers = pd.merge(wsteams, flatplayers, how = 'left', on = ['teamID', 'yearID'])
del flatplayers['seriesTeam']
del flatplayers['seriesYear']
del flatplayers['WSWin_y']
flatplayers = flatplayers.rename(columns = {
'playerID':team,
'teamID' : 'seriesTeam',
'yearID' : 'seriesYear',
'WSWin_x' : 'WSWin'
})
flatplayers[team].fillna(0, inplace=True)
flatplayers['playerseasons'].fillna(0, inplace=True)
return flatplayers
Now, we can make a data frame that shows, year-by-year, team-by-team, the average number of each team's alumni who appeared in winning and losing World Series.
# Make the empty stats data frame
summarystats = pd.DataFrame(columns = [
'teamname', 'diffmean', 'diffse', 't', 'pval',
'winsmean', 'winsmedian', 'winssd', 'winsmax',
'lossesmean', 'lossesmedian', 'lossessd', 'lossesmax'
], index = teams.teamID.unique())
# This bugs me, too. I really wanted to be able to use apply() to step through the empty
# summary dataframe, pick up each row's index, and generate out the statistical columns,
# but kept hitting my head against the wall. Fortunately, this doesn't take that long.
for exteam in teams.teamID.unique():
flatplayers = sliceoneexteam(exteam)
wins = flatplayers[flatplayers.WSWin == "Y"]
losses = flatplayers[flatplayers.WSWin == "N"]
diffs = (wins[exteam].values - losses[exteam].values)
summarystats.loc[exteam].diffmean = diffs.mean() #mean of the differences
summarystats.loc[exteam].diffse = diffs.std()/np.sqrt(float(len(flatplayers))/2) #standard error
summarystats.loc[exteam].t = summarystats.loc[exteam].diffmean / summarystats.loc[exteam].diffse #t statistic
summarystats.loc[exteam].pval = stats.t.sf(np.abs(summarystats.loc[exteam].t), 36)*2 #p value
summarystats.loc[exteam].teamname = teams.loc[teams.teamID == exteam].iloc[0]['name']
summarystats.loc[exteam].winsmean = wins[exteam].mean()
summarystats.loc[exteam].winsmedian = wins[exteam].median()
summarystats.loc[exteam].winssd = wins[exteam].std(ddof = 0)
summarystats.loc[exteam].winsmax = wins[exteam].max()
summarystats.loc[exteam].lossesmean = losses[exteam].mean()
summarystats.loc[exteam].lossesmedian = losses[exteam].median()
summarystats.loc[exteam].lossessd = losses[exteam].std(ddof = 0)
summarystats.loc[exteam].lossesmax = losses[exteam].max()
So, let's do some exploratory analysis of the impact Cubness appears to have on a team's performance. First, let's examine some basic descriptive statistics - we got summary statistics for every team as we built out the summarystats table, but for the sake of specifically analyzing the Cubs (teamID = 'CHN'), we reload their data into the detail frame (flatplayers).
# Go back and manually get the details for Chicago and hold them for plotting.
slicedteam = "CHN"
flatplayers = sliceoneexteam(slicedteam)
summarystats.loc['CHN']
Here's what the detailed data looks like (one row per World Series team per year, with the number of Ex-Cubs, indicating whether that team won or lost):
flatplayers.head()
So: There have been an average of 2.06 Ex-Cubs on losing World Series teams; whereas winning teams have only an average of 1.8. The difference of these means is -0.26.
Unfortunately for the original prediction that "a team with more than three Ex-Cubs cannot win the World Series," there have been some standout Cubs alumni in recent years, with as many as six appearing on a single winning team - the original rule is exploded. The recent high-Ex-Cub games appear as outliers in the boxplot below.
g = sns.boxplot(x="WSWin", y=slicedteam, data=flatplayers)
g.axes.set_title("Number of Ex-Cubs on World Series Teams 1945-2015\n", fontsize = 20)
g.axes.set_xlabel("World Series Win")
g.axes.set_ylabel("Number of Ex-Cubs")
However, all is not lost. Perhaps the nature of Cub-ness is still a drag on the performance of a World Series contender. Let's examine the distribution of the numbers of Ex-Cubs on winning and losing teams over the years.
Although winning teams have more than three Cubs much less often, the possession of two Ex-Cubs by successful clubs appears to be very common.
g = sns.FacetGrid(flatplayers, col="WSWin",margin_titles=True)
g.map(plt.hist, slicedteam, color="steelblue", bins=6, lw=0)
plt.subplots_adjust(top=0.8)
g.set_axis_labels("Number of Ex-Cubs", "n()")
g.fig.suptitle('Distribution of Ex-Cub presence on World Series Teams', fontsize = 18)
Finally, let's compare scatter charts of Ex-Cub presence in the World Series. The chart doesn't really give us much to go on. Losing teams do seem to have more Cubs more frequently, and you can see that the original Ex-Cubs Factor more or less held up until the 2000s.
g = (sns.lmplot(x="seriesYear", y=slicedteam, data=flatplayers, col = 'WSWin', fit_reg=False, hue = 'lgID'))
plt.subplots_adjust(top=0.8)
g.fig.suptitle('Number of Ex-Cubs on Losing and Winning WS Teams', fontsize = 18)
g.set_axis_labels("World Series Year", "Number of Ex-Cubs")
Out of curiosity, we plotted the American League teams in blue and the National League in green, wondering whether a striking pattern would emerge. It didn't.
American League (total wins: 38) teams with more Ex-Cubs do better in the World Series. The National League (total wins: 31), which perhaps naturally has more (National League) Cubs players, pays a penalty for Cubness. Go figure.
wsleagues = flatplayers.groupby(['lgID', 'WSWin'])['CHN'].mean()
print "Mean number of Ex-Cubs on World Series Teams by League\n\n", wsleagues
Is there any correlation between the average number of Ex-Cubs on a team and the number of World Series wins it enjoyed? Let's plot the number of wins (or losses) each team had against the mean number of Ex-Cubs they had when they were winning (or losing).
winloss = flatplayers.groupby(['WSWin', 'seriesTeam'])['CHN'].agg([np.sum, np.mean])
winloss = pd.DataFrame(winloss.reset_index())
#g = sns.FacetGrid(winloss, col="WSWin")
g = (sns.lmplot(x= 'sum', y= 'mean', data=winloss, col = 'WSWin'))
plt.subplots_adjust(top=0.8)
g.fig.suptitle('Mean Number of Ex-Cubs vs. Number of WS Wins (or Losses)', fontsize = 18)
g.set_axis_labels("Games Won (or lost)", "Number of Ex-Cubs")
winlm = stats.linregress(winloss.loc[winloss.WSWin == "Y"]['sum'], winloss.loc[winloss.WSWin == "Y"]['mean'])
loselm = stats.linregress(winloss.loc[winloss.WSWin == "N"]['sum'], winloss.loc[winloss.WSWin == "N"]['mean'])
print 'winning correlation to Cubness = %6.3f : winning pvalue = %6.4f' % (winlm.rvalue, winlm.pvalue)
print 'losing correlation to Cubness = %6.3f : losing pvalue = %6.4f' % (loselm.rvalue, loselm.pvalue)
If there were much to the Ex-Cubs factor, we might have expected a negative correlation between the number of World Series won and the number of Ex-cubs on the roster. Instead, it's just impossibly weak but still positive. I think we can conclude that the relationship between the average number of Ex-Cubs a team carries when it plays in the World Series and how many World Series a team has won is weak at best and not statistically significant.
p.s. That outlier way off to the right in the "Wins" plot is the New York Yankees, who appear to hold no truck with the Cubs. Stoopid Yankees.
In order to be able to reject the null hypothesis (there is no difference in the number of Ex-Cubs playing for winning and losing World Series teams), we apply a significance test of p less than 0.05. In other words, there must be a less than 5% chance that the difference between the number of Ex-Cubs on winning and losing teams is not zero.
To review:
Cohort | Mean Number of Ex-Cubs | SD | Differences of the Means |
---|---|---|---|
Winning Teams | 1.797 | 1.246 | -0.260 |
Losing Teams | 2.068 | 1.596 | -- |
The Standard Error of the Differences of the Means is the Standard Deviation of the differences / sqrt(n), or:
print 'Difference of the Means = %6.3f Standard Error = %6.4f' % \
(summarystats.loc['CHN'].diffmean, summarystats.loc['CHN'].diffse)
Let's perform a T-Test:
The T value is the Cubs' difference from the means divided by the Standard Error of the Differences. The p-value is the probability that the mean difference is actually zero. We calculated t-scores for all the clubs when we built the summary tables.
print 't-statistic = %6.3f pvalue = %6.4f' % (summarystats.loc['CHN'].t, summarystats.loc['CHN'].pval)
There is actually a 30% probability that the difference between teams with and without Ex-Cubs is actually zero. The differences we see are not statistically significant.
So, in addition to it having been proven that the original theory - "No team with more than three Ex-Cubs on its roster can win the World Series," is wrong, the number of Ex-Cubs on a team does not appear to have a significant relationship to their hosts' performances. Baseball, loaded with its love of statistics, odd superstition, and an attachment to fan martyrdom, told a good story with the Ex-Cub factor, but it appears that Cubness' impact on World Series performance is, in fact, just noise.
We fail reject the null hypothesis, that the number of Ex-Cubs makes no differences to the likelihood that their team will win the World Series.
The myth arose in an effort to relate the misery of being a fan of Chicago Cubs and their infamous World Series-denying powers, which might even extend to other teams who have much to do with them. Let's have a look, just for fun, and bearing in mind that at no time are we stating that correlation is the same thing as causation, at whether other teams' alumni are related to World Series performance. Is there an ex-anybody factor? Are any of those factors actually significant?
The difference in the number of Ex-Cubs on winning teams and Ex-Cubs on losing teams is -0.26 players. In other words, there are 0.26 more Ex-Cubs on the average WS losing team than on the winning team. That is 'Cubness.' A perfect zero would indicate no difference at all between the numbers of a team's winning and losing alumni. We went ahead and calculated the difference for all 45 teams who had players in post-1945 World Series. That is to say, we calculated "Tigerness," "Yankeeness," etc.
Here is the distribution of those 45 scores, tidily centered around zero. It's notable that the Cubs, at -.26, appear to harm their hosts less than other teams, and that yet other teams seem to provide marginal benefit.
plt.hist(summarystats['diffmean'])
plt.title("Distribution of Net Number of Team's Ex-Players \n on Winning vs. Losing World Series Teams \n")
...and here's the list of teams sorted by the probability that their 'impact' is statistically significant, strictly for your entertainment. You'll note that only one (the Texas Rangers) meets the p =0.05 minimum standard, and their appearance on team roster is strongly associated with winning performances. We will not attempt to fudge our original criteria to include the apparently terrible impact of the Pirates, who just miss the threshold with a pval of 0.055.
In conclusion, for all the misery attributed to the presence of Ex-Cubs on your baseball team, the team whose imaginary contagious awfulness is actually the greatest is the Pittsburgh Pirates. On the other hand, it appears that if you can't get your hands on some Rangers, the presence of Ex-Brooklyn Dodgers could make your team happier, even if they do not materially improve your World Series outcome.
Except they're all dead.
summarystats.sort_values(by = 'pval')