Motivation

The goal here is to use widely available baseball data to make a priori predictions of fanduel scores (https://www.fanduel.com/rules). If I am able to predict fanduel scores more accurately than average, then the hope is I can create superior lineups, and thus win money.

An initial corpus is created by scraping the results from (https://www.fantasycruncher.com/lineup-rewind/fanduel/MLB). Using the table capture chrome extension the fanduel results are scraped from the website to a local CSV file, leading to ~4200 data points. Minor data cleaning is manually performed on each CSV.

The CSVs contains major batting statistics for the player, the opposing team, opposing pitcher, handedness of each, and basic pitching statistic (ERA).

The response variable in this analysis is the ‘actual score’ for each player as calculated by fanduel.

The CSV files for each of the 18 days are read, and used to construct the training, testing, and cross validation data sets.

Initial features

Initially 12 features are modeled: * ISO, * wOBA, * opposing pitcher ERA, * batting average, * lineup order, * park factor, * batter/pitcher handness (1 is assigned if they’re the same, 0 if different. Assumed different for switch hitters), * fielding percentage, * team WHIP, * team DIPS, * team K/9, * team K/BB.

The feature set is fed into the R Caret package, and fitted with the ‘lm’ linear modeling package. The test set is then k 10-fold cross validated and the results are plotted:

plot(x=len,y=error,xlab="Data set size",ylab="Model Error",col="blue",pch=14,ylim=c(1,15),main="Non-Penalized Regression")
legend('topright', c("Test Set","Cross Val Set"),lty=1, lwd=4,bty='n', cex=.75,col=c("blue","red"))
points(x=len,y=CV_error,xlab="Data set size",ylab="Model Error",col="red",pch=16)

RMSE for both the test set and cross validation set are shown. Clearly after ~1000 data points, increasing the data set does not appreciably increase the models accuracy. Based on the shapes of the CV and test set learning curves, more features seem to be needed.

Next, some principal component analysis is conducted:

## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     38.2117 5.32515 0.49642 0.34164 0.21529 0.14755
## Proportion of Variance  0.9807 0.01905 0.00017 0.00008 0.00003 0.00001
## Cumulative Proportion   0.9807 0.99969 0.99986 0.99994 0.99997 0.99998
##                            PC7     PC8     PC9   PC10    PC11     PC12
## Standard deviation     0.11767 0.07269 0.06172 0.0339 0.01952 0.004324
## Proportion of Variance 0.00001 0.00000 0.00000 0.0000 0.00000 0.000000
## Cumulative Proportion  0.99999 1.00000 1.00000 1.0000 1.00000 1.000000

Using prcomp, it appears that ISO and wOBA explain the vast majority of the variance (>99%). This will require some further analysis.

Peaking at the summary of the linear fit:

summary(fit)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.520  -7.439  -2.866   4.360  89.265 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.074e+02  5.189e+01  -2.070   0.0385 *
## ISO         -2.069e+00  3.478e+00  -0.595   0.5520  
## wOBA         1.887e+01  7.932e+00   2.378   0.0175 *
## BAvg        -1.310e+01  7.034e+00  -1.862   0.0628 .
## OppERA      -3.350e-02  3.716e-02  -0.902   0.3673  
## V5           1.421e-03  9.503e-03   0.149   0.8812  
## park_factor -7.114e-01  6.173e-01  -1.152   0.2493  
## match_hand  -1.190e-01  4.140e-01  -0.287   0.7739  
## field_per    1.164e+02  4.767e+01   2.442   0.0147 *
## team_WHIP    3.319e+00  5.246e+00   0.633   0.5270  
## team_DIPS   -2.003e+00  2.836e+00  -0.706   0.4801  
## team_K9     -1.649e+00  3.121e+00  -0.528   0.5974  
## team_KBB     8.032e-01  3.099e+00   0.259   0.7955  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.27 on 2509 degrees of freedom
## Multiple R-squared:  0.01118,    Adjusted R-squared:  0.006452 
## F-statistic: 2.364 on 12 and 2509 DF,  p-value: 0.005053

Tells a different tale. Each implementation shows a very weak r-squared value, and different features fluctuate in relevance. Ultimately this is not a satisfactory model.

For a bench mark, the avg RMSE for the test set is: 10.7177086. Future development will seek to improve this error.