How To Quickly Import Fangraphs Stats With R | Astromets Mind

Friday, April 1, 2016

How To Quickly Import Fangraphs Stats With R

Let RSelenium scrape the Fangraphs data you want for you

            As Spring Training wraps up, more and more baseball fans start turning to their favorite stat pages to get ready for fantasy baseball season, and Fangraphs has become the standard source for stats. If you take your fantasy baseball seriously and/or you play in a daily fantasy league, you may want to import new data into R daily for research before you make your lineup choices. Unfortunately, while R has several useful packages for scraping data off the web, the ones that seemed obvious to me weren’t ideal for this situation, but I found a solution with the help of the RSelenium package. Below I review how to set-up a Firefox profile to make interacting with RSelenium easier, and then share a basic function for scraping complete datasets from Fangraphs directly into R.
            Each Fangraphs leader board that you could want is created with a predictable URL, so identifying the page you want to download stats from is no problem. But what we really want is the URL of the ‘Export Data’ link that appears on the right side above the stats. However, if you hover over the link, you’ll notice it says ‘javascript:…’ in the little URL box that pops up in the left-hand corner. Basically, this means you have to click on the ‘Export Data’ link to download the data, because there is no preset link to each ‘Export Data’ button. That’s where RSelenium comes in. You can set up a script using RSelenium commands to interact with Firefox (there are some other browsers it works with, but Firefox was easiest for me), and quickly import the stats of interest into R as an R object, or save them to a .CSV file if you prefer using Excel (or other stat program).
            First, you’ll need to set up a Firefox profile with auto download enabled to point RSelenium to in your script – you’ll probably want to download the ad blocker extension too. You may want to scrape information across multiple Fangraphs pages at once, and the video ads will slow things down and could interrupt R. Even if Firefox is set to your preferences, the RSelenium package will access the default Firefox settings unless you’ve saved those settings to a Firefox profile. You can follow the Mozilla instructions for setting up a Firefox profile with pictures here, but I sum up the steps below:

       1)   Quit Firefox.
       2)   Open Terminal.
       3)   Enter the following and press return.
       /Applications/Firefox.app/Contents/MacOS/firefox-bin -P
       Note: You can use -P, -p, or -ProfileManager (any of them should work).
 4)   The Firefox Profile Manager should now be open, so choose Create Profile.
       5)   Choose a profile name and copy the path to the folder being created before clicking done, because you’ll need that path in the R code.
       6)   Start Firefox with that profile, go to settings and turn on auto-download. Doesn’t matter which folder you choose (I used the downloads folder in the code below), just make sure there is no previous ‘Fangraphs Leaderboard.csv’ file in there already. Optionally, install the ad block plus add on.

With that set up, all that’s left is to paste the profile folder path from above into the code below for the variable fprof. Also, check that the path to the folder you downloaded the files into matches the path in the read.csv commands. I’m using a Mac from 2010 and tend to have a bunch of things open, so I had to add in the Sys.sleep commands, but if your computer is fast enough, you can just get rid of them, or at least lower the numbers. I have it set so that RSelenium is required for the function, but it doesn’t work for me unless I call ‘library(RSelenium)’ before running the function. You can easily remove certain pages or add a variable to control the split you’re interested in downloading. As I note, I didn't remove all extra symbols from the data, but I show how you can. I figure you'll edit the function to fit your needs anyway, so I just removed most of them.




fetch_FGmaj <- function(Year, borp, Pos="all", qual=0) {
 
  require(RSelenium)
  require(dplyr)
 
  if(borp != "bat" & borp != "pit"){return(paste0("Must be 'bat' or 'pit'"))}
  yr <- substr(Sys.Date(), 0, 4)
  if(Year > yr){return(paste0("Year is out of bounds"))}
 
# Create the URL's
  standardurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=0&season=", Year, "&team=0&players=")
  advancedurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=1&season=", Year, "&team=0&players=")
  battedballurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=2&season=", Year, "&team=0&players=")
  winprobabilityurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=3&season=", Year, "&team=0&players=")
  pitchtypeurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=4&season=", Year, "&team=0&players=")
  platedisciplineurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=5&season=", Year, "&team=0&players=")
  valueurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=6&season=", Year, "&team=0&players=")
  pitchvalueurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=7&season=", Year, "&team=0&players=")
  fxtypeurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=9&season=", Year, "&team=0&players=")
  fxvelocityurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=10&season=", Year, "&team=0&players=")
  fxhmovementurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=11&season=", Year, "&team=0&players=")
  fxvmovementurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=12&season=", Year, "&team=0&players=")
  fxvalueurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=13&season=", Year, "&team=0&players=")
  fxvaluep100url <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=14&season=", Year, "&team=0&players=")
  fxdisciplineurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=15&season=", Year, "&team=0&players=")
 
  # Scrape the data
  RSelenium::startServer()
  fprof <- getFirefoxProfile("/Your/Path/Here", useBase = TRUE)
  remDr <- remoteDriver(extraCapabilities = fprof)
  remDr$open()
  remDr$navigate(standardurl)
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(3)
  remDr$navigate(advancedurl)
  sdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(battedballurl)
  adata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(winprobabilityurl)
  bbdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(pitchtypeurl)
  wpdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(platedisciplineurl)
  ptdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(valueurl)
  pddata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(pitchvalueurl)
  vdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(fxtypeurl)
  pvdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(fxvelocityurl)
  fxtdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(fxhmovementurl)
  fxvedata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(fxvmovementurl)
  fxhmdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(fxvalueurl)
  fxvmdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(fxvaluep100url)
  fxvdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  remDr$navigate(fxdisciplineurl)
  fxv1data <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  Sys.sleep(2)
  webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV")
  webElem$clickElement()
  Sys.sleep(5)
  fxddata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv")
  file.remove("~/Downloads/FanGraphs Leaderboard.csv")
  remDr$closeWindow()
 
# Join the data
  data <- full_join(sdata, adata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, bbdata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, wpdata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, ptdata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, pddata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, vdata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, pvdata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, fxtdata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, fxvedata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, fxhmdata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, fxvmdata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, fxvdata, by=c("Name", "Team", "playerid"))
  data <- full_join(data, fxddata, by=c("Name", "Team", "playerid"))
 
  data$Year <- Year
 
  # Removes extra symbols (incomplete)
  data$K. <- as.numeric(gsub(" %", "", data$K.))
  data$BB. <- as.numeric(gsub(" %", "", data$BB.))
  data$LD. <- as.numeric(gsub(" %", "", data$LD.))
  data$GB. <- as.numeric(gsub(" %", "", data$GB.))
  data$FB..x <- as.numeric(gsub(" %", "", data$FB..x))
  data$IFFB. <- as.numeric(gsub(" %", "", data$IFFB.))
  data$HR.FB <- as.numeric(gsub(" %", "", data$HR.FB))
  data$IFH. <- as.numeric(gsub(" %", "", data$IFH.))
  data$BUH. <- as.numeric(gsub(" %", "", data$BUH.))
  data$Pull. <- as.numeric(gsub(" %", "", data$Pull.))
  data$Cent. <- as.numeric(gsub(" %", "", data$Cent.))
  data$Oppo. <- as.numeric(gsub(" %", "", data$Oppo.))
  data$Soft. <- as.numeric(gsub(" %", "", data$Soft.))
  data$Med. <- as.numeric(gsub(" %", "", data$Med.))
  data$Hard. <- as.numeric(gsub(" %", "", data$Hard.))
  data$LD. <- as.numeric(gsub(" %", "", data$LD.))
  data$FB..y <- as.numeric(gsub(" %", "", data$FB..y))
  data$SL..x <- as.numeric(gsub(" %", "", data$SL..x))
  data$CT. <- as.numeric(gsub(" %", "", data$CT.))
  data$CB. <- as.numeric(gsub(" %", "", data$CB.))
  data$CH..x <- as.numeric(gsub(" %", "", data$CH..x))
  data$SF. <- as.numeric(gsub(" %", "", data$SF.))
  data$KN..x <- as.numeric(gsub(" %", "", data$KN..x))
  data$XX. <- as.numeric(gsub(" %", "", data$XX.))
  data$O.Swing..x <- as.numeric(gsub(" %", "", data$O.Swing..x))
  data$Z.Swing..x <- as.numeric(gsub(" %", "", data$Z.Swing..x))
  data$Swing..x <- as.numeric(gsub(" %", "", data$Swing..x))
  data$O.Contact..x <- as.numeric(gsub(" %", "", data$O.Contact..x))
  data$Z.Contact..x <- as.numeric(gsub(" %", "", data$Z.Contact..x))
  data$Contact..x <- as.numeric(gsub(" %", "", data$Contact..x))
  data$Zone..x <- as.numeric(gsub(" %", "", data$Zone..x))
  data$F.Strike. <- as.numeric(gsub(" %", "", data$F.Strike.))
  data$SwStr. <- as.numeric(gsub(" %", "", data$SwStr.))
 
  (incomplete)
  if(borp=="pit"){
    data$LOB. <- as.numeric(gsub(" %", "", data$LOB.))
    data$K.BB. <- as.numeric(gsub(" %", "", data$K.BB.))
  }
 
  data
}
Created by Pretty R at inside-R.org


  • 0Blogger Comment
  • Facebook Comment
  • Disqus Comment

Leave your comment

Post a Comment

comments powered by Disqus
submit to reddit