Let RSelenium scrape the Fangraphs data you want for you
As
Spring Training wraps up, more and more baseball fans start turning to their
favorite stat pages to get ready for fantasy baseball season, and Fangraphs has
become the standard source for stats. If you take your fantasy baseball
seriously and/or you play in a daily fantasy league, you may want to import new
data into R daily for research before you make your lineup choices.
Unfortunately, while R has several useful packages for scraping data off the
web, the ones that seemed obvious to me weren’t ideal for this situation, but I
found a solution with the help of the RSelenium package.
Below I review how to set-up a Firefox profile to make interacting with
RSelenium easier, and then share a basic function for scraping complete
datasets from Fangraphs directly into R.
Each
Fangraphs leader board that you could want is created with a predictable URL,
so identifying the page you want to download stats from is no problem. But what
we really want is the URL of the ‘Export Data’ link that appears on the right
side above the stats. However, if you hover over the link, you’ll notice it
says ‘javascript:…’ in the little URL box that pops up in the left-hand corner.
Basically, this means you have to click on the ‘Export Data’ link to download
the data, because there is no preset link to each ‘Export Data’ button. That’s
where RSelenium comes in. You can set up a script using RSelenium commands to
interact with Firefox (there are some other browsers it works with, but Firefox
was easiest for me), and quickly import the stats of interest into R as an R
object, or save them to a .CSV file if you prefer using Excel (or other stat
program).
First,
you’ll need to set up a Firefox profile with auto download enabled to point
RSelenium to in your script – you’ll probably want to download the ad blocker
extension too. You may want to scrape information across multiple Fangraphs
pages at once, and the video ads will slow things down and could interrupt R.
Even if Firefox is set to your preferences, the RSelenium package will access
the default Firefox settings unless you’ve saved those settings to a Firefox
profile. You can follow the Mozilla instructions for setting up a Firefox
profile with pictures here,
but I sum up the steps below:
1)
Quit Firefox.
2)
Open Terminal.
3)
Enter the following and press return.
/Applications/Firefox.app/Contents/MacOS/firefox-bin -P
Note: You can use -P, -p, or -ProfileManager (any of them should work).
4) The Firefox Profile Manager should now be open, so choose Create Profile.
4) The Firefox Profile Manager should now be open, so choose Create Profile.
5)
Choose a profile name and copy the path to the
folder being created before clicking done, because you’ll need that path in the
R code.
6)
Start Firefox with that profile, go to settings
and turn on auto-download. Doesn’t matter which folder you choose (I used the
downloads folder in the code below), just make sure there is no previous
‘Fangraphs Leaderboard.csv’ file in there already. Optionally, install the ad block
plus add on.
With that set up, all that’s left
is to paste the profile folder path from above into the code below for the
variable fprof. Also, check that the path to the folder you downloaded the
files into matches the path in the read.csv commands. I’m using a Mac from 2010
and tend to have a bunch of things open, so I had to add in the Sys.sleep
commands, but if your computer is fast enough, you can just get rid of them, or
at least lower the numbers. I have it set so that RSelenium is required for the
function, but it doesn’t work for me unless I call ‘library(RSelenium)’ before
running the function. You can easily remove certain pages or add a variable to
control the split you’re interested in downloading. As I note, I didn't remove all extra symbols from the data, but I show how you can. I figure you'll edit the function to fit your needs anyway, so I just removed most of them.
fetch_FGmaj <- function(Year, borp, Pos="all", qual=0) { require(RSelenium) require(dplyr) if(borp != "bat" & borp != "pit"){return(paste0("Must be 'bat' or 'pit'"))} yr <- substr(Sys.Date(), 0, 4) if(Year > yr){return(paste0("Year is out of bounds"))} # Create the URL's standardurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=0&season=", Year, "&team=0&players=") advancedurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=1&season=", Year, "&team=0&players=") battedballurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=2&season=", Year, "&team=0&players=") winprobabilityurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=3&season=", Year, "&team=0&players=") pitchtypeurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=4&season=", Year, "&team=0&players=") platedisciplineurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=5&season=", Year, "&team=0&players=") valueurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=6&season=", Year, "&team=0&players=") pitchvalueurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=7&season=", Year, "&team=0&players=") fxtypeurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=9&season=", Year, "&team=0&players=") fxvelocityurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=10&season=", Year, "&team=0&players=") fxhmovementurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=11&season=", Year, "&team=0&players=") fxvmovementurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=12&season=", Year, "&team=0&players=") fxvalueurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=13&season=", Year, "&team=0&players=") fxvaluep100url <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=14&season=", Year, "&team=0&players=") fxdisciplineurl <- paste0("http://www.fangraphs.com/leaders.aspx?pos=", Pos, "&stats=", borp, "&lg=all", "&qual=", qual, "&type=15&season=", Year, "&team=0&players=") # Scrape the data RSelenium::startServer() fprof <- getFirefoxProfile("/Your/Path/Here", useBase = TRUE) remDr <- remoteDriver(extraCapabilities = fprof) remDr$open() remDr$navigate(standardurl) Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(3) remDr$navigate(advancedurl) sdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(battedballurl) adata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(winprobabilityurl) bbdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(pitchtypeurl) wpdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(platedisciplineurl) ptdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(valueurl) pddata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(pitchvalueurl) vdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(fxtypeurl) pvdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(fxvelocityurl) fxtdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(fxhmovementurl) fxvedata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(fxvmovementurl) fxhmdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(fxvalueurl) fxvmdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(fxvaluep100url) fxvdata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) remDr$navigate(fxdisciplineurl) fxv1data <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") Sys.sleep(2) webElem <- remDr$findElement(using = 'id', value = "LeaderBoard1_cmdCSV") webElem$clickElement() Sys.sleep(5) fxddata <- read.csv("~/Downloads/FanGraphs Leaderboard.csv") file.remove("~/Downloads/FanGraphs Leaderboard.csv") remDr$closeWindow() # Join the data data <- full_join(sdata, adata, by=c("Name", "Team", "playerid")) data <- full_join(data, bbdata, by=c("Name", "Team", "playerid")) data <- full_join(data, wpdata, by=c("Name", "Team", "playerid")) data <- full_join(data, ptdata, by=c("Name", "Team", "playerid")) data <- full_join(data, pddata, by=c("Name", "Team", "playerid")) data <- full_join(data, vdata, by=c("Name", "Team", "playerid")) data <- full_join(data, pvdata, by=c("Name", "Team", "playerid")) data <- full_join(data, fxtdata, by=c("Name", "Team", "playerid")) data <- full_join(data, fxvedata, by=c("Name", "Team", "playerid")) data <- full_join(data, fxhmdata, by=c("Name", "Team", "playerid")) data <- full_join(data, fxvmdata, by=c("Name", "Team", "playerid")) data <- full_join(data, fxvdata, by=c("Name", "Team", "playerid")) data <- full_join(data, fxddata, by=c("Name", "Team", "playerid")) data$Year <- Year # Removes extra symbols (incomplete) data$K. <- as.numeric(gsub(" %", "", data$K.)) data$BB. <- as.numeric(gsub(" %", "", data$BB.)) data$LD. <- as.numeric(gsub(" %", "", data$LD.)) data$GB. <- as.numeric(gsub(" %", "", data$GB.)) data$FB..x <- as.numeric(gsub(" %", "", data$FB..x)) data$IFFB. <- as.numeric(gsub(" %", "", data$IFFB.)) data$HR.FB <- as.numeric(gsub(" %", "", data$HR.FB)) data$IFH. <- as.numeric(gsub(" %", "", data$IFH.)) data$BUH. <- as.numeric(gsub(" %", "", data$BUH.)) data$Pull. <- as.numeric(gsub(" %", "", data$Pull.)) data$Cent. <- as.numeric(gsub(" %", "", data$Cent.)) data$Oppo. <- as.numeric(gsub(" %", "", data$Oppo.)) data$Soft. <- as.numeric(gsub(" %", "", data$Soft.)) data$Med. <- as.numeric(gsub(" %", "", data$Med.)) data$Hard. <- as.numeric(gsub(" %", "", data$Hard.)) data$LD. <- as.numeric(gsub(" %", "", data$LD.)) data$FB..y <- as.numeric(gsub(" %", "", data$FB..y)) data$SL..x <- as.numeric(gsub(" %", "", data$SL..x)) data$CT. <- as.numeric(gsub(" %", "", data$CT.)) data$CB. <- as.numeric(gsub(" %", "", data$CB.)) data$CH..x <- as.numeric(gsub(" %", "", data$CH..x)) data$SF. <- as.numeric(gsub(" %", "", data$SF.)) data$KN..x <- as.numeric(gsub(" %", "", data$KN..x)) data$XX. <- as.numeric(gsub(" %", "", data$XX.)) data$O.Swing..x <- as.numeric(gsub(" %", "", data$O.Swing..x)) data$Z.Swing..x <- as.numeric(gsub(" %", "", data$Z.Swing..x)) data$Swing..x <- as.numeric(gsub(" %", "", data$Swing..x)) data$O.Contact..x <- as.numeric(gsub(" %", "", data$O.Contact..x)) data$Z.Contact..x <- as.numeric(gsub(" %", "", data$Z.Contact..x)) data$Contact..x <- as.numeric(gsub(" %", "", data$Contact..x)) data$Zone..x <- as.numeric(gsub(" %", "", data$Zone..x)) data$F.Strike. <- as.numeric(gsub(" %", "", data$F.Strike.)) data$SwStr. <- as.numeric(gsub(" %", "", data$SwStr.)) (incomplete) if(borp=="pit"){ data$LOB. <- as.numeric(gsub(" %", "", data$LOB.)) data$K.BB. <- as.numeric(gsub(" %", "", data$K.BB.)) } data }
Best Telescopes for the Money https://t.co/hnY3w1hhtr #skywatching #astronomy pic.twitter.com/P06GX7qNMq— SPACE.com (@SPACEdotcom) March 21, 2016
Leave your comment
Post a Comment