In this article, I will show you how you can parse web pages with R. For example I will take an option quotes page from Yahoo Finance.
Here is a page with AAPL options for example: https://finance.yahoo.com/quote/AAPL/options?p=AAPL
To parse it I will use “rvest” package, it’s really nice and it’s really easy to use it.
First, we have to include libraries needed for that task. In addition to rvest I will use tidyverse package for basic data crunching.
library(tidyverse)
library(rvest)
Next, let’s save a URL in a variable and load the entire web page with read_html function:
url <- "https://finance.yahoo.com/quote/AAPL/options?p=AAPL&date=1597968000"
page <- read_html(url)
page
After that, we have to find our tables on the page and parse them as data.frames in R. On the page we have 2 separate tables for Puts and Calls. To do that we need to find unique CSS or XPath code to define this node on the page. You can use developer tools in Google Chrome for that. Click on your table with the right button and click “Inspect”
Then you’ll see HTML code for this page. You need to locate the “table” tag and after that define a unique path to it. For me it’s really simple, it has a unique css class “calls”:
Next, use this path with html_node function to find the table and then html_table function to transform it to data.frame.
calls <- page %>%
html_node(".calls") %>%
html_table()
head(calls)
As you can see it works and we already have out options quotes in R:
There are small issues with the format of some columns, so you can use the following code to parse “Last Trade Date” to normal date-time format, cast few columns from character to numeric data types and, get quotes only for today.
calls <- calls %>%
mutate(
`Last Trade Date` = as.POSIXct(`Last Trade Date`, format = "%Y-%m-%d %I:%M%p"),
`Implied Volatility` = as.numeric(str_replace(`Implied Volatility`, "%", "")),
`% Change` = as.numeric(str_replace(`% Change` , "%", "")),
Volume = as.numeric(Volume),
`Open Interest` = as.numeric(`Open Interest`)
) %>%
filter(as.Date( `Last Trade Date`) == Sys.Date())
head(calls)
So we’re done with calls, you can repeat the same steps to get puts:
puts <- page %>%
html_node(".puts") %>%
html_table()
puts <- puts %>%
mutate(
`Last Trade Date` = as.POSIXct(`Last Trade Date`, format = "%Y-%m-%d %I:%M%p"),
`Implied Volatility` = as.numeric(str_replace(`Implied Volatility`, "%", "")),
`% Change` = as.numeric(str_replace(`% Change` , "%", "")),
Volume = as.numeric(Volume),
`Open Interest` = as.numeric(`Open Interest`)
) %>%
filter(as.Date( `Last Trade Date`) == Sys.Date())
head(puts)
So you’re done, now you have calls and puts quotes in R. As you can see working with R and rvest library is extremely simple and you can parse complicated web page literally in minutes.