class: logo-slide --- class: title-slide ## Intro to Web Scraping ### Applications of Data Science - Class Bonus ### Giora Simchoni #### `gsimchoni@gmail.com and add #dsapps in subject` ### Stat. and OR Department, TAU ### 2023-03-26 --- layout: true <div class="my-footer"> <span> <a href="https://dsapps-2023.github.io/Class_Slides/" target="_blank">Applications of Data Science </a> </span> </div> --- class: section-slide # The Three Rules of Web Scraping --- ### Rule 1: Do you *really* need web scraping? There are data APIs for just about anything, you know... <img src="images/apis.png" style="width: 100%" /> --- #### R API Packages Many of them already accessible with a R/Python package... ```r library(wbstats) female_labor <- wb_data( indicator = c("women_lab_share" = "SL.TLF.TOTL.FE.ZS"), start_date = 1990, end_date = 2020 ) female_labor %>% filter(country %in% c("Israel", "United States")) %>% ggplot(aes(date, women_lab_share, color = country)) + geom_line(lwd = 2) + labs(title = "Share of women in labor force") + theme_light() + theme(text = element_text(size=16)) ``` .font80percent[ From: https://cfss.uchicago.edu/notes/application-program-interface/ ] --- <img src="images/WB-Stats-1.png" width="100%" /> --- #### The `datapasta` package My gift to you. <div class = "no_shadow"> <p align="center"> <img class = "no_shadow" src="images/demo.jpg"/> </p> </div> --- ### Rule 2: Learn some HTML first! HTML is a set (or tree) of *elements*, marked by *HTML tags*: .pull-left[ <img src="images/html1.png" style="width: 100%" /> ] .pull-right[ <img src="images/webpage1.png" style="width: 95%" /> ] - First children in the tree: `header` and `body` - View any page's HTML (on chrome) with right-click and "View page source" (or Ctrl + U) --- #### Useful elements and attributes to know - `<p>` for paragraph `</p>` - `<h1>` for headings `</h1>` - `<br>`, `<hr>` for breaks - `<a href = "http://www.google.com>` for links `</a>` - `<b><i>` For bold, italic etc. `</i></b>` - `<img src="img_name.jpg" alt="Alternative text">` - `<p style="color:DodgerBlue;">` for font color `</p>` --- #### HTML Tables A big thing when it comes to data as you can imagine... .pull-left[ <img src="images/html2.png" style="width: 60%" /> ] .pull-right[ <img src="images/webpage2.png" style="width: 120%" /> ] --- #### HTML Classes A class attribute is defined in a style sheet, lets you repeat a style. .pull-left[ <img src="images/html3.png" style="width: 80%" /> ] .pull-right[ <img src="images/webpage3.png" style="width: 120%" /> ] --- ### Rule 3: Be polite! With great power comes great responsibility. See e.g. the [polite](https://dmi3kno.github.io/polite/) package. <img src="images/polite_logo.png" style="width: 30%" /> --- class: section-slide # rvest --- ### `read_html()` You're now a NLP expert, and you've just developed a SOTA Q&A model. How would you demonstrate your model's performance? How about [triviaquestionsnow.com](https://www.triviaquestionsnow.com/)? Let's scrape a few Q&As. Politely. ```r library(rvest) url <- "https://www.triviaquestionsnow.com/for/sports-trivia" html_obj <- read_html(url) ``` `read_html()` is usually where you'd start. What did you get? ```r class(html_obj) ``` ``` ## [1] "xml_document" "xml_node" ``` --- ### View page source With time, you'll become friendly with this weird object. Right now it is better than... <img src="images/view_page_source.png" style="width: 100%" /> --- ### `html_children()` and `html_node()` Our tree has two children: `head` and `body` ```r html_obj %>% html_children() ``` ``` ## {xml_nodeset (2)} ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body>\n <div class="title-bar" data-responsive-toggle="nav" data-hide ... ``` Again notice the object returned might not be familiar ("`xml_nodeset`") And each of the children has children of its own: ```r html_obj %>% html_node("body") %>% html_children() ``` ``` ## {xml_nodeset (6)} ## [1] <div class="title-bar" data-responsive-toggle="nav" data-hide-for="medium ... ## [2] <div id="nav" class="top-bar">\n <div class="row">\n <d ... ## [3] <div class="wrap bg-grey-light t-pad-20 b-pad-20">\n <div clas ... ## [4] <script src="https://www.triviaquestionsnow.com/js/all.js?v=1"></script> ## [5] <script async src="https://www.googletagmanager.com/gtag/js?id=UA-1150690 ... ## [6] <script>\n window.dataLayer = window.dataLayer || [];\n fun ... ``` --- ### `html_nodes()` Usually we'd figure out a rule and want a list of all relevant nodes: ```r html_obj %>% html_nodes("img") ``` ``` ## {xml_nodeset (8)} ## [1] <img src="https://www.triviaquestionsnow.com/img/trivia-questions.png" al ... ## [2] <img src="https://www.triviaquestionsnow.com/img/trivia-questions.png" al ... ## [3] <img src="https://www.triviaquestionsnow.com/img/category/360x130/-catego ... ## [4] <img src="https://www.triviaquestionsnow.com/img/category/360x130/-catego ... ## [5] <img src="https://www.triviaquestionsnow.com/img/category/360x130/-catego ... ## [6] <img src="https://www.triviaquestionsnow.com/img/category/360x130/apologe ... ## [7] <img src="https://www.triviaquestionsnow.com/img/category/360x130/-catego ... ## [8] <img src="https://www.triviaquestionsnow.com/img/category/360x130/-catego ... ``` ```r html_obj %>% html_nodes("a") ``` ``` ## {xml_nodeset (44)} ## [1] <a href="/">\n <img src="https://www.triviaquestionsnow.c ... ## [2] <a href="https://www.triviaquestionsnow.com" class="no-pad">\n ... ## [3] <a href="https://www.triviaquestionsnow.com/easy-trivia-questions">Easy ... ## [4] <a href="https://www.triviaquestionsnow.com/for/sports-trivia">Sports Tr ... ## [5] <a href="https://www.triviaquestionsnow.com/for/music-trivia">Music Triv ... ## [6] <a href="https://www.triviaquestionsnow.com/for/math-trivia">Math Trivia ... ## [7] <a href="https://www.triviaquestionsnow.com/categories">Categories</a> ## [8] <a href="https://www.triviaquestionsnow.com/all">All Trivia</a> ## [9] <a href="https://www.triviaquestionsnow.com/question/which-country-won-t ... ## [10] <a href="#" class="click-to-show bold" ng-click="question.clickShow($eve ... ## [11] <a href="https://www.triviaquestionsnow.com/question/former-nba-player-w ... ## [12] <a href="#" class="click-to-show bold" ng-click="question.clickShow($eve ... ## [13] <a href="https://www.triviaquestionsnow.com/question/super-bowl-to-have- ... ## [14] <a href="#" class="click-to-show bold" ng-click="question.clickShow($eve ... ## [15] <a href="https://www.triviaquestionsnow.com/question/knocked-roger-feder ... ## [16] <a href="#" class="click-to-show bold" ng-click="question.clickShow($eve ... ## [17] <a href="https://www.triviaquestionsnow.com/question/first-country-to-wi ... ## [18] <a href="#" class="click-to-show bold" ng-click="question.clickShow($eve ... ## [19] <a href="https://www.triviaquestionsnow.com/question/in-the-sport-of-ten ... ## [20] <a href="#" class="click-to-show bold" ng-click="question.clickShow($eve ... ## ... ``` --- ### `html_attrs()` Getting a specific attribute from those nodes: ```r html_obj %>% html_nodes("img") %>% html_attr("src") ``` ``` ## [1] "https://www.triviaquestionsnow.com/img/trivia-questions.png" ## [2] "https://www.triviaquestionsnow.com/img/trivia-questions.png" ## [3] "https://www.triviaquestionsnow.com/img/category/360x130/-category-18-1485571402.jpg" ## [4] "https://www.triviaquestionsnow.com/img/category/360x130/-category-17-1485571416.jpg" ## [5] "https://www.triviaquestionsnow.com/img/category/360x130/-category-14-1485571440.jpg" ## [6] "https://www.triviaquestionsnow.com/img/category/360x130/apologeticsbooksforkids_igh35k.jpg" ## [7] "https://www.triviaquestionsnow.com/img/category/360x130/-category-12-1485571483.jpg" ## [8] "https://www.triviaquestionsnow.com/img/category/360x130/-category-20-1485571383.jpg" ``` ```r html_obj %>% html_nodes("a") %>% html_attr("href") ``` ``` ## [1] "/" ## [2] "https://www.triviaquestionsnow.com" ## [3] "https://www.triviaquestionsnow.com/easy-trivia-questions" ## [4] "https://www.triviaquestionsnow.com/for/sports-trivia" ## [5] "https://www.triviaquestionsnow.com/for/music-trivia" ## [6] "https://www.triviaquestionsnow.com/for/math-trivia" ## [7] "https://www.triviaquestionsnow.com/categories" ## [8] "https://www.triviaquestionsnow.com/all" ## [9] "https://www.triviaquestionsnow.com/question/which-country-won-the-2015-davis-cup" ## [10] "#" ## [11] "https://www.triviaquestionsnow.com/question/former-nba-player-with-starring-role-in-the-1980-comedy-movie-airplane" ## [12] "#" ## [13] "https://www.triviaquestionsnow.com/question/super-bowl-to-have-most-points-scored-ever" ## [14] "#" ## [15] "https://www.triviaquestionsnow.com/question/knocked-roger-federer-out-of-quarterfinals-of-mens-singles-of-2018-wimbledon-championship" ## [16] "#" ## [17] "https://www.triviaquestionsnow.com/question/first-country-to-win-fed-davis-and-hopman-cups-in-a-single-calendar-year" ## [18] "#" ## [19] "https://www.triviaquestionsnow.com/question/in-the-sport-of-tennis-the-acronym-atp-stands-for-what" ## [20] "#" ## [21] "https://www.triviaquestionsnow.com/question/the-final-of-the-mens-doubles-of-the-2012-us-open-was-won-by-which-set-of-two-players" ## [22] "#" ## [23] "https://www.triviaquestionsnow.com/question/which-of-the-following-is-not-a-possible-outcome-in-an-attempt-to-serve-in-tennis" ## [24] "#" ## [25] "https://www.triviaquestionsnow.com/question/former-nfl-player-starring-in-film-reggies-prayer" ## [26] "#" ## [27] "https://www.triviaquestionsnow.com/question/the-ground-on-which-the-game-of-golf-is-played-is-known-as" ## [28] "#" ## [29] "https://www.triviaquestionsnow.com/for/sports-trivia?page=2" ## [30] "https://www.triviaquestionsnow.com/for/sports-trivia?page=3" ## [31] "https://www.triviaquestionsnow.com/for/sports-trivia?page=4" ## [32] "https://www.triviaquestionsnow.com/for/sports-trivia?page=5" ## [33] "https://www.triviaquestionsnow.com/for/sports-trivia?page=6" ## [34] "https://www.triviaquestionsnow.com/for/sports-trivia?page=7" ## [35] "https://www.triviaquestionsnow.com/for/sports-trivia?page=2" ## [36] "https://www.triviaquestionsnow.com/for/food-drink-trivia" ## [37] "https://www.triviaquestionsnow.com/for/science-trivia" ## [38] "https://www.triviaquestionsnow.com/for/bible-trivia" ## [39] "https://www.triviaquestionsnow.com/for/kids-trivia" ## [40] "https://www.triviaquestionsnow.com/for/history-trivia" ## [41] "https://www.triviaquestionsnow.com/for/video-games-trivia" ## [42] "https://www.triviaquestionsnow.com/categories" ## [43] "javascript:void(0);" ## [44] "https://www.triviaquestionsnow.com/privacy" ``` --- ### `html_text()` Getting the text from whatever set of elements we have: ```r html_obj %>% html_nodes("a") %>% html_text() ``` ``` ## [1] "\n " ## [2] "\n " ## [3] "Easy Trivia" ## [4] "Sports Trivia" ## [5] "Music Trivia" ## [6] "Math Trivia" ## [7] "Categories" ## [8] "All Trivia" ## [9] "\n Which Country won the 2015 Davis Cup?\n " ## [10] "Show Answer" ## [11] "\n Which former NBA player had a starring role in the 1980 comedy movie \"Airplane!\"?\n " ## [12] "Show Answer" ## [13] "\n Which Super Bowl had the most points ever scored?\n " ## [14] "Show Answer" ## [15] "\n Who knocked Roger Federer out of the quarterfinals of the men's singles of the 2018 Wimbledon Championship?\n " ## [16] "Show Answer" ## [17] "\n Which country was the first to win the Fed Cup, the Davis Cup, and the Hopman Cup in a single calendar year?\n " ## [18] "Show Answer" ## [19] "\n In the sport of tennis, the acronym ATP stands for what?\n " ## [20] "Show Answer" ## [21] "\n The final of the men's doubles of the 2012 U.S. Open was won by which set of two players?\n " ## [22] "Show Answer" ## [23] "\n Which of the following is not a possible outcome in an attempt to serve in tennis?\n " ## [24] "Show Answer" ## [25] "\n \"Reggie's Prayer\" is a film starring with former NFL player?\n " ## [26] "Show Answer" ## [27] "\n The ground on which the game of golf is played is known as?\n " ## [28] "Show Answer" ## [29] "2" ## [30] "3" ## [31] "4" ## [32] "5" ## [33] "6" ## [34] "7" ## [35] "\r\n Next\r\n " ## [36] "\n \n Food & Drink Trivia\n \n " ## [37] "\n \n Science Trivia\n \n " ## [38] "\n \n Bible Trivia\n \n " ## [39] "\n \n Kids Trivia\n \n " ## [40] "\n \n History Trivia\n \n " ## [41] "\n \n Video Games Trivia\n \n " ## [42] "See All Categories" ## [43] "feedback box" ## [44] "Privacy Policy" ``` --- ### How to get to those questions? Option 1 Look at the page source, get some identifier yourself (class, ID, link) .pull-left[ <img src="images/question.png" style="width: 100%" /> ] .pull-right[ <img src="images/question_html.png" style="width: 100%" /> ] ```r html_obj %>% html_nodes(".question") %>% .[[1]] ``` ``` ## {html_node} ## <div class="question callout" ng-controller="QuestionController as question"> ## [1] <input type="hidden" ng-model="question.data.id" ng-init="question.data.i ... ## [2] <span class="float-right light-grey bold l-cush-10">Easy</span> ## [3] <h3 class="fs-1 bold">\n <a href="https://www.triviaquestionsn ... ## [4] <div class="t-pad-10">\n <a href="#" class="click-to-show bold ... ``` --- After some trial and error... ```r html_obj %>% html_nodes(".question") %>% html_nodes(".fs-1") %>% html_text() %>% str_trim() ``` ``` ## [1] "Which Country won the 2015 Davis Cup?" ## [2] "Which former NBA player had a starring role in the 1980 comedy movie \"Airplane!\"?" ## [3] "Which Super Bowl had the most points ever scored?" ## [4] "Who knocked Roger Federer out of the quarterfinals of the men's singles of the 2018 Wimbledon Championship?" ## [5] "Which country was the first to win the Fed Cup, the Davis Cup, and the Hopman Cup in a single calendar year?" ## [6] "In the sport of tennis, the acronym ATP stands for what?" ## [7] "The final of the men's doubles of the 2012 U.S. Open was won by which set of two players?" ## [8] "Which of the following is not a possible outcome in an attempt to serve in tennis?" ## [9] "\"Reggie's Prayer\" is a film starring with former NFL player?" ## [10] "The ground on which the game of golf is played is known as?" ``` --- ### How to get to those questions? Option 2 [SelectorGadget](https://selectorgadget.com/)! <div class = "no_shadow"> <p align="center"> <img class = "no_shadow" src="images/demo.jpg"/> </p> </div> --- ### From here it's a function fest! ```r extract_questions_and_answers_from_page <- function(url) { html_obj <- read_html(url) levels <- html_obj %>% html_nodes(".question") %>% html_nodes(".l-cush-10") %>% html_text() questions <- html_obj %>% html_nodes(".question") %>% html_nodes(".fs-1") %>% html_text() %>% str_trim() answers <- html_obj %>% html_nodes(".question") %>% html_nodes(".answer") %>% html_text() %>% str_extract(., "Answer:.*") %>% str_replace("Answer: ", "") tibble(level = levels, question = questions, answer = answers) } extract_questions_and_answers_from_page(url) ``` ``` ## # A tibble: 10 × 3 ## level question answer ## <chr> <chr> <chr> ## 1 Medium Which tennis player won the 2016 Laurens World Sports Award? Novak… ## 2 Easy How many teams are there currently in the NFL? 32 ## 3 Hard The Doak Walker Award is given annually to the top college p… Runni… ## 4 VeryHard Who was the first foreign-born NBA player to be selected num… Mycha… ## 5 Medium Which NBA player once costarred alongside Jean-Claude Van Da… Denni… ## 6 Easy Who was the winner of the men's singles of the 2006 Tennis M… Roger… ## 7 VeryHard When was the first year the three-point shot was introduced … 1979 ## 8 Medium Which player was the MVP of the 2014 NBA Finals? Kawhi… ## 9 Medium In 1999, the Chicago Bulls traded established superstar Scot… Houst… ## 10 VeryHard Within the first 70 years of NBA history (1947-2017), who wa… Scott… ``` --- ### Pagination ```r create_page_url <- function(topic, page_num) { str_c("https://www.triviaquestionsnow.com/for/", topic, "-trivia?page=", page_num) } extract_multiple_pages_single_topic <- function(topic, n = 5) { cat(topic, "\n") res <- map_dfr( 1:n, function(i) { cat(" ", i) url <- create_page_url(topic, i) extract_questions_and_answers_from_page(url) } ) res$topic <- topic cat("\n") res } ``` --- ```r extract_multiple_pages_single_topic("sports") ``` ``` ## # A tibble: 50 × 4 ## level question answer topic ## <chr> <chr> <chr> <chr> ## 1 VeryHard "In 1948, which NBA basketball team did the Harlem Glo… Minne… spor… ## 2 Medium "The Jacksonville Jaguars and Carolina Panthers entere… 1995 spor… ## 3 Hard "An automatic progression by a player to the next stag… Bye spor… ## 4 VeryHard "In 2016, Giants' wide receiver Odell Beckham, Jr. app… Code … spor… ## 5 Hard "Which NBA player broke the record for most points sco… Jerem… spor… ## 6 Medium "Before relocating to Foxborough, Massachusetts, what … Boston spor… ## 7 Easy "What is the term for the historic jerseys today worn … Throw… spor… ## 8 Medium "Who served as the starting center of the Golden State… Andre… spor… ## 9 Easy "In what year was the 4 minute mile achieved?" 1954 … spor… ## 10 Hard "Who was the first tennis player to complete a \"Grand… Don B… spor… ## # … with 40 more rows ``` --- ### Magic! ```r topics <- c("sports", "kids", "science", "bible", "food-drink", "history", "geography", "video-games") df_all <- map_dfr( topics, extract_multiple_pages_single_topic ) df_all %>% count(topic) ``` ``` ## # A tibble: 8 × 2 ## topic n ## <chr> <int> ## 1 bible 50 ## 2 food-drink 50 ## 3 geography 50 ## 4 history 50 ## 5 kids 50 ## 6 science 50 ## 7 sports 50 ## 8 video-games 50 ``` --- class: section-slide # BeautifulSoup --- ### Almost always start with ```python import requests from bs4 import BeautifulSoup html_obj = requests.get('https://en.wikipedia.org/wiki/List_of_The_Real_Housewives_cast_members') soup = BeautifulSoup(html_obj.content, 'html.parser') type(soup) ``` ``` ## <class 'bs4.BeautifulSoup'> ``` This object has all sorts of attributes and methods: ```python soup.get_text() soup.prettify() soup.attrs soup.children soup.title ``` --- ### `find()` a tag, `find_all()` ```python link_objs = soup.find_all('a', href=True) type(link_objs) ``` ``` ## <class 'bs4.element.ResultSet'> ``` ```python type(link_objs[3]) ``` ``` ## <class 'bs4.element.Tag'> ``` ```python link_objs[3].text ``` ``` ## 'Current events' ``` ```python link_objs[3].attrs ``` ``` ## {'href': '/wiki/Portal:Current_events', 'title': 'Articles related to current events'} ``` See the actual link in the [page](https://en.wikipedia.org/wiki/List_of_The_Real_Housewives_cast_members). --- ### Getting that `table` ```python table = soup.find('table', attrs={'class':'wikitable'}) table_body = table.find('tbody') rows = table_body.find_all('tr') print(len(rows)) ``` ``` ## 155 ``` ```python print(rows[0]) ``` ``` ## <tr> ## <th rowspan="2">Installment ## </th> ## <th rowspan="2">Housewives ## </th> ## <th rowspan="2">First season<br/>starred ## </th> ## <th rowspan="2">Last season<br/>starred ## </th> ## <th colspan="4">Number of seasons ## </th></tr> ``` --- ### Getting a Housewife name <img src="images/housewife_row_html.png" style="width: 100%" /> ```python import re print(rows[3].find('span', attrs = {'data-sort-value': re.compile(r'.*')})) ``` ``` ## <span data-sort-value="De La Rosa, Jo !">Jo De La Rosa</span> ``` --- ### Getting only HWives with Wiki pages ```python housewives_with_links = [] for row in rows: housewife = row.find('span', attrs = {'data-sort-value': re.compile(r'.*')}) if housewife is not None: link = housewife.find('a') if link is not None: housewives_with_links.append((housewife.text, link['href'])) import pandas as pd h_df = pd.DataFrame(housewives_with_links, columns=['name', 'link']) h_df.head() ``` ``` ## name link ## 0 Vicki Gunvalson /wiki/Vicki_Gunvalson ## 1 Jeana Keough /wiki/Jeana_Keough ## 2 Tamra Judge /wiki/Tamra_Judge ## 3 Heather Dubrow /wiki/Heather_Dubrow ## 4 Shannon Storms Beador /wiki/Shannon_Storms_Beador ``` --- ### (Though if your table is simple, try:) ```python l = pd.read_html(html_obj.text) l[0].head() ``` ``` ## Installment Housewives ... Number of seasons ## Installment Housewives ... Guest Ultimate Girls Trip ## 0 Orange County Kimberly Bryant ... 3 0 ## 1 Orange County Jo De La Rosa ... 2 0 ## 2 Orange County Vicki Gunvalson ... 0 2 ## 3 Orange County Jeana Keough ... 5 0 ## 4 Orange County Lauri Peterson ... 1 0 ## ## [5 rows x 8 columns] ``` --- ### Following HWives Links ```python def get_housewife_img_ref(housewife_link): html_obj = requests.get('https://en.wikipedia.org' + housewife_link) soup = BeautifulSoup(html_obj.content, 'html.parser') infobox = soup.find('table', attrs = {'class': 'vcard'}) if infobox is not None: img_obj = infobox.find('img', src=True) if img_obj is not None: return img_obj['src'] return None h_df['img_ref'] = h_df['link'].apply(get_housewife_img_ref) h_df.dropna(inplace=True) h_df.head() ``` ``` ## name ... img_ref ## 0 Vicki Gunvalson ... //upload.wikimedia.org/wikipedia/commons/thumb... ## 1 Luann de Lesseps ... //upload.wikimedia.org/wikipedia/commons/thumb... ## 2 Bethenny Frankel ... //upload.wikimedia.org/wikipedia/commons/thumb... ## 3 Kelly Killoren Bensimon ... //upload.wikimedia.org/wikipedia/commons/thumb... ## 4 Carole Radziwill ... //upload.wikimedia.org/wikipedia/commons/0/08/... ## ## [5 rows x 3 columns] ``` --- ### Downloading HWives Images ```python def make_img_filename(hf_name): return 'data/housewives/' + hf_name.lower().strip(',.-').replace(' ', '_') + '.jpg' def download_hw_img(hf_name, hf_img_ref): img_file = make_img_filename(hf_name) img_data = requests.get('http:' + hf_img_ref).content with open(img_file, 'wb') as handler: handler.write(img_data) h_df.apply(lambda row: download_hw_img(row['name'], row['img_ref']), axis=1) ```