+ - 0:00:00
Notes for current slide
Notes for next slide

Web scraping for drug safety

R-thritis Computing Group

David A. Selby

5th November 2021

1 / 20

Structure

1. Why Web scraping?

2. Intro to HTML/CSS

3. Web scraping with rvest

2 / 20

Why Web scraping?

3 / 20

Why Web scraping?

  • There's lots of useful information online

  • Not everything is a CSV file!

  • Faster / less error-prone than copying data manually

  • Fun

4 / 20

Motivating example

BNF logo

British National Formulary

BNF webpage screenshot

5 / 20

HTML for dummies

6 / 20

Example HTML document

<HTML>
<HEAD>
<TITLE>The title of my Web page</TITLE>
</HEAD>
<BODY>
<H1>A heading</H1>
<P>A paragraph about something.</P>
<P>A second paragraph about something <em>else</em></P>
<IMG SRC="logo.jpg" ALT="CfE logo">
<UL> <!-- This is an unordered list -->
<LI>A <A HREF="https://cfe.manchester.ac.uk">hyperlink</A>.
<LI>Another list item</LI>
</UL>
</BODY>
</HTML>
7 / 20

Example HTML document

8 / 20

Example HTML document

<HTML>
<HEAD>
<TITLE>The title of my Web page</TITLE>
</HEAD>
<BODY>
<H1 ID="headline">A heading</H1>
<P CLASS="intro">A paragraph about something.</P>
<P>A second paragraph about something <em>else</em></P>
<IMG SRC="logo.jpg" ALT="CfE logo" CLASS="logo">
<UL> <!-- This is an unordered list -->
<LI>A <A HREF="https://cfe.manchester.ac.uk">hyperlink</A>.
<LI>Another list item</LI>
</UL>
</BODY>
</HTML>
9 / 20

Cascading style sheets (CSS)

Use tags, classes and ids to identify objects in the DOM.

e.g. Select the headline text:

  • h1
  • h1#headline  (or  #headline)
  • body:first-child

e.g. Select the introduction paragraph:

  • p.intro  (or  .intro)
  • p:first-of-type
  • h1+p
  • body:nth-child(2)
10 / 20

Cascading style sheets (CSS)

Style:

  1. change the typeface
  2. centre the headline
  3. highlight the intro paragraph
  4. shrink the logo image

Add the following in <style> </style> tags:

body { font-family: 'Comic Sans MS'; }
h1#headline { text-align: center; }
.intro { background-color: yellow; }
.logo { width: 100px; }
11 / 20

Example HTML document with CSS

12 / 20

The element inspector

Explore the document object model (DOM) of any Web page:

13 / 20

Web scraping with rvest

15 / 20

Web scraping with rvest

library(rvest)
example <- read_html('example.html')
# {html_document}
# <html>
# [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
# [2] <body>\r\n <h1 id="headline">A heading</h1>\r\n <p class="intro">A ...
example %>% html_element('.intro')
# {html_node}
# <p class="intro">
example %>% html_element('.intro') %>% html_text()
# [1] "A paragraph about something."
16 / 20

Web scraping with rvest

drug_index <- read_html('https://bnf.nice.org.uk/drug/')
drug_links <- drug_index %>% html_elements('.row ul li a')
drugs <- data.frame(name = html_text2(drug_links),
path = html_attr(drug_links, 'href'))
head(drugs)
# A tibble: 6 x 2
name path
<chr> <chr>
1 ABACAVIR abacavir.html
2 ABACAVIR WITH DOLUTEGRAVIR AND LAMIVUDINE abacavir-with-dolutegravir-and-lami~
3 ABACAVIR WITH LAMIVUDINE abacavir-with-lamivudine.html
4 ABACAVIR WITH LAMIVUDINE AND ZIDOVUDINE abacavir-with-lamivudine-and-zidovu~
5 ABATACEPT abatacept.html
6 ABEMACICLIB abemaciclib.html
17 / 20

Web scraping with rvest

library(tidyverse)
scrape_drug <- function(path) {
webpage <- read_html(file.path('https://bnf.nice.org.uk/drug/', path))
name_of_drug <- webpage %>% html_element('h1') %>% html_text2
condition_grp <- webpage %>% html_elements('.indicationAndDoseGroup')
condition_name <- map(condition_grp, ~ html_element(.x, '.indications') %>% html_text2)
tibble(name_of_drug,
condition = map_chr(condition_name, paste, collapse = '\n'),
route_grp = map(condition_grp, html_elements, '.dosage-group') %>% map_depth(2, as.list)) %>%
unnest(route_grp) %>%
mutate(route = map_chr(route_grp, ~ html_elements(.x, 'span.routesOfAdministration') %>% html_text2),
patient_grp = map(route_grp, html_elements, 'li.dose') %>% map_depth(2, as.list)) %>%
unnest(patient_grp) %>%
mutate(patient = map_chr(patient_grp, ~ html_element(.x, '.patientGroup span') %>% html_text2),
dose = map_chr(patient_grp, ~ html_elements(.x, 'p') %>% html_text2)) %>%
select(-ends_with('_grp'))
}
18 / 20

Ibuprofen example

scrape_drug('ibuprofen.html')
# # A tibble: 24 x 5
# name_of_drug condition route patient dose
# <chr> <chr> <chr> <chr> <chr>
# 1 IBUPROFEN "Pain and inflammatio~ By mouth us~ Adult Initially 300–400~
# 2 IBUPROFEN "Pain and inflammatio~ By mouth us~ Adult 1.6&nbsp;g once daily,~
# 3 IBUPROFEN "Acute migraine\n" By mouth us~ Adult 400–600&nbsp;mg for 1 ~
# 4 IBUPROFEN "Mild to moderate pai~ By mouth us~ Child 3–~ 50&nbsp;mg 3 times a d~
# 5 IBUPROFEN "Mild to moderate pai~ By mouth us~ Child 6–~ 50&nbsp;mg 3–4 times a~
# 6 IBUPROFEN "Mild to moderate pai~ By mouth us~ Child 1–~ 100&nbsp;mg 3 times a ~
# 7 IBUPROFEN "Mild to moderate pai~ By mouth us~ Child 4–~ 150&nbsp;mg 3 times a ~
# 8 IBUPROFEN "Mild to moderate pai~ By mouth us~ Child 7–~ 200&nbsp;mg 3 times a ~
# 9 IBUPROFEN "Mild to moderate pai~ By mouth us~ Child 10~ 300&nbsp;mg 3 times a ~
# 10 IBUPROFEN "Mild to moderate pai~ By mouth us~ Child 12~ Initially 300–400~
# # ... with 14 more rows
19 / 20

More information

Upcoming R-thritis meetings

19 November
Topic/presenter to be confirmed
3 December
‘Advent of Code’ discussion
20 / 20

Structure

1. Why Web scraping?

2. Intro to HTML/CSS

3. Web scraping with rvest

2 / 20
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow