Preliminaries

Choose a directory to hold the work you are going to do on this course: I suggest that P:/R_course. Create a variable containing the name of your chosen directory:

mydir <- "P:/R_course"

Of course, the directory you are going to use must exist, before you can save any files in it. You can use

dir.create(mydir)

to do this, provided that the new directory is only one level below an already existing directory. To make sure that it exists, type

setwd(mydir)

to change to it, and make sure you don’t get an error message in response.

Creating and Modifying data

We are now going to practice manipulating a data table. In order to do this, we will read in a dataset. Since this practical is based on the practical from the Statistical Modelling with Stata course, we will use the same dataset. This will require us to load the haven package, which contains the read_stata() function. We will also load the data.table package so that we can use data tables rather than data frames, the expss package so we can use labels, and the stringr package for manipulating string variables.

library(data.table)
library(expss)
library(haven)
library(stringr)

First we will read in the auto dataset. This contains a variable called weight, which is the vehicle’s weight in pounds (it’s an unashamedly American dataset).

auto <- read_stata("https://personalpages.manchester.ac.uk/staff/mark.lunt/data/auto.dta")

Since a kg is 2.2046 pounds, we create a variable containing the weight of the vehicle in kilos by dividing the weight in pounds by 2.2046. It’s a good idea to label the variable so you remember what it means

auto$wtkg <- auto$weight/2.2046
var_lab(auto$wtkg) = "Weight in kg"
summary(auto$wtkg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   798.3  1020.6  1447.0  1369.6  1632.9  2195.4

Creating Indicator variables

Very often, you want to generate a variable that can only take two values, representing true'' andfalse’’. As an example, we will create a variable short, which takes the value 1 for all cars less than 190 in long and 0 for longer cars. This can be done with the command

auto$short <- (auto$length < 190)
var_lab(auto$short) = "Vehicle is less than 190 inches long"
table(auto$short)
## 
## FALSE  TRUE 
##    38    36

Creating Tertiles

Suppose that you want to divide the cars into tertiles by weight. There is a cut() function in R, but that requires the actual weight values that you wish to split the data on. They can be calculated using the the quantile() function, if we pass to it the proportions of the proportions corresponding to the quantiles at which we want the data to be split. So we end up with

breakpoints <- quantile(auto$weight, probs=c(0, 0.33, 0.67, 1))
auto$wtt <- cut(auto$weight, breaks=breakpoints, include.lowest=TRUE, labels = c("Lowest tertile", "Mid tertile", "Upper tertile"))
var_lab(auto$wtt) <- "Tertiles of weight"
table(auto$wtt)
## 
## Lowest tertile    Mid tertile  Upper tertile 
##             25             24             25

Creating a string variable

The package stringr contains a huge number of functions for manipulating strings. We will just try a single example here. The function word() splits a string (or a vector of strings, which is the R equivalent of a string variable) into separate words. You can specify which word or words you wish to keep. In our example, we will extract the make (the first word) from the variable make (which despite its name, contains the make and the model: look at the label for that variable if you don’t believe me.)

auto$company <- word(auto$make, 1)
table(auto$company)
## 
##     AMC    Audi     BMW   Buick    Cad.   Chev.  Datsun   Dodge    Fiat    Ford 
##       3       2       1       7       3       6       4       4       1       2 
##   Honda   Linc.   Mazda   Merc.    Olds Peugeot   Plym.   Pont. Renault  Subaru 
##       2       3       1       6       7       1       5       6       1       1 
##  Toyota   Volvo      VW 
##       3       1       4

Manipulating datasets

To practice manipulating data frames, we will use the bplong data. This consist of 2 blood pressure measures for each of 120 subjects, one measurement taken before some interventions (identified by having the variable when = 1) and one taken after the intervention (having when = 2). We will split that into two separate data frames, one with the observations before the intervention and one with the observations after it.

bplong <- read_stata("https://personalpages.manchester.ac.uk/staff/mark.lunt/data/bplong.dta")
dim(bplong)
## [1] 240   5
bpbefore <- bplong[bplong$when == 1,]
bpafter  <- bplong[bplong$when == 2,]
dim(bpbefore)
## [1] 120   5
dim(bpafter)
## [1] 120   5
summary(bpbefore$bp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   138.0   147.0   154.5   156.4   164.0   185.0
summary(bpafter$bp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   125.0   140.8   149.5   151.4   161.0   185.0

Append

If we wish to append these datasets, we can simply use the rbind() function, which binds rows from all of the data frames passed to it into a single data frame.

new_bp   <- rbind(bpbefore, bpafter)
summary(new_bp[new_bp$when==1,"bp"])
##        bp       
##  Min.   :138.0  
##  1st Qu.:147.0  
##  Median :154.5  
##  Mean   :156.4  
##  3rd Qu.:164.0  
##  Max.   :185.0
summary(new_bp[new_bp$when==2,"bp"])
##        bp       
##  Min.   :125.0  
##  1st Qu.:140.8  
##  Median :149.5  
##  Mean   :151.4  
##  3rd Qu.:161.0  
##  Max.   :185.0

Merge

We can also merge those two datasets, to get a data frame with 60 observations, one for each subject. Each record will need to have two blood pressure measurements, one taken before the intervention, the other taken after it. It would make sense to change give the blood pressure measurements different names in the two datasets, to make it easier to distinguish them after merging: we will call them bp_before and bp_after. Also, the values of sex and agegrp are the same for both records for each individual, so it makes sense to keep only one of them. We can define a vector of variable names that we want to keep, and select only those variables from bpafter.

names(bpbefore)[names(bpbefore)=="bp"] <- "bp_before"
names(bpafter)[names(bpafter)=="bp"] <- "bp_after"
keeps = c("patient", "bp_after")
bp_merged <- merge(bpbefore, bpafter[keeps], by="patient")
summary(bp_merged$bp_before)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   138.0   147.0   154.5   156.4   164.0   185.0
summary(bp_merged$bp_after)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   125.0   140.8   149.5   151.4   161.0   185.0

Further Exercises

Here are a few more things you can do to get used to working with data in R.

  1. Calculate the lengths of each car in metres, using the fact that 1 inch is 0.0254 metres.
  2. Create a variable heavy, which takes the value 0 for cars weighing less than 3000 lbs and 1 for cars weighing more than 3000 lbs.
  3. Create tertiles of wtkg from the auto dataset, and check that it produces exactly the same results as creating tertiles of weight.
  4. Using the bp_merged, calculate the change in blood pressure between the before'' andafter’’ visit.
  5. Using the same dataset, create a variable that takes 6 different values, depending on which age group and sex the patient belongs to. Label the values so that you can tell which group is which.