Choose a directory to hold the work you are going to do on this
course: I suggest that P:/R_course
. Create a variable
containing the name of your chosen directory:
mydir <- "P:/R_course"
Of course, the directory you are going to use must exist, before you can save any files in it. You can use
dir.create(mydir)
to do this, provided that the new directory is only one level below an already existing directory. To make sure that it exists, type
setwd(mydir)
to change to it, and make sure you don’t get an error message in response.
We are now going to practice manipulating a data table. In order to
do this, we will read in a dataset. Since this practical is based on the
practical from the Statistical Modelling with Stata course, we will use
the same dataset. This will require us to load the haven
package, which contains the read_stata()
function. We will
also load the data.table
package so that we can use data
tables rather than data frames, the expss
package so we can
use labels, and the stringr
package for manipulating string
variables.
library(data.table)
library(expss)
library(haven)
library(stringr)
First we will read in the auto
dataset. This contains a
variable called weight, which is the vehicle’s weight in pounds (it’s an
unashamedly American dataset).
auto <- read_stata("https://personalpages.manchester.ac.uk/staff/mark.lunt/data/auto.dta")
Since a kg is 2.2046 pounds, we create a variable containing the weight of the vehicle in kilos by dividing the weight in pounds by 2.2046. It’s a good idea to label the variable so you remember what it means
auto$wtkg <- auto$weight/2.2046
var_lab(auto$wtkg) = "Weight in kg"
summary(auto$wtkg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 798.3 1020.6 1447.0 1369.6 1632.9 2195.4
Very often, you want to generate a variable that can only take two
values, representing true'' and
false’’. As an example, we
will create a variable short
, which takes the value 1 for
all cars less than 190 in long and 0 for longer cars. This can be done
with the command
auto$short <- (auto$length < 190)
var_lab(auto$short) = "Vehicle is less than 190 inches long"
table(auto$short)
##
## FALSE TRUE
## 38 36
Suppose that you want to divide the cars into tertiles by weight.
There is a cut()
function in R, but that requires the
actual weight values that you wish to split the data on. They can be
calculated using the the quantile()
function, if we pass to
it the proportions of the proportions corresponding to the quantiles at
which we want the data to be split. So we end up with
breakpoints <- quantile(auto$weight, probs=c(0, 0.33, 0.67, 1))
auto$wtt <- cut(auto$weight, breaks=breakpoints, include.lowest=TRUE, labels = c("Lowest tertile", "Mid tertile", "Upper tertile"))
var_lab(auto$wtt) <- "Tertiles of weight"
table(auto$wtt)
##
## Lowest tertile Mid tertile Upper tertile
## 25 24 25
The package stringr
contains a huge number of functions
for manipulating strings. We will just try a single example here. The
function word()
splits a string (or a vector of strings,
which is the R equivalent of a string variable) into separate words. You
can specify which word or words you wish to keep. In our example, we
will extract the make (the first word) from the variable make (which
despite its name, contains the make and the model: look at the label for
that variable if you don’t believe me.)
auto$company <- word(auto$make, 1)
table(auto$company)
##
## AMC Audi BMW Buick Cad. Chev. Datsun Dodge Fiat Ford
## 3 2 1 7 3 6 4 4 1 2
## Honda Linc. Mazda Merc. Olds Peugeot Plym. Pont. Renault Subaru
## 2 3 1 6 7 1 5 6 1 1
## Toyota Volvo VW
## 3 1 4
To practice manipulating data frames, we will use the
bplong
data. This consist of 2 blood pressure measures for
each of 120 subjects, one measurement taken before some interventions
(identified by having the variable when
= 1) and one taken
after the intervention (having when
= 2). We will split
that into two separate data frames, one with the observations before the
intervention and one with the observations after it.
bplong <- read_stata("https://personalpages.manchester.ac.uk/staff/mark.lunt/data/bplong.dta")
dim(bplong)
## [1] 240 5
bpbefore <- bplong[bplong$when == 1,]
bpafter <- bplong[bplong$when == 2,]
dim(bpbefore)
## [1] 120 5
dim(bpafter)
## [1] 120 5
summary(bpbefore$bp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 138.0 147.0 154.5 156.4 164.0 185.0
summary(bpafter$bp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 125.0 140.8 149.5 151.4 161.0 185.0
If we wish to append these datasets, we can simply use the
rbind()
function, which binds rows from all of the data
frames passed to it into a single data frame.
new_bp <- rbind(bpbefore, bpafter)
summary(new_bp[new_bp$when==1,"bp"])
## bp
## Min. :138.0
## 1st Qu.:147.0
## Median :154.5
## Mean :156.4
## 3rd Qu.:164.0
## Max. :185.0
summary(new_bp[new_bp$when==2,"bp"])
## bp
## Min. :125.0
## 1st Qu.:140.8
## Median :149.5
## Mean :151.4
## 3rd Qu.:161.0
## Max. :185.0
We can also merge those two datasets, to get a data frame with 60
observations, one for each subject. Each record will need to have two
blood pressure measurements, one taken before the intervention, the
other taken after it. It would make sense to change give the blood
pressure measurements different names in the two datasets, to make it
easier to distinguish them after merging: we will call them
bp_before
and bp_after
. Also, the values of
sex
and agegrp
are the same for both records
for each individual, so it makes sense to keep only one of them. We can
define a vector of variable names that we want to keep, and select only
those variables from bpafter
.
names(bpbefore)[names(bpbefore)=="bp"] <- "bp_before"
names(bpafter)[names(bpafter)=="bp"] <- "bp_after"
keeps = c("patient", "bp_after")
bp_merged <- merge(bpbefore, bpafter[keeps], by="patient")
summary(bp_merged$bp_before)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 138.0 147.0 154.5 156.4 164.0 185.0
summary(bp_merged$bp_after)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 125.0 140.8 149.5 151.4 161.0 185.0
Here are a few more things you can do to get used to working with data in R.
heavy
, which takes the value 0 for
cars weighing less than 3000 lbs and 1 for cars weighing more than 3000
lbs.wtkg
from the auto dataset, and
check that it produces exactly the same results as creating tertiles of
weight
.bp_merged
, calculate the change in blood
pressure between the before'' and
after’’ visit.