This document largely follows the structure of session 1 of the Introduction to Statistical Modellng with Stata course, which gives an overview of the stata language and IDE (Integrated Developement Environment, a fancy name for all of the stata windows). For each action described there, I describe the equivalent action in R here.
Stata and R a similar in that they are both commandline driven programs. That is, you enter a command and the program performs the action that that command has told it to.
However, there are some fundamental differences between R and Stata, so some things we do in stata would not make sense to do in R (and vice versa). Stata is sold as a complete package for data analysis, and most people could go through an entire data analysis career without needing to add anything to stata. R is distributed in a more modular way: the base system is installed by default, but additional modules (called libraries or packages) need to be installed and loaded into R in order to perform many basic data manipulation and analysis tasks. I’ve therefore added an extra section to this document to explain how that works.
Another big difference is that stata expects you to be working on a single dataset at a time. There are other kinds of objects, like matrices and vectors, but your primary focus is a single dataset. Stata may create other objects as it carries out your commands, but you rarely want to manipulate them, so they just get quietly deleted unless you take steps to keep them.
In R, datasets are just objects like anything else. In fact, a
dataset can be stored as a matrix, with rows as observations and columns
as variables. However, they are more commonly stored as a
data.frame
or data.table
(if the data.table
package has been installed). All objects exist in the same environment,
and are as easy to manipulate as your primary dataset.
Also, in R, you tend to keep your objects from one run of R to the next. This is referred to as the Environment in R. There is no need to load and save files in R to nearly the extent that you need to in stata, since all of your data frames will be available to you all of the time. The environment is stored in the directory from which you started R, so it makes sense to start analyses for different projects in different directories so that the objects for each analysis are kept separate.
The screenshot below shows what RStudio may look like when you open it. The left hand column may have 1 or windows: the top window is a file editor, like the do-file editor in stata, but it only appears when you open a file. The bottom window is a console: you can enter R commands directly into this window, or run them from the file editor in the same way as in Stata (although the actual keypresses to use differ).
Rather than commands, R uses functions, but they work in a very
similar way. You enter the name of the function, followed by
parentheses: function()
. If you want to give parameters to
the function (the equivalent of giving variable names and options to a
stata command), you need to put them inside the parentheses. Parameters
are generally passed in the form “param=value”, with commas separating
every parameter (in contrast to stata, which requires a single comma
before the options, and adding a second comma would be a syntax error.)
Results appear in the console, as well as the commands you enter to
create them.
The right hand column contains two windows: the Browser and the Environment. The browser by default is a file browser for the current directory, but can also be used for browsing installed packages, graphics and help pages. The environment window lists all of the objects and functions that have been created so far under the “Environment” tab, and the commands used so far in the “History” tab. Just like stata, a double click in the history window will add that line from the history window to the console window.
R has a help function just like stata’s help command. To get help
about the lm
function, you would type
help(lm)
. This will open a help page in your default
browser, divided into sections. After a brief Description, the
Usage section corresponds to the Syntax section in a
stata help page, and the Arguments section corresponds to the
Options section in stata.
An R help page will generally then have a Details section, covering material that would usually appear in the Remarks and Examples manual entry for a stata command (although an R help page generally has a separate Examples secton at the end for examples). Then there is a Value section, which is far more important in R than the equivalent Stored Results section in stata. This is because R functions generally create objects, and the Value section tells you what is stored in this object. Stata also creates objects, but generally it prints out all of the values of interest in the log file, so you don’t need to know their names to manipulate them later. Understanding the members of and object in R is essential.
If you don’t know the name of the function you need to use in R, you
can use the help.search
function. This function expects a
string as as its argument, so to find all help pages that contain the
word “regress”, you would enter the command
help.search("regress")
. The help()
command
will accept either and object or the name of the object as a string. In
other words, help(lm)
and help("lm")
both
work, but help.search(lm)
produces and error. An
alternative syntax is ??lm
, which has the same effect. Try
help(??)
for details of all of the options available with
this function.
There is no real need for a webseek
function in R, since
R code and information is largely centralised in CRAN (the Comprehensive
R Archive Network). All packages and huge amounts of documentation can
be found here. Also, StackExchange has a data science site, and many of
the questions on that relate to R (although many do not, pythons seems
just as popular)
Much of the work of R is farmed out to separate packages, which must be installed and loaded before they can be used. For example, suppose you want to use the “data.table” package (hint: you do). The following code will both install it and load it into R’s memory, ready for use:
install.packages("data.table")
library(data.table)
We can now get help on data.table with the command
help(data.table)
: this will open the help page for
data.table
in your default browser
As with stata, the only way to do any serious analysis with R is to create a script of commands (referred to as a script in R). A script can be edited in the file editor, and parts or the whole of the script can be run in the console window. R also has a default script that it runs every time you start R, called .Rprofile.
Log-files are of much less importance in R than they are in stata. This is because all of your objects are retained in the environment, so if you want to check what the results of a linear regression model are, you just print out the model object (assuming that you’ve kept it, but not keeping it would be insane). In stata, you can only print out the results of the most recent regression again, so to look up results of and earlier model, you need to either check the log-file or rerun the regression (which can take a while).
However, it is possible, and useful, to create documents very like log-files using a language called R Markdown. R Markdown documents contain code (i.e. your R script), the output produced by that code (i.e. your log-file) and and explanation of what you are doing. In fact, I used R Markdown to produce this document. The main advantage of an R Markdown document over a simple log file is the formatting available for the explanatory text. It is even possible to write papers in R Markdown, so that they will automatically update their results if the data changes (and the file is rerun).
The table below lists the commands that you may wish to use to interact with the operating system, and how you would do that in stata and in R. By and large, stata names its commands after the commands that would be called in the OS. R needs to create its own functions, to call the OS command and return the results of that command in an object, and so it creates its own names for these functions
Task | Stata Command | R Function |
---|---|---|
Change directory | cd | setwd() |
Show current directory | pwd | getwd() |
Create new directory | mkdir | dir.create() |
List files | dir | list.files() |
Run another program | shell | shell() |
Whilst using “/” rather then “" in filenames in do-files is recommended, in R scripts it is essential. That is because R, like many other programming languages, uses”" as and escape character. That is, it changes the meaning of the following character. For example, “ is replaced by a tab, so it would be impossible to open a file using the string”C:“, but”C:/temp” would work.
Macros are unnecessary in R, anything you would do with a macro in stata you can do with a variable in R. For example, to change to the directory “C:/work/myproject”, you could enter the commands
mydir = "C:/work/myproject"
setwd(mydir)
In that example, the variable mydir
plays the role of a
macro in stata.
In R, everything you work with is an object, but there a many different types of object. A dataset consists of a series of records, each record containing the same type of information. Mathematcally, this can be thought of as a matrix, in which the rows represent records and the columns represent the data stored in each record. This is just like the “browse” window in stata: each row in this window represents a record or observation, and each column represents a stata variable.
A stata variable contains as many values as there are observations in
the dataset. In mathematics, such and object is referred to as a vector,
and the same is true in R. Each column in a data frame represents a
stata variable, and the column can be given a name which functions in
the same way as a stata variable name. However, since R can work with
may datasets at the same time, and each dataset can contain columns with
the same names, when referring a column within a dataframe you must give
both the data frame name and the column name, separated by a dollar
sign. For example, the age
column in the
baseline
data frame would be referred to as
baseline$age
.
You can create and object of any type in R with a command of the form
x <- myfunction()
The will ensure that x contains whatever object was returned by the
function myfunction()
. This could be any type of object: a
single number, a vector of numbers, a matrix or dataframe, a list of
strings. I could be that x
was a vector before running
myfunction()
and is now a single boolean value (either TRUE
or FALSE).
This means that R does not have separate generate
and
replace
commands: assigning a new value to an object with
always replace the old one. and there is no need for a separate
egen
command, since you can simply write your own function
to create an object you want and assign the result of that function to
any object.
Transforming variables, i.e. creating new columns in a data frame that are a function of existing columns, can be performed using the assignment operator “<-”. For example, if we want to have the log of the titre from the baseline data frame available to us, we can enter
baseline$ltitre <- log(baseline$titre)
Labelling variables is not as straightforward in R as it is in stata.
The easiest way is to add an additional package, expss
, and
use the functions from that package.
Creating and saving datasets is far less important in R than in
Stata. This is because every object created during an R session is kept
in the environment, and that can be saved between sessions. Therefore, a
data frame created a week ago will still exist within
.Renv
, even if it does not exist as a specific file that
you can address directly. The only reason you would really want to save
a file is so that you could share it with someone else. ## Appending
If you have two dataframe containing the same information about two distinct groups of people, appending the frames ## Merging