Introduction

This document largely follows the structure of session 1 of the Introduction to Statistical Modellng with Stata course, which gives an overview of the stata language and IDE (Integrated Developement Environment, a fancy name for all of the stata windows). For each action described there, I describe the equivalent action in R here.

Stata and R a similar in that they are both commandline driven programs. That is, you enter a command and the program performs the action that that command has told it to.

However, there are some fundamental differences between R and Stata, so some things we do in stata would not make sense to do in R (and vice versa). Stata is sold as a complete package for data analysis, and most people could go through an entire data analysis career without needing to add anything to stata. R is distributed in a more modular way: the base system is installed by default, but additional modules (called libraries or packages) need to be installed and loaded into R in order to perform many basic data manipulation and analysis tasks. I’ve therefore added an extra section to this document to explain how that works.

Another big difference is that stata expects you to be working on a single dataset at a time. There are other kinds of objects, like matrices and vectors, but your primary focus is a single dataset. Stata may create other objects as it carries out your commands, but you rarely want to manipulate them, so they just get quietly deleted unless you take steps to keep them.

In R, datasets are just objects like anything else. In fact, a dataset can be stored as a matrix, with rows as observations and columns as variables. However, they are more commonly stored as a data.frame or data.table (if the data.table package has been installed). All objects exist in the same environment, and are as easy to manipulate as your primary dataset.

Also, in R, you tend to keep your objects from one run of R to the next. This is referred to as the Environment in R. There is no need to load and save files in R to nearly the extent that you need to in stata, since all of your data frames will be available to you all of the time. The environment is stored in the directory from which you started R, so it makes sense to start analyses for different projects in different directories so that the objects for each analysis are kept separate.

RStudio

The screenshot below shows what RStudio may look like when you open it. The left hand column may have 1 or windows: the top window is a file editor, like the do-file editor in stata, but it only appears when you open a file. The bottom window is a console: you can enter R commands directly into this window, or run them from the file editor in the same way as in Stata (although the actual keypresses to use differ).

Rather than commands, R uses functions, but they work in a very similar way. You enter the name of the function, followed by parentheses: function(). If you want to give parameters to the function (the equivalent of giving variable names and options to a stata command), you need to put them inside the parentheses. Parameters are generally passed in the form “param=value”, with commas separating every parameter (in contrast to stata, which requires a single comma before the options, and adding a second comma would be a syntax error.) Results appear in the console, as well as the commands you enter to create them.

The right hand column contains two windows: the Browser and the Environment. The browser by default is a file browser for the current directory, but can also be used for browsing installed packages, graphics and help pages. The environment window lists all of the objects and functions that have been created so far under the “Environment” tab, and the commands used so far in the “History” tab. Just like stata, a double click in the history window will add that line from the history window to the console window.

Getting Help

Help

R has a help function just like stata’s help command. To get help about the lm function, you would type help(lm). This will open a help page in your default browser, divided into sections. After a brief Description, the Usage section corresponds to the Syntax section in a stata help page, and the Arguments section corresponds to the Options section in stata.

An R help page will generally then have a Details section, covering material that would usually appear in the Remarks and Examples manual entry for a stata command (although an R help page generally has a separate Examples secton at the end for examples). Then there is a Value section, which is far more important in R than the equivalent Stored Results section in stata. This is because R functions generally create objects, and the Value section tells you what is stored in this object. Stata also creates objects, but generally it prints out all of the values of interest in the log file, so you don’t need to know their names to manipulate them later. Understanding the members of and object in R is essential.

Internet

There is no real need for a webseek function in R, since R code and information is largely centralised in CRAN (the Comprehensive R Archive Network). All packages and huge amounts of documentation can be found here. Also, StackExchange has a data science site, and many of the questions on that relate to R (although many do not, pythons seems just as popular)

Installing packages

Much of the work of R is farmed out to separate packages, which must be installed and loaded before they can be used. For example, suppose you want to use the “data.table” package (hint: you do). The following code will both install it and load it into R’s memory, ready for use:

install.packages("data.table")
library(data.table)

We can now get help on data.table with the command help(data.table): this will open the help page for data.table in your default browser

Basic Concepts

Scripts

As with stata, the only way to do any serious analysis with R is to create a script of commands (referred to as a script in R). A script can be edited in the file editor, and parts or the whole of the script can be run in the console window. R also has a default script that it runs every time you start R, called .Rprofile.

log-files

Log-files are of much less importance in R than they are in stata. This is because all of your objects are retained in the environment, so if you want to check what the results of a linear regression model are, you just print out the model object (assuming that you’ve kept it, but not keeping it would be insane). In stata, you can only print out the results of the most recent regression again, so to look up results of and earlier model, you need to either check the log-file or rerun the regression (which can take a while).

However, it is possible, and useful, to create documents very like log-files using a language called R Markdown. R Markdown documents contain code (i.e. your R script), the output produced by that code (i.e. your log-file) and and explanation of what you are doing. In fact, I used R Markdown to produce this document. The main advantage of an R Markdown document over a simple log file is the formatting available for the explanatory text. It is even possible to write papers in R Markdown, so that they will automatically update their results if the data changes (and the file is rerun).

OS

The table below lists the commands that you may wish to use to interact with the operating system, and how you would do that in stata and in R. By and large, stata names its commands after the commands that would be called in the OS. R needs to create its own functions, to call the OS command and return the results of that command in an object, and so it creates its own names for these functions

Task Stata Command R Function
Change directory cd setwd()
Show current directory pwd getwd()
Create new directory mkdir dir.create()
List files dir list.files()
Run another program shell shell()

Whilst using “/” rather then “" in filenames in do-files is recommended, in R scripts it is essential. That is because R, like many other programming languages, uses”" as and escape character. That is, it changes the meaning of the following character. For example, “ is replaced by a tab, so it would be impossible to open a file using the string”C:“, but”C:/temp” would work.

Macros

Macros are unnecessary in R, anything you would do with a macro in stata you can do with a variable in R. For example, to change to the directory “C:/work/myproject”, you could enter the commands

mydir = "C:/work/myproject"
setwd(mydir)

In that example, the variable mydir plays the role of a macro in stata.

Manipulating objects

In R, everything you work with is an object, but there a many different types of object. A dataset consists of a series of records, each record containing the same type of information. Mathematcally, this can be thought of as a matrix, in which the rows represent records and the columns represent the data stored in each record. This is just like the “browse” window in stata: each row in this window represents a record or observation, and each column represents a stata variable.

A stata variable contains as many values as there are observations in the dataset. In mathematics, such and object is referred to as a vector, and the same is true in R. Each column in a data frame represents a stata variable, and the column can be given a name which functions in the same way as a stata variable name. However, since R can work with may datasets at the same time, and each dataset can contain columns with the same names, when referring a column within a dataframe you must give both the data frame name and the column name, separated by a dollar sign. For example, the age column in the baseline data frame would be referred to as baseline$age.

generating objects

You can create and object of any type in R with a command of the form

x <- myfunction()

The will ensure that x contains whatever object was returned by the function myfunction(). This could be any type of object: a single number, a vector of numbers, a matrix or dataframe, a list of strings. I could be that x was a vector before running myfunction() and is now a single boolean value (either TRUE or FALSE).

This means that R does not have separate generate and replace commands: assigning a new value to an object with always replace the old one. and there is no need for a separate egen command, since you can simply write your own function to create an object you want and assign the result of that function to any object.

generating columns in a data frame

Transforming variables, i.e. creating new columns in a data frame that are a function of existing columns, can be performed using the assignment operator “<-”. For example, if we want to have the log of the titre from the baseline data frame available to us, we can enter

baseline$ltitre <- log(baseline$titre)

Labels

Labelling variables is not as straightforward in R as it is in stata. The easiest way is to add an additional package, expss, and use the functions from that package.

Manipulating datasets

Create and Save

Creating and saving datasets is far less important in R than in Stata. This is because every object created during an R session is kept in the environment, and that can be saved between sessions. Therefore, a data frame created a week ago will still exist within .Renv, even if it does not exist as a specific file that you can address directly. The only reason you would really want to save a file is so that you could share it with someone else. ## Appending

If you have two dataframe containing the same information about two distinct groups of people, appending the frames ## Merging