00. A revised and updated version of this post can be found here.
0. Readers of this post may find the Glossary of Statistical Terms helpful.
1. R is a free, open-source, cross-platform statistical and graphical environment that can be obtained from www.r-project.org.
2. Some reasons for using R:
R is world-class software for data analysis and visualization.
R is free. There are no license fees and no restrictions on how many times and where you can install the software.
R runs on a variety of computer platforms including Windows, MacOS, and Unix.
R provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner.
R contains advanced statistical routines not yet available in other packages.
R has state-of-the-art graphics capabilities.
More reasons for using R can be found here. And here. And here. See also Why you should learn R first for data science, The rise of R as the language of analytics, and Programming tools: Adventures with R.
One may also benefit from reading Why R has A Steep Learning Curve.
3. There are many excellent sources of R documentation available on the Internet. These include An Introduction to R, the gigantic R Reference Index (PDF), and my personal favorite, Quick-R. There is also an R Journal, an R FAQ (Frequently Asked Questions), many books, the R Reference Card (PDF), various R cheat sheets (Google it), and other assorted R Web sites.
4. To install R, go to the R Web site. Under “Getting Started:”, click download R. Choose a CRAN (Comprehensive R Archive Network) mirror site that is closest to you. Then, under “Download and Install R,” download the installer (“Precompiled binary distribution”) for your particular computer platform. Run the downloaded installer application and you should be off and running. Extensive additional gory details can be found at R Installation and Administration.
5. R is not like other statistical software. Rather than being a “software package” in the usual sense of SPSS, SAS, Systat, etc., where the user fills in the blanks in a series of menus and dialog boxes and then has to sift through reams of canned output, R is actually a programming language customized for statistical analysis and graphics. R does have a graphical user interface (GUI) that makes the language easier to use, but the true power of R lies in its programmability.
6. Getting a feel for R. The impact of this introduction will be greater if you follow along in R, entering the commands as shown. You can copy and paste R commands from this introduction directly into the R console instead of typing them by hand.
When you start R you will see a command console window, into which you type R commands. It looks like this on the Mac (click to enlarge):
The command prompt is the >
symbol.
There are also more sophisticated, menu-driven GUIs and development environments available that run on top of R, such as R Commander and RStudio. More on these later.
R can evaluate expressions and manipulate variables, e.g.,
> x = 2
This assigns the value of 2 to the variable x. But there is also another way to assign values to variables in R, which looks like this:
> x <- 2
Instead of using an equals sign, you type a less-than symbol immediately followed by a dash, making a left-pointing arrow that points towards the variable to which you want to assign a value. R purists insist that this latter method is preferable, but in this introduction we will use the equals sign for assignment statements because it is probably more familiar to users who are new to R.
To print the contents of a variable, simply type the variable’s name:
> x
[1] 2
The 1 in square brackets is part of R’s way of printing lists of numbers. This convention becomes more useful when there is a longer list of numbers to print. The number in brackets is the index of the first number on that line. So if we were to generate 15 random numbers from a normal distribution, our output display might look like:
> rnorm(15)
[1] -1.20251243 -0.93915306 -0.58314349 -0.28205304 -0.72031211 1.12303378 [7] 1.60557581 1.30062736 1.06739881 -2.09506242 -0.04172139 -1.66868780 [13] 0.87027623 0.43993863 -0.07720584
Here, for example, the [7]
indicates that 1.60557581
is the seventh element in the list of numbers (called a vector in R parlance).
The View
function [NOTE: Capital V!!] provides a prettier way to view a variable, e.g.,
> x = rnorm(100)
> View(x, 'This is x')
This will place the contents of x
in a data viewer window, which can be scrolled using the keyboard page-up, page-down, and arrow keys.
Much of R involves function calls, arguments, and return values. Like in your high school algebra class, when you wrote f(x)
(“the f of x”), the f
stands for the name of the function and the x
stands for the arguments passed into the function. In algebra and in computer programming, a function is an equation or a piece of computer code that takes in one or more values (arguments), does some sort of computation with them, and then returns a result.
For example, in R we can generate a vector of 1000 random values from a normal distribution having a mean = 0 and standard deviation = 1 (the default arguments) by typing:
> y = rnorm(1000)
Here, rnorm
is the name of the function and 1000 is the argument that we pass into the function. The argument tells the function that we want it to generate 1000 values. The rnorm
function generates the values and returns them for storage in the variable that we have named y
. If you type the name of the variable y
at the R console you will get a display of those 1000 values. Here are the first and last few lines of that output:
[1] 1.3395720217 -0.0115032420 -0.6823326244 0.2613036332 0.1040746632 0.0454721524 0.3640455956 0.7205450723 0.4994317717 [10] 1.0467334635 0.0941044374 0.0470323976 1.6612519966 -1.2653165072 0.2894176277 0.2140272012 -0.1166290364 1.9260137594 [19] -0.1623506191 0.2429738277 -0.0908777982 -1.0126380527 -1.4146009241 0.6927008684 1.5923438893 0.7920350474 -0.3419451639 ... [982] 0.2657230634 0.4844520987 1.9939724168 -0.3426382437 -2.6953913082 -0.7822534469 1.0135964703 0.7795363595 0.1870213245 [991] -1.4046173709 -0.3727388529 -0.2606101406 -0.9251060911 0.1755509390 -2.5640188283 0.5750884848 0.5416196143 -0.5890929375 [1000] -1.4114665861
Instead of using the default, unspecified argument values of mean = 0 and standard deviation = 1, we can invoke rnorm
like this, specifying a particular mean and sd:
> y = rnorm(1000,mean=42,sd=2.5)
If you type help(rnorm)
you will get a help screen that explains the various arguments that can be used with rnorm
and its relatives.
To see if we got what we wanted, we can produce a few summary statistics of the variable y
:
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
34.09 40.29 41.91 41.91 43.63 50.73
and we see that the Mean is 41.91, which is “close enough” to 42, given that rnorm
generates a different batch of random values each time it is invoked.
To visualize this batch of numbers, we can produce a histogram:
> hist(y)
This graphic shows us the shape of the distribution of the variable y
. The peak of the histogram is centered on the mean value of 42 and tapers off symmetrically on both sides of the mean. This is because the data we generated using rnorm
is from a normal distribution, which by definition is in the shape of the famous bell curve.
H I N T You can get help on any R command by typing help(command)
.
e.g.,
help(plot)
?(command)
works the same, e.g. ?(plot)
You can also search for all help topics that contain a particular command by typing two question marks immediately followed by the command, without parentheses, e.g.:
??plot
The plot we just produced, with only one line of R code, can be easily saved to a graphics file in various formats for importing into other applications such as word processors and presentations. For example, to export the plot to a PDF file, select the graphics window and then invoke the R File|Save As… menu option to save the file.
Plots can be saved to other graphics formats as well, but you need to do it from the R console. To save as a JPEG graphics file:
> jpeg('rplot.jpg')
> hist(y)
> dev.off()
Note that the plot will be saved in the current working directory (see the Hint, below) unless otherwise specified. You also may get a “null device” message after the dev.off()
statement; this just means that the JPEG file (the “graphics device”) has been closed.
You can use similar commands to export R graphics to other popular graphics formats, such as bmp, tiff, and png.
[TEASER: Much of this is a lot easier to manage in RStudio. Read on.]
To summarize thus far, R works differently than other statistical software you may have used:
“Rather than setting up a complete analysis at once, the process is highly interactive. You run a command (say fit a model), take the results and process it through another command (say a set of diagnostic plots), take those results and process it through another command (say cross-validation), etc. The cycle may include transforming the data, and looping back through the whole process again. You stop when you feel that you have fully analyzed the data.”
H I N T R maintains a history of all of your commands, which you can access using the up- and down-arrow keys. Use the left- and right-arrow keys to move through and edit a retrieved command, after which you can re-submit the edited command by hitting the Enter key.
Some handy workspace commands:
Print the current working directory: getwd()
Change to mydirectory: setwd(mydirectory)
List the objects in the current workspace: ls()
Remove one or more objects from the workspace: rm()
, e.g., rm(y)
[For more help on managing your R workspace, go here.]
7. Getting data into R
H I N T At this point, you may want to peruse section 12, RStudio, and begin using R from inside of RStudio. You can follow along the same as before, entering commands into the R console window of RStudio. It will look exactly the same as R’s own console window, except that graphical plots will appear in their own Plots window inside of RStudio (except for 3D plots, which will be rendered in their own, floating graphics window). RStudio is a very helpful tool for organizing and managing your R projects and workflow.
For a short and very genteel video on the basics of RStudio, see RStudio Introduction.
R uses an internal structure called a dataframe to store data in a row-column, spreadsheet-like format, where the rows are the objects of interest (your units of analysis) and the columns are variables measured on each object. In R, a dataframe can be created by reading in raw data from an external file. A dataframe can be saved by exporting it to an external file or to R’s internal data format.
To illustrate this process we will use a famous data set from biology, the Bumpus House Sparrow data:
Bumpus sparrow data: bumpus.txt
Bumpus sparrow metadata: bumpus.met
More information on Hermon Bumpus and House Sparrows (PDF)
To create an R dataframe of the sparrow data:
a. Download the plain text data file bumpus.txt to a readily-accessible location on your computer.
b. Issue the following command in R:
> bumpus = read.table("d:/empty/bumpus.txt",sep = "",header = TRUE)
replacing the “d:/empty/bumpus.txt
” with the path to the data file on your particular computer.
In the above read.table
command, sep = ""
means that there is nothing separating the data items from each other (they are only separated by white space), and header = TRUE
means that the first line in the data file is a header containing the names of the variables.
Once your data are in a dataframe, you can then begin to manipulate, analyze, and visualize the data. Typing names(bumpus)
lists the names of the variables. Typing the name of the dataframe (bumpus
) will scroll all of the data to the console window. To display individual variables, use syntax like bumpus$survive
. Better still, enter attach(bumpus)
. Once a dataframe is “attached” to the console window you can just type the name of a variable by itself to display the variable’s contents. Typing fix(bumpus)
will place the dataframe into an editing window.
H I N T A digression on data management in R and RStudio.
Actually, you don’t need to download the data file to your computer. You can read it into an R dataframe directly if you know the URL that points to the file location on the server:
> bumpus = read.table("http://college.holycross.edu/faculty/rlent/data/bumpus.txt",sep = "",header = TRUE)
This will create the bumpus
dataframe exactly as if you had first downloaded the source data to your computer.
Importing data from a local file or URL is even easier in RStudio. From RStudio’s Tools menu, choose Import Dataset | From Web URL, and then enter the URL. If instead you choose to import data from a local file, you will be given a dialog box from which you can choose the file from your computer’s filesystem. RStudio can usually figure out the structure of your data as long as you follow standard practice, such as organizing your data into a comma-separated values (CSV) file. A CSV file might look like this, in a plain-text file we could call cars.csv:
Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38
In the CSV format, the first line of the data file contains field names, and subsequent lines contain the actual data, with all values separated by commas.
Once imported, RStudio will place your data into a Data Viewer window. (See Using the Data Viewer.) You can’t edit data directly in the Data Viewer; it’s just a viewer. To edit an R dataframe, type, for example, fix(cars)
into the console window. This places the dataframe into an editor pane in which you can change values. When you quit the editor, the changes will be saved into the dataframe. You can then save the modified dataframe using the save
command, e.g.:
save(cars,file="cars.rdata")
This creates a data file in R’s own binary format, which can later be re-loaded into your workspace with the load
command:
load("cars.rdata")
In RStudio you can also load an R data file by choosing it from the Files pane.
An R dataframe can be exported to a plain-text CSV file using the write.table
function:
write.table(cars, "cars.csv", sep=",")
Here the three parameters passed to write.table
are (1) the name of the R dataframe to export, (2) the name of the exported file, and (3) the data separator.
If we had an SPSS file, we could read the data into an R dataframe like this:
> cars = read.spss("d:/empty/cars.sav", to.data.frame = TRUE)
Go here to see how to import other popular data formats.
The following sequence of commands creates a new dataframe from scratch:
age = c(25, 30, 56)
gender = c("male", "female", "male")
weight = c(160, 110, 220)
mydata = data.frame(age,gender,weight)
The c
function combines values into a vector, and the data.frame
function packages the vectors as columns into a dataframe.
8. Exploratory data analysis
The principles of exploratory data analysis (EDA) were pioneered by one of my favorite statisticians, John Tukey. An important tenet of EDA is that, before you statistically analyze your data, you should actually look at it.
Tukey stressed the contrasting approaches of exploratory versus confirmatory data analysis. Classical parametric statistical techniques like analysis of variance and linear regression are confirmatory techniques requiring that the data adhere to some fairly rigid assumptions, such as normality and linearity. Modern software makes it easy to crank out sophisticated statistical analyses with a couple of mouse clicks, but it also makes it easy to ignore the underlying assumptions. Performing some exploratory analysis of your data will allow you to assess those assumptions, and will also allow you to detect things like outliers and errors in data entry. (See Tukey’s classic 1980 American Statistician paper We Need Both Exploratory and Confirmatory.)
What follows is an EDA session in R using the Bumpus dataset.
A handy graphical tool for EDA is the scatterplot matrix. This is simple to produce in R: Just give the plot
function the name of a dataframe:
> plot(bumpus)
A scatterplot matrix shows individual scatterplots of all pairwise combinations of variables in the dataframe and is an excellent way to search for odd patterns in your data, such as groupings of data points, nonlinearities, etc.
H I N T R packages.
R organizes all of its routines into packages. For example, the plot
routine is part of R’s graphics package. The packages that are installed by default can be listed as follows:
> getOption("defaultPackages")
[1] "datasets" "utils" "grDevices" "graphics" "stats" "methods"
And as you may have noticed by now, R commands are case-sensitive.
The datasets package contains example datasets. The utils package contains functions used for programming and developing other packages. The grDevices package provides support for graphics hardware and software. Common statistical routines are found in R’s stats package. The methods package supports R functions.
Users of R will probably interact most with the graphics and stats packages. There are many other R packages that provide a way to expand the functionality of the base software. Contributed packages are written by members of the R user community. CRAN Task Views present collections of packages organized by topic and provide tools to automatically install all packages for particular areas of interest. Another way to find R packages is to do a Google search, e.g. “How do I do principal coordinates analysis in R?“
If you want to use an R function that is not part of the base packages, you must first install that function’s package before you can use it. A Package Installer is included as part of the R GUI and can be used to download and install contributed packages. In RStudio, the Packages pane and Tools menu (Install Packages) can be used to facilitate installation and management of packages.
Summary statistics for all variables in dataframe bumpus:
> summary(bumpus)
survive length alar weight lbh lhum lfem ltibio wskull lkeel Min. :0.0000 Min. :153.0 Min. :39.00 Min. :3.000 Min. :30.00 Min. :0.6590 Min. :0.6580 Min. :1.000 Min. :0.5700 Min. :0.7880 1st Qu.:0.0000 1st Qu.:158.8 1st Qu.:44.75 1st Qu.:4.875 1st Qu.:30.30 1st Qu.:0.7080 1st Qu.:0.7000 1st Qu.:1.103 1st Qu.:0.5940 1st Qu.:0.8300 Median :1.0000 Median :160.0 Median :47.00 Median :5.750 Median :31.00 Median :0.7360 Median :0.7090 Median :1.117 Median :0.6000 Median :0.8500 Mean :0.5893 Mean :160.0 Mean :46.98 Mean :5.764 Mean :30.96 Mean :0.7291 Mean :0.7105 Mean :1.123 Mean :0.6012 Mean :0.8506 3rd Qu.:1.0000 3rd Qu.:161.0 3rd Qu.:50.00 3rd Qu.:6.500 3rd Qu.:31.50 3rd Qu.:0.7460 3rd Qu.:0.7180 3rd Qu.:1.150 3rd Qu.:0.6082 3rd Qu.:0.8795 Max. :1.0000 Max. :166.0 Max. :53.00 Max. :8.300 Max. :31.90 Max. :0.7800 Max. :0.7650 Max. :1.197 Max. :0.6330 Max. :0.9160
Previously we used the summary
function for a single variable. Here we give summary
the name of our dataframe and get summary statistics for all of the variables. survive
is a binary variable, taking on only two values: 1 for survived, 0 for dead. With a binary variable the mean is not very useful: A bird can’t be 0.58 dead. However, the other variables are morphological measures that vary continuously along their scale of measurement (see bumpus.met), and therefore the summary statistics can be very informative. Look at the minimum and maximum values carefully for errors, such as a misplaced decimal point. Would a House Sparrow really weigh 3000 grams (6.61 pounds)? Many researchers skip this elementary data-screening step, which is especially important if humans are typing in the data from mud-splattered field notebooks. Also, laboratory instruments can go out of calibration, batteries can become weak, power surges can fry delicate circuitry, etc., etc. Check your data!
The mean and median both measure central tendency, but in very different ways. Most people are familiar with the mean, or arithmetic average, calculated by summing the values and dividing the sum by the number of observations. The median, however, is the middle value of ranked data. In other words, if you sorted the values from low to high, the median would be the one in the middle. The mean is very sensitive to outliers (extreme values), whereas the median is not. The following R experiment will illustrate this.
> x = c(1,2,3,4,5)
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max. 1 2 3 3 4 5
Note that the median and mean are equal. But if we add an extreme value (outlier):
> x = c(1,2,3,4,3000)
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max. 1 2 3 602 4 3000
The median is unchanged by the huge outlier (the median is still the middle value, 3) but the mean has increased enormously. When the data are normally distributed the mean and median are equal (see Measures of Central Tendency). So another exploratory diagnostic is to compare the mean and the median. If they are very different, you may have some outliers that are skewing the distribution.
The summary statistics 1st Qu.
and 3rd Qu.
are the first and third quartiles. Along with the median, the quartiles divide the data set into four equal groups, each group comprising a quarter of the observations.
This five-number summary (minimum, maximum, median, first quartile, third quartile) is better visualised using another invention of John Tukey: the box plot:
> boxplot(length)
The line running through the central rectangle is the median. The distances from the median to the top and bottom of the rectangle span the third and first quartiles respectively. The central rectangle itself spans the distance from the first to third quartiles (the interquartile range, or IQR). Above and below the rectangle are dotted lines (called “whiskers”) that usually terminate in the minimum and maximum values. However, if outliers are present (as in this example) the whiskers terminate at 1.5 times the IQR and the outlying values are shown as dots.
A box plot is a nice way to visually assess the distribution of your data. For example, if the distribution deviates from a smooth bell-shaped curve, a box plot will make this obvious. Let’s make a skewed set of numbers using the log-normal distribution:
> q = rlnorm(1000)
> hist(q)
> boxplot(q)
The severely squished box plot instantly reveals the highly skewed nature of the data.
Yet another exploratory tool popularized by Tukey is the stem-and-leaf display:
> stem(weight)
The decimal point is at the |
3 | 089 4 | 0133566677899 5 | 001456667777899 6 | 00000013555567799 7 | 01569 8 | 033
The stem-and-leaf plot is best explained by looking at the variable weight
as a sorted vector from low to high:
> sort(weight)
[1] 3.0 3.8 3.9 4.0 4.1 4.3 4.3 4.5 4.6 4.6 4.6 4.7 4.7 4.8 4.9 4.9 5.0 5.0 5.1 5.4 5.5 5.6 5.6 [24] 5.6 5.7 5.7 5.7 5.7 5.8 5.9 5.9 6.0 6.0 6.0 6.0 6.0 6.0 6.1 6.3 6.5 6.5 6.5 6.5 6.6 6.7 6.7 [47] 6.9 6.9 7.0 7.1 7.5 7.6 7.9 8.0 8.3 8.3
The vertical line in the stem-leaf plot separates the “stems” from the “leaves,” with the stems to the left of the line. The vertical line itself is interpreted as the decimal point. So, in our sorted vector of weight
, the first value is 3.0. The 3 is the stem, to the left of the vertical line, the line is the decimal point, and then the first leaf to the right of the vertical line is the zero. The next two values are 3.8 and 3.9, filling out the first row of the stem-and-leaf plot. A stem-and-leaf plot is like a histogram lying on its side, with the bars made up of the actual data values. This gives you a much more detailed view of the distribution, right down to individual values, unlike a traditional histogram in which the individual values are thrown into “bins” that are the bars of the histogram. A histogram is based on frequencies, whereas a stem-leaf plot is based on individual values. A histogram obscures the original data by summarizing it, while a stem-leaf plot displays each and every value.
In our Bumpus dataframe we have the variable survive
, a categorical variable taking on only two values, a 1 indicating that the bird survived the storm and a 0 indicating that it did not survive. R calls such categorical variables factors. We can generate multiple box plots according to factor levels, such as:
> boxplot(length~survive)
A variant of the box plot is the notched box plot:
> boxplot(length~survive, notch=TRUE)
The notches show a 95% confidence interval around the medians. If the notches do not overlap, there is “strong evidence” that the medians differ. Our notched box plot indicates that larger birds (males in this subset of Bumpus’ data) did not survive the storm, potential evidence for natural selection. [Note, however, that the Bumpus dataset has been subjected to multiple analyses and interpretations over the years. See, for example, Hermon Bumpus and Natural Selection in the House Sparrow Passer domesticus (PDF) and Differential overnight survival by Bumpus’ House Sparrows: An alternate interpretation (PDF)].
At this point we are beginning to stray into confirmatory analysis. A couple of more examples will suffice.
The classical t-test of the difference in mean body length between surviving and non-surviving male sparrows is as follows:
> t.test(length~survive)
Welch Two Sample t-test data: length by survive t = 3.2576, df = 50.852, p-value = 0.002005 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.9007984 3.7948537 sample estimates: mean in group 0 mean in group 1 161.3478 159.0000
And your basic one-way analysis of variance (ANOVA) for the same data is:
> summary(aov(length~survive))
Note that, in the above R command, we are nesting the ANOVA function (aov
) inside of the summary
function. This is common practice in R, and in fact in many other programming languages, where the result returned by a function is used as the input of another function. The summary
function, like many functions in R, can modify its output according to what kind of input it receives.
Here are the ANOVA results:
Df Sum Sq Mean Sq F value Pr(>F) survive 1 74.7 74.71 10.16 0.00239 ** Residuals 54 397.2 7.36 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Both tests seem to indicate that there is a significant difference between the two means, with survivors being smaller than non-survivors.
However, before you believe this, read Jacob Cohen’s Things I Have Learned (So Far).
9. Some more graphics
Again, the following series of commands will be more instructive if you follow along and execute them in R.
The R language makes it possible to create impressive graphics with only a few lines of code. For example, to make a scatterplot of humerus versus femur length using the Bumpus data:
> plot(lhum,lfem)
This yields a simple plot of the data points. However, if we instead enter:
> scatter.smooth(lhum,lfem)
we get the same scatterplot of points but with a smoothed line fit to the data.
Another way. Needs prior call to plot(lhum,lfem)
.
> lines(loess.smooth(lhum,lfem))
Linear instead of lowess/loess smooth (lowess stands for LOcally WEighted Scatterplot Smoothing). Note that variables are reversed in the call to abline
. This also needs a prior call to plot(lhum,lfem)
.
> abline(lm(lfem~lhum))
[lines()
connects dots to produce a curve; abline()
plots a single straight line.]
Change the line color and thickness.
> abline(lm(lfem~lhum),col='blue',lwd=10)
Here again is our notched box plot, but we now add a title and some axis labels:
> boxplot(length~survive, notch=TRUE, main="Bumpus' Sparrows", xlab="Survival", ylab="Body length")
And now for something a little fancier:
Invoke the rgl 3D graphics package.
> library(rgl)
Rotate your data in 3D! (Drag your mouse cursor over the plot to rotate.)
> plot3d(length, alar, lfem, col="red", size=6)
Create a new variable for labeling points:
> bumpus$survcat = ifelse(bumpus$survive == 0, c('N'), c('Y'))
If you now type names(bumpus)
you will see that the categorical variable survcat
has been added to the dataframe.
Next, re-plot with smaller plotting symbols:
> plot3d(length,alar,lfem,size=2)
and then add text labels to the plotted data points. Ponder the effects of body size on survival:
> text3d(length,alar,lfem,texts=bumpus$survcat)
You can even skip creating a separate labeling variable by inserting the recoding statement inside of the call to text3d():
> text3d(length, alar, lfem, texts=ifelse(bumpus$survive == 0, c('N'), c('Y')))
(My thanks to Holy Cross Mathematics Professor John Little for showing me this trick.)
10. Command files
R statements can be stored in a text file and submitted together as a batch. Thus we could produce our 3D rotating scatterplot by storing the following R commands in a plain text file:
library(rgl) bumpus = read.table("http://college.holycross.edu/faculty/rlent/data/bumpus.txt",sep = "",header = TRUE) attach(bumpus) plot3d(length,alar,lfem,size=2) text3d(length, alar, lfem, texts=ifelse(bumpus$survive == 0, c('N'), c('Y')))
To create command files, use whatever plain text editor you like as long as it produces nothing more than plain text. DO NOT use a word processor, because word processors insert invisible formatting symbols that will mess up your R commands. Plain text editors include TextEdit for the Mac (just be sure you use plain text, not rich text), Notepad for Windows, Emacs for Unix, etc. RStudio has a plain text editor built in; access it from the File menu (choose New File, then Text File).
The command file contains R commands written exactly as you would have entered them interactively at the R console. If we named the command file plot.R
(the R suffix is standard practice for R command files, but you could use whatever you want), we could then submit this command file to R with the following statement:
source("plot.R")
You can also submit command files using the R GUI.
Command files are an excellent way to maintain complicated batches of R statements without having to constantly retype them.
11. An R menu system: R-commander
R Commander runs on top of R and enables users to access a selection of commonly-used R commands with a simple, menu-driven interface. R Commander is a standard R package and can be installed using the R Package Installer that is part of the standard R GUI. R Commander does depend on a number of other contributed packages, which must be present in order for it to work correctly. (Be sure to click the “Install Dependencies” checkbox in the Package Installer when you are installing R Commander.) If you are getting error messages when you try to load R Commander, see R Commander Installation Notes.
After installing the Rcmdr package enter library(Rcmdr)
at the R prompt and R Commander should appear. It looks like this:
The beauty of R Commander is that it generates the R code that corresponds to your various menu selections, which makes it a great tool for learning the R language. You can interactively change the code and resubmit it. This is often necessary because no point-and-click menu system could possibly cover all of the options in a system as powerful and complex as R.
For more information on using R Commander, see Getting Started With the R Commander (PDF).
12. RStudio
RStudio is a free IDE (integrated development environment) for R that runs on Windows, Mac, or Linux. Some of the features of RStudio that facilitate working with R include:
Workspace browser and data viewer
Plot history, zooming, and flexible image and PDF export
Integrated R help and documentation
Searchable command history
H I N T See Install R, RStudio, and R Commander in Windows and OS X for a nice set of instructions on getting all of these tools installed and working together.
You can run R Commander from inside of RStudio. Do this by either typing library(Rcmdr)
in the R console window of RStudio or by selecting Rcmdr
in the Packages tab of RStudio’s plots window. In this scenario you can use R Commander’s menus and dialog boxes to construct your analyses and graphics, and R Commander will then direct its output to the appropriate areas of RStudio.
You must have R installed before you can use RStudio, but once RStudio is installed you do not need to have R running, as RStudio can access the R system as needed.
Here is a screenshot of RStudio running on an iMac:
In this particular setup (which can be customized) there are four windows, showing the current dataset, the R command console, the local environment (datasets, file system, etc.), and an output window for graphics. RStudio is a very helpful tool for organizing and managing your R projects and workflow.
For an excellent introduction to installing and using R and RStudio, including explanations, examples and exercises, see A (very) short introduction to R (PDF).
13. R Markdown
R Markdown is a plain-text document format for writing reproducible, dynamic reports with R. It is based on standard Markdown, which is a simple, plain-text formatting syntax. R code can be embedded directly into a Markdown document, which can then be translated into a variety of standard document formats including HTML, PDF, and Microsoft Word. The combination of R and Markdown enables the writing of documents containing not just text and figures but also built-in data analysis and graphics (see Introductory R Markdown: dynamic documents and reproducible research for beginners).
RStudio makes it easy to create R Markdown documents, as described in the R Markdown Cheat sheet (PDF).
An example of an R Markdown file, slightly modified from here.
This plain text file can be copied and pasted into an editor window in RStudio, saved as an Rmd file (R Markdown), and rendered as HTML by tapping the Knit button. The Knit button invokes knitr, an R package that enables integration of R code into Markdown and a variety of other plain-text document formats. This workflow can also be accomplished in the regular R console, but RStudio is usually easier because it integrates all of the necessary tools into one convenient environment.
You can publish your rendered R Markdown documents directly from RStudio via Rpubs.com, a free R Markdown publishing site maintained by RStudio. You can also publish your R Markdown documents on your own web server, e.g.:
http://college.holycross.edu/faculty/rlent/test/RMarkdowntest.html