Reproducible Research Revisited
If we were creating a journal article back in the olden days (i.e., prior to 2011, the first public beta release of RStudio), we would start writing our manuscript in a word processor, say Microsoft Word. If it was a science manuscript, the text would contain Introduction, Methods, Results, Discussion, and Literature Cited or References sections. The data, say from a laboratory experiment or from field observations, might reside in an Excel spreadsheet, a database application, or preferably in one or more plain text files. To produce data summaries, statistical analyses, and graphics, we would have to bring the data into a statistics package like SPSS, SAS, or one of many others. Reduction and manipulation of the original data might continue in the statistical software. Tabular statistical output such as regression and ANOVA tables would need to be copy-pasted back into Word, where they might then be wrangled into a pretty table. Graphical output from statistical software or maybe a separate graphics package would need to be saved to a graphics file, then imported back into Word. If we needed a map of study sites, we might have to use a geographic information system to produce a map, requiring even more data files, and then export the map to an external graphics file so that it could be brought into Word. Bibliographic references might be stored in reference management software like Zotero or RefWorks, and via a Word plugin, we could produce our literature citations and formatted bibliographies. At some point, a final manuscript would be produced.
And then the revisions would begin.
The cut-and-paste approach to producing a scholarly manuscript is tedious, slow, and error-prone, to say the least. Moving data back and forth between applications makes it difficult to retrace the steps taken to produce a given result, even if careful notes are taken every step of the way. If a project involves multiple researchers, each working on different parts of the analysis, and each keeping their own set of notes, this process becomes even more complicated.
Production of an analysis, publication-quality graphics, and a final manuscript can be greatly streamlined by keeping everything in one R Notebook. With R code chunks embedded in R Markdown text, you can fully document how you arrived at your results, while simultaneously producing the statistical output, graphics, and references for your paper. R Notebooks can be easily archived and shared among collaborators, using cloud storage technologies such as Dropbox and Google Drive, or version control systems such as git, and can be rendered into publication-quality documents in a variety of formats. And because everything is plain text, the R Markdown manuscript can be edited on any computing device that has a text editor, including smartphones and other mobile gadgetry.
We illustrate this workflow with a small example, involving the analysis of the raw data file sites.csv. This is a comma-separated values file containing ecological data from 11 grassland sites in Massachusetts, New Hampshire, and Vermont. The companion metadata file sites.metadata.txt describes the variables (columns) of sites.csv
. The data for each site consist of measures of site vegetation structure, morphological measures on individuals of the butterfly species Coenonympha tullia (the Common Ringlet) inhabiting each site, and the geographic location of each site in both UTM and decimal degree coordinates. The aim of the study was to examine relationships between habitat structure and morphological variation in the butterfly populations at each site.
You can view the complete example in sites.nb.html, which is an HTML notebook created in RStudio from the corresponding R Notebook file sites.Rmd. The HTML notebook contains the rendered text of a scientific paper originally written in R Markdown with embedded R code that creates all of the statistical analyses, tables, and graphics. The same sites.Rmd file was used to produce PDF and Microsoft Word versions of the paper. The manuscript in sites.Rmd was written to be self-contained and self-documenting, essentially a “paper-within-a-paper.” It includes comments that document both the main text and the embedded R code. Also in sites.nb.html
is a link from which you can download a copy of the complete R Markdown source file sites.Rmd
. You can also download sites.zip, a zip archive file that contains all of the document files, data, metadata, R code, and other associated files (such as external images and bibliographic data) needed to completely replicate our analysis and document production.
Recalling the 3 criteria for reproducible research, the files comprising our example satisfy the requirement that All data and files used for the analysis are publicly available. The data file and its companion metadata file would be placed in a publicly accessible digital repository so that other workers desiring to replicate the analysis could get the data and know what they were working with. At a minimum, the metadata file needs to describe what the variables are and their units of measurement. The requirement that All methods are fully reported should be satisfied by the Methods section of the article. Because the article is written in R Markdown and contains embedded R code showing exactly how the data were analyzed, we also have satisfied the third requirement of reproducible research, that The process of analyzing raw data is well reported and preserved. The R Markdown manuscript with its embedded R code and accompanying data and metadata files, all residing in sites.zip, is a self-contained package of reproducible research.
Coda I: Python
We note briefly here that you can insert code chunks from other programming languages besides R into an R Markdown document. See knitr Language Engines for more details. There is an example of an R Notebook that includes both R and Python code here.
Coda II: Inspirational Quotes About Data
“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.” — Sherlock Holmes, The Adventure of the Copper Beeches
“War is 90% information.” — Napoleon Bonaparte
“Everybody gets so much information all day long that they lose their common sense.” — Gertrude Stein
“Statistics are no substitute for judgment.” — Henry Clay
“Information is not knowledge.” — Albert Einstein
“Facts are stubborn, but statistics are more pliable.” — Mark Twain