3 R Markdown

This module consists of text, example code, and two exercises to gently introduce to the world of R Markdown.

3.1 Motivation

Say you were sent some tree data from a colleague, consisting of lengths and diameters of some trees repeatedly measured over 30 years. You then spent a month writing lots of R code to analyse the data. Your code also produces beautiful figures and tables, that you then manually copied into a Word document. You also had to write lots of text, including calculations such as the simple

The average tree height was 10.1 m.

Then your colleague sheepishly tells you that someone introduced an error when extracing the data from their tree-size database, and so you need to redo everything. Your heart sinks with the prospect of re-running all your code and making sure you manually copy the correct new figures into your document. Oh, and you need to redo the tables and check all the numbers in your text.

The alternative modern approach is to use R Markdown. The idea is that you can generate a ‘dynamic report.’ You write code that contains a mixture of your R code and your write up all in one place. You can use this for short analyses, scientific manuscripts, or even a complete thesis. For shorter documents you will have a single R Markdown file, and for more complex documents it’s better to break your work up into individual files that are linked together, such as chapters for a thesis.

This module is aimed to get biologists started with the basics without too much extraneous information. Having gone through this module you will then be in a good position to learn more details from the online Definitive Guide to R Markdown and the RStudio introduction, which are aimed at a general statistical audience (rather than specifically biologists).

A key concept is that everything is written as code. You do not have to manually point and click anything, or copy and paste figures between directories. So once you understand how something works or have figured out some formatting that you like, you can just copy that code and use it elsewhere.

Example application

A recent 328-page document we wrote in R Markdown is A reproducible data synopsis for over 100 species of British Columbia groundfish. For each of 113 species, we produced two pages of figures:

Screenshot of a page with a complex layout of 23 figures.

Figure 3.1: First example page showing multiple figures of data.

and

Second example page showing further figures of data.

Figure 3.2: Second example page showing multiple figures of data.

Do not worry about the details – the point here is that for each species the layout of the figures is identical (even to show no data when none are available). Producing each figure and manually inserting them into a Word document would be extremely tedious and time-consuming. Instead, the production of the document is automated using R Markdown.

Furthermore, the work is transparent and traceable. Because the code produces the figures (they are not pasted in from somewhere), we can trace back from the R Markdown code to see the R code that:

  • pulled the data from databases
  • fit models
  • generated plots.

In particular, we intend to periodically update the document as new data become available. While still a lot of work, it is less daunting knowing that the code is already available.

On a practical level, the report has allowed anyone to see the data that are available, and has consequently increased data transparency between Fisheries and Oceans Canada, the fishing industry, non-governmental organizations, and the public.

This is admittedly a very advanced example with a ton of code (several new R packages) and work behind it, but the idea is to show you what is possible and what you can work towards.

3.2 Basic idea

In the above tree example, if using Word, for example, you would have a sentence that says

The average tree height was 10.1 m.

The “10.1” is hard-wired into your text, and you typed it in from the value “10.1” that your R code calculated (in a variable you calculated, let’s say you called it avge_height).

In your R Markdown document you have your R code and your text write up. You would equivalently have:

The average tree height was `r avge_height` m.

Instead of “10.1” you refer directly to the variable avge_height that you have already calculated. When you ‘render’ your R Markdown document (convert it from code into .pdf, .html or other formats), it automatically fills in the avge_height value as “10.1.” The

`r

means that the next bit of code (until the next backtick) should be evaluated using R, and the result inserted.

This is the basic idea. Then, when your colleague mentions the error (or, say, provides you with extra data) you can just re-run your code and the “10.1” will be automatically updated in your document. This concept extends to your tables and figures – they can all be automatically updated.

3.3 Simple example

Here we will generate some data, show some of it in a table, plot it, and show the results of fitting a simple linear regression. Read through this and then you will download and run the code in the exercise.

Generate data

First we’ll need some libraries:

library(kableExtra)
library(dplyr)
library(xtable)

Now generate some data:

set.seed(42)
n <- 50  # sample size
x <- 1:n
y <- 10 * x + rnorm(n, 0, 10)
n
[1] 50

So we are showing our R code here (we can choose to hide it if we like), and it has been executed, yielding the printed output of the value of n (because of the final line of the code).

We can also embed results from R within sentences, for example:

We have a sample size of 50.

This is done (as mentioned above) by the code:

We have a sample size of `r n`.

Note that you need the space straight after the r.

We can also automatically say that the maximum value of the data is 506.5564788, or round it to a whole number: the maximum value of the data is 507. These were done by:

the maximum value of the data is `r max(y)`

the maximum value of the data is `r round(max(y))`

Show some of the data

Let’s combine the data in a tibble (think of it as a data frame if you don’t know what that is):

data <- tibble(x, y)
data
# A tibble: 50 x 2
       x     y
   <int> <dbl>
 1     1  23.7
 2     2  14.4
 3     3  33.6
 4     4  46.3
 5     5  54.0
 6     6  58.9
 7     7  85.1
 8     8  79.1
 9     9 110. 
10    10  99.4
# ... with 40 more rows

(only the first 10 rows get printed here thanks to dplyr).

To have a proper table, we can do

kable(data[1:10, ], caption = "The first rows of my data.")
Table 3.1: The first rows of my data.
x y
1 23.70958
2 14.35302
3 33.63128
4 46.32863
5 54.04268
6 58.93875
7 85.11522
8 79.05341
9 110.18424
10 99.37286

(If you’re running the code separately the exact style may look different because of settings we have, but pretty much everything is tweakable with the kable and kableExtra packages).

Plot then fit a regression data

Now to plot the data:

plot(x, y)
Simple plot of the 50 simulated data points.

Figure 3.3: Simple plot of the 50 simulated data points.

To fit and then print the summary regression output from R:

fit <- lm(y ~ x)
print(summary(fit))

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.403  -4.366  -1.193   8.319  21.072 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.7071     3.2719   1.133    0.263    
x             9.8406     0.1117  88.124   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.39 on 48 degrees of freedom
Multiple R-squared:  0.9939,    Adjusted R-squared:  0.9937 
F-statistic:  7766 on 1 and 48 DF,  p-value: < 2.2e-16

And for a report we can produce a simple table (including a caption) of output and the regression fit:

kable(coefficients(summary(fit)), caption = "Linear regression fit.")
Table 3.2: Linear regression fit.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.707115 3.2719100 1.133012 0.2628368
x 9.840634 0.1116686 88.123530 0.0000000

And create a plot:

plot(x, y)
abline(fit, col = "red")
Same simple plot with simple regression line added.

Figure 3.4: Simple plot of the 50 simulated data points with the fitted regression added in red.

Now, let’s go back and change the data

The big feature of dynamically generating reports is when you go back and change or update the input data. For example, changing the data in the above example and then re-running it to redo the report. The best way to demonstrate this is for you to do it in Exercise 1.

3.3.1 Exercise 1

  1. In R do library(rmarkdown) to make sure you have the rmarkdown package. If not then install it with install.packages("rmarkdown").

  2. Download rmark-exercise.Rmd onto your computer and put it where you want to work on this exercise. Make sure it is saved as rmark-exercise.Rmd and not rmark-exercise.txt.

  3. The file is an R Markdown (.Rmd) file that you can run by either doing rmarkdown::render("rmark-exercise.Rmd") in R, or clicking the knitr button in RStudio and selecting the .html option. See the RStudio section at the end of this module for help using RStudio.

  4. Check that this has produced (‘rendered’) an .html file document that looks similar to what you see above (don’t worry if the styling is not identical, but the important content should be). Note that the .Rmd file is not repeating all the explanations that we gave above, but is a simplified version containing the commands, which is why the output won’t be identical. (The exact R Markdown file, module-2-rmarkdown.Rmd, used to create what you are reading here is available on the GitHub repository for this module).

  5. Carefully read through the .Rmd file and compare it with what you see in the resulting .html. There are some comments in there, denoted by <!-- comment --> to help you, but you should be able to get the idea of how commands in the .Rmd file translate into output in the .html file.

  6. Copy the resulting rmark-exercise.html to rmark-exercise-orig.html, change n to 30 in rmark-exercise.Rmd, and re-render it.

  7. Compare the two .html files. You have done the same analyses but on different data. You changed one value, n, and consequently all the resulting calculations changed, and these are all updated in your new .html document! You have performed exactly the same analysis on your ‘new’ data, including creating tables and figures. No copy-and-pasteing of tables, or manually keeping track of which figure corresponds to which analysis.

That last bit is the crux of R Markdown. Once you understand that then you should incorporate it into your workflow. You can easily run identical analyses on two data sets, or do 10 runs of a model with different parameters, and easily compare the output.

You can even get a bit clever with your writing by including an R ifelse statement to somewhat automate the text. Here is the results of some code (that will be included in the file):

So the maximum value of \(y\) is 507, which is greater than the special value of 400. The “greater than” or “less than” part is given by

`r ifelse(max(y)>400, paste("greater than"), paste("less than"))

But you have to be careful and think about all possibilities – what if \(y=399.9\)?

3.4 Output format

A feature of R Markdown is that you can create output in various different formats, such as .html, .pdf, and .docx (if you are so inclined). This is often touted as a great feature – you can simply switch an argument in the start of your code to switch output formats.

However, in practice, doing biological research and analyses, we have not found the need to have the same piece of code switch from .html to .pdf output. If you are writing a manuscript you are going to want it as .pdf, whereas if you are writing a vignette for an R package (these are done in R Markdown) you will probably want it as .html. The trouble with switching is that subtle formatting tweaks may render differently in the different formats, and you may end up spending a lot of time trying to make things work perfectly for two outputs – but in practice you may only really need, say, a .pdf.

So we would recommend choosing which output you are going to use (depending on what you are writing your particular piece of code for), and then sticking with that. For example, I prefer to do exploratory analyses as .pdf while others may prefer .html.

To keep it simple, in Exercise 1 we set the output to be .html. For scientific documents such as manuscripts and theses, a .pdf is preferable. We will now demonstrate this, for this you will need LaTeX installed.

For decades, LaTeX has been the standard typesetting system for writing mathematical papers. In 2002, Friedrich Leisch created the R package ‘Sweave,’ to weave together R code for calculations and LaTeX code for writing up results into a .pdf file. (It was called Sweave because S was the precursor language to R, plus I expect Sweave sounded better than Rweave). Sweave motivated the R package knitr (to knit together R code with text write ups) by Yihui Xie, which also allows html and other output formats. R Markdown was then created to allow simpler use of knitr, without learning lots of LaTeX commands, though you need LaTeX installed to use R Markdown.

3.4.1 Exercise 2

Now download rmark-pdf-example.Rmd onto your computer and put it where you want to run it. As before, make sure it is saved as rmark-pdf-example.Rmd and not rmark-pdf-example.txt.

Try running it as earlier, either using rmarkdown::render("rmark-pdf-example.Rmd") in R, or clicking the knitr button in RStudio but selecing the knit-as-pdf option. If it doesn’t run then you probably need to install LaTeX (you should get a helpful error message). See the simple instructions in another book by Yhiui Xie et al., the R Markdown cookbook.

Go through rmark-pdf-example.Rmd carefully, reading the notes in the comments, and comparing the .Rmd code to the rendered .pdf. I get three warnings (not errors) like Package xcolor Warning: Incompatible color definition on input line 283. It seems to work so don’t worry if you get them also.

Technial aside: Using those warnings as a teachable moment, this is why I set keep_tex = TRUE in the YAML header of rmark-pdf-example.Rmd. When you render the document, the rmarkdown::render() R function uses knitr::knit() to run the R code in your .Rmd file, and then build a LaTeX .tex file, which is then run by Latex to build a .pdf. It’s good to be aware that there are several steps going on here, especially when something doesn’t work. The warning in the previous paragraph is referring to line 283 of the rmark-pdf-example.tex file (the .tex file, not the original .Rmd file). The default R Markdown setting seems to be to delete the .tex file when everything builds okay, and so the .tex file was getting deleted. So keeping the .tex file can be helpful to try and debug warnings or errors that need fixing. Generally you want to fix warnings when you first see them occurring, but sometimes you just have to leave them in as everything seems to be working as intended (as is the case here).

Also, if you are using Git and GitHub then you would track the .Rmd file using Git, but add the .tex file to your .gitignore file since it is an output from the .Rmd file (there is no need for Git to track each version of it).

3.5 Caching (advanced)

The two example files ran fairly quickly. So making a small edit to the text and then re-running the code is not onerous. But if you are producing a large complex document and you make some minor text edits, then you do not want to have to wait for it to build (and calculate results that have not changed) to just check your edits in the resulting .pdf or .html.

You can add the cache=TRUE option to a chunk. The results will then be saved (cached) and the code in that chunk not rerun when you render the document again, unless any part of the code in that chunk has changed (e.g. you changed a + to a -). This is very useful, but you have to be careful and understand how it works.

To do so, see the Cache section of the R Markdown Cookbook. While the authors “do not recommend that you set the chunk option cache = TRUE globally in a document,” in our own work we have sometimes done this (as often once the results are finalised we are just editing text). But often we will clear the cache (remove the directory where all the cached results are saved) to re-run the document afresh, to guarantee that all results are reproducible.

Certainly practice on individual chunks first and play around with caching if you wish to use it in your own work. To get started, in rmark-pdf-example.Rmd change the cache = FALSE option in the knitr::opts_chunk$set() call to cache = TRUE. Re-running the code will create two directories (as named in knitr::opts_chunk$set()) that will saved the cached versions of chunks and of figures. Re-running the code a second time will then use those cached values (because of the cache=TRUE setting). Then change a piece of code and see which cached files change (by seeing when they were saved).

Another workflow is to have one exploratory R Markdown file that includes the analyses and calculations (and saves the results). Then have a write-up R Markdown file that loads those results (but does no time-consuming calculations) and uses them to quickly build tables and figures. If you do have to re-run the analyses then you can do so using the first file, and the results will get updated when you then re-run the second file. This would be a good approach for writing a manuscript or a thesis.

3.6 Further reading

As we said, the idea of this module was to give you a simple introduction to using R Markdown. The Definitive Guide to R Markdown is highly recommended (and often where you end up when Googling for how to do something), but you do not need to understand all of it. For example, I know that Pandoc is doing stuff behind the scenes, but have managed to not need to understand much more than that. Chapter 2 goes through the basics, including many things that we glossed over above, such as more details about YAML, the code between the --- at the top of rmark-exercise.Rmd and rmark-pdf-example.Rmd. It also explains how R Markdown, despite the ‘R’ in its name, can also be used for computer languages other than R, such as Python, SQL, Julia, C, and Fortran.

Also see the related R Markdown Cookbook, which provides “practical and relatively short examples to show the interesting and useful usage of R Markdown,” and supplements the Definitive Guide by making it less daunting to figure out how to achieve certain tasks by providing a range of examples. Also see RStudio introduction if you use RStudio.

Finally, there is an R package rticles that includes a suite of custom R Markdown LaTeX formats for various journals and publishers. We haven’t tried it, but it looks like it provides some useful templates. At Fisheries and Oceans Canada we have built our own R package csasdown to conform to stylistic requirements for certain documents.

In practice, once you have something working in the style you like, you can just copy those settings for each new project. It can be handy to keep a readme file of containing commands/options/tips that you use a lot.

3.7 Using RStudio

Here are some screenshots for using RStudio (thanks to Joe Watson).

To knit a document:

Screenshot highlighting the knit button in RStudio.

Figure 3.5: Screenshot highlighting the knit button in RStudio.

and then select the format with

Screenshot showing knit options in RStudio.

Figure 3.6: Screenshot showing how to select knit output in RStudio.

Be sure to click the correct format, though RStudio does not seem to over-ride the output option set in the .Rmd file (in the YAML at the top).

You need to view the .html output in Exercise 1 in a browser, either clicking on the resulting file or doing:

Screenshot highlighting the open in browser button in RStudio.

Figure 3.7: Screenshot highlighting the open in browser button in RStudio.

To run a line of code you can click Ctrl-Enter, and to run a chunk you can click on the green triangle:

Screenshot highlighting the green triangle run a chunk in RStudio.

Figure 3.8: Screenshot showing how to run a chunk in RStudio.

and click the green rectangle to run all the preceding chunks also (useful if variables have changed):

Screenshot showing the green rectangle to click to run all code preceding a chunk.

Figure 3.9: Screenshot showing how to run all code preceding a chunk.

3.8 Using Emacs for R (and everything else)

If you have issues with using a mouse too much, or would rather just type commands (especially if you have vision issues) and prefer a less cluttered interface, then I’d recommend Emacs Speaks Statistics, which lets you run R from the text editor Emacs. You can also use Emacs for editing all types of files (it highlights code very well), so you only need to learn one editor for all your needs.

3.9 Feedback

We value your input! Please take the time to let us know how we might improve these materials in this survey.