Introduction to R and RStudio

Chapter 1

Today’s investigation. Transparency and reproducibility are key attributes of scientific studies. In an effort to advance science across fields, primary literature now needs to be accompanied by the data analyzed and the analysis carried out. Because of this, programming languages have become popular in science. They support easy work with data and allow us to reproduce a statistical analysis in its entirety. Today, we will explore the popular R programming language. We will be using R and its integrated development environment, RStudio, to manage biological data and address research questions with the goal to practice transparency and reproducibility in statistical analyses.


Introduction

In this lab, we will introduce R, a programming language and free software environment for statistical computing and graphics. First developed in the 1990s, R has become a popular tool among biologists in recent years and nearly an essential skill among biology students, supporting national calls to promote and enhance quantitative literacy. In particular, the popularity of R in biology is partly due to its quality as a statistical software and its ability to manage large datasets. Besides being free, one can contribute to new functions and many other tools within R that can be shared with the public. One of these developments is RStudio, an integrated development environment for R that makes it easier to navigate certain aspects of the language. So, let’s explore both R and RStudio and learn some of its basic attributes and codes.

Upon completion of this lab, you should be able to:

Reference:

Materials


1. Creating and setting a working directory

In this course, you will primarily work in R Markdown (.Rmd) using a class-provided template. When working in R Markdown, you generally do not need to manually set a working directory. Instead, your .Rmd file and your data files (for example, anole.csv) should be saved in the same folder. R Markdown will then be able to find the files automatically.

For this activity, create a folder on your computer named working_directory (or another clearly named folder for this class or week). Save the downloaded file anole.csv in this folder. This folder will also be where you save your class template .Rmd file.

Although you will not usually set a working directory when using R Markdown, it is important to understand how working directories function—especially when working with regular R scripts (.R).

To manually set a working directory in RStudio:

  1. Open RStudio. In the Files / Plots / Packages / Help pane (usually in the bottom-right), click the Files tab and navigate to your working_directory folder.
  2. Once inside the folder, click the gear icon and select Set As Working Directory.

Note: Setting a working directory tells R where to look for files on your computer. This step is typically unnecessary when using R Markdown, as long as your .Rmd file and data files are stored in the same location.

Alternative method: You can also set the working directory by going to SessionSet Working DirectoryChoose Directory..., and then selecting your working_directory folder.

2. Creating and saving your R Markdown file

In this course, you will complete your work using an R Markdown (.Rmd) file, which combines R code, text, and output in a single document. You will be provided with a class template R Markdown file to use for this activity.

Open the class template .Rmd file in RStudio and immediately save it in your working_directory folder (the same folder that contains anole.csv). Use a clear and informative file name.

As you work, you can click the Knit button at the top of the editor to preview how your document will look when rendered. Knitting allows you to check that your code runs correctly and that your text, figures, and tables appear as expected.

Later in the course, you will knit your completed R Markdown files to PDF format for submission. For now, think of knitting as a way to frequently preview and troubleshoot your work.

3. Annotating your R Markdown file

Annotating your work is a very important step that helps you keep track of your analysis and makes your work reproducible. In this course, your R Markdown (.Rmd) file serves as your lab notebook.

An R Markdown file contains two main components:

Annotating an R Markdown file therefore happens in two ways:

For example, inside a code chunk you might write:

# Load the anole dataset
anole <- read.csv("anole.csv")

Outside of code chunks, you might add a section header or description such as:

# Chapter 1: Intro to R and RStudio

In this section, we load the dataset and begin exploring its structure.

Note: In R code chunks, # indicates a comment. Outside of code chunks, # indicates a header in Markdown. Be sure you are typing in the appropriate location.

Click the Knit button frequently to check how your document is rendering and to ensure that your code, text, and annotations appear as expected.

4. Installing and loading packages

R uses packages to add new tools and functions. In this course, we will use the tidyverse package.

Installing a package

You only need to install a package one time per computer. In the Console in RStudio (not in a script or R Markdown file), run:

install.packages("tidyverse")

Do not include install.packages() inside an R Markdown file that you will knit and submit. Installation only needs to happen once.

Loading a package

Each time you start working in R, you must load the packages you want to use. At the top of your script or R Markdown file, run:

library(tidyverse)

This makes tidyverse functions (such as filter(), mutate(), and the pipe operator %>%) available for use. If you see an error like:

could not find function "filter"

it usually means the package has not been loaded.

5. Importing data files to R using RStudio

There are several ways to import data to R using RStudio and it depends on the type of file the data is (i.e., .csv, .RData). You may use the one that works best for you. Below are three ways to import data:

# importing the anole data
anole <- read.csv("anole.csv",header=TRUE,stringsAsFactors=TRUE)

To run the line of codes, copy it and paste it in your R script. You may highlight it and click “Run”, which is located in the upper side of the script. If it worked, you should see the line of codes in your console and “anole.csv” in the RStudio Global Environment window.

6. Exploring the imported dataset

Before any analysis, we first need to check the data and understand its attributes.

Let’s start looking into the structure of “anole.csv” by using the function str(). This function gives us information about the class of the data object (i.e., “anole.csv”), what variables are in the data object, and how many observations we have in the data object.

# data structure
str(anole)

Formatting questions in your R Markdown (.Rmd) report

In this course, format each question as a header in your .Rmd file, and then write your answer as plain text immediately below it. You do not have to re-type the full question, but including it is recommended so your report is easy to understand later without needing to look back at the lab manual.

## Question Set 1: Descriptive title

### Question 1a
*Optional: write the question text here.*

Write your answer here as plain text. You can include R code chunks, figures,
and brief interpretation as needed.

### Question 1b
*Optional: write the question text here.*

Write your answer here.

Question Set 1

  1. 1a. What is the class of the dataset “anole.csv” (e.g., data frame, table, tibble)?

  2. 1b. How many observations and variables does the dataset “anole.csv” have?

  3. 1c. What is the class of each variable in the dataset “anole.csv” (e.g., character, factor, numeric, integer)?

Other useful exploratory functions are levels(), which returns the unique values (levels) of a factor variable; summary(), which summarizes each variable in the dataset; head(), which returns the first rows in our dataset to explore the variables, and View() which opens the dataset as a spreadsheet. Note that R language is case-sensitive.

# levels of the variable Sex, the $ sign means "within", or in other words the column
levels(anole$Sex)

# summary of each variable
summary(anole)

# first rows of anole
head(anole)

# viewing anole
View(anole)

Question Set 2

  1. 2a. How many levels does the variable Sex have?

  2. 2b. What is the mean Femur length of “anole.csv”?

  3. 2c. What are the first three variables in “anole.csv”?

7. Managing data

Many functions, such as str(), come built into R. However, additional functionality is provided through R packages. In this course, you will use functions from the tidyverse package to manage and manipulate data (e.g., selecting and filtering columns).

Installing and loading packages in R Markdown

You only need to install a package once on your computer. Do this by running the installation command in the Console (not inside an .Rmd code chunk). In your R Markdown document, you should then load the package using library() in the first (or an early) code chunk so it is available for the rest of the report.

# run this ONCE in the Console (not in an .Rmd code chunk)
install.packages("tidyverse")

# load tidyverse (after installing it in the Console)
library(tidyverse)

Now, suppose you are interested only in the column Femur from the dataset anole, and you want to store this column in a new object named femurs. To select columns from a data frame, use the select() function. The first argument is the data object, and the second argument specifies the column(s) to keep. Note that select() is part of the tidyverse.

# selecting the Femur column
femurs <- select(anole, Femur)

# checking the new object
femurs

Question Set 3

  1. 3a. Is the new object in the Global Environment?

Now, say you are interested only in femurs of length higher than 10 mm. To filter by row, use filter(). The first argument in this function is the data object, and the subsequent arguments are the column name and the condition.
# filtering femurs by femur length > 10mm
femurs_10mm <- filter(femurs, Femur>10)

# checking the new object created
femurs_10mm

Question Set 4

  1. 4a. What’s the longest femur length in femurs_10mm?

  2. 4b. How many observations does femurs_10mm have?

8. Descriptive statistics

Finally, keep in mind that R is a language in which we can define our own mathematical functions. Therefore, it also works as a calculator. Let’s employ descriptive stats functions for “Femur” in anole, including mean(), median(), min(), max(), and range() to estimate the mean, median, minimum, maximum and range of femur length values, respectively. Note the use of the \$ sign!

# mean femur length
mean(anole$Femur)

# median femur length
median(anole$Femur)

# minimum femur length
min(anole$Femur)
# max femur length
max(anole$Femur)

# range of values for femur length
range(anole$Femur)

Question Set 5

  1. 5a. Do you get the same results from the summary() function?

Stop, Think, Do:

Now, it is your turn to practice what we have done with dataset “lizards”. Stop and review the steps 1-7 you just did. Think about how you could manipulate such codes in order to do the same analysis for “lizards”. Do the analysis and answer the questions! Hint: give an appropriate name to the new objects you will create for “lizards”. Such names should not overwrite the ones used for “anole.csv” in your script. Do the analysis and be ready to present it!

Discussion Questions

  1. Now that you are more familiarized with RStudio, can you describe its layout including its four principal components (windows)?
  2. Mention three benefits of annotating your script.
  3. How do you set a path between R and the location of your files in your computer?
  4. Why would you use the str() function?

8. How to knit your R Markdown document

Once you have written some code and added text to your .Rmd file, you can “knit” the document to produce a formatted output (usually PDF in this course). Knitting runs all of your R code, captures the results, and combines them with your text into a polished document.

To knit your document:

  1. Make sure your R Markdown file is the active tab in RStudio.
  2. Click the Knit button near the top of the editor (it looks like a ball of yarn with a knitting needle).
  3. Choose the output format if prompted (for this course, choose PDF).
  4. RStudio will run your code chunks in order and generate a new document that shows both your text and the results of your code.

If there are errors in your code, RStudio will stop knitting and show you messages that explain what went wrong. Use those messages to fix issues in your R code or text and then try knitting again.

Knit your document often as you work. This helps you:

You will turn in knitted PDFs of your lab each week, so the sooner you're comfortable with knitting the better.

Great Work!


This activity was adapted from the Biostatistics using R: A Laboratory Manual by Raisa Hernández-Pacheco and Alexis A Diaz of California State University, Long Beach by Jenna T. B. Ekwealor for San Francisco State University.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.