Today’s investigation. Transparency and reproducibility are key attributes of scientific studies. In an effort to advance science across fields, primary literature now needs to be accompanied by the data analyzed and the analysis carried out. Because of this, programming languages have become popular in science. They support easy work with data and allow us to reproduce a statistical analysis in its entirety. Today, we will explore the popular R programming language. We will be using R and its integrated development environment, RStudio, to manage biological data and address research questions with the goal to practice transparency and reproducibility in statistical analyses.
In this lab, we will introduce R, a programming language and free software environment for statistical computing and graphics. First developed in the 1990s, R has become a popular tool among biologists in recent years and nearly an essential skill among biology students, supporting national calls to promote and enhance quantitative literacy. In particular, the popularity of R in biology is partly due to its quality as a statistical software and its ability to manage large datasets. Besides being free, one can contribute to new functions and many other tools within R that can be shared with the public. One of these developments is RStudio, an integrated development environment for R that makes it easier to navigate certain aspects of the language. So, let’s explore both R and RStudio and learn some of its basic attributes and codes.
Upon completion of this lab, you should be able to:
Reference:
In this course, you will primarily work in R Markdown (.Rmd) using a
class-provided template. When working in R Markdown, you generally
do not need to manually set a working directory. Instead, your
.Rmd file and your data files (for example, anole.csv) should
be saved in the same folder. R Markdown will then be able to find the
files automatically.
For this activity, create a folder on your computer named
working_directory (or another clearly named folder for this class or
week). Save the downloaded file anole.csv in this folder. This folder
will also be where you save your class template .Rmd file.
Although you will not usually set a working directory when using R Markdown, it is important to understand how working directories function—especially when working with regular R scripts (.R).
To manually set a working directory in RStudio:
Note: Setting a working directory tells R where to look for files on
your computer. This step is typically unnecessary when using R Markdown, as
long as your .Rmd file and data files are stored in the same location.
Alternative method: You can also set the working directory by going to Session → Set Working Directory → Choose Directory..., and then selecting your working_directory folder.
In this course, you will complete your work using an R Markdown (.Rmd) file, which combines R code, text, and output in a single document. You will be provided with a class template R Markdown file to use for this activity.
Open the class template .Rmd file in RStudio and immediately
save it in your working_directory folder (the same folder
that contains anole.csv). Use a clear and informative file name.
As you work, you can click the Knit button at the top of the editor to preview how your document will look when rendered. Knitting allows you to check that your code runs correctly and that your text, figures, and tables appear as expected.
Later in the course, you will knit your completed R Markdown files to PDF format for submission. For now, think of knitting as a way to frequently preview and troubleshoot your work.
Annotating your work is a very important step that helps you keep track of your analysis and makes your work reproducible. In this course, your R Markdown (.Rmd) file serves as your lab notebook.
An R Markdown file contains two main components:
Annotating an R Markdown file therefore happens in two ways:
# to add brief comments
that explain what the code is doing. These comments are for clarifying
specific lines or blocks of code.
For example, inside a code chunk you might write:
# Load the anole dataset
anole <- read.csv("anole.csv")
Outside of code chunks, you might add a section header or description such as:
# Chapter 1: Intro to R and RStudio
In this section, we load the dataset and begin exploring its structure.
Note: In R code chunks, # indicates a comment.
Outside of code chunks, # indicates a header in Markdown. Be sure
you are typing in the appropriate location.
Click the Knit button frequently to check how your document is rendering and to ensure that your code, text, and annotations appear as expected.
R uses packages to add new tools and functions. In this course, we will use the tidyverse package.
You only need to install a package one time per computer. In the Console in RStudio (not in a script or R Markdown file), run:
install.packages("tidyverse")
Do not include install.packages() inside an R Markdown file that you will knit and submit. Installation only needs to happen once.
Each time you start working in R, you must load the packages you want to use. At the top of your script or R Markdown file, run:
library(tidyverse)
This makes tidyverse functions (such as filter(), mutate(), and the pipe operator %>%) available for use. If you see an error like:
could not find function "filter"
it usually means the package has not been loaded.
There are several ways to import data to R using RStudio and it depends on the type of file the data is (i.e., .csv, .RData). You may use the one that works best for you. Below are three ways to import data:
A. Importing a .csv file using the RStudio drop down menu: In the Flies/Plots/Packages/Help window of RStudio, click File and search for your working_directory folder. Once there, click on the anole.csv file and import it using the drop down menu. If it worked, you should see “anole.csv” in the RStudio Global Environment window.
B. Importing a .csv file using codes: After
setting up your working directory (Step 1), use the function
read.csv() to import the anole.csv file. See the example
below. Note the use of the commands header=TRUE in order to
treat the first row of the data frame as a header and
stringsASFactors=TRUE to indicate that strings in the data
frame should be treated as factor variables.
# importing the anole data
anole <- read.csv("anole.csv",header=TRUE,stringsAsFactors=TRUE)
To run the line of codes, copy it and paste it in your R script. You may highlight it and click “Run”, which is located in the upper side of the script. If it worked, you should see the line of codes in your console and “anole.csv” in the RStudio Global Environment window.
Before any analysis, we first need to check the data and understand its attributes.
Let’s start looking into the structure of “anole.csv” by using the
function str(). This function gives us information about
the class of the data object (i.e., “anole.csv”), what
variables are in the data object, and how many
observations we have in the data object.
# data structure
str(anole)
Formatting questions in your R Markdown (.Rmd) report
In this course, format each question as a header in your .Rmd file, and then write your answer as plain text immediately below it. You do not have to re-type the full question, but including it is recommended so your report is easy to understand later without needing to look back at the lab manual.
## Question Set 1: Descriptive title
### Question 1a
*Optional: write the question text here.*
Write your answer here as plain text. You can include R code chunks, figures,
and brief interpretation as needed.
### Question 1b
*Optional: write the question text here.*
Write your answer here.
1a. What is the class of the dataset “anole.csv” (e.g., data frame, table, tibble)?
1b. How many observations and variables does the dataset “anole.csv” have?
1c. What is the class of each variable in the dataset “anole.csv” (e.g., character, factor, numeric, integer)?
Other useful exploratory functions are levels(), which
returns the unique values (levels) of a factor variable;
summary(), which summarizes each variable in the dataset;
head(), which returns the first rows in our dataset to
explore the variables, and View() which opens the dataset
as a spreadsheet. Note that R language is case-sensitive.
# levels of the variable Sex, the $ sign means "within", or in other words the column
levels(anole$Sex)
# summary of each variable
summary(anole)
# first rows of anole
head(anole)
# viewing anole
View(anole)
2a. How many levels does the variable Sex have?
2b. What is the mean Femur length of “anole.csv”?
2c. What are the first three variables in “anole.csv”?
Many functions, such as str(), come built into R. However, additional
functionality is provided through R packages. In this course, you will use
functions from the tidyverse package to manage and manipulate data
(e.g., selecting and filtering columns).
Installing and loading packages in R Markdown
You only need to install a package once on your computer. Do this
by running the installation command in the Console (not inside an
.Rmd code chunk). In your R Markdown document, you should then
load the package using library() in the first (or an early)
code chunk so it is available for the rest of the report.
# run this ONCE in the Console (not in an .Rmd code chunk)
install.packages("tidyverse")
# load tidyverse (after installing it in the Console)
library(tidyverse)
Now, suppose you are interested only in the column Femur from the
dataset anole, and you want to store this column in a new object named
femurs. To select columns from a data frame, use the
select() function. The first argument is the data object, and the second
argument specifies the column(s) to keep. Note that select() is part
of the tidyverse.
# selecting the Femur column
femurs <- select(anole, Femur)
# checking the new object
femurs
3a. Is the new object in the Global Environment?
filter(). The first argument in
this function is the data object, and the subsequent arguments are the
column name and the condition.
# filtering femurs by femur length > 10mm
femurs_10mm <- filter(femurs, Femur>10)
# checking the new object created
femurs_10mm
4a. What’s the longest femur length in femurs_10mm?
4b. How many observations does femurs_10mm have?
Finally, keep in mind that R is a language in which we can define our
own mathematical functions. Therefore, it also works as a calculator.
Let’s employ descriptive stats functions for “Femur” in anole, including
mean(), median(), min(),
max(), and range() to estimate the mean,
median, minimum, maximum and range of femur length values, respectively.
Note the use of the \$ sign!
# mean femur length
mean(anole$Femur)
# median femur length
median(anole$Femur)
# minimum femur length
min(anole$Femur)
# max femur length
max(anole$Femur)
# range of values for femur length
range(anole$Femur)
5a.
Do you get the same results from the summary() function?
Now, it is your turn to practice what we have done with dataset “lizards”. Stop and review the steps 1-7 you just did. Think about how you could manipulate such codes in order to do the same analysis for “lizards”. Do the analysis and answer the questions! Hint: give an appropriate name to the new objects you will create for “lizards”. Such names should not overwrite the ones used for “anole.csv” in your script. Do the analysis and be ready to present it!
Once you have written some code and added text to your
.Rmd file, you can “knit” the document to
produce a formatted output (usually PDF in this course). Knitting runs all
of your R code, captures the results, and combines them with your text into
a polished document.
To knit your document:
If there are errors in your code, RStudio will stop knitting and show you messages that explain what went wrong. Use those messages to fix issues in your R code or text and then try knitting again.
Knit your document often as you work. This helps you:
You will turn in knitted PDFs of your lab each week, so the sooner you're comfortable with knitting the better.
Great Work!
This activity was adapted from the Biostatistics using R: A Laboratory Manual by Raisa Hernández-Pacheco and Alexis A Diaz of California State University, Long Beach by Jenna T. B. Ekwealor for San Francisco State University.
This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.