Today’s investigation. Transparency and reproducibility are key attributes of scientific studies. In an effort to advance science across fields, primary literature now needs to be accompanied by the data analyzed and the analysis carried out. Because of this, programming languages have become popular in science. They support easy work with data and allow us to reproduce a statistical analysis in its entirety. Today, we will explore the popular R programming language. We will be using R and its integrated development environment, RStudio, to manage biological data and address research questions with the goal to practice transparency and reproducibility in statistical analyses.
In this lab, we will introduce R, a programming language and free software environment for statistical computing and graphics. First developed in the 1990s, R has become a popular tool among biologists in recent years and nearly an essential skill among biology students, supporting national calls to promote and enhance quantitative literacy. In particular, the popularity of R in biology is partly due to its quality as a statistical software and its ability to manage large datasets. Besides being free, one can contribute to new functions and many other tools within R that can be shared with the public. One of these developments is RStudio, an integrated development environment for R that makes it easier to navigate certain aspects of the language. So, let’s explore both R and RStudio and learn some of its basic attributes and codes.
Upon completion of this lab, you should be able to:
Reference:
This example should work as a step-by-step introduction to some basic skills you will be implementing in the following chapters. For this activity, we will be using the dataset “anole.csv” from Donihue et al. 2018.
In your computer, create a folder named “working_directory” and save the downloaded file “anole.csv” in it. Open RStudio. Go to the Flies/Plots/Packages/Help window (usually in the bottom right pane) and search for your working_directory folder in Files. This window in RStudio allows you to navigate folders in your computer. Once in your working_directory folder, click the gear icon -> Set As Working Directory. Note: for R to read the files of interest, you need to indicate where in your computer is the file (i.e., your working directory).
Alternatively, you can go to Session (at the top) -> Set Working Directory -> Choose Directory.. and navigate to working_directory.
An R script is where you will be saving your codes and the progress of your work. The first step when carrying out a new analysis is to open a new script and save it. We are going to be doing this in our custom “Lab Report” format. Download the file from Canvas and save it somewhere handy on your computer. This copy will serve as your template for each Lab this semester. Then, open it and File -> Save As with a useful name in your working_directory folder, with a .Rmd file extension. This is the copy you can work with for today’s Lab Activity.
Annotating your script is a very important step that helps you to keep track of your work and facilitates reproducibility. The R script serves as your notebook. There, you save your codes together with associated annotations indicating what those codes are for. In your script, write a description of the codes you will be employing in this activity. Use a “#” before any line of text you want R to interpret as an annotation and not as code. In Markdown, “#” outside a code chunk indicates a header. Click the “Knit” button at the top frequently to see how your document is looking.
For example, in the beginning of this activity’s script, you may add:
# Chapter 1: Intro to R and RStudio
Note: Remember to do annotations before any line of code so that you or others can understand the meaning of the code.
There are several ways to import data to R using RStudio and it depends on the type of file the data is (i.e., .csv, .RData). You may use the one that works best for you. Below are three ways to import data:
A. Importing a .csv file using the RStudio drop down menu: In the Flies/Plots/Packages/Help window of RStudio, click File and search for your working_directory folder. Once there, click on the anole.csv file and import it using the drop down menu. If it worked, you should see “anole.csv” in the RStudio Global Environment window.
B. Importing a .csv file using codes: After
setting up your working directory (Step 1), use the function
read.csv()
to import the anole.csv file. See the example
below. Note the use of the commands header=TRUE
in order to
treat the first row of the data frame as a header and
stringsASFactors=TRUE
to indicate that strings in the data
frame should be treated as factor variables.
# importing the anole data
anole <- read.csv("anole.csv",header=TRUE,stringsAsFactors=TRUE)
To run the line of codes, copy it and paste it in your R script. You may highlight it and click “Run”, which is located in the upper side of the script. If it worked, you should see the line of codes in your console and “anole.csv” in the RStudio Global Environment window.
Before any analysis, we first need to check the data and understand its attributes.
Let’s start looking into the structure of “anole.csv” by using the
function str()
. This function gives us information about
the class of the data object (i.e., “anole.csv”), what
variables are in the data object, and how many
observations we have in the data object.
# data structure
str(anole)
What is the class of the dataset “anole.csv” (e.g., data frame, table, tibble)?
How many observations and variables does the dataset “anole.csv” have?
What is the class of each variable in the dataset “anole.csv” (e.g., character, factor, numeric, integer)?
Other useful exploratory functions are levels()
, which
returns the unique values (levels) of a factor variable;
summary()
, which summarizes each variable in the dataset;
head()
, which returns the first rows in our dataset to
explore the variables, and View()
which opens the dataset
as a spreadsheet. Note that R language is case-sensitive.
# levels of the variable Sex, the $ sign means "within"
levels(anole$Sex)
# summary of each variable
summary(anole)
# first rows of anole
head(anole)
# viewing anole
View(anole)
How many levels does the variable Sex have?
What is the mean Femur length of “anole.csv”?
What are the first three variables in “anole.csv”?
Many functions like str()
come built into R. However, R
packages give you additional useful functions. You will be using some
functions from the package tidyverse to manage your data (e.g.,
filtering columns, selecting columns). Before you use a package for the
first time you need to install it in R. After installation, you should
load it using library()
in every subsequent R session as
needed.
# installing tidyverse
install.packages("tidyverse",repos="http://cran.us.r-project.org")
# loading tidyverse
library(tidyverse)
Now, let’s say you are interested only in the column “Femur” within
anole and you want to store such column in a new object named
femurs. To select columns of a data frame, use
select()
. The first argument in this function is the data
object, and the subsequent argument is the column to keep. Keep in min
that select()
is part of package tidyverse.
# selecting the column femur
femurs <- select(anole, Femur)
# checking the new object created
femurs
Now, say you are interested only in femurs of length higher than 10
mm. To filter by row, use filter()
. The first argument in
this function is the data object, and the subsequent arguments are the
column name and the condition.
# filtering femurs by femur length > 10mm
femurs_10mm <- filter(femurs, Femur>10)
# checking the new object created
femurs_10mm
Finally, keep in mind that R is a language in which we can define our
own mathematical functions. Therefore, it also works as a calculator.
Let’s employ descriptive stats functions for “Femur” in anole, including
mean()
, median()
, min()
,
max()
, and range()
to estimate the mean,
median, minimum, maximum and range of femur length values, respectively.
Note the use of the \$
sign!
# mean femur length
mean(anole$Femur)
# median femur length
median(anole$Femur)
# minimum femur length
min(anole$Femur)
# max femur length
max(anole$Femur)
# range of values for femur length
range(anole$Femur)
Now, it is your turn to practice what we have done with dataset “lizards”. Stop and review the steps 1-7 you just did. Think about how you could manipulate such codes in order to do the same analysis for “lizards”. Do the analysis and answer the questions! Hint: give an appropriate name to the new objects you will create for “lizards”. Such names should not overwrite the ones used for “anole.csv” in your script. Do the analysis and be ready to present it!
Great Work!
This activity was adapted from the Biostatistics using R: A Laboratory Manual by Raisa Hernández-Pacheco and Alexis A Diaz of California State University, Long Beach by Jenna T. B. Ekwealor for San Francisco State University.
This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.