Ch3_html.knit

Today’s investigation. Displaying data using appropriate graphing tools is one of the most important jobs we have as scientists. Good data visualization brings us the opportunity to show the newly discovered trends and relationships between variables. That is, through good data visualization, we may reach different audiences and send a clear message home. Today, we will learn how to generate graphs parting from the type of data we have and the research question we want to answer using two different datasets of marine invertebrates from CSULB Marine Ecology Lab and of microbial genomics from CSULB Comparative Microbial Genomics Lab.

Introduction

In this lab, we will generate multiple graphs to identify good data visualization practices, which include making data displays that are clear, easy-to-interpret, and unbiased. When we carry out an experiment, a field study, or a computer-based analysis, we are evaluating variables. These variables can take different forms; they can be categorical (e.g., males vs females; climatic seasons) or numerical (e.g., weight; temperature), and our main interest may be to show a range of values for one variable or an association between two variables. So, how do we choose between the many data visualization tools available?

Today, we will explore different biological datasets in order to generalize on good data visualization practices. Specifically, we will practice how to identify variable types and explore effective ways to display such variables using anatomical data from the Uca pugnax crab collected by Dr. Bengt Allen from CSULB Marine Ecology Lab and microbial genomics data extracted from the public database PATRICbrc by Dr. Renaud Berlemont from CSULB Comparative Microbial Genomics Lab (Figure 1). R programming language offers multiple ways to carry out high-quality graphs, however, it is up to us to decide which one to use for our data. So, let’s first explore the different datasets and decide the most appropriate graphs to display them.

Figure 1. Left panel: Escherichia coli under a high magnification of 21674X. Image: Centers for Disease Control and Prevention. Right panel: Uca species. Image: Dominique DeLambert

Upon completion of this lab, you should be able to:

Differentiate between variable types;
Identify appropriate data visualizations given a set of variables;
Read and interpret graphs.

References:

Worked example

To get started, let’s remind ourselves the different variable types:

categorical - describes membership in a category or group. These variables do not correspond to a numerical magnitude.
numerical - describes a quantitative measure. These variables are numbers, and thus have magnitude in a numerical scale. Numerical variables can also be continuous or discrete. A continuous numerical variable can take any value within a range, while a discrete numerical variable can take integer values only.

When we are exploring associations between variables, they are also categorized into explanatory or response variables:

explanatory - the variable that predicts or affects the other. This variable is also known as the independent variable.
response - the variable predicted or affected by the explanatory variable. This variable is also known as the dependent variable.

Our choice for data visualization depends on both the types of variables and our research question. However, there are many other aesthetics that are fully dependent on personal preference. Thus, we often have more than one choice. Say that we are exploring data from the National Climatic Data Center NCDC. The NCDC provides the daily temperature normals (average daily temperatures over a 30-year window) for different US sites. Say we are interested in knowing what periods of the year are characterized by less variability in temperature across four US sites (Chicago, Death Valley, Houston, and San Diego). In this case, temperature is a continuous numerical variable and days is a continuous discrete variable (takes integer values of 1 to 366 considering leap years, exclusively). Because we are interested in how temperature changes across days, we say that day is the explanatory variable predicting temperature, and temperature is the response variable being predicted by the day of the year.

Let’s explore three different visualizations for the same temperature data and how they are useful to explore three different research questions.

Visualization 1: How does site-specific temperature change across time?

A common way to display trends in time, such as temporal changes in temperature, is through line graphs. Usually, we display one data point of the y-axis (response=temperature) per each data point of time (explanatory=days), which is displayed in the x-axis. In this way, adjacent points along the x-axis are connected by a line, displaying the temporal trend (here labeled by month, Figure 2).

Figure 2. Daily temperature normals for four locations in the U.S. Data source: NOAA.

Interpretation. From this line graph, we can observe general patterns in temperature across time; cooler temperatures during winter months and warmer temperatures during summer months. We can also observe site-specific patterns because the data also includes a third categorical variable; location. According to the displayed data, Chicago has more variability in daily temperature normals, while San Diego has less variability. This simple data visualization exercise opens up more interesting questions concerning latitudinal patterns in temperature and the drivers of variability in such latitudinal gradient.

Visualization 2: What month exhibits the highest mean temperature variation across sites?

Another visual technique that we can use to show the magnitude and pattern of a phenomenon are heat maps (Figure 3). In this case, we display a variation in color that corresponds to the magnitude of the response variable. This serves as a visual cue to the reader about how the variable changes and/or is clustered over time.

Figure 3. Monthly normal mean temperatures for four locations in the U.S. Data source: NOAA

Interpretation. From this heat map, we can infer that spring and fall months exhibit the highest variation in mean temperature normals (they depict a larger range of colors representing mean monthly temperature). Although we could have inferred this same message from Figure 2, the heat map makes it easier to our eyes. Another message we can get from this graph is the repetitive pattern that mean monthly temperature has across these sites. That is, the figure depicts a mirror image.

Visualization 3: How periodic is daily temperature across sites?

A last visualization that is useful for periodic data, such as daily temperature across years, are polar coordinates. Here, data values at one end of the scale can be logically joined to data values at the other end. Let’s consider our explanatory variable; days. December 31st is the last day of the year, but it is also one day before the first day of the year. If we want to show how some quantity varies over the year, we can use polar coordinates with the angle coordinate specifying each day (Figure 4). Here, the radial distance from the center point indicates the daily temperature and the days of the year are arranged counter-clockwise starting with January. By plotting the temperature normals in a polar coordinate system, we emphasize the cyclical property they have.

Figure 4. Daily temperature normals for four selected locations in the U.S. Data source: NOAA

Interpretation. From this polar coordinate, we can address directly the repetitiveness and thus periodicity that temperature has through time. If this was not the case, we would not observe circles in this plot. Similar to Figure 2 and Figure 3, we can also see that Death Valley presents the warmest temperatures during summer, while Chicago exhibits the coldest temperatures during winter. We can also see that Death Valley, Houston and San Diego present similar temperatures during winter.

Materials and Methods

R and RStudio
R packages ggplot2 and tidyverse
uca.csv
micro.txt

Today’s activity Visualizing data is organized into five main exercises using the relatively small Uca crab dataset followed by a challenge exercise using the larger microbial genomics dataset.

Visualizing data

1. Import the data

Import both datasets to RStudio and check them out. Don’t forget to use the metadata file for reference.

# importing the uca data
uca <- read.csv("uca.csv",header=TRUE,stringsAsFactors = TRUE)
View(uca)

# importing the microbial genomics data
micro <- read.table("micro.txt",stringsAsFactors = TRUE)
View(micro)

Questions

How many variables and observations does “uca” have? Hint: review past R scripts!
How many variables and observations does “micro” have?

Info-Box! To plot the data, we will use R package ggplot2. To have a preview of plots, click on “Plots” in the bottom right window of RStudio. In ggplot2, we use the function ggplot() to generate the graphs. The basic template for ggplots is

ggplot(data = DATA, mapping = aes(MAPPINGS)) + GEOM_FUNCTION(),

where DATA is our data, MAPPINGS are the x-axis and y-axis variables, and GEOM_FUNCTION is the graph type. ggplot2 offers many different geoms, including:

geom_bar(), for bar graphs;
geom_histogram(), for histograms;
geom_point(), for scatter plots;
geom_boxplot(), for boxplots;
geom_line(), for trend lines and time series.

We will explore some ggplots, but many free-online resources exist for your reference such as the Cookbook for R.

2. Displaying data for one variable

To examine data for a single variable we use frequency distributions which are the number of occurrences of each value in the data.

A. When displaying categorical data, use a bar graph.

A bar graph uses the height of the bars to display the frequency distribution of categorical variables. Say we are interested in exploring the claw size distribution of the Uca crab. Are there size classes with more representation in the population? To answer this question, we can generate a frequency distribution for the different claw size classes (categorical variables; i.e., small, medium, large) using a bar graph. As we did in Chapter 1, before you use a package for the first time you need to install it, then you should load it using library() in every subsequent R session as needed.

# installing ggplot2
install.packages("ggplot2",repos="http://cran.us.r-project.org")

# loading package ggplot2
library(ggplot2)

# checking the first rows of the dataframe 
head(uca)

# frequency of claw size classes
p1 <- ggplot(data=uca,aes(x=claw_size)) +
  geom_bar() 
p1

A good practice is to order the bars by magnitude as it helps the reader to find important patterns in the data. Let’s reorder the frequency bars from highest magnitude to lowest. For this, we first need to estimate a frequency table for claw size classes using table().

# frequency table of claw size classes
freq_t <- table(uca$claw_size)
freq_t

# converting freq_t into a data frame to plot it
freq_t <- as.data.frame(freq_t)
freq_t

# reordering the bars per magnitude using the new dataframe
p1 <- ggplot(data=freq_t,aes(x=reorder(Var1,-Freq),y=Freq)) +
   geom_bar(stat="identity")
p1

Finally, let’s add some important aesthetics to p1.

# adding y- and x-axis labels using ylab() and xlab()
p1 <- ggplot(data=freq_t,aes(x=reorder(Var1,-Freq),y=Freq)) +
  geom_bar(stat="identity") +
  ylab("Number of individuals") +
  xlab("Claw size class")
p1

# deleting the background color using theme_classic()
p1 <- ggplot(data=freq_t,aes(x=reorder(Var1,-Freq),y=Freq)) +
  geom_bar(stat="identity") +
  ylab("Number of individuals") +
  xlab("Claw size class") +
  theme_classic()
p1

# increasing the font size to "18"
p1 <- ggplot(data=freq_t,aes(x=reorder(Var1,-Freq),y=Freq)) +
  geom_bar(stat="identity") +
  ylab("Number of individuals") +
  xlab("Claw size class") +
  theme_classic(18)
p1

Questions

What type of variable is “claw_size”?
Are claw size classes represented equally in the population?
Which size class is least represented in the population?

Challenge 1. Say we are interested in exploring biases in available microbial sequenced genomes. Are there particular microbes that have been sequenced by scientists more than others? To answer this question we can generate a frequency distribution for different microbial classifications (categorical variables; e.g., kingdom, phylum) using also bar graph. Reproduce Figure 5 below showing the number of sequenced genomes per kingdom (categorical variables; i.e., Archaea, Bacteria, Viruses) using p1 as example.

Figure 5. Proportion of microbial sequenced genomes per kingdom. Data source: PATRICbrc dataset.

B. When displaying numerical data, use a histogram.

In contrast to bar graphs, histograms use the area of the bars to display the frequency distribution of numerical variables. In this way, histograms split the data values into intervals or bins, showing the shape of frequency distributions.

Let’s explore the claw length distribution of the Uca crab.

# histogram for claw length distribution
p2 <- ggplot(uca,aes(x=claw_length)) +
  geom_histogram() 
p2

# changing the bin width
p2 <- ggplot(uca,aes(x=claw_length)) +
  geom_histogram(binwidth = 2)
p2

# adding aesthetics
p2 <- ggplot(uca,aes(x=claw_length)) +
  geom_histogram(binwidth = 2) +
  ylab("Number of individuals") +
  xlab("Claw length (mm)") +
  theme_classic(18)
p2

Questions

What type of variable is “claw_length”?
What happens as you increase the bin width?
What shape does the frequency distribution of claw length have?

Challenge 2. Now, let’s say we are interested in exploring the genome length distribution of the available microbial sequenced genomes. Reproduce Figure 6 below showing the microbial genome length distribution across kingdoms using p2 as an example. Hint: check the variable “genome_length” in “micro” and think about an appropriate bin width.

Figure 6. Microbial sequenced genomes per kingdom. Data source: PATRICbrc dataset.

2. Displaying associations between two variables

Here, we are interested in graphing two different variables to explore associations between them, or a difference between groups.

A. When displaying categorical data, use a mosaic plot.

A mosaic plot is similar to a bar graph but bars within groups are stacked on top of one another.

Let’s explore associations between claw size class and crab mass using a mosaic plot.

# mosaic plot for claw size and crab body mas
p3 <- ggplot(data=uca,aes(x=claw_size,fill=mass_class)) +
  geom_bar() +
  ylab("Number of individuals") +
  xlab("Claw size class") +
  theme_classic(18)
p3

Questions

What type of variable are “claw_size” and “mass_class”?
What associations between claw size and crab body mass can you infer from the graph?
What distribution of mass classes would you expect in a population of crabs with large claw size?

Challenge 3. Say we are interested in exploring associations between available sequenced genomes per kingdom and phylum. Reproduce Figure 7 below showing the number of microbial sequenced genome per kingdom and phylum using p3 as an example.

Figure 7. Microbial sequenced genomes per kingdom and phylum. Data source: PATRICbrc dataset.

B. When displaying numerical data, use scatter plots.

A scatter plot is used whenever we want to explore the association between two numerical variables. Here, the explanatory variable is positioned in the x-axis, while the response variable is positioned in the y-axis.

Let’s explore the association between Uca’s claw length and claw mass using geom_point().

# scatter plot for claw length and claw mass 
p4 <- ggplot(uca,aes(x=claw_mass,y=claw_length)) +
  geom_point()
p4

# differentiating by mass class adding the aesthetic of color=mass_class
p4 <- ggplot(uca,aes(x=claw_mass,y=claw_length,color=mass_class)) +
  geom_point()
p4

# adding all aesthetics
p4 <- ggplot(uca,aes(x=claw_mass,y=claw_length,color=mass_class)) +
  geom_point() +
  ylab("Claw length (mm)") +
  xlab("Claw mass (g)") +
  theme_classic(18)
p4

Questions

What type of variable are “claw_mass”, “claw_length”, and “mass_class”?
What is the response variable and the explanatory variable?
What associations can you infer from the graph?

Challenge 4. Say we are interested in exploring whether longer microbial genomes are associated to a larger number of genes in the genome. Reproduce Figure 8 below showing the association between the number of genes and the genome length for microbes across phylum using p4 as an example. Hint: You may plot the data without a legend by adding theme(legend.position = “none”) to the plot codes. Be aware of the x-axis!

Figure 8. Number of genes as a function of genome length (kbp). Data source: PATRICbrc dataset.

C. When displaying numerical and categorical data, use box plots.

A box plot uses lines and rectangular boxes to display a summary of the frequency distribution of the variable of interest. Specifically, box plots show the median, quartiles, range, and extreme measurements of the data (Figure 9). Let’s first review the definitions of each component in the summary:

median - the middle measurement of the observations. That is, half of the observations are located below the median and the other half are located above the median.
quartile - each of four equal groups into which the data can be divided according to the distribution of values. That is, one-fourth or 25% of the data lies below the first quartile and one-fourth or 25% of the data lies above the third quartile.
range - the difference between the maximum and minimum value.
extreme measure - a data point that differs significantly from the rest.

Figure 9. Anatomy of a box plot.

Let’s explore the summaries of the frequency distributions of Uca crabs body mass across claw size classes.

# box plot for body mass across claw size using geom_boxplot
p5 <- ggplot(uca,aes(x=claw_size,y=crab_mass)) +
   geom_boxplot()
p5

# adding aesthetics
p5 <- ggplot(uca,aes(x=claw_size,y=crab_mass,fill=claw_size)) +
   geom_boxplot() +
  ylab("Body mass (g)") +
  xlab("Claw size class") +
  theme_classic(18)
p5

Question:

What shape does the frequency distribution of body mass have across claw size classes?

Challenge 5. Say we are interested in exploring the frequency distribution of gene density (# genes per 1000 nucleotides) across classes in the phylum Actinobacteria. Reproduce Figure 10 below showing the frequency distribution of gene density across Actinobacteria using p5 as example. Hints: Review Chapter 1’s script on how to filter data (you will need to create a new data object with Actinobacteria data only). You may rotate the x-axis labels by 90 degrees by adding theme(axis.text.x = element_text(angle = 90)) to the plot codes. y-axis limits can be manipulated with the function coord_cartesian(). For this figure, adding coord_cartesian(ylim=c(0.75,1.25)) works fine!

Figure 10. Gene density in Actinobacteria.

Info-Box! Exporting and presenting graphs:

Figure export - RStudio provides an easy way to export images using different formats, including .pdf files. This feature is located in the Plots window (bottom right window of RStudio). Note that here you may manipulate the size of the image.
Figure caption - All graphs in a report must be accompanied by a caption (a short title and description below the figure). Note that all the figures in this module have their own name and description. In this way, graphs are stand alone figures in the report. The name of the figure is useful when referencing it in the main text of a report.

Discussion questions

If you are interested in exploring the frequency distribution of one numerical variable, which visualization would you use? Explain.
Explain the components of a box plot and why it is a useful visualization.
Choose an image from a challenge and write up a caption for it.

Great Work!

Visualizing data

Chapter 3

Visualization 1: How does site-specific temperature change across time?

Visualization 2: What month exhibits the highest mean temperature variation across sites?

Visualization 3: How periodic is daily temperature across sites?