Population Genetics

Chapter 13

Today’s Investigation. In past labs, we learned how to describe variation in biological data, test hypotheses, and build models to explain patterns. Today, we shift our focus to the genetic variation that underlies many of those patterns. Every population carries a story written in its DNA, shaped by history, isolation, and chance. We will use real genomic data from dusky-footed woodrats (Neotoma fuscipes) across California, collected by Dr. Robert Boria’s lab at San Francisco State University, to explore how geography and history leave signatures in genetic diversity. You will calculate basic population genetic statistics and interpret what they reveal about how populations are structured across the landscape.


Introduction

Population genetics studies how genetic variation is distributed within and between populations. It helps us understand processes like gene flow, drift, and adaptation. Today you will explore genetic diversity across populations of dusky-footed woodrats, an important ecological species in California ecosystems.

Dusky-footed woodrat emerging from hole
A dusky-footed woodrat (Neotoma fuscipes) emerges from its hole. (Photo by Jon Klein via Bay Nature Magazine.)

Upon completion of this lab, you should be able to:

References:


Worked example

In population genetics, a single nucleotide polymorphism (SNP) is a single base-pair position in the genome where individuals may differ. SNPs are the most common type of genetic marker and are widely used to study genetic diversity, population structure, and evolutionary history.

Suppose you are studying populations of an endangered species, the mountain gorilla (Gorilla beringei beringei). You genotype individuals from three isolated parks at a single SNP locus. The genotype counts for each location are shown below:

LocationAAAGGG
Park A10155
Park B6816
Park C12108

At this SNP locus:

1. Calculating heterozygosity

We want to calculate two key measures:

Let's walk through the calculation for Park A.


First, calculate the total number of individuals sampled:

Total individuals sampled: 10 + 15 + 5 = 30
# Total number of individuals
total_individuals <- 10 + 15 + 5

Next, calculate the observed heterozygosity:

Hobs = 15 ÷ 30 = 0.5
# Observed heterozygosity
H_obs <- 15 / total_individuals

Now, calculate the allele frequencies. Each AA individual contributes two copies of allele A, each AG individual contributes one copy of A. So:

Copies of A: (2 × 10) + (1 × 15) = 35
Copies of A: (2 × 10) + (1 × 15) = 35
p = 35 ÷ 60 ≈ 0.583
# Frequency of allele A
p <- (2*10 + 15) / (2*total_individuals)

The frequency of allele G is simply:

q = 1 − p = 0.417
# Frequency of allele G
q <- 1 - p

Finally, calculate the expected heterozygosity under Hardy-Weinberg equilibrium:

Hexp = 2 × p × q ≈ 2 × 0.583 × 0.417 ≈ 0.486
# Expected heterozygosity
H_exp <- 2 * p * q

Thus, in Park A, the observed heterozygosity is 0.5 and the expected heterozygosity is about 0.486. These values are very close, suggesting the population may be near Hardy-Weinberg equilibrium at this locus.

2. Spatial analysis

In the worked example above, you calculated genetic diversity at a single SNP. Now, we will explore how genetic similarity might change with geographic distance, by constructing two real distance matrices: one based on genotype differences, and one based on location.

What is a distance matrix? A distance matrix is a table showing how different each pair of individuals is. Rows and columns represent individuals, and each cell shows their distance: 0 if identical, larger if different. Matrices are symmetric, and distances along the diagonal are always 0 (an individual is identical to itself).

A. Genotype data

Suppose we have genotypes for four gorillas at one SNP locus:

GorillaGenotype
G1AA
G2AG
G3GG
G4AG

We can calculate genetic distance between individuals using a simple mismatch method:

# Create genotype data
genotypes <- c("AA", "AG", "GG", "AG")
names(genotypes) <- c("G1", "G2", "G3", "G4")

# Function to calculate pairwise genetic distance
genetic_distance <- function(g1, g2) {
  if (g1 == g2) {
    return(0)
  } else if (g1 == "AG" || g2 == "AG") {
    return(0.5)
  } else {
    return(1)
  }
}

# Build genetic distance matrix
gen_dist_matrix <- outer(genotypes, genotypes, Vectorize(genetic_distance))
gen_dist_matrix

This creates a genetic distance matrix:

G1G2G3G4
G100.510.5
G20.500.50
G310.500.5
G40.500.50

On larger datasets with many SNPs, we can also use other genetic distance metrics, such as FST or Nei’s distance, which compare allele frequencies between populations more formally.


B. Location data

Suppose these four gorillas were sampled at the following locations (in kilometers):

GorillaXY
G100
G2010
G3100
G41010

We can calculate their pairwise geographic distances:

# Create coordinates
coords <- data.frame(
  X = c(0, 0, 10, 10),
  Y = c(0, 10, 0, 10)
)
rownames(coords) <- c("G1", "G2", "G3", "G4")

# Calculate geographic distance matrix
geo_dist_matrix <- dist(coords)
as.matrix(geo_dist_matrix)

C. Testing for Isolation by Distance

Now we can test whether gorillas who are farther apart geographically are also more genetically different.

What is the Mantel test? The Mantel test formally compares two distance matrices to test whether they are correlated. In our case, one matrix represents genetic distances between individuals and the other represents geographic distances. A significant p-value suggests that individuals farther apart geographically are also more genetically different; a pattern expected under Isolation by Distance.

# Load the vegan package for the Mantel test
library(vegan)

# Perform Mantel test between genetic and geographic distances
mantel_result <- mantel(gen_dist_matrix, geo_dist_matrix)

# View the result
mantel_result

The output shows a correlation coefficient (r) and a p-value. If the p-value is small (typically < 0.05), it suggests a significant association between geographic and genetic distances.

Important: The Mantel test is specifically designed for comparing two distance matrices, where each entry represents a pairwise comparison between individuals or populations.



Materials and methods

File descriptions: In this lab, we focus on two datasets. Neo_fus.vcf contains SNP genotype data while Neo_fus_locations.csv contains latitude, longitude, and location information for dusky-footed woodrats sampled across California while.


Today’s activity on genetic variation and geographic structure is organized into one main exercise that explores how isolation and movement shape genetic diversity among woodrat populations. This exercise will help us apply basic population genetic concepts, visualize genetic structure, and test hypotheses about isolation by distance.

Genetic diversity and population structure in woodrats

Research question 1: Does genetic differentiation increase with geographic distance among dusky-footed woodrat populations?

Before we begin analyzing genetic structure, let’s load the dataset. Neo_fus.csv contains SNP genotype, latitude, longitude, and location information for data for dusky-footed woodrats sampled across California. We will use these data to calculate measures of genetic diversity and explore spatial patterns.

1. Import the data

# Load metadata (locations, coordinates)
neo_fus_loc <- read.csv("Neo_fus.csv")

# Load SNP genotype data from VCF file
neo_fus_gen <- read.vcfR("Neo_fus.vcf")

2. Create a genind object

# Convert VCF data to a genind object
genind_obj <- vcfR2genind(neo_fus_gen)

# View summary of the genetic data
summary(genind_obj)

Questions:

  1. How many woodrat individuals are included in the genotype dataset?
  2. What information is included in the dataset?

3. Calculate heterozygosity

# Summarize the genind object to get basic statistics
genind_summary <- summary(genind_obj)

# Extract observed and expected heterozygosity per SNP locus
observed_het <- genind_summary$Hobs
expected_het <- genind_summary$Hexp

# Combine into a tidy table
het_df <- tibble(
  Locus = names(observed_het),
  Hobs = observed_het,
  Hexp = expected_het
)

# View first few rows
head(het_df)

4. Plot observed and expected heterozygosity

# Plot observed vs expected heterozygosity
ggplot(het_df, aes(x = Hexp, y = Hobs)) +
  geom_point(color = "gray60") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(x = "Expected Heterozygosity", 
       y = "Observed Heterozygosity",
       title = "Observed vs Expected Heterozygosity")

Questions:

  1. Do most loci have observed heterozygosity greater than, less than, or close to expected heterozygosity?
  2. What might cause loci to deviate from Hardy-Weinberg expectations?

3. Calculate FST among populations

Now we want to quantify genetic differentiation among populations using FST statistics. FST measures how much allele frequencies differ between populations.

# Assign populations based on metadata
pop(genind_obj) <- as.factor(neo_fus_loc$location)

# Convert genind object to hierfstat format
hf_data <- genind2hierfstat(genind_obj)

# Calculate basic statistics, including per-locus Fst values
basic_stats <- basic.stats(hf_data)

# View per-locus values, including Fst
basic_stats_perloc <- basic_stats$perloc

# Turn locus IDs (rownames) into a column
basic_stats_perloc$locusID <- rownames(basic_stats_perloc)

Questions:

  1. Which SNP loci show the highest FST values? What do these FST values suggest about genetic differentiation among woodrat populations?
  2. Are there loci with very low or near-zero FST values? What biological or statistical factors might explain that?

Next, let's visualize the variation in FST values across SNPs using a bar plot.

Challenge 1. Create a plot that shows FST values for each SNP. Hint: use ggplot2::geom_col().

5. Spatial patterns of genetic structure

In the worked example, you thought about how genetic similarity might change with distance. Here, we calculate two types of distances across real populations: geographic distance (based on coordinates) and genetic distance (based on allele frequencies).

# Calculate pairwise geographic distances
dist_geo <- neo_fus_loc %>%
  select(longitude, latitude) %>%
  dist()

# Calculate pairwise genetic distances between individuals
dist_gen <- dist(genind_obj)

Challenge 2. Create a plot comparing genetic and geographic distance Add a regression line to your scatterplot to help visualize the trend. Hint: use ggplot2::geom_smooth(method = "lm").

Stop and Think: Stop and consider: What would you expect the relationship to look like if there is isolation by distance?

Questions:

  1. Does your plot suggest a positive or negative relationship between geographic distance and genetic distance?
  2. What biological processes could create a pattern where genetic distance increases with geographic distance?

Challenge 3. What kind of statistical methods could you use to formally test whether genetic distance increases with geographic distance? Justify the best method and statistically test the relationship.


Discussion questions

  1. What does heterozygosity measure, and how is it calculated from genotype data?
  2. Are the populations genetically distinct? Revisit basic_stats and include an appropriate statistic to support your answer.
  3. What does FST quantify statistically? What does a high FST indicate about variation among groups?
  4. When using a Mantel test, what does a significant p-value tell you about the association between two distance matrices?

  5. Look Ahead: Next week, we will extend what you learned here to build phylogenetic trees. Instead of simply comparing distances, we will use those distances to infer the evolutionary relationships among populations and species. Start thinking about how patterns of genetic similarity and difference might be shaped by common ancestry over time!

    Great work!