Population Genetics

Today’s Investigation. In past labs, we learned how to describe variation in biological data, test hypotheses, and build models to explain patterns. Today, we shift our focus to the genetic variation that underlies many of those patterns. Every population carries a story written in its DNA, shaped by history, isolation, and chance. We will use real genomic data from dusky-footed woodrats (Neotoma fuscipes) across California, collected by Dr. Robert Boria’s lab at San Francisco State University, to explore how geography and history leave signatures in genetic diversity. You will calculate basic population genetic statistics and interpret what they reveal about how populations are structured across the landscape.

Introduction

Population genetics studies how genetic variation is distributed within and between populations. It helps us understand processes like gene flow, drift, and adaptation. Today you will explore genetic diversity across populations of dusky-footed woodrats, an important ecological species in California ecosystems.

Worked example

In population genetics, a single nucleotide polymorphism (SNP) is a single base-pair position in the genome where individuals may differ. SNPs are the most common type of genetic marker and are widely used to study genetic diversity, population structure, and evolutionary history.

Suppose you are studying populations of an endangered species, the mountain gorilla (Gorilla beringei beringei). You genotype individuals from three isolated parks at a single SNP locus. The genotype counts for each location are shown below:

1. Calculating heterozygosity

Now, calculate the allele frequencies. Each AA individual contributes two copies of allele A, each AG individual contributes one copy of A. So:

Finally, calculate the expected heterozygosity under Hardy-Weinberg equilibrium:

Thus, in Park A, the observed heterozygosity is 0.5 and the expected heterozygosity is about 0.486. These values are very close, suggesting the population may be near Hardy-Weinberg equilibrium at this locus.

2. Spatial analysis

In the worked example above, you calculated genetic diversity at a single SNP. Now, we will explore how genetic similarity might change with geographic distance, by constructing two real distance matrices: one based on genotype differences, and one based on location.

Location	AA	AG	GG
Park A	10	15	5
Park B	6	8	16
Park C	12	10	8

What is a distance matrix? A distance matrix is a table showing how different each pair of individuals is. Rows and columns represent individuals, and each cell shows their distance: 0 if identical, larger if different. Matrices are symmetric, and distances along the diagonal are always 0 (an individual is identical to itself).

A. Genotype data

We can calculate genetic distance between individuals using a simple mismatch method:

Gorilla	Genotype
G1	AA
G2	AG
G3	GG
G4	AG

	G1	G2	G3	G4
G1	0	0.5	1	0.5
G2	0.5	0	0.5	0
G3	1	0.5	0	0.5
G4	0.5	0	0.5	0

On larger datasets with many SNPs, we can also use other genetic distance metrics, such as F_ST or Nei’s distance, which compare allele frequencies between populations more formally.

B. Location data

Suppose these four gorillas were sampled at the following locations (in kilometers):

C. Testing for Isolation by Distance

Now we can test whether gorillas who are farther apart geographically are also more genetically different.

Gorilla	X	Y
G1	0	0
G2	0	10
G3	10	0
G4	10	10

What is the Mantel test? The Mantel test formally compares two distance matrices to test whether they are correlated. In our case, one matrix represents genetic distances between individuals and the other represents geographic distances. A significant p-value suggests that individuals farther apart geographically are also more genetically different; a pattern expected under Isolation by Distance.

The output shows a correlation coefficient (r) and a p-value. If the p-value is small (typically < 0.05), it suggests a significant association between geographic and genetic distances.

Important: The Mantel test is specifically designed for comparing two distance matrices, where each entry represents a pairwise comparison between individuals or populations.

Materials and methods

File descriptions: In this lab, we focus on two datasets. Neo_fus.vcf contains SNP genotype data while Neo_fus_locations.csv contains latitude, longitude, and location information for dusky-footed woodrats sampled across California while.

Today’s activity on genetic variation and geographic structure is organized into one main exercise that explores how isolation and movement shape genetic diversity among woodrat populations. This exercise will help us apply basic population genetic concepts, visualize genetic structure, and test hypotheses about isolation by distance.

Before we begin analyzing genetic structure, let’s load the dataset. Neo_fus.csv contains SNP genotype, latitude, longitude, and location information for data for dusky-footed woodrats sampled across California. We will use these data to calculate measures of genetic diversity and explore spatial patterns.

1. Import the data

2. Create a genind object

3. Calculate heterozygosity

4. Plot observed and expected heterozygosity

3. Calculate F_ST among populations

Now we want to quantify genetic differentiation among populations using F_ST statistics. F_ST measures how much allele frequencies differ between populations.

Next, let's visualize the variation in F_ST values across SNPs using a bar plot.

Challenge 1. Create a plot that shows F_ST values for each SNP. Hint: use ggplot2::geom_col().

5. Spatial patterns of genetic structure

In the worked example, you thought about how genetic similarity might change with distance. Here, we calculate two types of distances across real populations: geographic distance (based on coordinates) and genetic distance (based on allele frequencies).

Challenge 2. Create a plot comparing genetic and geographic distance Add a regression line to your scatterplot to help visualize the trend. Hint: use ggplot2::geom_smooth(method = "lm").

Stop and Think: Stop and consider: What would you expect the relationship to look like if there is isolation by distance?

Questions:

Does your plot suggest a positive or negative relationship between geographic distance and genetic distance?
What biological processes could create a pattern where genetic distance increases with geographic distance?

Challenge 3. What kind of statistical methods could you use to formally test whether genetic distance increases with geographic distance? Justify the best method and statistically test the relationship.

Discussion questions

What does heterozygosity measure, and how is it calculated from genotype data?
Are the populations genetically distinct? Revisit basic_stats and include an appropriate statistic to support your answer.
What does F_ST quantify statistically? What does a high F_ST indicate about variation among groups?
When using a Mantel test, what does a significant p-value tell you about the association between two distance matrices?

Look Ahead: Next week, we will extend what you learned here to build phylogenetic trees. Instead of simply comparing distances, we will use those distances to infer the evolutionary relationships among populations and species. Start thinking about how patterns of genetic similarity and difference might be shaped by common ancestry over time!

Great work!

Chapter 13