Ch4_html.knit

Today’s investigation. Biologists are interested in understanding patterns in nature and how specific factors shape those patterns. However, it takes more than interest to advance our understanding; it takes knowing how to deal with uncertainty. That is, we need to make the best out of the sample under study to approximate the true population value or parameter of interest. Parameters are fixed values but our estimates from random samples are variables, and thus are characterized by a probability distribution. Today, we will learn how to estimate a sample statistic from random samples and measure the uncertainty - precision - around such statistic using a dataset of rhesus macaques from the CSULB Quantitative Ecology Lab.

Introduction

In this lab, we will estimate sample statistics from probability distributions to approximate population parameters. For this, we will explore the distribution of age at death in Cayo Santiago rhesus macaques during major hurricanes (Figure 1). Cayo Santiago is one of the world’s oldest monkey colonies where rhesus macaques free range from birth to death. Cayo Santiago monkeys are all uniquely identified allowing for demographic data collection. Thus, the Cayo Santiago rhesus macaque population presents unique opportunities to address ecological drivers shaping survival and reproduction.

Today, we will estimate probability distributions of mean age at death in order to explore whether major hurricanes are important drivers of mortality in Cayo Santiago rhesus macaques. For this, we will use data from the CSULB Quantitative Ecology Lab lead by Dr. Raisa Hernández Pacheco. In her study, Dr. Hernández Pacheco and colleagues used 45 years of data and evaluated annual survival and reproductive rates in order to determine if these rates changed during hurricane years (Figure 1; Morcillo et al. 2020). So, let’s explore these changes using probability distributions of age at death to test whether mortality patterns change with the occurrence of an extreme climatic event.

$Figure 1. Tracks of major hurricanes Hugo, Georges, and María affecting Puerto Rico (left panel; Morcillo et al. 2020) and Cayo Santiago rhesus macaques (right panel). Image: Raisa Hernández Pacheco.$

Figure 1. Tracks of major hurricanes Hugo, Georges, and María affecting Puerto Rico (left panel; Morcillo et al. 2020) and Cayo Santiago rhesus macaques (right panel). Image: Raisa Hernández Pacheco.

Upon completion of this lab, you should be able to:

Differentiate between a population distribution and a sampling distribution;
Estimate the standard error (precision) of a sample statistic;
Estimate the probability of an event.

References:

Morcillo et al. 2020

Worked example

As we discussed in Chapter 2, we usually are not able to measure an entire population and get a true population parameter. We rely on a sampling process - ideally with reduced bias and increased measurement accuracy and precision - to instead get a sample statistic (Figure 2). Thus, we need to know how to infer a population parameter from sample data. To get started, let’s differentiate a population parameter and its distribution from a sample statistic and its sampling distribution:

population parameter - a quantity describing a population.
population distribution - the whole set of values defining parameters. The population distribution is characterized by two parameters; the mean or location measurement ($\mu$) and the standard deviation or spread measurement ($\sigma$).
sample statistic - a quantity describing a sample.
sampling distribution - probability distribution of all values for a sample statistic we might obtain when sampling a population. Here the sample mean $\overline{y}$ and standard deviation s are the sample statistics characterizing the sampling distribution.
probability distribution - a list of probabilities of all mutually exclusive outcomes of a random trial.
mutually exclusive events - two events that cannot occur at the same time.

Figure 2. Visual depiction of the population and sampling distributions. Say that we are interested in a monkey trait here represented by color. From this example, the mean population parameter should be close to the blue color. However, when random samples are drawn from the population, a different sample statistics might result just by chance producing a distribution of different values (sampling distribution). Note that the larger the sample size (Sample 3), the closer we can get to the population parameter.

1. Inferring population parameters from random samples

Let’s follow the monkey example, but use real data this time. The Cayo Santiago rhesus macaque population has an extraordinary demographic dataset that includes the age at which each female gave birth to an offspring. Let’s call that the “age at delivery”. Because we have data for all adult females in the population, we can plot the population distribution of age at delivery:

Figure 3. Frequency distribution of age at delivery in Cayo Santiago rhesus macaque females as counts (left panel) and as proportions (right panel).

The histograms above show the age at delivery of all females in the population during the period of 1984 to 2020 (n = 1,486). Because this is the population distribution, the relative frequency of age at delivery (Figure 3, right panel) represents the probability of observing a female giving birth at that particular age when sampling a random female from the population. For example, if we sample a female giving birth from this population at random, there is about 9.8% chance she will have 5 years of age.

From this population distribution we get the following population parameters (n = 1,486):

Name	Parameter	Value (age at delivery in years)
Mean	μ	9.732
Standard deviation	σ	4.482

However, we usually do not have the population distribution. Instead, we take a sample of the population and estimate a sampling distribution. Let’s say we don’t have access to the demographic dataset of females at Cayo Santiago so we decide to go to Cayo Santiago and sample the population instead. Given constrains involving logistics, time and funding, we just sample 500 females at random out of the total 1,486 females in the population represented in Figure 3. By doing this, we get the following sampling distribution in Figure 4:

Figure 4. Frequency distribution of age at delivery in Cayo Santiago rhesus macaque females from a random sample of n = 500 as counts (left panel) and as proportions (right panel).

By comparing the population distribution in Figure 3 (n = 1,486) with the frequency distribution of the random sample in Figure 4 (n = 500), we can see that they are not identical but they share some features. For example, both distributions indicate that most of the females have offspring at younger ages, relative to older ages.

From this sample distribution we get the following sample statistics (n = 500):

Name	Statistic	Sample value (age at delivery in years)
Mean	ȳ	9.696
Standard deviation	s	4.555

By estimating the mean age at delivery from a random sample of 500 females, we see that the sample statistics differ from the population parameters by chance. Given that 500 out of 1,486 females represents a good sample size for field work, our approximation to the true population parameter is good. However, sometimes we cannot yield large sample sizes, thus our ability to describe population parameters is constrained.

Take the following example in Figure 5 where we compare the population distribution to a random sample of n = 500 and n = 50 females:

Figure 5. Age at delivery of Cayo Santiago rhesus macaques females across samples.

2. Estimating the sampling distribution of ȳ

From the random sample n = 500, we estimated the sample statistics ȳ = 9.696 years. If we continue randomly sampling 500 females and estimating the sample statistic ȳ each time, we will get a probability distribution of the sample statistic ȳ, known as the sampling distribution.

With the help of a computer, we can easily draw 500 females randomly from our 1,486 female population 10,000 times and estimate the mean age at delivery ȳ for each of the 10,000 samples. The following histogram in Figure 6 presents the frequency distribution of the 10,000 estimated ȳ:

Figure 6. Sampling distribution of the sample statistic (mean age at delivery) in Cayo Santiago rhesus macaque females from random samples of n = 500.

From this sampling distribution of ȳ, we can see that the sample statistic is a variable (ranges from ~8.5 to 9.8 years) while the parameter of a population is a constant ($\mu=$ 9.732 years). For example, there is ~1% chance that the mean age at delivery estimated from a random sample of 500 females results in ~8.6 years, while there is a much higher chance that the mean age at delivery estimated from a random sample of 500 females results in 9 years. Thus, ȳ varies from sample to sample. From this sampling distribution, we can also see that the distribution of ȳ is centered on the true mean ($\mu$). Therefore, we say that ȳ is an accurate or unbiased estimate of $\mu$.

3. Estimating the standard error of ȳ

The standard deviation of a sampling distribution is called the standard error. The standard error serves as our measure of precision. The smaller the standard error, the more precise and representative the sample will be of the overall population. As you may infer, the standard error decreases as the sample size increases.

The formula for the standard error of ȳ is:

$\sigma_\overline{y}=\frac{\sigma}{\sqrt{n}}$

where $\sigma$ is the standard deviation of the parameter.

For our example,

\[ \begin{aligned} \sigma_\overline{y}&=\frac{4.482}{\sqrt{500}}=0.200\\\\ \end{aligned} \]

However, if we do not have information about the true population parameters, we use s instead of $\sigma$ as an approximation. Let’s estimate it for our random sample (N = 500):

$SE_\overline{y}=\frac{s}{\sqrt{n}}$

\[ \begin{aligned} SE_\overline{y}&=\frac{4.555}{\sqrt{500}}=0.204\\\\ \end{aligned} \]

For our sample mean, we say that the age at delivery of females is 9.696 ± 0.204 years (mean ± SE).

Info-Box! When estimating with uncertainty, that is estimating from a sample, we need to report the precision of our sample statistic ȳ. Results need to include the sample mean and standard error in the following format:

mean ± SE

Note. We are interested in high precision (low SE) but keep in mind that our precision also depends on the type of data collected. In control settings such as laboratories, we might expect relatively low SE. In uncontrolled settings such as nature, SE can be higher as many variables can be working together to affect our data. Thus, the high need for large sample sizes to approach the true population parameter.

4. Estimating probability distributions

We have use proportion and probability interchangeably. This is because the probability of an event is the proportion of times the event would occur if we repeat a random trial many times. Thus, the sampling distribution is a type of probability distribution; it gives us the proportion, or probability, of all values for a sample statistic. Here, values represent the different events and the sample represent the random trials.

As variables come in different types, so do probability distributions. Distributions can be discrete if they are describing a discrete variable. This is the case of all the age distributions describe with histograms above. Here age is taking discrete values. Distributions can also be continuous if they are describing the values a continuous variable can take. In this case, we use a curve whose height represents a probability density, rather than histograms (Figure 7). The probability density tells us the probability of obtaining a value within a certain range.

Let’s plot age at delivery for our random sample n = 500 females, using age as a continuous variable.

Figure 7. Probability density curve of age at delivery of Cayo Santiago rhesus macaque females (n = 500).

For our own interest, we can ignore the quantitative aspect of the y-axis of a probability density distribution, just focus on the shape and relative height of the curve. Rather than asking ourselves what is the probability that a female gives birth at age 15.5, for example, here we ask ourselves what is the probability that a female gives birth within the ages of say 15.5 to 17.1 (area in purple). Think about it, the probability of a particular age in a continuous probability density curve is almost 0 as age can take infinite values. Thus, here we talk about ranges.

5. Estimating the probability of two or more events

Before carrying out today’s activity, let’s make sure we understand the basics of probability. The addition rule states that when two events are mutually exclusive, the probability of getting either of the two events is the sum of the probabilities of each of those events separately. For example, the probability of having a monkey offspring with black or gray hair, assuming these are mutually exclusive, is the sum of the probability of black hair and the probability of gray hair. This is intuitively and widely use in biology. However, we often are interested in more complex patterns. This involves combinations of events and thus, the multiplication of probabilities.

Say that we have the hypothesis that female offspring rhesus require more parental care, and thus they are more “energetically costly” than male offspring, specially for inexperienced females having their first and second offspring. Thus, we want to get insights into how many females undergo this burden early in life. For this, we want to estimate the probability that a female rhesus produces two daughters in a row during her first and second offspring deliveries. For this, we can use probability trees.

According to our rhesus macaque data, 51.5% of offspring are females. Assuming the sex of consecutive offspring are independent events, there is a 0.515 probability to give birth to a daughter every year and 1-0.515 = 0.485 probability to give birth to a son. We follow assumptions of independence and the multiplication rule and build a probability tree:

Figure 8. Probability tree for all possible values of a female having two consecutive births.

From this tree, we can see that there is 26.5% chance of having two consecutive daughters, a relatively higher probability and an extra challenge to our original hypothesis (Figure 8). Of course, many more analyses and extra data is needed to actually answer the research question! But for now, let’s explore how we can infer other Cayo Santiago rhesus macaque demographic rates using these statistical tools!

Materials and Methods

R and RStudio
packages tidyverse and ggplot2
cayo.csv

Today’s activity Primate survival in a tropical island is organized into two main exercises exploring the distribution of age at death in Cayo Santiago rhesus macaque females. These exercises will also motivate inferences about the role of hurricanes on primate mortality.

Primate survival in a tropical island

Research question 1: Does age at death vary across age in females?

1. Import the data Let’s start by importing the “cayo” dataset to RStudio and exploring it. Hint: review past R scripts and don’t forget to use the metadata file for reference.

Questions:

How many variables and observations does “cayo” have?
Considering the research question, what are our variables of interest and what type of variables are they?

2. Infer population parameters from random samples

Let’s first plot the population distribution of age at death for all individuals independent of sex. For this, let’s round age at death to the smallest value using the function floor() and deal with it as a discrete variable.

# loading the package
library(ggplot2)

# plotting counts
p1 <- ggplot(cayo,aes(x=floor(age_at_death))) +
  geom_histogram(binwidth = 1) +
  xlab("Age at death (years)") +
  ylab("Frequency")
p1

# plotting proportions
p2 <- ggplot(cayo,aes(x=floor(age_at_death))) +
   geom_histogram(aes(y=..count../sum(..count..)),binwidth = 1) +
   xlab("Age at death (years)") +
   ylab("Frequency")
p2

Let’s now estimate the population parameters $\mu$ and $\sigma$ for age at death.

# mean age at death
mu <- mean(cayo$age_at_death)
mu

# standard deviation of age at death
sigma <- sd(cayo$age_at_death)
sigma

Questions:

What ages experience more death?
What do $\mu$ and $\sigma$ represent and what are their values?

Now, let’s explore the distribution of age at death for females. But first, do we have information for all females in the population? Let’s check the levels of the variable “sex”.

# checking the levels in the variable sex
levels(as.factor(cayo$sex))

Questions:

What sex categories do we have?
Considering the fact that we have unknown categories, what distribution (population or sampling) can we estimate and what average and spread measure can we get?

Let’s estimate the mean age at death and its standard deviation for females.

# loading packages
library(tidyverse)

# filtering by females
f <- filter(cayo,sex=="f")
f

# mean age at death of females
m_fem <- mean(f$age_at_death)
m_fem

# standard deviation for females
s_fem <- sd(f$age_at_death)
s_fem

Challenge 1. Plot the distribution of age at death for females.

Questions:

What female ages experience more deaths?
Is this pattern different from the entire population distribution?

3. Estimate the sampling distribution

From the random sample above (n = 737), we estimated a mean age at death $\overline{y}=5.64$ years. Because there will always be some uncertainty in the sex variable (some individuals may die before sex can be determined), let’s estimate a sampling distribution using a random sample of n = 500 females. For this, we will employ the function sample() which allows us to randomly sample a dataframe. Because we want to repeat this process 10,000 times, we will use a loop in our coding.

# creating a new object to add the 10,000 sample statistics
r500 <- NULL 

# loop for sampling 500 females at random and estimate the mean age at death
for(i in 1:10000){
  temporarySample <- sample(floor(f$age_at_death), size = 500,
                                   replace = FALSE)
  r500[i] <- mean(temporarySample)
}

r500 <- as.data.frame(r500)

# viewing the new data frame created
View(r500)

# plotting the sampling distribution
p3 <- ggplot(r500,aes(x=r500)) +
  geom_histogram(aes(y=..count../sum(..count..)),binwidth = .1) +
  xlab("Female age at death (years)") +
  ylab("Probability")

p3

Questions:

Is the sampling distribution centered around $\overline{y}$?
What is the relationship between s_fem and the shape of the sampling distribution?
What is your expectation for $\overline{y}$ and $s$ if you re-run the exercise with $n$ = 50?

4. Estimate the standard error

Let’s use the formula in the worked example to estimate the standard error using the function sqrt() to estimate the square root.

# standard error of the mean
se_fem <- s_fem/sqrt(737)
se_fem

Questions

How precise is your sample mean?

5. Plot the probability distribution of the mean age at death

We have used age at death as a discrete variable but let’s learn how to plot its probability density distribution using geom_density().

# probability density distribution for females
p4 <- ggplot(f,aes(age_at_death)) +
   geom_density() +
   xlab("Female age at death (years)") +
    ylab("Probability density")
p4

Questions:

Is there any pattern in the figure? Interpret it “biologically”.

6. Estimate the probability of two or more (survival) events

According to our analysis, a large proportion of females die early in life. In fact, we can estimate the age-specific probability of surviving from birth to a certain age. First, dying and surviving in a particular age are two mutually exclusive events; we either die or survive. Thus, if we know the probability of dying at a particular age, we can define the probability of not dying (i.e., surviving) at such age. Let’s estimate the probability of surviving for the most vulnerable ages; from birth to 2 years of age.

# frequency table of age at death of females
table(floor(f$age_at_death))

# proportion of deaths during the first year of life
m0 <- 249/737
m0

# survival during the first year of life
s0 <- 1-m0
s0

# proportion of deaths during the second year of life
m1 <- 120/(737-249)
m1

# survival during the second year of life
s1 <- 1-m1
s1

# probability of surviving from birth to 2 years of age 
s0*s1

Questions:

Why did you multiply each probability s0 and s1 in the last step?
What is the probability that a female will survive to 2 years of age?

Research question 2: Do hurricanes affect survival from birth to 2 years of age?

At this point, we have knowledge about mortality patterns in female rhesus macaques. Now, let’s explore a potential driver of these patterns.

Stop, Think, Do: Now, it is your turn to explore whether the probability of survival from birth to 2 years of age in females during a hurricane year is different from that of females during a normal (control) year. Stop and review step 6 you just did. Think about how to manage the “cayo” dataframe to obtain the data needed to address the research question (check your past scripts!). Do the analysis following the demonstration codes and answer the research question. Hint: Create two new dataframes; one for treatment == “hurricane” and another for treatment == “control” and follow step 6.

Discussion questions

Mention one example of a population distribution and one of a sampling distribution?
Compare the measurement of precision carried out in Chapter 2 and today’s measurement of precision.
What assumptions do we need to make to apply the multiplication rule in our survival exercise?

Great Work!

Probability distributions

Chapter 4

1. Inferring population parameters from random samples

2. Estimating the sampling distribution of ȳ

3. Estimating the standard error of ȳ

4. Estimating probability distributions

5. Estimating the probability of two or more events