The Sample Variance: Why Divide by n-1?

jbstatistics

27 Jan 201406:53

EducationalLearning

32 Likes 10 Comments

TLDRThe video script explains the rationale behind using n-1 instead of n in the sample variance formula. It discusses how the sample mean is used as an estimate for the population mean, which can lead to an underestimation of the population variance. To correct this, dividing by n-1 is introduced as it provides an unbiased estimator of the population variance, a concept also linked to degrees of freedom. The script uses examples to illustrate how degrees of freedom change when estimating the population mean, affecting the calculation of sample variance.

Takeaways

📊 The sample variance formula divides by n minus 1, not n, to correct for the underestimation of the population variance when the true mean is unknown.
🔢 N represents the sample size and is used to estimate population parameters like the mean (mu) and variance (sigma squared).
🌟 The true mean (mu) is often unknown, so the sample mean (x-bar) is used as the best estimate in the calculation of sample variance.
📉 Using the sample mean in place of the true mean results in a sum that is smaller than it would be if the true mean were used, thus leading to an underestimation of the population variance.
🔄 The division by n minus 1 compensates for this underestimation, providing an unbiased estimator of the population variance on average.
🚫 The script does not delve into why n minus 2, n minus 3, or n minus 0.5 would not be appropriate divisors for unbiased estimation.
📐 The concept of degrees of freedom is introduced as the number of independent values in a calculation that can vary freely.
🌰 Two examples are provided: one where the population mean is known and one where it is estimated from the sample, showing the impact on degrees of freedom.
🔢 In the scenario with an unknown population mean, once the sample mean and two observations are known, the third observation and its deviation are fixed, leaving only two degrees of freedom.
📊 The script emphasizes that dividing the sum of squared deviations by the degrees of freedom (n minus 1) rather than the sample size provides a better estimator for the population variance.

Q & A

Why do we divide by n minus 1 instead of n in the sample variance formula?
-Dividing by n minus 1 in the sample variance formula compensates for the underestimation of the population variance that occurs when using the sample mean as an estimate for the population mean. This adjustment results in an unbiased estimator of the population variance, known as Bessel's correction.
What are the parameters of a population that we often want to estimate with a sample?
-The two parameters of a population that are commonly estimated with a sample are the population mean (mu) and the population variance (Sigma squared).
What is the ideal estimator for the population variance if the population mean was known?
-The ideal estimator for the population variance, if the population mean (mu) was known, would be the sum of the squared differences between each observation and the true population mean, divided by the sample size (n).
Why can't we use the true mean in the sample variance formula?
-We cannot use the true mean in the sample variance formula because it is usually unknown. Instead, we use the sample mean as the best estimate of the population mean.
What is the effect of substituting the true mean with the sample mean in the variance estimation?
-Substituting the true mean with the sample mean in the variance estimation tends to make the sum of squared differences smaller, leading to an underestimation of the population variance.
How does the concept of degrees of freedom relate to the sample variance?
-The degrees of freedom in the context of the sample variance refers to the number of independent values in the calculation that can vary freely. When estimating the population mean with the sample mean, one degree of freedom is lost, leading to the division by n minus 1 in the sample variance formula.
What happens to the degrees of freedom when we estimate the population mean with the sample mean?
-When we estimate the population mean with the sample mean, we lose one degree of freedom. This is because the sum of the deviations from the sample mean must equal zero, and thus, knowing the sample mean and any two observations determines the third observation.
Why is the sample variance a better estimator when we divide by the degrees of freedom rather than the sample size?
-Dividing by the degrees of freedom, which is n minus 1, provides a better estimator of the population variance because it accounts for the loss of one degree of freedom when estimating the population mean. This adjustment results in an unbiased estimator of the population variance.
What is an unbiased estimator?
-An unbiased estimator is one that, on average, equals the true value of the parameter it is estimating. In the context of the sample variance, dividing by n minus 1 results in an unbiased estimator of the population variance.
How does the sample mean affect the sum of squared deviations from the true mean?
-The sample mean, being an estimate of the true mean, tends to fall near the center of the observations. This causes the sum of squared deviations from the sample mean to be smaller than from the true mean, leading to an underestimation of the population variance.
Why don't we divide by n minus 2, n minus 3, or n minus 0.5 in the sample variance formula?
-Dividing by n minus 1 specifically results in an unbiased estimator of the population variance. Dividing by other values, such as n minus 2, n minus 3, or n minus 0.5, would not compensate correctly for the bias introduced by estimating the population mean with the sample mean and would not yield an unbiased estimator.
What is the significance of the sum of squared deviations from the sample mean always equaling zero?
-The sum of squared deviations from the sample mean always equaling zero signifies that the degrees of freedom are fully accounted for in the sample variance calculation. It reflects the fact that once the sample mean and any two observations are known, the third observation is determined, and thus, it cannot be freely varied.

Outlines

00:00

📊 Introduction to Sample Variance and Degrees of Freedom

This paragraph introduces the concept of sample variance in the context of an introductory statistics course. It explains why we divide by n minus 1 instead of n when calculating sample variance. The discussion begins with the premise that we often estimate population parameters like mean (mu) and variance (Sigma squared) using sample statistics. The paragraph outlines the process of estimating the population variance using the sum of squared deviations from the sample mean (x-bar) and highlights the issue of underestimation when the true mean (mu) is unknown. It then explains how using n minus 1 as the divisor compensates for this underestimation, providing an unbiased estimator of the population variance. The concept of degrees of freedom is introduced as a way to understand the number of independent values that can vary in a calculation, with an example illustrating how estimating the population mean reduces the degrees of freedom.

05:02

🔢 Degrees of Freedom and Sample Variance Calculation

The second paragraph delves deeper into the concept of degrees of freedom and how it relates to the calculation of sample variance. It contrasts two scenarios: one where the population mean (mu) is known and another where it is estimated from the sample data. The paragraph explains that when mu is known, each observation can independently vary, resulting in three degrees of freedom. However, when mu is estimated from the sample mean (x-bar), the values of the observations are no longer completely independent, as they must satisfy the condition that the sample mean equals the calculated x-bar. This reduces the degrees of freedom to two, and the sum of squared deviations from the sample mean must always sum to zero. The paragraph emphasizes that dividing by the degrees of freedom (n minus 1) when estimating the population variance leads to a more accurate estimator, as demonstrated in previous mathematical discussions.

Mindmap

Keywords

💡sample variance

Sample variance is a statistical measure that estimates the population variance based on a sample of data. In the video, it is explained that the sample variance is calculated by dividing the sum of the squared differences from the sample mean by the degrees of freedom (n-1), rather than the sample size (n), to compensate for the estimation of the population mean from the sample data. This adjustment ensures that the estimator is unbiased, meaning that on average, it will equal the true population variance.

💡population mean (mu)

The population mean, denoted as mu (μ), is the average value of a characteristic for the entire population. In the context of the video, it is noted that the population mean is often unknown, and therefore, the sample mean (x̄) is used as an estimate. The process of estimating the population mean introduces a bias in the calculation of the sample variance, which is why the adjustment (n-1) is necessary.

💡sample mean (x-bar)

The sample mean, represented as x-bar (x̄), is the average of the observed values in a sample and is used as an estimate for the population mean (mu). In the video, it is explained that because the true population mean is typically unknown, the sample mean is used in the calculation of the sample variance, leading to a potential underestimation of the population variance.

💡population variance (Sigma squared)

The population variance, denoted as Sigma squared (σ^2), is the average of the squared differences from the mean for the entire population. It measures the spread or dispersion of the data points around the population mean. The video emphasizes the importance of estimating the population variance and how the sample variance serves as an estimator for it.

💡degrees of freedom

Degrees of freedom refer to the number of independent values in a data set that are free to vary. In the context of the video, it is used to describe the number of values that can change independently when calculating the sample variance. When the sample mean is used as an estimate for the population mean, one degree of freedom is lost, hence the division by (n-1) instead of n in the sample variance formula.

💡biased estimator

A biased estimator is one that, on average, does not estimate the true parameter value. In the video, it is explained that using the sum of squared deviations divided by the sample size (n) would result in a biased estimator of the population variance because it tends to underestimate the true variance. The adjustment to divide by (n-1) instead of n corrects this bias.

💡unbiased estimator

An unbiased estimator is one that, on average, estimates the true parameter value correctly. The video explains that dividing the sum of squared deviations by the degrees of freedom (n-1) instead of the sample size (n) results in an unbiased estimator of the population variance. This means that the average of the sample variances from many samples would equal the true population variance.

💡sum of squared deviations

The sum of squared deviations is the aggregate of the squared differences between each data point and the mean value. This is a key component in the calculation of variance, both for a population and a sample. In the video, it is used to illustrate how the sample variance is derived and why the denominator must be adjusted to (n-1) to achieve an unbiased estimator of the population variance.

💡independent observations

Independent observations refer to data points that are collected in such a way that the value of one does not influence the value of another. In the video, the concept is used to describe the nature of the data drawn from a population when estimating parameters like the mean and variance.

💡estimation

Estimation in statistics involves using sample data to make inferences about population parameters. The video discusses how the sample mean and sample variance are used to estimate the population mean (mu) and population variance (Sigma squared), respectively. The process of estimation is crucial in statistics as it allows us to draw conclusions about a population based on a sample.

💡mathematical proof

A mathematical proof is a logical demonstration that shows a statement or theorem to be true. In the context of the video, it is mentioned that there is a mathematical proof that supports why dividing by (n-1) instead of n results in an unbiased estimator for the population variance. The proof is not shown in the video but is referenced to justify the use of (n-1) in the sample variance formula.

Highlights

The discussion is pitched at the level of an introductory statistics course for non-majors.

The concept of estimating population parameters with sample statistics is introduced.

The true mean (mu) is usually unknown, which complicates the estimation of population variance (Sigma squared).

The sample mean (x-bar) is used as an estimate for the population mean (mu).

Subtracting the sample mean from each observation tends to underestimate the true population variance.

Dividing by n minus 1 compensates for the underestimation and provides an unbiased estimator of the population variance.

The rationale behind using n minus 1 instead of n is explained through the concept of degrees of freedom.

Degrees of freedom refer to the number of values in a calculation that are free to vary.

In the context of sample variance, degrees of freedom equal the number of independent observations minus one.

A scenario with known population mean illustrates the concept of degrees of freedom.

When the population mean is known, each observation can vary independently.

In contrast, when estimating the population mean, the degrees of freedom are reduced by one.

The sum of squared deviations from the sample mean always equals zero when the sample mean is used.

The third observation's value and deviation are determined by the first two observations and the sample mean.

Dividing the sum of squared deviations by the degrees of freedom (n minus 1) results in a better estimator of population variance.

The mathematical proof of unbiased estimation by dividing by n minus 1 is mentioned but not elaborated in this video.

The video aims to provide motivation for the use of n minus 1 in the sample variance formula.

Transcripts

Browse More Related Video

Why We Divide by N-1 in the Sample Variance (Standard Deviation) Formula | The Bessel's Correction

Why do we divide by n-1 and not n? | shown with a simple example | variance and sd

Another simulation giving evidence that (n-1) gives us an unbiased estimate of variance

Simulation showing bias in sample variance | Probability and Statistics | Khan Academy

What is an unbiased estimator? Proof sample mean is unbiased and why we divide by n-1 for sample var

Dividing By n-1 Explained

The Sample Variance: Why Divide by n-1?

Takeaways

Q & A

Why do we divide by n minus 1 instead of n in the sample variance formula?

What are the parameters of a population that we often want to estimate with a sample?

What is the ideal estimator for the population variance if the population mean was known?

Why can't we use the true mean in the sample variance formula?

What is the effect of substituting the true mean with the sample mean in the variance estimation?

How does the concept of degrees of freedom relate to the sample variance?

What happens to the degrees of freedom when we estimate the population mean with the sample mean?

Why is the sample variance a better estimator when we divide by the degrees of freedom rather than the sample size?

What is an unbiased estimator?

How does the sample mean affect the sum of squared deviations from the true mean?

Why don't we divide by n minus 2, n minus 3, or n minus 0.5 in the sample variance formula?

What is the significance of the sum of squared deviations from the sample mean always equaling zero?