Checking assumptions of the linear model

Drew Tyre
11 Aug 201709:05
EducationalLearning
32 Likes 10 Comments

TLDRThe video script discusses the assumptions of linear models and methods to check them, focusing on the examination of residuals. The key assumptions covered include the absence of measurement error in independent variables, independence of errors, normal distribution of errors, constant variance, and linearity of the relationship between X and Y. The script emphasizes the use of residual plots, such as histograms, normal quantile plots, residual vs. fitted values, scale-location plots, and leverage plots, to visually assess these assumptions. It highlights the importance of identifying and addressing deviations, particularly extreme points, to ensure the validity and reliability of the model.

Takeaways
  • ๐Ÿ” The assumptions of a linear model include that independent variables are measured without error, errors are independent, residuals are normally distributed with constant variance, and the relationship between X and Y is linear.
  • ๐Ÿ“Š To check these assumptions, one can use graphical methods such as examining plots of residuals, which are the deviations between the predicted and actual values.
  • ๐Ÿ“ˆ A histogram of residuals can provide a visual test for normal distribution, showing if residuals are symmetrically distributed around zero.
  • ๐Ÿ“ A quantile-quantile (Q-Q) plot can help verify if residuals follow a normal distribution by comparing sample quantiles to theoretical quantiles.
  • ๐Ÿ“Š Residuals vs. fitted values plot can reveal if there's an even scatter of points around the horizontal line representing the predicted mean value, indicating no systematic error.
  • ๐Ÿ“ˆ The scale-location plot, which rescales the residuals by their estimated error, can detect departures from the assumption of constant variance.
  • ๐Ÿ” Large outliers in the scale-location plot can indicate points that do not fit the model well and may need further investigation.
  • ๐Ÿ“Š A plot of standardized residuals against leverage can identify influential points that have a significant impact on the model fit.
  • ๐Ÿ”— High leverage points with large residuals suggest that the model fit is poor and these points may influence the outcome significantly.
  • ๐Ÿ› ๏ธ Removing extreme points can lead to changes in the fitted model, potentially improving the fit and flattening the fitted line.
  • ๐Ÿ“ˆ After removing influential points, the Q-Q plot and residual vs. independent variable plot can show improvements, indicating a better fit to the data.
Q & A
  • What are the assumptions of a linear model?

    -The assumptions of a linear model include that the independent variables are measured without error, the errors are independent, the errors are normally distributed, have a constant variance, and the relationship between the X variable and the average Y is linear.

  • How can we check if the independent variables are measured without error?

    -In practice, it is almost always violated to assume that independent variables are measured without error. This violation is often ignored as it is very common in real-world scenarios.

  • What is the role of residual plots in assessing the assumptions of a linear model?

    -Residual plots are used to graphically examine the deviations between the average value of Y predicted by the model and a particular observation. They help in checking if the assumptions of the linear model are met, particularly the normality, constant variance, and linearity of the relationship between X and Y.

  • How can a histogram of residuals help in assessing normality?

    -A histogram of residuals can visually test whether the residuals are normally distributed. If there are more residuals close to zero and it is roughly symmetrical, it suggests that the residuals could be normally distributed, although it is not easy to judge with certainty.

  • What is a quantile-quantile (Q-Q) plot and how does it help in assessing the normality of residuals?

    -A quantile-quantile plot is a graphical tool where the theoretical quantiles are plotted on the x-axis and the sample quantiles for the data points are plotted on the y-axis. If the data follows a normal distribution, the points should fall along a straight line. Deviations from this line, especially in the tails, indicate that the residuals may not be normally distributed.

  • What does a plot of residuals versus fitted values show?

    -A plot of residuals versus fitted values shows the residual (error) on the y-axis and the predicted mean value for each observation on the x-axis. If the assumptions are correct, the points should be scattered evenly above and below a horizontal line, indicating no systematic pattern.

  • What is the significance of a scale-location plot?

    -A scale-location plot is similar to a residuals versus fitted values plot, but the y-axis has been rescaled by taking the absolute value of the residuals and then scaling them by the estimated error in the model, followed by taking the square root. This plot helps in assessing whether the variance is constant, as a flat line without any trend indicates that the assumption of constant variance is met.

  • How can we identify influential points in a linear model?

    -Influential points can be identified by examining a plot with standardized residuals on the y-axis and leverage on the x-axis. High leverage points with large residuals indicate that the model fit is significantly impacted by these points, and they may need to be considered for removal.

  • What is the impact of removing extreme points from a linear model?

    -Removing extreme points can change the fitted line of the model. For example, after removing point 141 and three other extreme points from the dataset, the fitted line becomes flatter, indicating a change in the slope and suggesting that the model fit was previously influenced by those points.

  • What is the purpose of a Cook's distance plot?

    -A Cook's distance plot helps identify points that have a large influence on the model. It measures the impact of each observation and points that are far out in the corners with large Cook's distances may be considered for removal as they could be influencing the model fit.

  • How does the removal of influential points affect the quantile-quantile plot?

    -The removal of influential points can significantly improve the quantile-quantile plot. After removing point 141 and three other extreme points, the plot shows a much straighter line, indicating a better fit to the normal distribution assumption.

Outlines
00:00
๐Ÿ“Š Linear Model Assumptions and Residual Analysis

This paragraph discusses the assumptions underlying a linear model and the importance of checking whether these assumptions are met. The key assumptions include the absence of measurement error in independent variables, independence of errors, normal distribution of errors, constant variance, and linearity of the relationship between the X variable and the average Y. The speaker emphasizes the common practice of ignoring minor violations of these assumptions in real-world scenarios. The primary method for examining these assumptions is through graphical analysis of residuals, which are the deviations between the predicted Y values and the actual observations. The speaker describes the use of histograms, quantile-quantile plots, and residual versus fitted value plots to assess the normality, variance, and linearity of the data. The paragraph highlights the significance of these diagnostic plots in identifying potential issues with the model, such as extreme values or non-linear relationships.

05:04
๐Ÿ“ˆ Advanced Residual Analysis and Outlier Detection

The second paragraph delves deeper into the analysis of residuals and the detection of outliers. It introduces additional diagnostic plots such as the scale-location plot and leverage plot. The scale-location plot, which uses the absolute value of residuals and scales them by the estimated error, helps detect departures from the assumption of constant variance. The leverage plot, on the other hand, measures the influence of individual observations on the model fit, highlighting points with high leverage and large residuals that may indicate a poor fit and significant impact on the model. The speaker also discusses the concept of Cook's distance, another measure of influence, and the importance of examining points with large Cook's distance. The paragraph concludes with an example of how removing extreme points can alter the fitted model, leading to a flatter and potentially more accurate linear fit, as evidenced by improvements in the quantile-quantile plot and the residual versus independent variable plot.

Mindmap
Keywords
๐Ÿ’กAssumptions
In the context of the video, 'assumptions' refer to the underlying conditions that must be met for a linear model to be valid. These include the error terms being independent, normally distributed, having constant variance, and the relationship between the independent variable (X) and the dependent variable (Y) being linear. The video discusses how to check these assumptions using various plots of residuals, which are the deviations between the observed values and those predicted by the model.
๐Ÿ’กResiduals
Residuals are the differences between the actual observed values of the dependent variable and the values predicted by the model. In the video, residuals are used to graphically examine the validity of the assumptions made for the linear model. By plotting residuals, one can assess whether the assumptions about normality, constant variance, and linearity hold true. For instance, a histogram of residuals can help determine if they are normally distributed, while a plot of residuals versus fitted values can reveal patterns that might suggest non-linear relationships.
๐Ÿ’กQuantile-Quantile Plot
A quantile-quantile plot, or Q-Q plot, is a graphical tool used to compare the distribution of a set of data with a reference probability distribution, typically the normal distribution. In the context of the video, a Q-Q plot is used to check the assumption that the residuals from a linear model are normally distributed. If the data points fall approximately along a straight line on the Q-Q plot, it suggests that the residuals are normally distributed. Deviations from this line, especially in the tails of the distribution, can indicate that the residuals are not normal.
๐Ÿ’กNormal Distribution
A normal distribution, also known as Gaussian distribution, is a probability distribution that is symmetric around its mean and characterized by its bell shape. In the context of the video, the normal distribution is important because it is one of the key assumptions for the linear model. The residuals of the model are expected to be normally distributed, which can be assessed using graphical tools like the Q-Q plot.
๐Ÿ’กConstant Variance
Constant variance, also known as homoscedasticity, is an assumption of linear models where the variance of the residuals is the same across all levels of the independent variable(s). This means that the spread or dispersion of the error terms is uniform and does not change as the predicted values change. The video discusses the importance of this assumption and how to check for it using plots such as the scale-location plot, which can reveal patterns that suggest the presence of heteroscedasticity (the opposite of constant variance).
๐Ÿ’กLinear Relationship
A linear relationship refers to a type of correlation between two variables where the relationship can be described by a straight line. In the context of the video, the assumption of a linear relationship means that as the independent variable (X) changes, the dependent variable (Y) changes by a constant amount. This is a fundamental assumption of linear models, and the video discusses how to assess this assumption using plots of residuals against the fitted values or the independent variable, looking for systematic patterns that might suggest a non-linear relationship.
๐Ÿ’กFitted Values
Fitted values are the predicted values of the dependent variable (Y) obtained from a statistical model, in this case, a linear model. They represent what the model estimates the response variable to be for each level of the independent variable(s). In the video, the fitted values are used as a reference when examining plots of residuals to assess the validity of the model's assumptions, such as constant variance and linearity of the relationship.
๐Ÿ’กLeverage
Leverage in the context of regression analysis refers to the extent to which a particular data point influences the fit of the model. High leverage points are those that fall far from the mean of the independent variable(s) and can potentially have a large impact on the regression coefficients. In the video, leverage is used to identify observations that might be exerting undue influence on the model and to assess the assumption that all observations have equal influence on the model fit.
๐Ÿ’กCook's Distance
Cook's Distance is a measure of the influence of each data point in a regression analysis. It quantifies how much the regression coefficients would change if a particular observation were removed from the dataset. A large Cook's Distance indicates that the observation has a significant influence on the model estimates. In the video, Cook's Distance is used as another measure to identify influential points that might need to be considered for removal to improve the model's assumptions and fit.
๐Ÿ’กScale-Location Plot
A scale-location plot is a diagnostic tool used in regression analysis to assess the assumption of constant variance, or homoscedasticity. In this plot, the y-axis shows the absolute value of the residuals, rescaled by the estimated error, and then taking the square root of that value. The x-axis remains the same as in other residual plots. Ideally, the points should be randomly scattered around a value of 1 (or the estimated standard deviation), indicating constant variance. Patterns or trends in the scale-location plot suggest that the variance is not constant, pointing to heteroscedasticity.
๐Ÿ’กInfluential Points
Influential points are data points that have a significant impact on the results of a statistical analysis, particularly on the estimates of the regression coefficients in a linear model. These points can affect the fit of the model and the interpretation of the results. In the video, influential points are identified using measures such as leverage and Cook's Distance, and their potential removal is discussed as a way to improve the model's assumptions and overall fit.
Highlights

Assumptions of the linear model are crucial for its validity and need to be checked.

Independent variables on the x-axis are assumed to be measured without error, which is often violated in practice.

Errors are assumed to be independent, meaning one observation's error does not inform about the next observation's error.

Residuals, the deviations between the predicted and actual values, are assumed to be normally distributed.

The variance of residuals is assumed to be constant across all levels of the independent variable.

The relationship between the X variable and the average Y is assumed to be linear.

Graphical examination of plots is a common method to check the assumptions of the linear model.

Histograms of residuals can provide a visual test for normal distribution.

Quantile-quantile (Q-Q) plots are used to check if residuals follow a normal distribution.

Residuals versus fitted values plot can reveal if the assumptions hold true.

Scale-location plot can detect departures from the assumption of constant variance.

Systematic deviation in residuals may indicate a nonlinear relationship between X and Y.

Leverage plot helps identify observations that have a large impact on the model fit.

High leverage points with large residuals suggest a poor fit by the model and significant influence on the result.

Cook's distance is another measure to assess the influence of individual observations on the model fit.

Removing extreme points can improve the fit and assumptions of the linear model.

After removing influential points, the Q-Q plot and residual versus independent variable plot can show significant improvement.

Transcripts
Rate This

5.0 / 5 (0 votes)

Thanks for rating: