Correlations and Covariance in R with Example | R Tutorial 4.12 | MarinStatsLectures

MarinStatsLectures-R Programming & Statistics
30 Sept 201306:36
EducationalLearning
32 Likes 10 Comments

TLDRIn this educational video, Mike Marin explains how to calculate Pearson's, Spearman's, and Kendall's correlation coefficients and covariance using R programming language. He demonstrates these calculations with the Lung Capacity dataset, guiding viewers through scatterplot creation, correlation analysis, and hypothesis testing. The video also covers confidence intervals, handling ties in data, and creating correlation and covariance matrices, with a focus on numeric variables.

Takeaways
  • ๐Ÿ“Š The video discusses calculating correlation and covariance using R programming language, focusing on different types of correlation measures.
  • ๐Ÿ” It explains the difference between Pearson's, Spearman's, and Kendall's correlation, highlighting that Pearson's is a parametric measure while Spearman's and Kendall's are nonparametric.
  • ๐Ÿ“ˆ The video uses the Lung Capacity dataset to demonstrate how to explore the relationship between Age and Lung Capacity variables.
  • ๐Ÿ› ๏ธ It shows how to use the 'cor', 'cov', and 'cor.test' commands/functions in R for calculating correlation and covariance.
  • ๐Ÿ“š The script provides guidance on accessing help menus in R, suggesting the use of 'help' or '?' for command assistance.
  • ๐Ÿ“ˆ A scatterplot is created using the 'plot' command to visualize the relationship between Age and Lung Capacity, indicating a positive association.
  • ๐Ÿง The 'cor' function is used to calculate Pearson's correlation with the 'method' argument set to 'pearson', and it's noted that the order of variables does not affect the result.
  • ๐Ÿ”„ For nonparametric measures, the 'method' argument can be set to 'spearman' for Spearman's correlation or 'Kendall' for Kendall's rank correlation.
  • ๐Ÿ“‰ The 'cor.test' function is introduced to calculate a confidence interval for the correlation and to test the hypothesis that the correlation is equal to zero, including handling ties with the 'exact' argument.
  • ๐Ÿ”„ The video also covers changing the alternative hypothesis and confidence level in the correlation test using the 'alt' and 'conf.level' arguments.
  • ๐Ÿ“Š The 'cov' command is mentioned for calculating covariance, and the 'pairs' command is used to produce all possible pair-wise plots, with a focus on numeric variables.
  • ๐Ÿ”ข A correlation matrix can be produced for numeric variables using the 'cor' function, and the video explains how to subset data to avoid errors with categorical variables.
Q & A
  • What is the main topic of the video by Mike Marin?

    -The main topic of the video is calculating correlation and covariance using the R programming language, specifically focusing on Pearson's, Spearman's, and Kendall's rank correlation measures.

  • What is Pearson's correlation in statistics?

    -Pearson's correlation is a parametric measure of the linear association between two numeric variables.

  • What are Spearman's and Kendall's rank correlations?

    -Spearman's rank correlation is a nonparametric measure of the monotonic association between two numeric variables, while Kendall's rank correlation is another nonparametric measure based on concordance or discordance of x-y pairs.

  • What dataset does Mike Marin use in his video?

    -Mike Marin uses the Lung Capacity dataset in the video to demonstrate the calculations.

  • How does one access help menus in R for specific commands or functions?

    -To access help menus in R, you can type 'help' followed by the command name in brackets, or simply place a question mark (?) in front of the command/function.

  • What command in R is used to produce a scatterplot?

    -The 'plot' command in R is used to produce a scatterplot, where you can specify variables for the x and y axes.

  • How can one calculate the correlation between Age and Lung Capacity using the 'cor' function in R?

    -You can calculate the correlation between Age and Lung Capacity using the 'cor' function in R by setting the 'method' argument to 'pearson' or leaving it out as it is the default.

  • What does the 'cor.test' function in R provide for Pearson's correlation?

    -The 'cor.test' function in R provides the estimate of the correlation, a 95% confidence interval for the correlation, the test statistic, and the p-value for the hypothesis that the correlation is equal to zero.

  • How can the 'pairs' command in R be used to produce all possible pair-wise plots for a dataset?

    -The 'pairs' command in R can be used by passing the dataset name as an argument to produce all possible pair-wise plots. For numeric variables only, you can subset the data to specific columns.

  • What is the issue with calculating a correlation matrix for the entire LungCap dataset in R?

    -The issue is that R will not calculate a correlation for categorical variables or factors, which are present in the LungCap dataset.

  • What command can be used to calculate the covariance between Age and Lung Capacity in R?

    -The 'cov' command can be used to calculate the covariance between Age and Lung Capacity in R.

  • How can one change the alternative hypothesis in the 'cor.test' function in R?

    -You can change the alternative hypothesis in the 'cor.test' function by using the 'alt' argument and setting it to 'greater' or 'less' for one-sided tests, or leaving it as the default for a two-sided test.

  • What is the purpose of the 'exact' argument in the 'cor.test' function when there are ties in the data?

    -The 'exact' argument in the 'cor.test' function is used to specify whether R should compute an exact p-value when there are ties in the data. Setting it to 'False' tells R to approximate the p-value.

  • How can one produce a correlation matrix for only numeric variables in R?

    -To produce a correlation matrix for only numeric variables in R, you can subset the data to include only the numeric columns and then use the 'cor' function with the 'method' argument set to the desired correlation type.

Outlines
00:00
๐Ÿ“Š Introduction to Correlation and Covariance in R

In this segment, Mike Marin introduces the concepts of Pearson, Spearman, and Kendall rank correlations, which are statistical measures used to assess the linear or monotonic association between two numeric variables. He also explains how to use R programming language to calculate these correlations and covariance using the 'cor', 'cov', and 'cor.test' functions. The data set used for demonstration is the Lung Capacity data, and the focus is on the relationship between Age and Lung Capacity. The video also covers how to create a scatterplot using the 'plot' function in R, and how to access help menus for these commands. A step-by-step guide on calculating Pearson's correlation with the 'method' argument set to 'pearson' is provided, along with how to calculate Spearman's and Kendall's correlations by setting the 'method' argument accordingly. The segment concludes with a discussion on hypothesis testing for the correlation being equal to zero using 'cor.test', including handling ties in the data and adjusting the confidence interval and alternative hypothesis.

05:01
๐Ÿ“ˆ Advanced Correlation Analysis and Data Visualization in R

This paragraph delves deeper into advanced correlation analysis and data visualization techniques in R. Mike Marin discusses how to handle ties in the data when calculating Spearman's correlation and the limitations of nonparametric confidence intervals. He also explains how to adjust the alternative hypothesis and confidence level in the correlation test using the 'alt' and 'conf.level' arguments. The concept of covariance is briefly introduced, and its calculation is demonstrated using the 'cov' function. Furthermore, the video script covers the creation of a correlation matrix for numeric variables using the 'cor' function, and the error handling when attempting to calculate correlations for categorical variables. The 'pairs' command is introduced for generating pair-wise plots, including scatterplots, and the importance of subsetting data for appropriate visualization is emphasized. The segment ends with a preview of the next video in the series, which will focus on fitting a simple linear regression in R.

Mindmap
Keywords
๐Ÿ’กCorrelation
Correlation is a statistical measure that expresses the extent to which two variables are linearly related. In the video, Mike Marin discusses calculating different types of correlation using R programming language, specifically Pearson's, Spearman's, and Kendall's rank correlation, to understand the relationship between variables like Age and Lung Capacity.
๐Ÿ’กCovariance
Covariance is a measure that quantifies the joint variability of two random variables. It is used to understand how much two variables change together. In the script, Mike explains that while covariance is less commonly of interest, it can be calculated using the 'cov' command in R to measure the relationship between numeric variables such as Age and Lung Capacity.
๐Ÿ’กPearson's correlation
Pearson's correlation is a parametric measure that assesses the linear association between two numeric variables. It is mentioned in the script as the default method for calculating correlation in R, where Mike demonstrates its use to find the correlation between Age and Lung Capacity, indicating a positive association.
๐Ÿ’กSpearman's rank correlation
Spearman's rank correlation is a nonparametric measure that evaluates the monotonic relationship between two variables. Mike Marin explains in the video that this type of correlation can be calculated in R by setting the 'method' argument to 'spearman', which is useful when the data does not meet the assumptions of parametric tests.
๐Ÿ’กKendall's rank correlation
Kendall's rank correlation is another nonparametric measure of association that is based on the concordance or discordance of pairs of data points. The script mentions that this can be calculated in R by setting the 'method' argument to 'Kendall', providing an alternative to Pearson's and Spearman's correlation for certain types of data.
๐Ÿ’กScatterplot
A scatterplot is a type of plot that uses Cartesian coordinates to display values for two variables for a set of data. In the video, Mike uses the 'plot' command in R to create a scatterplot of Lung Capacity versus Age, which visually demonstrates the positive association between these two variables.
๐Ÿ’กR programming language
R is a programming language and environment commonly used for statistical computing and graphics. Throughout the script, Mike Marin uses R to demonstrate various statistical analyses, including calculating correlations, covariance, and creating plots to visualize data.
๐Ÿ’กCor.test
The 'cor.test' function in R is used to perform a hypothesis test on the correlation coefficient. Mike Marin explains how to use this function to test the null hypothesis that the correlation is equal to zero and to obtain a confidence interval for the correlation.
๐Ÿ’กConfidence interval
A confidence interval provides a range of values that are likely to contain a population parameter with a certain level of confidence. In the script, Mike demonstrates how to calculate a 95% confidence interval for the correlation coefficient using the 'cor.test' function in R.
๐Ÿ’กPairs plot
A pairs plot is a grid of scatterplots that display the relationships between all pairs of variables in a dataset. Mike Marin mentions using the 'pairs' command in R to produce all possible pair-wise plots for the Lung Capacity data, illustrating the relationships between numeric variables.
๐Ÿ’กCorrelation matrix
A correlation matrix is a table that shows the correlation coefficients between pairs of variables in a dataset. In the script, Mike attempts to calculate a correlation matrix for the LungCap Data in R, but encounters an error due to the presence of categorical variables, which R does not include in the correlation calculation.
Highlights

Introduction to calculating correlation and covariance using R programming language.

Explanation of Pearson's, Spearman's, and Kendall's rank correlation measures.

Use of the Lung Capacity dataset for demonstrating statistical analysis.

Importing and attaching data in R for analysis.

Utilizing 'cor', 'cov', and 'cor.test' commands/functions in R.

Accessing help menus in R for command assistance.

Creating a scatterplot to visualize the relationship between Age and Lung Capacity.

Calculating Pearson's correlation with the 'cor' function in R.

Calculating Spearman's and Kendall's rank correlations in R.

Using 'cor.test' for hypothesis testing and confidence intervals.

Handling ties in data with the 'exact' argument in R.

Modifying the alternative hypothesis and confidence level in correlation tests.

Calculating covariance between Age and Lung Capacity using the 'cov' command.

Generating all possible pair-wise plots with the 'pairs' command.

Subsetting data for pair-wise plots to exclude categorical variables.

Creating a correlation matrix for numeric variables in the dataset.

Producing a covariance matrix in R for the dataset.

้ข„ๅ‘Šไธ‹ไธ€่ง†้ข‘ๅ†…ๅฎน๏ผš็ฎ€ๅ•็บฟๆ€งๅ›žๅฝ’็š„R่ฏญ่จ€ๅฎž็Žฐใ€‚

้ผ“ๅŠฑ่ฎข้˜…marinstatslectures้ข‘้“ไปฅ่Žทๅ–ๆ›ดๅคšR็ผ–็จ‹ๅ’Œ็ปŸ่ฎก่ง†้ข‘ใ€‚

Transcripts
Rate This

5.0 / 5 (0 votes)

Thanks for rating: