StatQuest: Random Forests Part 2: Missing data and clustering

StatQuest with Josh Starmer

15 Jan 202011:53

EducationalLearning

32 Likes 10 Comments

TLDRIn this Stack Quest episode, host Josh Stommer explores random forests, focusing on handling missing data and sample clustering. He demonstrates how to initially guess and then refine missing values like blocked arteries and weight for patient data using the most common and median values. The video delves into building a proximity matrix to determine sample similarity and iteratively improving guesses by leveraging the forest's trees. Josh also highlights the creative use of proximity matrices for visualizing data relationships through heat maps and MDS plots. Finally, he addresses classifying new samples with missing data by comparing guesses against the forest's predictions.

Takeaways

🌳 The video is part of a series on random forests, focusing on handling missing data and sample clustering.
🔍 Random forests can deal with two types of missing data: missing data in the original dataset used to create the forest and missing data in new samples to be categorized.
📊 The initial approach to missing data is to make an educated guess and then refine it iteratively.
🏥 For categorical missing data, the guess is based on the most common value found in similar samples without the target condition (e.g., heart disease).
📈 For numeric data, the initial guess is the median value of similar samples without the target condition.
🌐 The similarity between samples is determined by how often they end up in the same leaf node across multiple trees in the forest.
🗂 A proximity matrix is used to track which samples are similar, with ones indicating samples that ended up in the same leaf node.
🔄 The process of refining guesses involves recalculating proximities and missing values multiple times until they converge.
📊 Proximity values can be used to create a distance matrix, which can be visualized through a heat map or an MDS plot to show sample relationships.
🎯 For new samples with missing data, the iterative method is used to make educated guesses, and the random forest's classification accuracy helps determine the most likely option.
🎉 The video concludes with a call to action for viewers to subscribe and support the channel through various means.

Q & A

What is the main topic of the video 'Random Forests Part Two'?
-The main topic of the video is dealing with missing data and sample clustering in the context of random forests.
What are the two types of missing data that random forests consider?
-Random forests consider missing data in the original dataset used to create the random forest and missing data in a new sample that we want to categorize.
How does the video suggest dealing with missing data in the creation of a random forest?
-The video suggests making an initial guess for the missing data, which could be bad, and then gradually refining the guess until it becomes a good guess.
What is the initial guess for the blocked arteries value in the case of patient number four?
-The initial guess for the blocked arteries value is 'No', which is the most common value for blocked arteries found in other samples that do not have heart disease.
How is the initial guess for the weight of patient number four determined?
-The initial guess for the weight is the median value of the patients that did not have heart disease, which in this case is 160.75.
What is a proximity matrix in the context of random forests?
-A proximity matrix is used to keep track of similar samples in random forests. It has a row and a column for each sample, and entries are filled based on which samples end up in the same leaf node.
How does the video describe the process of refining guesses for missing data?
-The process involves determining which samples are similar to the one with missing data by running all data down all the trees, updating the proximity matrix, and using the proximity values to make better guesses about the missing data.
What does the video suggest for revising the guess for blocked arteries based on the proximity matrix?
-The video suggests using the weighted frequency of 'yes' and 'no' based on proximity values as weights, and choosing the option with the higher weighted frequency, which in this case is 'no'.
How is the weighted average for the missing weight calculated in the video?
-The weighted average for the missing weight is calculated using the proximities to determine the weights for each sample's weight, and then averaging these weighted values.
What is the iterative method described in the video for refining guesses?
-The iterative method involves building a random forest, running the data through the trees, recalculating the proximities, and recalculating the missing values multiple times until they converge and no longer change.
How can the proximity matrix be used beyond just refining guesses for missing data?
-The proximity matrix can be used to create a distance matrix, which can then be visualized through a heat map or an MDS (Multidimensional Scaling) plot to show how the samples are related to each other.
What is the second method for dealing with missing data when classifying a new sample?
-The second method involves creating two copies of the data with different assumptions for the missing values, running these through the existing random forest, and choosing the option that is correctly labeled more times by the forest.

Outlines

00:00

🌳 Random Forests and Missing Data Handling

This paragraph introduces the topic of the video, which is about handling missing data and sample clustering in the context of random forests. The speaker, Josh Stommer, expresses his enthusiasm for the sample clustering aspect of random forests. The data set consists of information from four patients, with patient number four having missing data. The video will address two types of missing data: one in the original data set used to create the random forest, and the other in a new sample for categorization. The method for dealing with missing data involves making an initial guess and then refining it iteratively. For example, if a patient's blocked arteries status is unknown, the initial guess would be the most common value found in similar samples without heart disease. For numeric data like weight, the median value from similar samples is used as the initial guess. The process of refining guesses involves building a random forest, running the data through all trees to determine similarity, and using a proximity matrix to track similar samples. The matrix is updated after each tree, and proximity values are used to make better guesses about missing data.

05:04

🔍 Refining Guesses and Proximity Matrix Analysis

The second paragraph delves into the process of refining initial guesses for missing data using the proximity matrix. The weighted frequency for categorical data like 'blocked arteries' is calculated using the proximity values as weights. For example, if 'no' for blocked arteries occurs more frequently in similar samples, it will have a higher weighted frequency and is chosen as the revised guess. For numeric data like weight, a weighted average is calculated based on the proximity values. The paragraph also explains how the proximity matrix can be transformed into a distance matrix, enabling the creation of a heat map or a Multidimensional Scaling (MDS) plot to visualize the relationship between samples. This demonstrates the versatility of random forests in analyzing various types of data. The iterative process of refining guesses is repeated several times until the missing values stabilize.

10:04

🏥 Classifying New Samples with Missing Data

The final paragraph discusses the application of random forests in classifying new samples with missing data. The scenario involves a pre-built random forest and a new patient whose heart disease status is unknown due to missing 'blocked arteries' data. The process involves creating two copies of the data, one assuming the patient has heart disease and the other without it. Using the iterative method previously described, guesses for the missing values are made. These samples are then run through the forest to see which version is more frequently and correctly labeled by the random forest. The option that is correctly labeled more often is chosen, effectively filling in the missing data and classifying the new sample. The video concludes with a call to action for viewers to subscribe and support the channel through various means.

Mindmap

Keywords

💡Random Forests

Random Forests is a machine learning algorithm that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees. In the video, it is used to handle missing data and for sample clustering, which is a method of grouping similar samples together based on their proximity in the forest.

💡Missing Data

Missing data refers to the absence of values in a dataset, which can occur for various reasons. In the context of the video, missing data is addressed in two scenarios: within the original dataset used to create the random forest and in a new sample that needs to be categorized. The script discusses strategies for dealing with missing data, such as making initial guesses and refining them.

💡Sample Clustering

Sample clustering is the process of grouping similar samples together. In the video, the host expresses excitement about this aspect of random forests, as it allows for the identification of similarities between data points. The script explains how sample clustering is performed using a proximity matrix to track similarity based on leaf nodes in the trees.

💡Proximity Matrix

A proximity matrix is a tool used in random forests to measure the similarity between samples. It is a square matrix where each cell represents the similarity between two samples, with higher values indicating greater similarity. In the script, the proximity matrix is used to refine guesses about missing data by weighting the frequency of certain values based on the proximity of samples.

💡Initial Guess

An initial guess in the context of the video refers to the starting point for estimating missing values. For categorical data like 'blocked arteries', the initial guess is the most common value among similar samples without heart disease. For numeric data like 'weight', the initial guess is the median value of similar samples. The script illustrates how these initial guesses are made and then refined.

💡Weighted Frequency

Weighted frequency is a statistical measure that takes into account the relative importance of different observations. In the video, it is used to refine the guess for missing categorical data by calculating the frequency of a value (like 'yes' or 'no' for blocked arteries) and weighting it by its proximity value from the proximity matrix.

💡Weighted Average

A weighted average is an average in which each data point contributes to the final sum differently, based on a given weight. In the script, the weighted average is used to estimate missing numeric data (like 'weight') by calculating the average with weights derived from the proximity matrix, thus giving more importance to samples that are more similar.

💡Convergence

Convergence in this context refers to the process where the recalculated missing values stabilize and no longer change with each iteration. The script describes an iterative process of refining guesses for missing data through multiple cycles until the values converge, indicating that the guesses have become reliable.

💡Distance Matrix

A distance matrix is a matrix that represents distances between pairs of objects. In the video, it is derived from the proximity matrix by subtracting the proximity values from one, transforming the measure of similarity into a measure of distance. This allows for visualizing the relationships between samples using techniques like heat maps or MDS plots.

💡Heat Map

A heat map is a graphical representation of data where individual values are represented as colors. In the script, the host mentions that a heat map can be drawn using the distance matrix derived from the proximity matrix, providing a visual way to understand the relationships and similarities between different samples.

💡MDS Plot

MDS (Multidimensional Scaling) plot is a technique used to visualize the level of similarity or dissimilarity between different items. In the video, it is mentioned as another way to represent the relationships between samples, similar to a heat map, but using a different graphical approach.

Highlights

Introduction to Random Forests Part Two focusing on missing data and sample clustering.

Explanation of two types of missing data: in the original dataset and in a new sample for categorization.

Initial guess for missing data based on the most common value in similar samples without heart disease.

Median value used as an initial guess for missing numeric data like weight.

Building a Random Forest and running data through trees to determine sample similarity.

Utilization of a proximity matrix to track and quantify sample similarity.

Process of refining guesses for missing data using proximity values as weights.

Iterative method to improve guesses by recalculating missing values multiple times.

Application of proximity matrix for creating a heat map or MDS plot to visualize sample relationships.

Cool feature of using the proximity matrix to determine the closest samples in the dataset.

Dealing with missing data in a new sample by creating copies with different assumptions.

Using the iterative method to make educated guesses for missing values in new samples.

Classifying new samples based on which option is more frequently correctly labeled by the Random Forest.

Finalizing missing data and classification by running the samples through the forest multiple times.

Encouragement to subscribe for more content and support through Patreon, channel memberships, or donations.

Invitation to check out Stat Quest for more information on heat maps and MDS plots.

Transcripts

Browse More Related Video

Quantiles and Percentiles, Clearly Explained!!!

Quantile-Quantile Plots (QQ plots), Clearly Explained!!!

Importing/Reading Excel data into R using RStudio (readxl) | R Tutorial 1.5b | MarinStatsLectures

p-hacking: What it is and how to avoid it!

Calculating the Mean, Variance and Standard Deviation, Clearly Explained!!!

What is a Sampling Distribution? | Puppet Master of Statistics

StatQuest: Random Forests Part 2: Missing data and clustering

Takeaways

Q & A

What is the main topic of the video 'Random Forests Part Two'?

What are the two types of missing data that random forests consider?

How does the video suggest dealing with missing data in the creation of a random forest?

What is the initial guess for the blocked arteries value in the case of patient number four?

How is the initial guess for the weight of patient number four determined?

What is a proximity matrix in the context of random forests?

How does the video describe the process of refining guesses for missing data?

What does the video suggest for revising the guess for blocked arteries based on the proximity matrix?

How is the weighted average for the missing weight calculated in the video?

What is the iterative method described in the video for refining guesses?

How can the proximity matrix be used beyond just refining guesses for missing data?

What is the second method for dealing with missing data when classifying a new sample?