StatQuest: Random Forests Part 2: Missing data and clustering
TLDRIn this Stack Quest episode, host Josh Stommer explores random forests, focusing on handling missing data and sample clustering. He demonstrates how to initially guess and then refine missing values like blocked arteries and weight for patient data using the most common and median values. The video delves into building a proximity matrix to determine sample similarity and iteratively improving guesses by leveraging the forest's trees. Josh also highlights the creative use of proximity matrices for visualizing data relationships through heat maps and MDS plots. Finally, he addresses classifying new samples with missing data by comparing guesses against the forest's predictions.
Takeaways
- π³ The video is part of a series on random forests, focusing on handling missing data and sample clustering.
- π Random forests can deal with two types of missing data: missing data in the original dataset used to create the forest and missing data in new samples to be categorized.
- π The initial approach to missing data is to make an educated guess and then refine it iteratively.
- π₯ For categorical missing data, the guess is based on the most common value found in similar samples without the target condition (e.g., heart disease).
- π For numeric data, the initial guess is the median value of similar samples without the target condition.
- π The similarity between samples is determined by how often they end up in the same leaf node across multiple trees in the forest.
- π A proximity matrix is used to track which samples are similar, with ones indicating samples that ended up in the same leaf node.
- π The process of refining guesses involves recalculating proximities and missing values multiple times until they converge.
- π Proximity values can be used to create a distance matrix, which can be visualized through a heat map or an MDS plot to show sample relationships.
- π― For new samples with missing data, the iterative method is used to make educated guesses, and the random forest's classification accuracy helps determine the most likely option.
- π The video concludes with a call to action for viewers to subscribe and support the channel through various means.
Q & A
What is the main topic of the video 'Random Forests Part Two'?
-The main topic of the video is dealing with missing data and sample clustering in the context of random forests.
What are the two types of missing data that random forests consider?
-Random forests consider missing data in the original dataset used to create the random forest and missing data in a new sample that we want to categorize.
How does the video suggest dealing with missing data in the creation of a random forest?
-The video suggests making an initial guess for the missing data, which could be bad, and then gradually refining the guess until it becomes a good guess.
What is the initial guess for the blocked arteries value in the case of patient number four?
-The initial guess for the blocked arteries value is 'No', which is the most common value for blocked arteries found in other samples that do not have heart disease.
How is the initial guess for the weight of patient number four determined?
-The initial guess for the weight is the median value of the patients that did not have heart disease, which in this case is 160.75.
What is a proximity matrix in the context of random forests?
-A proximity matrix is used to keep track of similar samples in random forests. It has a row and a column for each sample, and entries are filled based on which samples end up in the same leaf node.
How does the video describe the process of refining guesses for missing data?
-The process involves determining which samples are similar to the one with missing data by running all data down all the trees, updating the proximity matrix, and using the proximity values to make better guesses about the missing data.
What does the video suggest for revising the guess for blocked arteries based on the proximity matrix?
-The video suggests using the weighted frequency of 'yes' and 'no' based on proximity values as weights, and choosing the option with the higher weighted frequency, which in this case is 'no'.
How is the weighted average for the missing weight calculated in the video?
-The weighted average for the missing weight is calculated using the proximities to determine the weights for each sample's weight, and then averaging these weighted values.
What is the iterative method described in the video for refining guesses?
-The iterative method involves building a random forest, running the data through the trees, recalculating the proximities, and recalculating the missing values multiple times until they converge and no longer change.
How can the proximity matrix be used beyond just refining guesses for missing data?
-The proximity matrix can be used to create a distance matrix, which can then be visualized through a heat map or an MDS (Multidimensional Scaling) plot to show how the samples are related to each other.
What is the second method for dealing with missing data when classifying a new sample?
-The second method involves creating two copies of the data with different assumptions for the missing values, running these through the existing random forest, and choosing the option that is correctly labeled more times by the forest.
Outlines
π³ Random Forests and Missing Data Handling
This paragraph introduces the topic of the video, which is about handling missing data and sample clustering in the context of random forests. The speaker, Josh Stommer, expresses his enthusiasm for the sample clustering aspect of random forests. The data set consists of information from four patients, with patient number four having missing data. The video will address two types of missing data: one in the original data set used to create the random forest, and the other in a new sample for categorization. The method for dealing with missing data involves making an initial guess and then refining it iteratively. For example, if a patient's blocked arteries status is unknown, the initial guess would be the most common value found in similar samples without heart disease. For numeric data like weight, the median value from similar samples is used as the initial guess. The process of refining guesses involves building a random forest, running the data through all trees to determine similarity, and using a proximity matrix to track similar samples. The matrix is updated after each tree, and proximity values are used to make better guesses about missing data.
π Refining Guesses and Proximity Matrix Analysis
The second paragraph delves into the process of refining initial guesses for missing data using the proximity matrix. The weighted frequency for categorical data like 'blocked arteries' is calculated using the proximity values as weights. For example, if 'no' for blocked arteries occurs more frequently in similar samples, it will have a higher weighted frequency and is chosen as the revised guess. For numeric data like weight, a weighted average is calculated based on the proximity values. The paragraph also explains how the proximity matrix can be transformed into a distance matrix, enabling the creation of a heat map or a Multidimensional Scaling (MDS) plot to visualize the relationship between samples. This demonstrates the versatility of random forests in analyzing various types of data. The iterative process of refining guesses is repeated several times until the missing values stabilize.
π₯ Classifying New Samples with Missing Data
The final paragraph discusses the application of random forests in classifying new samples with missing data. The scenario involves a pre-built random forest and a new patient whose heart disease status is unknown due to missing 'blocked arteries' data. The process involves creating two copies of the data, one assuming the patient has heart disease and the other without it. Using the iterative method previously described, guesses for the missing values are made. These samples are then run through the forest to see which version is more frequently and correctly labeled by the random forest. The option that is correctly labeled more often is chosen, effectively filling in the missing data and classifying the new sample. The video concludes with a call to action for viewers to subscribe and support the channel through various means.
Mindmap
Keywords
π‘Random Forests
π‘Missing Data
π‘Sample Clustering
π‘Proximity Matrix
π‘Initial Guess
π‘Weighted Frequency
π‘Weighted Average
π‘Convergence
π‘Distance Matrix
π‘Heat Map
π‘MDS Plot
Highlights
Introduction to Random Forests Part Two focusing on missing data and sample clustering.
Explanation of two types of missing data: in the original dataset and in a new sample for categorization.
Initial guess for missing data based on the most common value in similar samples without heart disease.
Median value used as an initial guess for missing numeric data like weight.
Building a Random Forest and running data through trees to determine sample similarity.
Utilization of a proximity matrix to track and quantify sample similarity.
Process of refining guesses for missing data using proximity values as weights.
Iterative method to improve guesses by recalculating missing values multiple times.
Application of proximity matrix for creating a heat map or MDS plot to visualize sample relationships.
Cool feature of using the proximity matrix to determine the closest samples in the dataset.
Dealing with missing data in a new sample by creating copies with different assumptions.
Using the iterative method to make educated guesses for missing values in new samples.
Classifying new samples based on which option is more frequently correctly labeled by the Random Forest.
Finalizing missing data and classification by running the samples through the forest multiple times.
Encouragement to subscribe for more content and support through Patreon, channel memberships, or donations.
Invitation to check out Stat Quest for more information on heat maps and MDS plots.
Transcripts
Browse More Related Video
Quantiles and Percentiles, Clearly Explained!!!
Quantile-Quantile Plots (QQ plots), Clearly Explained!!!
Importing/Reading Excel data into R using RStudio (readxl) | R Tutorial 1.5b | MarinStatsLectures
p-hacking: What it is and how to avoid it!
Calculating the Mean, Variance and Standard Deviation, Clearly Explained!!!
What is a Sampling Distribution? | Puppet Master of Statistics
5.0 / 5 (0 votes)
Thanks for rating: