Box Plots with Two Factors (Stratified Boxplots) in R | R Tutorial 2.3 | MarinStatsLectures
TLDRIn this instructional video, Mike Marin demonstrates the creation of stratified boxplots to explore the relationship between smoking and lung capacity within different age groups. Using the LungCapacity dataset, he shows how to convert numeric age data into categorical age groups and visualizes the impact of smoking on lung capacity across these strata. The video highlights the importance of considering confounding variables like age when analyzing health data and concludes with tips on refining plot aesthetics.
Takeaways
- π The video discusses the creation of stratified boxplots to examine the relationship between a categorical variable and a numeric variable within groups formed by a third categorical variable.
- π¬ It uses the example of examining the relationship between smoking and lung capacity within different age groups or strata.
- ποΈ The 'LungCapacity' dataset is used, which includes a new 'AgeGroup' variable categorized into less than 13, 14-15, 16-17, and 18-plus age groups.
- π The script demonstrates how to make a basic boxplot for a single variable, such as lung capacity, and how to modify it with titles and axis labels.
- π The concept of side-by-side boxplots is introduced to compare lung capacities between smokers and non-smokers, revealing a counter-intuitive initial observation.
- π The video highlights the importance of considering confounding variables, such as age, which affects both smoking habits and lung capacity.
- 𧩠The script shows how to subset data to examine relationships within specific age groups, such as focusing only on individuals aged 18 or older.
- π It explains how to create stratified boxplots by crossing smoking categories with age group categories to produce multiple boxplots on a single plot.
- π¨ The video suggests using colors to differentiate groups within the boxplots for better visual separation and understanding.
- π οΈ The script includes tips for improving plot aesthetics, such as rotating axis labels, adding a box around the plot, and including a legend.
- π The final message emphasizes the importance of stratified analysis for drawing accurate conclusions, as opposed to ignoring potential confounders.
Q & A
What are stratified boxplots used for?
-Stratified boxplots are used for examining the relationship between a categorical variable and a numeric variable within strata or groups formed by a third categorical variable.
What is the purpose of creating an AgeGroup variable in the video?
-The AgeGroup variable is created to categorize the Age variable into groups such as less than 13, 14-15, 16-17, and 18-plus, allowing for a more detailed analysis within these age categories.
How does the video demonstrate making a box plot for a single variable?
-The video demonstrates making a box plot for Lung Capacity by setting up a y-label, adding a title, and using the 'las' argument to rotate the y-axis values for better readability.
What does the tilda (~) symbol do in the context of creating boxplots in R?
-In R, the tilda symbol is used to create side-by-side boxplots for different groups, such as comparing Lung Capacity for Smokers and Non-Smokers separately.
Why is the average Lung Capacity of Smokers larger than that of Non-Smokers in the initial observation?
-The initial observation is counter-intuitive because the Smoking effect is confounded with the Age effect; on average, Smokers are older than Non-smokers, and older children have bigger lung capacities.
How does examining the relationship within Age Strata help to clarify the relationship between Smoking and Lung Capacity?
-Examining the relationship within Age Strata allows for a more nuanced view, revealing that within the 18-plus Age Strata, Smokers have a lower Lung Capacity than Non-smokers, which is a more intuitive relationship.
What is the significance of using the square brackets in the R code shown in the video?
-The square brackets are used for subsetting data in R, allowing the user to focus on specific subsets of data, such as Lung Capacity and Smoking status for individuals aged 18 or older.
How does the video address the issue of overlapping x-axis labels in the stratified boxplot?
-The video suggests setting the 'las' argument to 2 to rotate the x-axis labels, making them more visible and preventing overlap.
What is the purpose of using different colors in the stratified boxplot?
-Different colors are used to visually distinguish between different groups, such as Non-smokers (blue) and Smokers (red), making it easier to compare Lung Capacity within each age stratum.
How can viewers learn more about modifying plots in R as shown in the video?
-Viewers are encouraged to watch additional instructional videos on modifying plots, re-labeling axes, adding legends, and other plot enhancements.
Outlines
π Introduction to Stratified Boxplots
In the first paragraph, Mike Marin introduces the concept of stratified boxplots, which are a statistical tool for examining the relationship between a categorical and a numeric variable within groups defined by a third categorical variable. The video specifically explores the relationship between smoking and lung capacity within different age groups, using the LungCapacity dataset. Mike has already prepared the dataset in R, creating an AgeGroup variable with categories for different age ranges. He also briefly discusses the previous video's content on creating basic and side-by-side boxplots, and notes an unexpected finding where smokers seem to have a larger lung capacity on average, which is later attributed to the confounding age effect. The paragraph concludes with a demonstration of how to examine the smoking-lung capacity relationship within the 18-plus age group, revealing a more intuitive negative correlation.
πΌοΈ Visualizing Stratified Relationships with Boxplots
The second paragraph delves into the visualization of the smoking-lung capacity relationship across different age strata using boxplots. Mike explains how to modify the boxplot command to create separate boxplots for each combination of smoking status and age group, resulting in eight distinct boxplots. He addresses the issue of overlapping x-axis labels by adjusting the 'las' argument to better display the categories. The paragraph emphasizes the visual comparison of smokers and non-smokers within each age stratum and introduces the use of colors to differentiate between the two groups. Mike demonstrates how to apply colors programmatically in R, resulting in a clear visualization that shows smokers generally having lower lung capacity within their respective age groups. The paragraph concludes with suggestions for further improving the plot's aesthetics and references additional resources for learning about plot modifications.
Mindmap
Keywords
π‘Stratified Boxplots
π‘Categorical Variable
π‘Numeric Variable
π‘LungCapacity Data
π‘AgeGroup Variable
π‘Boxplot
π‘Confound
π‘Subsetting Data
π‘R (Statistical Software)
π‘Plot Customization
Highlights
Introduction to stratified Boxplots for examining relationships between categorical and numeric variables within strata.
Use of LungCapacity data to explore the relationship between Smoking and Lung Capacity within age groups.
Creation of an AgeGroup variable categorizing age into less than 13, 14-15, 16-17, and 18-plus.
Demonstration of converting a numeric variable to a categorical variable using the 'cat' command.
Basic box plot creation for a single variable, such as Lung Capacity.
Creating side-by-side boxplots to compare groups, like Smokers vs. Non-Smokers.
Observation of a counter-intuitive larger average Lung Capacity for Smokers.
Introduction of the confound concept where Smoking effect is confounded with Age effect.
Proposal to examine the relationship within Age Strata to avoid confounding.
Technique to subset data using square brackets for specific age groups.
Visualization of the relationship between Smoking and Lung Capacity within the 18-plus Age Strata.
Method to produce boxplots for each combination of Smoking category and AgeGroup category.
Improvement of plot readability by rotating x-axis labels and using color coding for groups.
Observation that within age strata, Smokers tend to have a lower Lung capacity than Non-Smokers.
Cleaning up and improving the plot with a box, re-labeling axes, and adding a legend.
Encouragement to watch additional videos for learning more about modifying plots.
Transcripts
Browse More Related Video
Including Variables/ Factors in Regression with R, Part I | R Tutorial 5.7 | MarinStatsLectures
Including Variables/ Factors in Regression with R, Part II | R Tutorial 5.8 | MarinStatsLectures
Dummy Variables or Indicator Variables in R | R Tutorial 5.5 | MarinStatsLectures
Multiple Linear Regression with Interaction in R | R Tutorial 5.9 | MarinStatsLectures
Multiple Linear Regression in R | R Tutorial 5.3 | MarinStatsLectures
Scatterplots in R | R Tutorial 2.7 | MarinStatsLectures
5.0 / 5 (0 votes)
Thanks for rating: