Box Plots with Two Factors (Stratified Boxplots) in R | R Tutorial 2.3 | MarinStatsLectures

MarinStatsLectures-R Programming & Statistics
4 Jan 201607:32
EducationalLearning
32 Likes 10 Comments

TLDRIn this instructional video, Mike Marin demonstrates the creation of stratified boxplots to explore the relationship between smoking and lung capacity within different age groups. Using the LungCapacity dataset, he shows how to convert numeric age data into categorical age groups and visualizes the impact of smoking on lung capacity across these strata. The video highlights the importance of considering confounding variables like age when analyzing health data and concludes with tips on refining plot aesthetics.

Takeaways
  • πŸ“Š The video discusses the creation of stratified boxplots to examine the relationship between a categorical variable and a numeric variable within groups formed by a third categorical variable.
  • 🚬 It uses the example of examining the relationship between smoking and lung capacity within different age groups or strata.
  • πŸ—‚οΈ The 'LungCapacity' dataset is used, which includes a new 'AgeGroup' variable categorized into less than 13, 14-15, 16-17, and 18-plus age groups.
  • πŸ“ˆ The script demonstrates how to make a basic boxplot for a single variable, such as lung capacity, and how to modify it with titles and axis labels.
  • πŸ”„ The concept of side-by-side boxplots is introduced to compare lung capacities between smokers and non-smokers, revealing a counter-intuitive initial observation.
  • πŸ”‘ The video highlights the importance of considering confounding variables, such as age, which affects both smoking habits and lung capacity.
  • 🧩 The script shows how to subset data to examine relationships within specific age groups, such as focusing only on individuals aged 18 or older.
  • πŸ“ˆ It explains how to create stratified boxplots by crossing smoking categories with age group categories to produce multiple boxplots on a single plot.
  • 🎨 The video suggests using colors to differentiate groups within the boxplots for better visual separation and understanding.
  • πŸ› οΈ The script includes tips for improving plot aesthetics, such as rotating axis labels, adding a box around the plot, and including a legend.
  • πŸ” The final message emphasizes the importance of stratified analysis for drawing accurate conclusions, as opposed to ignoring potential confounders.
Q & A
  • What are stratified boxplots used for?

    -Stratified boxplots are used for examining the relationship between a categorical variable and a numeric variable within strata or groups formed by a third categorical variable.

  • What is the purpose of creating an AgeGroup variable in the video?

    -The AgeGroup variable is created to categorize the Age variable into groups such as less than 13, 14-15, 16-17, and 18-plus, allowing for a more detailed analysis within these age categories.

  • How does the video demonstrate making a box plot for a single variable?

    -The video demonstrates making a box plot for Lung Capacity by setting up a y-label, adding a title, and using the 'las' argument to rotate the y-axis values for better readability.

  • What does the tilda (~) symbol do in the context of creating boxplots in R?

    -In R, the tilda symbol is used to create side-by-side boxplots for different groups, such as comparing Lung Capacity for Smokers and Non-Smokers separately.

  • Why is the average Lung Capacity of Smokers larger than that of Non-Smokers in the initial observation?

    -The initial observation is counter-intuitive because the Smoking effect is confounded with the Age effect; on average, Smokers are older than Non-smokers, and older children have bigger lung capacities.

  • How does examining the relationship within Age Strata help to clarify the relationship between Smoking and Lung Capacity?

    -Examining the relationship within Age Strata allows for a more nuanced view, revealing that within the 18-plus Age Strata, Smokers have a lower Lung Capacity than Non-smokers, which is a more intuitive relationship.

  • What is the significance of using the square brackets in the R code shown in the video?

    -The square brackets are used for subsetting data in R, allowing the user to focus on specific subsets of data, such as Lung Capacity and Smoking status for individuals aged 18 or older.

  • How does the video address the issue of overlapping x-axis labels in the stratified boxplot?

    -The video suggests setting the 'las' argument to 2 to rotate the x-axis labels, making them more visible and preventing overlap.

  • What is the purpose of using different colors in the stratified boxplot?

    -Different colors are used to visually distinguish between different groups, such as Non-smokers (blue) and Smokers (red), making it easier to compare Lung Capacity within each age stratum.

  • How can viewers learn more about modifying plots in R as shown in the video?

    -Viewers are encouraged to watch additional instructional videos on modifying plots, re-labeling axes, adding legends, and other plot enhancements.

Outlines
00:00
πŸ“Š Introduction to Stratified Boxplots

In the first paragraph, Mike Marin introduces the concept of stratified boxplots, which are a statistical tool for examining the relationship between a categorical and a numeric variable within groups defined by a third categorical variable. The video specifically explores the relationship between smoking and lung capacity within different age groups, using the LungCapacity dataset. Mike has already prepared the dataset in R, creating an AgeGroup variable with categories for different age ranges. He also briefly discusses the previous video's content on creating basic and side-by-side boxplots, and notes an unexpected finding where smokers seem to have a larger lung capacity on average, which is later attributed to the confounding age effect. The paragraph concludes with a demonstration of how to examine the smoking-lung capacity relationship within the 18-plus age group, revealing a more intuitive negative correlation.

05:02
πŸ–ΌοΈ Visualizing Stratified Relationships with Boxplots

The second paragraph delves into the visualization of the smoking-lung capacity relationship across different age strata using boxplots. Mike explains how to modify the boxplot command to create separate boxplots for each combination of smoking status and age group, resulting in eight distinct boxplots. He addresses the issue of overlapping x-axis labels by adjusting the 'las' argument to better display the categories. The paragraph emphasizes the visual comparison of smokers and non-smokers within each age stratum and introduces the use of colors to differentiate between the two groups. Mike demonstrates how to apply colors programmatically in R, resulting in a clear visualization that shows smokers generally having lower lung capacity within their respective age groups. The paragraph concludes with suggestions for further improving the plot's aesthetics and references additional resources for learning about plot modifications.

Mindmap
Keywords
πŸ’‘Stratified Boxplots
Stratified boxplots are a type of statistical visualization used to examine the relationship between two variables while controlling for a third variable. In the context of this video, stratified boxplots help to analyze the relationship between smoking and lung capacity within different age groups. The script demonstrates how to create these boxplots to better understand the impact of smoking on lung capacity when age is taken into account.
πŸ’‘Categorical Variable
A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or case to a particular group or category. In the video, age groups and smoking status (smoker or non-smoker) are categorical variables that are used to stratify the analysis of lung capacity.
πŸ’‘Numeric Variable
A numeric variable, also known as a quantitative variable, is a variable that can take on any numerical value within a range. In this video, lung capacity is the numeric variable being analyzed in relation to the categorical variables of smoking status and age groups.
πŸ’‘LungCapacity Data
The LungCapacity data is the dataset used in the video to demonstrate the creation of stratified boxplots. It contains information about individuals' lung capacities, their smoking status, and their ages, which are used to explore the relationship between smoking and lung capacity within different age brackets.
πŸ’‘AgeGroup Variable
The AgeGroup variable is a categorical representation of the age data, created to facilitate the stratified analysis. In the script, age categories are defined as less than 13, 14-15, 16-17, and 18-plus, allowing for the examination of lung capacity across these distinct age groups.
πŸ’‘Boxplot
A boxplot, or box-and-whisker plot, is a standardized way of displaying the distribution of a dataset based on five number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The video script describes how to create boxplots for a single variable and for comparing groups, such as smokers versus non-smokers.
πŸ’‘Confound
A confound in statistics refers to a variable that influences both the independent variable and the dependent variable, making it difficult to determine the true relationship between them. In the video, the effect of smoking on lung capacity is said to be confounded with age, as older children tend to be smokers and have larger lung capacities.
πŸ’‘Subsetting Data
Subsetting data involves selecting a subset of data from a larger dataset based on specific criteria. In the script, the use of square brackets is demonstrated to subset the data for individuals aged 18 or older, allowing for a focused analysis of the relationship between smoking and lung capacity within this age group.
πŸ’‘R (Statistical Software)
R is a programming language and environment commonly used for statistical computing and graphics. The script refers to the use of R for importing data, creating variables, and generating stratified boxplots to analyze the relationship between smoking and lung capacity.
πŸ’‘Plot Customization
Plot customization refers to the process of modifying the appearance of a plot to improve its readability, aesthetics, or to convey information more effectively. The video script includes steps for customizing the boxplot, such as adding a box around the plot, re-labeling axes, and adding a legend to distinguish between smokers and non-smokers.
Highlights

Introduction to stratified Boxplots for examining relationships between categorical and numeric variables within strata.

Use of LungCapacity data to explore the relationship between Smoking and Lung Capacity within age groups.

Creation of an AgeGroup variable categorizing age into less than 13, 14-15, 16-17, and 18-plus.

Demonstration of converting a numeric variable to a categorical variable using the 'cat' command.

Basic box plot creation for a single variable, such as Lung Capacity.

Creating side-by-side boxplots to compare groups, like Smokers vs. Non-Smokers.

Observation of a counter-intuitive larger average Lung Capacity for Smokers.

Introduction of the confound concept where Smoking effect is confounded with Age effect.

Proposal to examine the relationship within Age Strata to avoid confounding.

Technique to subset data using square brackets for specific age groups.

Visualization of the relationship between Smoking and Lung Capacity within the 18-plus Age Strata.

Method to produce boxplots for each combination of Smoking category and AgeGroup category.

Improvement of plot readability by rotating x-axis labels and using color coding for groups.

Observation that within age strata, Smokers tend to have a lower Lung capacity than Non-Smokers.

Cleaning up and improving the plot with a box, re-labeling axes, and adding a legend.

Encouragement to watch additional videos for learning more about modifying plots.

Transcripts
Rate This

5.0 / 5 (0 votes)

Thanks for rating: