The Shape of Data: Distributions: Crash Course Statistics #7

CrashCourse
7 Mar 201811:23
EducationalLearning
32 Likes 10 Comments

TLDRThis video explains how the shape and distribution of data samples can provide insights into the larger data sets they represent. It discusses different types of distributions like normal, skewed, multimodal, and uniform. Comparing distribution shapes helps determine if samples come from the same generative process and make inferences about the world. Statistics allows us to examine samples, with their uncertainty, and guess at the true underlying distributions that created them.

Takeaways
  • 😀 Samples give us a glimpse of the bigger picture and tell us something about the shape of all the data
  • 👨‍🏫 Distributions represent all possible values for a set of data and how often those values occur
  • 📊 The shape of a normal distribution is set by the mean and standard deviation
  • ✏️ Data is often skewed with extreme values on one side rather than symmetric
  • 😮 Bimodal or multimodal data has two or more peaks and may come from two hidden distributions
  • 🎲 Uniform distributions have the same frequency for each value, like numbers on a die
  • 🔍 The shape of sample data gives us clues about the true underlying distribution
  • 🤔 We use statistics to make decisions when we are uncertain, based on patterns in sample data
  • 🚦 Comparing sample shapes helps us figure out if they come from the same distribution
  • 🎯 The shape of data gives us insight into what's really happening in the world
Q & A
  • What is a distribution in statistics?

    -A distribution represents all possible values for a set of data and how often those values occur. Distributions show the shape and spread of data.

  • How does a sample relate to a distribution?

    -A sample gives us a glimpse of what the distribution might look like for the full data set. We collect samples because we think they will tell us something about the shape of all the data.

  • What are some common shapes of distributions?

    -Some common distribution shapes are normal (symmetric, bell curve), skewed (with a long tail on one side), bimodal (two peaks), and uniform (all values equally likely).

  • What does the normal distribution look like?

    -The normal distribution is symmetric and bell-shaped. It has a single peak at the mean, with 68% of values within 1 standard deviation of the mean.

  • How can you identify skewed distributions?

    -Skewed distributions have a long tail on one side. In the box plot, the median will not split the box in half. There may also be more outliers on the skewed side.

  • What causes bimodal distributions?

    -Bimodal distributions have two peaks and often occur when there are two groups being measured together, possibly with two underlying distributions.

  • What is a uniform distribution?

    -In a uniform distribution, each value has an equal chance of occurring, like the numbers on a die roll. Samples may not look perfectly uniform but we know the underlying distribution is.

  • Why do we care about distribution shapes?

    -The shape of a distribution gives us information about what generated the data. Different shapes imply different data generating processes.

  • How do you set a normal distribution?

    -A normal distribution is set by its mean, which defines the center, and its standard deviation, which defines how spread out it is.

  • What does comparing distribution shapes allow us to do?

    -By comparing the shapes of samples, we can make inferences about whether the underlying distributions that generated them are different or the same.

Outlines
00:00
😃 Introducing Data Distributions and Their Shapes

This paragraph introduces the concept of data distributions, which represent all possible values for a data set and how often those values occur. It explains that distributions can be visualized as histograms with narrow bins to create a smooth curve, and discusses discrete vs continuous distributions. The distribution acts as instructions for a data generating machine, specifying how the data is shaped.

05:00
😊 Exploring Common Distribution Shapes

This paragraph explores some of the most common distribution shapes, starting with the normal/bell curve distribution. It explains how the normal distribution is symmetric and unimodal, and how its shape is determined by the mean and standard deviation. It then discusses skewed distributions, comparing the shapes of two sample test score distributions. Finally, it introduces bimodal and multimodal distributions, as well as the uniform distribution.

10:05
🤓 Using Statistics to Understand Distributions

This concluding paragraph explains how statistics allows us to make inferences about the true underlying distribution that generated a sample of data, despite randomness and uncertainty. It gives examples of using distribution shapes for real-world tasks like determining if a die is loaded. It emphasizes that the goal is to glimpse the true nature of what's happening in the world.

Mindmap
Keywords
💡distribution
A distribution represents all possible values for a set of data and how often those values occur. It tells us about the shape and spread of data. The video discusses different types of distributions like normal, skewed, bimodal, and uniform.
💡sample
A sample is a subset of data collected from a larger population. The video says we collect samples because they give us a glimpse of the bigger picture and tell us something about the shape of all the data.
💡histogram
A histogram is a graph showing the frequency distribution of continuous data using bars. As the bins get smaller, a histogram starts to look like a smooth curve representing the distribution.
💡normal distribution
Also called a bell curve. It is symmetric, with a single peak, and defined by its mean and standard deviation. Many natural processes generate data that is approximately normally distributed.
💡skewed distribution
A skewed distribution has extreme values on one side, making it asymmetrical. This stretches one side of the boxplot. Skew tells us where most of the data is concentrated.
💡bimodal distribution
A distribution with two peaks. This often happens when two groups are measured together. Bimodal data may actually come from two overlapping distributions.
💡uniform distribution
A distribution in which all values have an equal probability of occurring, like numbers on a dice. Samples may not look perfectly uniform but we can infer the distribution.
💡statistics
The video says statistics allows us to make decisions when we're not sure - to look at the shape of samples and guess the true distribution. Statistics help us understand patterns in noisy real-world data.
💡shape
The video emphasizes how the shape of data gives us information about the underlying distribution - whether it's normal, skewed, bimodal etc. Shape helps us make comparisons between samples.
💡random
Data is often generated randomly based on a distribution. So samples vary randomly as well. Statistics helps us determine if differences between samples are significant or random.
Highlights

Samples and the shapes they give us are shadows of what all the data would look like.

We collect samples because we think they’ll give us a glimpse of the bigger picture.

It turns out we can learn almost everything we need to know about data from its shape.

A distribution represents all possible values for a set of data and how often those values occur.

The shape of a normal distribution is set by two familiar statistics: the mean and standard deviation.

Skew can be a useful way to compare data.

Answering whether one distribution could have produced two samples gets complicated, but we’ll get there.

Often when you see multimodal data in the world it’s because there are two different machines with two different distributions that are both generating data.

While we don’t know for sure that bimodal data is secretly two distributions disguised as one, it is a good reason to look at things more closely.

There’s a difference between the shape of all the data, and the shape of a sample of the data.

Using statistics allow us to take the shape of samples that has some randomness and uncertainty, and make a guess about the true distribution that created that sample of data.

Whether it’s finding the true distribution of eruption times at Old Faithful, or showing evidence that a company is discriminating based on age, gender, or race, the shape of data gives us a glimpse into the true nature of what is happening in the world.

Picture a histogram of every single person’s height. Now imagine the bars getting thinner and thinner as the bins get smaller and smaller. Till they are so thin that the outline of our histogram looks like a smooth line since there’s an infinite possibility of heights.

If we let our bars be infinitely small, we get a smooth curve, also known as the distribution, of the data.

We’ll have a skinnier normal distribution. Most of the data in the normal distribution—about 68%—is within 1 standard deviation of the mean on either side.

Transcripts
Rate This

5.0 / 5 (0 votes)

Thanks for rating: