Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

MODELING DATA DISTRIBUTIONS

PART 1 Percentiles

PART 2 Z-scores

PART 3 Effects of linear transformation

PART 4 Density curves

PART 5 Normal distribution and the empirical rule

PART 6 Normal distribution calculations

PART 7 More on normal distributions


PART 1 Percentiles

1. Calculating percentile

(1) Method1: percent of the data that is below the amount in question

(2) Method2: percent of the data that is at or below the amount

2. Frequency, Relative frequency, Cumulative relative frequency

(1) Frequency: the number of times a given data occurs in a data set.

(2) Relative frequency: the fraction or proportion of times a given data occurs.

(3) Cumulative relative frequency: the accumulation of the previous relative frequencies.

 


PART 2 Z-scores

1. Z-score / Standard score: gives you an idea of how far from the mean a data point is. More technically, it measures how many standard deviations above or below the mean a data point is.

2. Formula:  Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

3. Important facts of z-score: 

(1) a positive z-score means the data point is above average

(2) a negative z-score means the data point is below average

(3) a z-score close to 0 means the data point is close to average

(4) a data point can be considered unusual is its z-score is above 3 or below -3

(5) Z-score can be applied to any distribution. It just means how many standard deviations you are away from the mean.

[NOTE] the “unusual” guideline is not an absolute rule. Some may say that a z-score beyond Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS is unusual, while beyond Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS is highly unusual. Some may use Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS as the cutoff.

4. Comparing the z-scores

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

He did slightly better on the LSAT because he did more standard deviations above the mean. And because 2.1 is close to 1.86, we can say that these two scores are comparable.

5. Z-score allows us to calculate the probability of a score occurring within a normal distribution and enables us to compare two scores that are from different normal distributions.

 


PART 3 Effects of linear transformation

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

1. If the data is shifted (adding a constant to each data point in the dataset) by 1 unit:

  • the measures of the center will increase by 1
  • the measures of the spread will be affected

2. If the data is scaled (multiplying a constant to each data point in the dataset) by 10 units:

  • both the measures of center and spread will increase by a multiple of 10

[eg]The amount of cleaning solution a company fills its bottles with has a mean of 33 fl oz and a standard deviation of 1.5fl oz. The company advertises that these bottles have  32fl oz of cleaning solution. What will be the mean and standard deviation of the distribution of excess cleaning solution, in milliliters? (1fl oz is approximately 30mL.)

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

 


PART 4 Density curves

1. Density curve: is a graph that shows probability. The area under the density curve is equal to 100% OR 1

[eg] The density curve of how body weights are distributed:

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

2. Mean, median, and skew from density curves

(1) Density curves can be a skewed distribution

(2) The right- or left-skew doesn’t refer to how the graph looks, it refers to whether the data is skewed

(3) A right-skewed graph will have the mean to the right to the median, and vice versa.

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

3. Properties of continuous probability density functions:

(1) The outcomes are measured, not counted

(2) The entire area under the curve and above the x-axis is equal to one

(3) Probability is found for intervals of Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS values, rather than for individual Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS values

(4) Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS is the probability that the random variable Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS is in the interval between the values Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS and Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS. It is the area under the curve, above the x-axis, to the right of Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS and the left of Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

(5) Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS. The probability that Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS takes on any single individual value is zero.

(6) Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS is the same as Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS because the probability is equal to the area.

 


PART 5 Normal distribution and the empirical rule

1. Normal distribution: is a probability function that describes how the values of a variable are distributed.

2. Features of normal distribution:

(1) symmetric bell shape

(2) mean and median are equal; both located at the center of the distribution

(3) most of the observations cluster around the central peak

(4) extreme values in both tails of the distribution are similarly unlikely.

3. Empirical rule for the normal distribution

(1) around 68% of the data falls within 1 standard deviation of the mean

(2) around 95% of the data falls within 2 standard deviations of the mean

(3) around 99% of the data falls within 3 standard deviations of the mean

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

4. Standard normal distribution

(1) The normal distribution has many different shapes depending on the parameter values (mean and standard deviation)

(2) The standard deviation is a special case of the normal distribution where the mean is zero and the standard deviation is 1.

 


PART 6 Normal distribution calculations

1. Z table: A z-table, also called the standard normal table, is a mathematical table that allows us to know the percentage of values below a z-score in a standard normal distribution.

2. Standard normal table for proportion below

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

3. Standard normal table for proportion above

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

4. Standard normal table for proportion between values

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

5. Finding z-score for percentile

要注意临界值:

  • 当要找小于10%的可能性时,在z值表中要找最接近于10%,且小于等于10%对应的格子。

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

 


PART 7 More on normal distributions

1. Why normal distribution is important?

(1) Some statistical hypothesis tests assume that the data follow a normal distribution.

  • Hypothesis test hierarchy

Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

(2) Linear and nonlinear regression both assumes that the residuals follow a normal distribution

(3) The central limit theorem states that as the sample size increases, the sampling distribution of the mean follows a normal distribution even when the underlying distribution of the original variable is non-normal.

  • Central limit theorem: if you have a population with mean Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS and standard deviation Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed.

2. Normal distribution plays an important role in inferential statistics. Inferential statistics use a random sample to draw conclusions about a population because it is not practical to obtain data from all population members.

3. Probability density function(PDF) vs. Probability mass function(PMF)

(1) PMF(概率质量函数): is a function that gives the probability that a discrete random variable is exactly equal to some value, denoted as Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS.

  • Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

(2) PDF(概率密度函数): is the continuous analog of PMF, denoted as Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS. 连续型随机变量的概率密度函数是描述这个随机变量的输出值,在某个确定的取值点附近的可能性的函数

  • Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

4. Probability density function(PDF) vs. Cumulative distribution function(CDF)

(1) CDF(分布函数): CDF of a random variable Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS is the probability that Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS will take a value less than or equal to Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS, denoted as Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

  • Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS
  • Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

(2) Denote Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS as the probability density function, and Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS as the cumulative function

  • Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS
  • Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS
  • 随机变量的X的分布函数在某一点 Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS 的值,就是概率密度函数在 Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS 到 Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS 上求定积分/在 Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS 到 Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS 的线下面积

5. Normal distribution vs. Binomial distribution vs. Poisson distribution

(1) Normal distribution:

  • describes continuous data which have a symmetric distribution, with a characteristic ‘bell’ shape. Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONSKhan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS 
  • probability density function of the normal distribution: Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS
  • In normal distribution, the probability is not coming from reading the graph, but the area under the curve.
  • Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS

(2) Binomial distribution:

  • describes the distribution of binary data from a finite sample. Thus it gives the probability of getting Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS events out of Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS trials (and the probability of success in each trial is Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS). Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS ~ Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS
  • probability mass function of the binomial distribution: the probability of getting exactly Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS successes in Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS independent binomial trials is given by the probability mass function: Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS
  • The normal distribution can be used as an approximation to the binomial distribution, under certain circumstances: If Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS ~ Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS and if Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS is large and/or Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS is close to ½, then Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS is approximately Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS, where Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS 

(3) Poisson distribution:

  • describes the distribution of binary data from an infinite sample. Thus it gives the probability of getting Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS events in a population.
  • probability mass function: Khan Academy - Statistics and Probability - Unit 4 MODELING DATA DISTRIBUTIONS