统计学总结
7. Parameter Estimation
- Model and parameters
- Properties of good estimators
- Unbiasedness, consistency
- UMVUE, efficiency
- MLE
- Bayesian Estimation
- why?
- Prior and Posterior
- Conjugate distribution
- Limitations
Reason: statistic estimation is not general estimation problem.
-
Formulation:
Properties of Good Estimators:
Correctness:
-
Unbiasedness: 样本量抽样分布的数学期望等于被估计总体的参数
-
Consistency: 随样本量增大,估计量收敛于总体的被估计参数
-
Example:
- Accurate:
- Efficient:
- UMVUE is very restrictive. Efficient is weaker condition.
Maximum Likelihood Estimation
Why?
MLE is a framework to design consistent and efficient estimator under very general conditions.
Formulation
- The likelihood function:
- MLE: For given data samples X=x
Limitations:
- To solve MLE, even numerically, could be very challenging.
- MLE does not guarantee good performance in finite sample.
Bayesian Estimation
With Bayesian estimation, we can easily update our estimator in a fashion that samples are collected sequentially.
Formulation:
- θ ~ E
- as the prior of
-
called posterior, which gives the distribution of on condition data
Sequential Bayesian Estimation
Intuitively, if more data Xn+1,…,Xn+m is available, we can take the previous posterior f1 as the new prior and update the belief again using the new data only:
Limitations:
- Its dependence on the prior, which can be any distribution on E. A very strong prior could lead to a non-consistent estimation.
- In the information-based trade example, what will happen if we pick p0 = 1?
On the other hand, a weak prior could lead to slow convergence.
- In the information-based trade example, what will happen if we pick p0 = 1?
- The computation of the posterior could be very costly when the parameter space E is large.
8. Confidence Interval
- Three constructions of CI for i.i.d samples:
- normal
- t
- bootstrap
- When and how?
Central Limit Theory
-
Theorem: is a sequence of i.i.d. samples of X with and
. Then,
- Therefore, when n is “large”, for any α > 0
where Z is a standard normal r.v.
Confidence Interval(z-distribution)
- For any confidence level , we simply choose such that
, then the a confidence interval is
- 95% CI means that: 如果做了100次抽样,大概有95次找到的区间包含真值,有5次找到的区间不包含真值。
The Effect of Sample Size
- The magnitude of estimation error, measured by the half length of CI, is
- In order to have the estimation error ≈ ε, we need the sample size
Intuitively, to improve the estimation accuracy by 10 times, we need enlarge the sample size by 100 times.
CI for Small Samples
-
Theorem: (CI of t-distribution)
If are i.i.d. samples of a normal distribution , then
, a t-distribution with degree of freedom n − 1. -
Remark:
- t-distribution is more disperse than normal.
- When n → ∞, t(n − 1) ⇒ N (0, 1).
Bootstrap
9. Significance Test
- Formulation of general hypothesis test
- Parameter space
- Hypothesis / Alternative
- Hypothesis testing
- Significance test
- 5 steps
- What is the intuition
- How to choose the hypothesis and alternative
- How to interpret the p-value
- Type I and II errors
Steps of a Significance Test
- Assumptions: underlying probability model for population
- Hypothesis: Formulate the statement or prediction in your research problem into a statement about the population parameter.
- Test Statistic: the test statistic measures how “far” the point estimate of parameter is from its null hypothesis value(s), conditional on that null hypothesis is true.
- P-Value: the tail probability beyond the observed value of test statistic, if we presume null hypothesis is true. 事件发生的不可能程度
- Conclusion: Report and interpret the p-value in the context of the study. Make a decision about H0 based on p-value.
Type I & Type II errors & Interpreting P-Value
Inference on Single Variables
Population proportion
- z-test
- Difference from CI
- Small sample: binomial test
Population mean
- t-test
- Relation with CI
- Small sample: bootstrap
Inference on Two Variables
- Independent samples
- Population proportion: z-test
- Population mean: t-test
- Small sample: permutation test
- Paired data: t_test for single variable
-
standard error of z:
-
standard error of u:
一般不做要求,直接给出 -
Conclude CI:
Given our estimation on the standard error for the estimated mean or proportion difference, we can construct the confidence interval for mean or proportion difference:
The coefficient φα is determined by α and model assumptions (normal
distribution for proportions, t distribution for means).
Permutation Test
检验是两个总体是否是同一个服从同样的分布
Paired data
10. Multiple Regression
- Assumptions
- Interpretation of estimation results
- Inference methods:
- t-test for single coefficient
- F-test for nested models
- Residual analysis
Assumptions(linear regression model)
where the functions are known. Besides, we assume the following conditions on :
- Independence: are independent.
- Zero mean: for all possible value of .
- Equal variance: .
- Normality: are normal conditional on x.
T-test & F-test
Residual analysis
- DW-test 检验是否独立,原假设是残差独立不相关
- JB-test 检验是否正太分布,原假设是残差是正太分布
Assumptions(logistic regression)
n=nrow(data)
tpr=fpr=rep(0,n)
#compute TPR and FPR for different threshold
for (i in 1:n)
{
threshold=data$prob[i]
tp=sum(data$prob>threshold&data$obs==1)
fp=sum(data$prob>threshold&data$obs==0)
tn=sum(data$prob<=threshold&data$obs==1)
fn=sum(data$prob<=threshold&data$obs==0)
tpr[i]=tp/(tp+tn) #true positive rate
fpr[i]=fp/(fp+fn) #false positive rate
}
# plot ROC
plot(fpr,tpr,type='l',ylim = c(0,1),xlim = c(0,1),main = 'ROC')
abline(a=0,b=1)