统计学总结

7. Parameter Estimation

Model and parameters
Properties of good estimators
- Unbiasedness, consistency
- UMVUE, efficiency
MLE
Bayesian Estimation
- why?
- Prior and Posterior
- Conjugate distribution
- Limitations

Reason: statistic estimation is not general estimation problem.

Formulation:
$X_1, X_2,..., X_n \ i.i.d \sim f(x ; \theta) \ \ \ \theta \in unknown\\ Estimator: \hat \phi = \phi(X) , \phi: \mathbb{R}^{n} \rightarrow E$

Properties of Good Estimators:

Correctness:

Unbiasedness： 样本量抽样分布的数学期望等于被估计总体的参数
$E[\phi(X)]=\theta \text { for } X \sim f(x ; \theta)$

Consistency: 随样本量增大，估计量收敛于总体的被估计参数
$\phi(X) \rightarrow \theta \text { in probability for } X \sim f(x ; \theta)$
Example:
$\begin{aligned} s^{2} &=\frac{1}{n-1} \sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)^{2} 无偏的\\ \hat{\sigma}^{2} &=\frac{1}{n} \sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)^{2} 一致的\end{aligned}$
Accurate:

Efficient:
UMVUE is very restrictive. Efficient is weaker condition.

Maximum Likelihood Estimation

Why?
MLE is a framework to design consistent and efficient estimator under very general conditions.

Formulation

The likelihood function:
$L(X ; \theta)=\prod_{i=1}^{n} f\left(X_{i} ; \theta\right) \\ X ~ \ i.i.d \sim f(x ; \theta) \ \ \ \theta \in unknown\\$
MLE: For given data samples X=x
$\hat\theta=argmax _{\theta \in E} L(x ; \theta)=L(x ; \hat{\theta})$

Limitations:

To solve MLE, even numerically, could be very challenging.
MLE does not guarantee good performance in finite sample.

Bayesian Estimation

With Bayesian estimation, we can easily update our estimator in a fashion that samples are collected sequentially.

Formulation:

θ ～ E
$f_{0}(\theta)$ as the prior of $\theta$
$f_{1}(\theta)$ called posterior, which gives the distribution of $\theta$ on condition data
$f_{1}(\theta)= f(\theta|X)=\frac{L(x ; \theta) f_{0}(\theta)}{\int_{E} L(x ; u) f_{0}(u) d u}$

Sequential Bayesian Estimation
Intuitively, if more data Xn+1,…,Xn+m is available, we can take the previous posterior f1 as the new prior and update the belief again using the new data only:

$f_{2}(\theta)=\frac{L(x ; \theta) f_{1}(\theta)}{\int_{E} L(x ; u) f_{1}(u) d u}$

Limitations:

Its dependence on the prior, which can be any distribution on E. A very strong prior could lead to a non-consistent estimation.
- In the information-based trade example, what will happen if we pick p0 = 1?
  On the other hand, a weak prior could lead to slow convergence.
The computation of the posterior could be very costly when the parameter space E is large.

8. Confidence Interval

Three constructions of CI for i.i.d samples:
- normal
- t
- bootstrap
When and how?

Central Limit Theory

Theorem: $\{X_i\}$ is a sequence of i.i.d. samples of X with $E[X] = μ$ and
$Var(X) = σ^2$ . Then,
$\frac{\sqrt{n}}{\sigma}\left(\overline{X}_{n}-\mu\right) \Rightarrow N(0,1)$
Therefore, when n is “large”, for any α > 0
$P\left(\left|\frac{\sqrt{n}}{\sigma}\left(\overline{X}_{n}-\mu\right)\right|>a\right) \approx P(|Z|>a)$
where Z is a standard normal r.v.

Confidence Interval(z-distribution)

For any confidence level $a$ , we simply choose $\phi$ such that
$P(|Z|>\phi)=1-a$ , then the a confidence interval is
$\left[\overline{X}_{n}-\phi \frac{\sigma}{\sqrt{n}}, \overline{X}_{n}+\phi \frac{\sigma}{\sqrt{n}}\right]$
95% CI means that: 如果做了100次抽样，大概有95次找到的区间包含真值，有5次找到的区间不包含真值。
$样本均值的标准误差s.e.为\sigma_{ . \overline{x}}=\sigma / \sqrt{n}$

The Effect of Sample Size

The magnitude of estimation error, measured by the half length of CI, is
$\phi \frac{\sigma}{\sqrt{n}}$
In order to have the estimation error ≈ ε, we need the sample size
$n \approx \frac{\phi^{2} \sigma^{2}}{\varepsilon^{2}}$
Intuitively, to improve the estimation accuracy by 10 times, we need enlarge the sample size by 100 times.

CI for Small Samples

Theorem: (CI of t-distribution)
If $X1, X2,...,Xn$ are i.i.d. samples of a normal distribution $N(μ,σ^2)$ , then
$\frac{\sqrt{n}}{s}\left(\overline{X}_{n}-\mu\right) \sim t(n-1)$ , a t-distribution with degree of freedom n − 1.
Remark:
- t-distribution is more disperse than normal.
- When n → ∞, t(n − 1) ⇒ N (0, 1).

Bootstrap

9. Significance Test

Formulation of general hypothesis test
- Parameter space
- Hypothesis / Alternative
- Hypothesis testing
Significance test
- 5 steps
- What is the intuition
- How to choose the hypothesis and alternative
- How to interpret the p-value
- Type I and II errors

Steps of a Significance Test

Assumptions: underlying probability model for population
Hypothesis: Formulate the statement or prediction in your research problem into a statement about the population parameter.
Test Statistic: the test statistic measures how “far” the point estimate of parameter is from its null hypothesis value(s), conditional on that null hypothesis is true.
P-Value: the tail probability beyond the observed value of test statistic, if we presume null hypothesis is true. 事件发生的不可能程度
Conclusion: Report and interpret the p-value in the context of the study. Make a decision about H0 based on p-value.

Type I & Type II errors & Interpreting P-Value

Inference on Single Variables

Population proportion

z-test
Difference from CI
Small sample: binomial test

统计学总结

Population mean

t-test
Relation with CI
Small sample: bootstrap

Inference on Two Variables

Independent samples
- Population proportion: z-test
- Population mean: t-test
- Small sample: permutation test
Paired data: t_test for single variable

standard error of z:
$z=\frac{\left(p_{1}-p_{2}\right)-\left(\pi_{1}-\pi_{2}\right)}{\sqrt{\frac{p_{1}\left(1-p_{1}\right)}{n_{1}}+\frac{p_{2}\left(1-p_{2}\right)}{n_{2}}}}$
standard error of u:
一般不做要求，直接给出
Conclude CI:
Given our estimation on the standard error for the estimated mean or proportion difference, we can construct the confidence interval for mean or proportion difference:
$\left[(\overline{x}-\overline{y})-\phi_{\alpha} s e,(\overline{x}-\overline{y})+\phi_{\alpha} s e\right]$
The coefficient φα is determined by α and model assumptions (normal
distribution for proportions, t distribution for means).

Permutation Test

检验是两个总体是否是同一个服从同样的分布

Paired data

10. Multiple Regression

Assumptions
Interpretation of estimation results
Inference methods:
- t-test for single coefficient
- F-test for nested models
Residual analysis

Assumptions(linear regression model)

$y_{i}=\beta_{0}+\sum_{k=1}^{p} \beta_{k} g_{k}\left(x_{i k}\right)+\varepsilon_{i}$

where the functions $g_k$ are known. Besides, we assume the following conditions on $ε_i$ :

Independence: $ε_i$ are independent.
Zero mean: $E[ε|x] = 0$ for all possible value of $x = (x1, ..., xm)$ .
Equal variance: $Var(ε|x) = σ2$ .
Normality: $ε_i$ are normal conditional on x.

T-test & F-test

Residual analysis

DW-test 检验是否独立，原假设是残差独立不相关
JB-test 检验是否正太分布，原假设是残差是正太分布

Assumptions(logistic regression)

n=nrow(data)
tpr=fpr=rep(0,n)
#compute TPR and FPR for different threshold
for (i in 1:n)
{
  threshold=data$prob[i]
  tp=sum(data$prob>threshold&data$obs==1)
  fp=sum(data$prob>threshold&data$obs==0)
  tn=sum(data$prob<=threshold&data$obs==1)
  fn=sum(data$prob<=threshold&data$obs==0)
  tpr[i]=tp/(tp+tn)  #true positive rate
  fpr[i]=fp/(fp+fn)  #false positive rate
}
# plot ROC
plot(fpr,tpr,type='l',ylim = c(0,1),xlim = c(0,1),main = 'ROC')
abline(a=0,b=1)