`{r NHANES2006,echo = TRUE} 库(sm)库(“kdensity”)甘油三酯<-read.table(“C:/Users/thamo/OneDrive - 南非大学/UNISA/MODULES/2021-2022/STA4809/ASSIGNMENT 02/NHANES2006.csv”,head = TRUE,sep =“;”)
y <- (甘油三酯[, c("TRG"), drop = FALSE])
kde = kdensity(y, start = "gamma", kernel = "normal") plot(kde, main = "甘油三酯")
The kernel density estimate shows that triglyceride data is negatively skewed,has
sharper peak and have got a larger spread with a bandwidth of 10.16.
\item[(b)] Parametric density estimate for triglyceride.\\
```{r NHANES, echo = TRUE}
kde = kdensity((y), start = "gamma", kernel = "normal")
plot(kde, lwd = 2, col = "red",
main = "Normal distribution parametric density estimates")
lines(kde, plot_start = TRUE, lty = 2, lwd = 2)
带宽为 10.16 的参数密度估计(虚线)的峰值小于核密度估计,右尾的扩散也较大。 \end{enumerate}
\textbf{问题 2}
library(sm)
library(rpanel)
library(tkrplot)
par(mfrow=c(1,2))
n1 <- 50
mu1 <- 3
sigma1 <- 0.5
x1 <- rnorm(n1, mean = mu1, sd = sigma1)
sm.density(x1, model = "Normal", ylim = c(0,2),
xlab = "Normal(mean = 3, sd = 0.5) for n = 50")
n2 <- 200
mu2 <- 5
sigma2 <- 0.6
x2 <- rnorm(n2, mean = mu2, sd = sigma2)
sm.density(x2, model = "Normal", ylim = c(0,1.5),
xlab = "Normal(mean = 5, sd = 0.6) for n = 200")
n3 <- 100
mu3 <- 10
sigma3 <- 4
x3 <- rnorm(n3, mean = mu3, sd = sigma3)
sm.density(x3, model = "Normal", ylim = c(0,0.12),
xlab = "Normal(mean = 10, sd = 4) for n = 100")
从上图可以看出,给定平均值和标准差的密度估计值因样本大小而异。
\textbf{问题 3} \begin{enumerate} \item[(a)]
$$ \begin{aligned} \text{偏差 KDE 由下式给出。}\ &E[\hat f_n(y) - f(y)] = E(\frac{1}{nh}\sum^n_{i=1} W(\frac{Y_i-y}{h})) - f(y)\ &=\frac{1}{h}E(W(\frac{Y_i-y}{h})) - f(y)\ &=\frac{1}{h}\int W(\frac{Yy}{h})f(y)dy-f(y)\ \text{经过一些变量变化之后}\ &=\frac{1}{h}\int W(\frac{Yy}{h})f(y)\frac{dy}{h}-f(y)\ &=\int W(y)f(y + hy)dy-f(y)\ \text{当 h 较小时使用泰勒展开式}\ &f(y+h(y) = f(y)-h(y).f\prime(y)+\frac{1}{2} h^2y^2f\prime\prime(y)+o(h^2))\ &=\frac{1}{2}h^2f\prime\prime(y)\int y^2W(y)dy-f(Y)+o(h^2)\ &=\frac{1}{2}h^2f\prime\prime(y)\sigma^2_w+o(h^2)\ \text{其中}\ &\sigma^2_w = \int y^2W(y) dy\ \text{因此 KDE 的偏差为} &=\frac{1}{2}h^2f\prime\prime(y)\sigma^2_w+o(h^2) \end{aligned} $$
\item[(b)]
$$ \begin{aligned} \text{从上面的 (a),我们得到泰勒展开式}\ &f(y+h(y) = f(y)-h(y).f\prime(y)+\frac{1}{2} h^2y^2f\prime\prime(y)+o(h^2))\ \text{使用 Var 我们有}\ &{\operatorname{Var}}(\hat f_n(y)= {\operatorname{Var}}(\frac{1}{nh}\sum^n_{i=1} W\frac{Y_i-y}{h}))\ &=\frac{1}{nh^2}{\operatorname{Var}}(W(\frac{Y_i-y}{h}))\ &\le\frac{1}{nh^2}E(W^2\frac{(Y_i-y)}{h})\ &=\frac{1}{nh^2}\int W^2\frac{(Y_i-y)}{h}f(y)dy\ &=\frac{1}{nh}\int W^2(y)f(y+hy)dy\ &=\frac{1}{nh}\int W^2(y)[f(y)+hyp\prime(y)+o(h)]dy\ &=\frac{1}{nh}(p(y).\int W^2(y)dy+o(h))\ &=\frac{1}{nh}p(y)\int W^2(y)dy+o(\frac{1}{nh})\ &=\frac{1}{nh}f(y)\alpha^2_w+o(\frac{1}{nh})\ &=\frac{1}{nh}f(y)\alpha_w\ \text{where}\ \alpha2_w=\int w^2(u)du \end{aligned} $$
\item[(c)]
$$ \begin{aligned} \text{从上面的 (a) 和 (b) 可知 KDE 的 MISE 由以下公式给出:}\ &=\text{MISE}(f\prime_n) = \int MSE(f\prime_n(y))dy = E(\int (f\prime_n(y) - f(y))^2)dy\ &=\frac{1}{4}h^4\int \mid f\prime\prime(y)\mid^2dy\ \mu^2_w+\frac{1}{nh}\int f(y)dy\ \alpha^2_w+o(h^4)+o(\frac{1}{nh})\ &=\frac{\sigma^2_w}{4}h^4\int \mid f\prime\prime(y)\mid^2dy +\frac{\alpha^2_w}{nh}+o(h^4)+o(\frac{1}{nh})\ &=\frac{1}{4}h^4\sigma^4_w \int f\prime\prime^2 dy+\frac{1}{nh}\alpha(w)\ \text{其中}\ \sigma(w) = \int \mu^2w(u)du。\end{aligned} $$ \item[(d)]
$$ \text{从上面的 (c) 可知渐近均方误差 (AMSE) 为}\ \begin{aligned} &=\frac{\sigma^2_w}{4}.h^4.\int \mid f\prime\prime(y)\mid^2dy+\frac{\alpha^2_w}{nh}+o(h^4)+o(\frac{1}{nh})\ \text{因此,通过最小化 AMSE 来选择最佳平滑带宽}\ &h_{opt} = \left[ \frac{1}{n}.\frac{4}{\int \mid p(y)\prime\prime\mid^2dy}.\frac{\sigma^2_w}{\alpha^2_w})\right]^{\frac{1}{5}}\ &=\left[\frac{\psi(w)}{\beta(f)n}\right]^{\frac{1}{5}}\ &\text{其中}\ &\psi(w)=\frac{\alpha(w)}{\sigma^4_w} \text{并且}\ \beta(w)=\int f\prime\prime^2dy \end{aligned} $$ \item[(e)]
$$ \text{核密度估计量的近似 (1 -}\alpha) 100%\ \text{置信区间}\ \begin{aligned} &\hat f_n(y)=\frac{1}{nh}\sum^n_{i=1}W(\frac{Y_i-y}{h}) = \frac{1}{n}\sum^n_{i=1}Y_i\ \text{因此,在}\ y_0\ \text{处评估的 KDE 是 Y 的样本均值}\ &\sqrt n \left( \frac{ \hat f_n(y)-E(\hat f_n(y)}{Var\ Y_i}\right)\frac{D}{\rightarrow}N(0,1)\ &Var(Y_i)=Var \left( \frac{1}{h}W\left(\frac{Y_i-y}{h}\right) \right)=\frac{1}{h}f(y)\sigma^2_w+o(\frac{1}{h})\ &\text{当 h → 0 时发散,因此,当 h → 0 时,}\ \hat f_n(y)\ \text{的渐近分布为}\ &\sqrt {nh}(\hat f_n(y))-E(\hat f_n(y))\frac{D}{\rightarrow}N(0,f(y)\sigma^2_w)\ \text{因此,核密度估计量的置信区间可以写成}\ \hat f_n(y) +- Z_{1-\alpha/2}.\sqrt {\hat f_n(y)\sigma^2_w} \end{aligned} $$
\textbf{问题 4} \begin{枚举} \item[(a)]
library(sm)
data(mtcars)
x <- mtcars$disp
hnorm(x)
hcv(x)
hsj(x)
par(mfrow=c(1,1))
sm.density(x, h = hnorm(x), xlab = "Gross horsepower", lty=2)
sm.density(x, h = hcv(x), xlab = "Gross horsepower", add = T)
sm.density(x, h = hsj(x), xlab = "Gross horsepower", add = T, lty = 3)
title(main="Displacement (cu.in.) per Gross horsepower (hp)")
上述密度显示每总马力的排量,其值为正常最优值 = 65.63941、交叉验证 = 38.78734 和 Sheather-Jones = 42.01607。
library(sm)
data(mtcars)
y <- mtcars$hp
hnorm(y)
hsj(y)
par(mfrow=c(1,1))
sm.density(y, h = hnorm(x), xlab = "Gross horsepower", lty=2)
sm.density(y, h = hcv(x), xlab = "Gross horsepower", add = T)
sm.density(y, h = hsj(x), xlab = "Gross horsepower", add = T, lty = 3)
title(main="Displacement (cu.in.) per Gross horsepower (hp)")
该图仅给出了正常最优值 = 36.31171 和 Sheather-Jones = 26.19646 的值,并且该图看起来比位移更平滑。
\item[(b)]