R语言 Bootstrap置信区间

创作时间:

作者:

@小白创作中心

R语言 Bootstrap置信区间

引用

来源

https://geek-docs.com/r-language/r-tutorials/g_bootstrap-confidence-interval-with-r-programming.html

Bootstrap是一种常用的统计推断方法，特别适用于估计置信区间（CI）。本文将详细介绍如何使用R语言中的boot包来计算Bootstrap置信区间，并通过鸢尾花数据集进行演示。

Bootstrap方法简介

Bootstrapping是一种利用样本数据对总体进行推断的统计方法。它可以用来估计置信区间（CI），通过从样本数据中抽取样本并进行替换。Bootstrapping可以用来给各种没有封闭式或复杂解决方案的统计量分配CI。假设我们想用自举重抽法获得95%的置信区间，步骤如下：

从原始样本数据中抽出n个元素，并进行替换。
对每个样本计算所需的统计量，如平均数、中位数等。
重复第1步和第2步m次，并保存计算出的统计量。
将计算出的统计量绘制成引导分布图。
使用所需统计量的引导分布，我们可以计算出95%的CI。

从样本中生成引导分布的插图。

在R中的实现

在R编程中，boot包允许用户轻松生成几乎所有我们可以计算的统计量的bootstrap样本。我们可以从boot包的计算结果中生成偏差估计值、bootstrap置信区间或bootstrap分布图。

出于演示的目的，我们将使用鸢尾花数据集，因为它很简单，而且可以作为R中的一个内置数据集。每个样本都有四个特征：萼片和花瓣的长度和宽度，单位是厘米。我们可以用head命令查看鸢尾花的数据集，并注意到感兴趣的特征。

# View the first row
# of the iris dataset
head(iris, 1)

输出

 Sepal.Length    Sepal.Width    Petal.Length    Petal.Width    Species
     5.1             3.5              1.4             0.2      setosa

我们想估计花瓣长度和花瓣宽度之间的相关性。

在R中计算Bootstrap CI的步骤

导入用于计算bootstrap CI的boot库和用于绘图的ggplot2。

# Import library for bootstrap methods
library(boot)

# Import library for plotting
library(ggplot2)

创建一个函数，计算我们要使用的统计量，如平均数、中位数、相关度等。

# Custom function to find correlation
# between the Petal Length and Width
corr.fun <- function(data, idx)
{
  df <- data[idx, ]

  # Find the spearman correlation between
  # the 3rd and 4th columns of dataset
  c(cor(df[, 3], df[, 4], method = 'spearman'))
}

使用引导函数找到统计量的R引导值。

# Setting the seed for
# reproducability of results
set.seed(42)

# Calling the boot function with the dataset
# our function and no. of rounds
bootstrap <- boot(iris, corr.fun, R = 1000)

# Display the result of boot function
bootstrap

输出

ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = iris, statistic = corr.fun, R = 1000)
Bootstrap Statistics :
     original       bias    std. error
t1* 0.9376668 -0.002717295 0.009436212

我们可以用计算自举分布的绘图命令来绘制生成的自举分布。

# Plot the bootstrap sampling
# distribution using ggplot
plot(bootstrap)

输出

使用boot.ci()函数，得到置信区间。

# Function to find the
# bootstrap Confidence Intervals
boot.ci(boot.out = bootstrap,
        type = c("norm", "basic",
                 "perc", "bca"))

输出

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL : 
boot.ci(boot.out = bootstrap, type = c("norm", "basic", "perc", 
    "bca"))
Intervals : 
Level      Normal              Basic         
95%   ( 0.9219,  0.9589 )   ( 0.9235,  0.9611 )  
Level     Percentile            BCa          
95%   ( 0.9142,  0.9519 )   ( 0.9178,  0.9535 )  
Calculations and Intervals on Original Scale

从输出结果中推断出Bootstrap CI。看一下正态法的区间（0.9219, 0.9589），我们可以95%确定花瓣长度和宽度之间的实际相关性在这个区间内有95%的时间。正如我们所看到的，输出由多个CI组成，根据函数boot.ci中的类型参数使用不同的方法。计算出的区间对应于（”Norm”、”Basic”、”Perc”、”BCa”）或Normal、Basic、Percentile和BCa，它们对同一水平的95%给出不同的区间。对任何变量使用的具体方法取决于各种因素，如其分布、同质性、偏差等。