资讯

历史

科技

环境与自然

成长

游戏

财经

文学与艺术

美食

健康

家居

文化

情感

汽车

三农

军事

旅行

运动

教育

生活

星座命理

RNA-seq分析全流程详解：从数据准备到功能富集分析

创作时间:

作者:

@小白创作中心

RNA-seq分析全流程详解：从数据准备到功能富集分析

引用

CSDN

https://m.blog.csdn.net/Da_gan/article/details/143466797

RNA-seq（RNA测序）是用于研究基因表达和转录组的强大工具。它通过高通量测序技术对生物样本中的RNA进行测序，从而获得基因表达的定量信息。RNA-seq分析在疾病研究、药物开发、功能基因组学等领域有着广泛的应用。本文将详细介绍RNA-seq分析的完整流程，包括数据准备、质量控制、序列比对、表达量定量以及差异表达分析等关键步骤，并提供具体的代码示例。

1. 数据准备

首先，从测序公司获取测序数据，一般是FastQ格式的原始数据文件（.fastq或.fastq.gz）。

文件说明：

一般会有两个文件（如果是成对的双端测序）：
sample_1.fastq.gz
sample_2.fastq.gz

2. 数据质控（Quality Control, QC）

QC可以帮助识别和去除低质量的reads。常用工具：FastQC和MultiQC。

FastQC分析

# 安装FastQC
conda install -c bioconda fastqc
# 运行FastQC
fastqc sample_1.fastq.gz sample_2.fastq.gz -o ./qc_output/

MultiQC汇总

# 安装MultiQC
conda install -c bioconda multiqc
# 运行MultiQC
multiqc ./qc_output/ -o ./multiqc_report/

3. 去除低质量和接头序列（Trimming）

使用Trimmomatic或Cutadapt来清除接头和低质量的reads。

使用Trimmomatic

# 安装Trimmomatic
conda install -c bioconda trimmomatic
# 运行Trimmomatic
trimmomatic PE -phred33 \
  sample_1.fastq.gz sample_2.fastq.gz \
  sample_1_paired.fastq.gz sample_1_unpaired.fastq.gz \
  sample_2_paired.fastq.gz sample_2_unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

4. 比对到参考基因组（Alignment）

常用工具：HISAT2、STAR。

使用HISAT2

首先需要下载参考基因组文件并构建索引。

# 安装HISAT2
conda install -c bioconda hisat2
# 下载参考基因组
wget ftp://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# 构建索引
hisat2-build Homo_sapiens.GRCh38.dna.primary_assembly.fa GRCh38_index
# 进行比对
hisat2 -x GRCh38_index -1 sample_1_paired.fastq.gz -2 sample_2_paired.fastq.gz -S sample_aligned.sam

5. SAM文件转换为BAM并排序

使用Samtools进行SAM到BAM的转换，并排序和索引。

# 安装Samtools
conda install -c bioconda samtools
# SAM to BAM转换
samtools view -Sb sample_aligned.sam > sample_aligned.bam
# BAM排序
samtools sort sample_aligned.bam -o sample_aligned_sorted.bam
# 索引BAM文件
samtools index sample_aligned_sorted.bam

6. 定量转录本表达量

使用featureCounts对比对后的BAM文件进行定量。

# 安装Subread（包含featureCounts工具）
conda install -c bioconda subread
# 使用featureCounts进行基因表达定量
featureCounts -a Homo_sapiens.GRCh38.104.gtf -o gene_counts.txt sample_aligned_sorted.bam

7. 差异表达分析

差异表达分析通常在R中进行，常用的包有DESeq2和edgeR。

使用DESeq2进行差异表达分析

首先，将gene_counts.txt文件读入R。

# 安装DESeq2
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DESeq2")
# 加载DESeq2包
library(DESeq2)
# 读入数据
counts <- read.table("gene_counts.txt", header = TRUE, row.names = 1)
coldata <- data.frame(row.names = colnames(counts), condition = factor(c("control", "treatment")))
# 创建DESeq2对象
dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition)
# 运行DESeq2
dds <- DESeq(dds)
res <- results(dds)
# 查看差异表达基因结果
summary(res)

可视化差异表达基因

常用的可视化图包括火山图（volcano plot）和热图（heatmap）。

火山图

# 安装EnhancedVolcano包
BiocManager::install("EnhancedVolcano")
library(EnhancedVolcano)
# 绘制火山图
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'pvalue',
                title = 'Differentially Expressed Genes')

热图

# 安装pheatmap包
install.packages("pheatmap")
library(pheatmap)
# 准备数据
vsd <- vst(dds, blind = FALSE)
topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 20)
mat <- assay(vsd)[topVarGenes, ]
mat <- mat - rowMeans(mat)
# 绘制热图
pheatmap(mat, annotation_col = coldata)

8. 功能富集分析（Gene Ontology, GO 或 KEGG）

可以使用clusterProfiler包进行GO和KEGG富集分析。

使用clusterProfiler进行GO分析

# 安装clusterProfiler
BiocManager::install("clusterProfiler")
library(clusterProfiler)
# 选择显著性差异基因
sig_genes <- rownames(res[res$padj < 0.05 & abs(res$log2FoldChange) > 1, ])
# GO富集分析
ego <- enrichGO(gene = sig_genes,
                OrgDb = org.Hs.eg.db,
                keyType = "SYMBOL",
                ont = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff = 0.05)
# 显示结果
barplot(ego, showCategory = 10)

热门推荐

农村人口流失加剧，村庄逐渐荒芜