小生这厢有礼了(BioFaceBook Personal Blog) » 生物信息

Metagenomics standard operating procedure v2

szypanther — Sun, 23 Aug 2020 03:28:53 +0000

http://www.360doc.com/content/20/0823/10/71250389_931761136.shtml

https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2

Note that this workflow is continually being updated. If you want to use the below commands be sure to keep track of them locally.

Last updated: 9 Oct 2018 (see revisions above for earlier minor versions and Metagenomics SOPs on the SOP for earlier major versions)

Important points to keep in mind:

The below options are not necessarily best for your data; it is important to explore different options, especially at the read quality filtering stage.
If you are confused about a particular step be sure to check out the pages under Metagenomic Resources on the right side-bar.
You should check that only the FASTQs of interest are being specified at each step (e.g. with the ls command). If you are re-running commands multiple times or using optional steps the input filenames and/or folders could differ.
The tool parallel comes up several times in this workflow. Be sure to check out our tutorial on this tool here.
You can always run parallel with the --dry-run option to see what commands are being run to double-check they are doing what you want!

This workflow starts with raw paired-end MiSeq or NextSeq data in demultiplexed FASTQ format assumed to be located within a folder called raw_data.

1. First Steps

1.1 Join lanes together (all Illumina, except MiSeq)

Concatenate multiple lanes of sequencing together (e.g. for NextSeq data). If you do NOT do this step, remember to change cat_lanes to raw_data in the further commands below that have assumed you are working with the most common type of metagenome data.

concat_lanes.pl raw_data/* -o cat_lanes -p 4

1.2 Inspect read quality

Run fastqc to allow manual inspection of the quality of sequences.

mkdir fastqc_out
fastqc -t 4 cat_lanes/* -o fastqc_out/

1.3 (Optional) Stitch F+R reads

Stitch paired-end reads together (summary of stitching results are written to “pear_summary_log.txt”). Note: it is important to check the % of reads assembled. It may be better to concatenate the forward and reverse reads together if the assembly % is too low (see step 2.2).

run_pear.pl -p 4 -o stitched_reads cat_lanes/*

If you don’t stitch your reads together at this step you will need to unzip your FASTQ files before continuing with the below commands.

2. Read Quality-Control and Contaminant Screens

2.1 Running KneadData

Use kneaddata to run pre-processing tools. First Trimmomatic is run to remove low quality sequences. Then Bowtie2 is run to screen out contaminant sequences. Below we are screening out reads that map to the human or PhiX genomes. Note KneadData is being run below on all unstitched FASTQ pairs with parallel, you can see our quick tutorial on this tool here. For a detailed breakdown of the options in the below command see this page. The forward and reverse reads will be specified by “_1″ and “_2″ in the output files, ignore the “R1″ in each filename. Note that the \ characters at the end of each line are just to split the command over multiple lines to make it easier to read.

parallel -j 1 --link 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
 ::: cat_lanes/*_R1.fastq ::: cat_lanes/*_R2.fastq

Clean up the output directory (helps downstream commands) by moving the discarded sequences to a subfolder:

mkdir kneaddata_out/contam_seq

mv kneaddata_out/*_contam*.fastq kneaddata_out/contam_seq

You can produce a logfile summarizing the kneadData output with this command:

kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt

2.2 Concatenate unstitched output (Omit if data stitched above)

If you did not stitch your paired-end reads together with pear, then you can concatenate the forward and reverse FASTQs together now. Note this is done after quality filtering so that both reads in a pair are either discarded or retained. It’s important to only specify the “paired” FASTQs outputted by kneaddata in the below command.

concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_*.fastq

You should check over the commands that are printed to screen to make sure the correct FASTQs are being concatenated together.

3. Determine Functions with HUMAnN2

3.1 Run HUMAnN2

Run humann2 with parallel to calculate abundance of UniRef90 gene families and MetaCyc pathways. If you are processing environment data (e.g. soil samples) the vast majority of the reads may not map using this approach. Instead, you can try mapping against the UniRef50 database (which you can point to with the --protein-database option).

mkdir humann2_out

parallel -j 4 'humann2 --threads 1 --input {} --output humann2_out/{/.}' ::: cat_reads/*fastq

3.2 Merge individual sample data together

Join HUMAnN2 output per sample into one table.

mkdir humann2_final_out

humann2_join_tables -s --input humann2_out/ --file_name pathabundance --output humann2_final_out/humann2_pathabundance.tsv
humann2_join_tables -s --input humann2_out/ --file_name pathcoverage --output humann2_final_out/humann2_pathcoverage.tsv
humann2_join_tables -s --input humann2_out/ --file_name genefamilies --output humann2_final_out/humann2_genefamilies.tsv

3.3 Table output normalization

Re-normalize gene family and pathway abundances (so that all samples are in units of copies per million).

humann2_renorm_table --input humann2_final_out/humann2_pathabundance.tsv --units cpm --output humann2_final_out/humann2_pathabundance_cpm.tsv
humann2_renorm_table --input humann2_final_out/humann2_genefamilies.tsv --units cpm --output humann2_final_out/humann2_genefamilies_cpm.tsv

3.4 Separate out taxonomic contributions

Split HUMAnN2 output abundance tables in stratified and unstratified tables (stratified tables include the taxa associated with a functional profile).

humann2_split_stratified_table --input humann2_final_out/humann2_pathabundance_cpm.tsv --output humann2_final_out
humann2_split_stratified_table --input humann2_final_out/humann2_genefamilies_cpm.tsv --output humann2_final_out
humann2_split_stratified_table --input humann2_final_out/humann2_pathcoverage.tsv --output humann2_final_out

3.5 Format STAMP function file

Convert unstratified HUMAnN2 abundance tables to STAMP format by changing header-line. These commands remove the comment character and the spaces in the name of the first column. Trailing descriptions of the abundance datatype are also removed from each sample’s column name.

sed 's/_Abundance-RPKs//g' humann2_final_out/humann2_genefamilies_cpm_unstratified.tsv | sed 's/# Gene Family/GeneFamily/' > humann2_final_out/humann2_genefamilies_cpm_unstratified.spf
sed 's/_Abundance//g' humann2_final_out/humann2_pathabundance_cpm_unstratified.tsv | sed 's/# Pathway/Pathway/' > humann2_final_out/humann2_pathabundance_cpm_unstratified.spf
sed 's/_Coverage//g' humann2_final_out/humann2_pathcoverage_unstratified.tsv | sed 's/# Pathway/Pathway/' > humann2_final_out/humann2_pathcoverage_unstratified.spf

3.6 Extract MetaPhlAn2 taxonomic compositions

Since HUMAnN2 also runs MetaPhlAn2 as an initial step, we can use the output tables already created to get the taxonomic composition of our samples. First we need to gather all the output MetaPhlAn2 results per sample into a single directory and then merge them into a single table using MetaPhlAn2’s merge_metaphlan_tables.py command. After this file is created we can fix the header so that each column corresponds to a sample name without the trailing “_metaphlan_bugs_list” description. Note that MetaPhlAn2 works best for well-characterized environments, like the human gut, and has low sensitivity in other environments.

mkdir metaphlan2_out
cp humann2_out/*/*/*metaphlan_bugs_list.tsv metaphlan2_out/
/usr/local/metaphlan2/utils/merge_metaphlan_tables.py metaphlan2_out/*metaphlan_bugs_list.tsv > metaphlan2_merged.txt
sed -i 's/_metaphlan_bugs_list//g' metaphlan2_merged.txt

3.7 Format STAMP taxonomy file

Lastly we can convert this MetaPhlAn2 abundance table to STAMP format

metaphlan_to_stamp.pl metaphlan2_merged.txt > metaphlan2_merged.spf

肠型分析学习

szypanther — Wed, 01 Jul 2020 05:24:02 +0000

肠型，Enterotype，是2011年在这篇文章中提出的，即将过去的2018年又有20多们肠道微生物的大佬对肠型的概念进行了回顾和确认。一直比较好奇怎样来用代码分析肠型，今天找到了这个教程，放在这：

这是那篇原始的文章：Arumugam, M., Raes, J., et al. (2011) Enterotypes of the human gut microbiome, Nature,doi://10.1038/nature09944 在谷歌上一搜，作者竟然做了个分析肠型的教程在这，学习一下：http://enterotyping.embl.de/enterotypes.html 这是2018年大佬们的共识文章：这是国人翻译的这篇文章，http://blog.sciencenet.cn/blog-3334560-1096828.html 当然，如果你只需要获得自己的结果或者自己课题的结果，不需要跑代码的，有最新的网页版分型，更好用，网址也放在这，同样也是上面翻译的那篇文章里提到的网址：http://enterotypes.org/ 只需要把菌属的含量比例文件上就能很快得到结果。

下面我就边学习边做来尝试着来个分析，并把代码放在这里备忘。其实作者已经整理好了代码，我学习一下，争取实现对手上的数据进行分析。

首先下载测试数据，

wget http://enterotyping.embl.de/MetaHIT_SangerSamples.genus.txt
wget http://enterotyping.embl.de/enterotypes_tutorial.sanger.R

跑跑示例数据，排排错

我表示对R语言还只是一知半解的状态，所以，先跑下，然后能用上自己的数据，当个工具用就暂知足啦。我是黑苹果10.11的系统，运行这个软件提示少了Xquartz，于是装了个，windows和linux应该不需要。原代码中还提示『没有”s.class”这个函数』，百度了一下发现有个老兄的新浪博客说了是这个包，于是加了句library(ade4)就ok了。 Xquartz的下载地址Mac 10.6+：https://dl.bintray.com/xquartz/downloads/XQuartz-2.7.11.dmg

#Uncomment next two lines if R packages are already installed
#install.packages("cluster")
#install.packages("clusterSim")
library(cluster)
library(clusterSim)
#BiocManager::install("genefilter")
library(ade4)#Download the example data and set the working directory
#setwd('')
data=read.table("../MetaHIT_SangerSamples.genus.txt", header=T, row.names=1, dec=".", sep="\t")
data=data[-1,]dist.JSD <- function(inMatrix, pseudocount=0.000001, ...) {
 KLD <- function(x,y) sum(x *log(x/y))
 JSD<- function(x,y) sqrt(0.5 * KLD(x, (x+y)/2) + 0.5 * KLD(y, (x+y)/2))
 matrixColSize <- length(colnames(inMatrix))
 matrixRowSize <- length(rownames(inMatrix))
 colnames <- colnames(inMatrix)
 resultsMatrix <- matrix(0, matrixColSize, matrixColSize) inMatrix = apply(inMatrix,1:2,function(x) ifelse (x==0,pseudocount,x)) for(i in 1:matrixColSize) {
   for(j in 1:matrixColSize) {
     resultsMatrix[i,j]=JSD(as.vector(inMatrix[,i]),
                            as.vector(inMatrix[,j]))
   }
 }
 colnames -> colnames(resultsMatrix) -> rownames(resultsMatrix)
 as.dist(resultsMatrix)->resultsMatrix
 attr(resultsMatrix, "method") <- "dist"
 return(resultsMatrix)
}data.dist=dist.JSD(data)pam.clustering=function(x,k) { # x is a distance matrix and k the number of clusters
 require(cluster)
 cluster = as.vector(pam(as.dist(x), k, diss=TRUE)$clustering)
 return(cluster)
}data.cluster=pam.clustering(data.dist, k=3)require(clusterSim)
nclusters = index.G1(t(data), data.cluster, d = data.dist, centrotypes = "medoids")nclusters=NULLfor (k in 1:20) {
 if (k==1) {
   nclusters[k]=NA
 } else {
   data.cluster_temp=pam.clustering(data.dist, k)
   nclusters[k]=index.G1(t(data),data.cluster_temp,  d = data.dist,
                         centrotypes = "medoids")
 }
}plot(nclusters, type="h", xlab="k clusters", ylab="CH index",main="Optimal number of clusters")obs.silhouette=mean(silhouette(data.cluster, data.dist)[,3])
cat(obs.silhouette) #0.1899451#data=noise.removal(data, percent=0.01)## plot 1
obs.pca=dudi.pca(data.frame(t(data)), scannf=F, nf=10)
obs.bet=bca(obs.pca, fac=as.factor(data.cluster), scannf=F, nf=k-1)
dev.new()
s.class(obs.bet$ls, fac=as.factor(data.cluster), grid=F,sub="Between-class analysis", col=c(3,2,4))#plot 2
obs.pcoa=dudi.pco(data.dist, scannf=F, nf=3)
dev.new()
s.class(obs.pcoa$li, fac=as.factor(data.cluster), grid=F,sub="Principal coordiante analysis", col=c(3,2,4))

上图，稍微调整下

, col=c(3,2,4)这个代码是给三个聚类上不同的颜色，还没搞清楚怎么给画的圈上色来实现理好看的效果，相信对于熟悉R语言的同学是小菜一碟。, cell=0, cstar=0是不显示圈和边线，只显示散点。不加这两个参数，只用上面的代码，图如下：

加上两个参数的图片，就和教程里的最后面的两张图一样：

Pandas and Sklearn

szypanther — Fri, 05 Jun 2020 03:02:14 +0000

pandas isnull函数检查数据是否有缺失

pandas isnull sum with column headers

for col in main_df:
    print(sum(pd.isnull(data[col])))

I get a list of the null count for each column:

0
1
100

What I’m trying to do is create a new dataframe which has the column header alongside the null count, e.g.

col1 | 0
col2 | 1
col3 | 100

#print every column using:

nulls = df.isnull().sum().to_frame()
for index, row in nulls.iterrows():
    print(index, row[0])

for col in df:
    print(df[col].unique())

pandas.get_dummies 的用法 (One-Hot Encoding)

get_dummies 是利用pandas实现one hot encode的方式。详细参数请查看官方文档

pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False)[source]

参数说明：

data : array-like, Series, or DataFrame 输入的数据
prefix : string, list of strings, or dict of strings, default None get_dummies转换后，列名的前缀
columns : list-like, default None 指定需要实现类别转换的列名
dummy_na : bool, default False 增加一列表示空缺值，如果False就忽略空缺值
drop_first : bool, default False 获得k中的k-1个类别值，去除第一个

离散特征的编码分为两种情况：

1、离散特征的取值之间没有大小的意义，比如color：[red,blue],那么就使用one-hot编码

2、离散特征的取值有大小的意义，比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}

例子：

import pandas as pd

df = pd.DataFrame([  
            ['green' , 'A'],   
            ['red'   , 'B'],   
            ['blue'  , 'A']])  

df.columns = ['color',  'class'] 
pd.get_dummies(df)

get_dummies 前：

get_dummies 后：

上述执行完以后再打印df 出来的还是get_dummies 前的图，因为你没有写

df = pd.get_dummies(df)

可以对指定列进行get_dummies

pd.get_dummies(df.color)

将指定列进行get_dummies 后合并到元数据中

df = df.join(pd.get_dummies(df.color))

参考：https://blog.csdn.net/maymay_/article/details/80198468

 
>>> train_filter.info()

RangeIndex: 1482 entries, 0 to 1481
Columns: 182 entries, SampleID to BS120
dtypes: float64(177), int64(2), object(3)
memory usage: 2.1+ MB
>>> train_filter.dtypes
SampleID object
Streptococcus Infection float64
Duration_of_gestation object
Gestation_age float64
Gestation_age_G1 float64
Gestation_age_G2 float64
GDM_HDP float64
Age int64
Age_group int64
Blood_type float64
Medication_use float64
Progesterone_use float64
Pregnancy_mode float64
Native_place float64
Combined_disease float64
Infection float64
Scar_uterus float64
Risk_rating float64
Anamnesis float64
Thalassemia float64
Ovary_disease float64
Hepatopathy float64
Allergic_history float64
Thyroid_disease float64
Hysteromyoma float64
Breast_disease float64
Weight_at_delivery object
Weight_before_pregnancy float64
Height float64
BMI_before_pregnancy float64
 ... 
B_A/G float64
B_r_GT_G float64
B_r_GT float64
B_TBA_G float64
B_TBA float64
B_ALT_G float64
B_ALT float64
B_AST_G float64
B_AST float64
B_TBIL_G float64
B_TBIL float64
B_DBIL_G float64
B_DBIL float64
B_IBIL_G float64
B_IBIL float64
B_Crea_G float64
B_Crea float64
B_CysC_G float64
B_CysC float64
B_UA_G float64
B_UA float64
B_Urea_G float64
B_Urea float64
B_GLU_G float64
B_GLU float64
HbA1c_G float64
HbA1c float64
BS float64
BS60 float64
BS120 float64
Length: 182, dtype: object

scikit-learn 是基于 Python 语言的机器学习工具

http://www.scikitlearn.com.cn/

DBS model training

szypanther — Thu, 14 May 2020 08:39:21 +0000

phenotype gproNOG.annotations bitscores

This is the complete workflow used to generate a random forest model using output data from an hmmsearch of your protein coding genes against eggNOG gamma proteobacterial protein HMMs. Before running this notebook, run the parse_hmmsearch.pl script to get a tab-delimited file containing bitscores for all isolates.

“`{r, read in data}
# library(gplots)
library(caret)
library(randomForest)
set.seed(1)
# set the directory you are working from
directory <- “/home/shenzy/UNISED/CII_ECOR_update_final_164genes”
# Reading in the eggNOG model scores, checking to make sure that top eggNOG model hit for each protein in the orthogroup is the same
traindata <- read.delim(paste(directory, “/bitscores.tsv”, sep=””))
traindata <- t(traindata)
phenotype <- read.delim(paste(directory, “/phenotype.tsv”, sep=””), header=F)
phenotype[,1] <- make.names(phenotype[,1])
traindata <- cbind.data.frame(traindata, phenotype=phenotype[match(row.names(traindata), phenotype[,1]),2])
traindata[is.na(traindata)] <- 0
# traindata <- na.roughfix(traindata)
traindata <- traindata[,-nearZeroVar(traindata)]
names(traindata) <- make.names(names(traindata))
“`

The following section is an optional step for picking the best values of mtry (number of gnees sampled per node) and ntree (number of trees in your random forest) for building your model. Instead of running all of the code at once, proceed through each step or model building and examine the figures produced. These will give you an indication of the point at which the performance of your model starts to level off.

In general, greater values of ntree and mtry will give you better stability in the top genes that are identified by the model. You can alternatively skip this step and proceed immediately to the next one, where values of 10,000 trees and p/10 genes per node (where p is total number of genes in the training data) have been chosen as a good starting point.

“`{r, train model}
# this section is for picking out the best parameters for building your model
set.seed(1)
# varying ntree
error <- vector()
sparsity <- vector()
for(i in c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000)) {
model <- randomForest(phenotype ~ ., data=traindata, ntree=i)
error <- c(error, model$err.rate[length(model$err.rate)])
sparsity <- c(sparsity, (sum(model$importance[,1]<=0))/(ncol(traindata)-1))
}
# varying mtry
error2 <- vector()
sparsity2 <- vector()
param <- ncol(traindata)-1
for(i in c(1, round(param/10), round(param/5), round(param/3), round(param/2), param)) {
model <- randomForest(phenotype ~ ., data=traindata, ntree=10000, mtry=i)
error2 <- c(error2, model$err.rate[length(model$err.rate)])
sparsity2 <- c(sparsity2, (sum(model$importance[,1]<=0))/(ncol(traindata)-1))
}
model <- randomForest(phenotype ~ ., data=traindata, ntree=10000, mtry=param/10, na.action=na.roughfix)
png(paste(directory, “/model_training/m1_error_vs_ntree.png”, sep=””), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=error, xlab=”Number of trees”, ylab=”OOB error rate”, pch=16)
dev.off()
png(paste(directory, “/model_training/m1_sparsity_vs_ntree.png”, sep=””), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=sparsity, xlab=”Number of trees”, ylab=”% genes uninformative”, pch=16)
dev.off()
png(paste(directory, “/model_training/m1_error_vs_mtry.png”, sep=””), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=error2, xlab=”Number of genes sampled per tree”, ylab=”OOB error rate”, pch=16)
dev.off()
png(paste(directory, “/model_training/m1_sparsity_vs_mtry.png”, sep=””), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=sparsity2, xlab=”Number of genes sampled per tree”, ylab=”% genes uninformative”, pch=16)
dev.off()
train2 <- traindata[,match(names(model$importance[model$importance[,1]>0,]), colnames(traindata))]
train2 <- cbind(train2, phenotype=traindata$phenotype)
error <- vector()
sparsity <- vector()
for(i in c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000)) {
model <- randomForest(phenotype ~ ., data=train2, ntree=i, na.action=na.roughfix)
error <- c(error, median(model$err.rate))
sparsity <- c(sparsity, (sum(model$importance[,1]<=0))/(ncol(train2)-1))
}
error2 <- vector()
sparsity2 <- vector()
param <- ncol(train2)-1
for(i in c(1, round(param/10), round(param/5), round(param/3), round(param/2), param)) {
model <- randomForest(phenotype ~ ., data=train2, ntree=10000, mtry=i, na.action=na.roughfix)
error2 <- c(error2, median(model$err.rate))
sparsity2 <- c(sparsity2, (sum(model$importance[,1]<=0))/(ncol(train2)-1))
}
model <- randomForest(phenotype ~ ., data=train2, ntree=10000, mtry=param/10, na.action=na.roughfix)
png(paste(directory, “/model_training/m2_error_vs_ntree.png”, sep=””), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=error, xlab=”Number of trees”, ylab=”OOB error rate”, pch=16)
dev.off()
png(paste(directory, “/model_training/m2_sparsity_vs_ntree.png”, sep=””), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=sparsity, xlab=”Number of trees”, ylab=”% genes uninformative”, pch=16)
dev.off()
png(paste(directory, “/model_training/m2_error_vs_mtry.png”, sep=””), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=error2, xlab=”Number of genes sampled per tree”, ylab=”OOB error rate”, pch=16)
dev.off()
png(paste(directory, “/model_training/m2_sparsity_vs_mtry.png”, sep=””), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=sparsity2, xlab=”Number of genes sampled per tree”, ylab=”% genes uninformative”, pch=16)
dev.off()
train3 <- train2[,match(names(model$importance[model$importance[,1]>quantile(model$importance[,1], 0.5),]), colnames(train2))]
train3 <- cbind(train3, phenotype=train2$phenotype)
error <- vector()
sparsity <- vector()
param <- ncol(train3)-1
for(i in c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000)) {
model <- randomForest(phenotype ~ ., data=train3, ntree=i, na.action=na.roughfix)
error <- c(error, median(model$err.rate))
sparsity <- c(sparsity, (sum(model$importance[,1]<=0))/(ncol(train3)-1))
}
error2 <- vector()
sparsity2 <- vector()
for(i in c(1, round(param/10), round(param/5), round(param/3), round(param/2), param)) {
model <- randomForest(phenotype ~ ., data=train3, ntree=10000, mtry=i, na.action=na.roughfix)
error2 <- c(error2, median(model$err.rate))
sparsity2 <- c(sparsity2, (sum(model$importance[,1]<=0))/(ncol(train3)-1))
}
model <- randomForest(phenotype ~ ., data=train3, ntree=10000, mtry=param/10, na.action=na.roughfix)
png(paste(directory, “/model_training/m3_error_vs_ntree.png”, sep=””), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=error, xlab=”Number of trees”, ylab=”OOB error rate”, pch=16)
dev.off()
png(paste(directory, “/model_training/m3_sparsity_vs_ntree.png”, sep=””), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=sparsity, xlab=”Number of trees”, ylab=”% genes uninformative”, pch=16)
dev.off()
png(paste(directory, “/model_training/m3_error_vs_mtry.png”, sep=””), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=error2, xlab=”Number of genes sampled per tree”, ylab=”OOB error rate”, pch=16)
dev.off()
png(paste(directory, “/model_training/m3_sparsity_vs_mtry.png”, sep=””), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=sparsity2, xlab=”Number of genes sampled per tree”, ylab=”% genes uninformative”, pch=16)
dev.off()
train4 <- train3[,match(names(model$importance[model$importance[,1]>quantile(model$importance[,1], 0.5),]), colnames(train3))]
train4 <- cbind(train4, phenotype=train3$phenotype)
error <- vector()
sparsity <- vector()
for(i in c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000)) {
model <- randomForest(phenotype ~ ., data=train4, ntree=i, na.action=na.roughfix)
error <- c(error, median(model$err.rate))
sparsity <- c(sparsity, (sum(model$importance[,1]<=0))/(ncol(train4)-1))
}
error2 <- vector()
sparsity2 <- vector()
param <- ncol(train4)-1
for(i in c(1, round(param/10), round(param/5), round(param/3), round(param/2), param)) {
model <- randomForest(phenotype ~ ., data=train4, ntree=10000, mtry=i, na.action=na.roughfix)
error2 <- c(error2, median(model$err.rate))
sparsity2 <- c(sparsity2, (sum(model$importance[,1]<=0))/(ncol(train4)-1))
}
model <- randomForest(phenotype ~ ., data=train4, ntree=10000, mtry=param/10, na.action=na.roughfix)
png(paste(directory, “/model_training/m4_error_vs_ntree.png”, sep=””), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=error, xlab=”Number of trees”, ylab=”OOB error rate”, pch=16)
dev.off()
png(paste(directory, “/model_training/m4_sparsity_vs_ntree.png”, sep=””), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=sparsity, xlab=”Number of trees”, ylab=”% genes uninformative”, pch=16)
dev.off()
png(paste(directory, “/model_training/m4_error_vs_mtry.png”, sep=””), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=error2, xlab=”Number of genes sampled per tree”, ylab=”OOB error rate”, pch=16)
dev.off()
png(paste(directory, “/model_training/m4_sparsity_vs_mtry.png”, sep=””), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=sparsity2, xlab=”Number of genes sampled per tree”, ylab=”% genes uninformative”, pch=16)
dev.off()
model$predicted
names(model$importance[order(model$importance, decreasing=T),][1:10])
save(model, train2, train3, train4, file=paste(directory, “/traindatanew.Rdata”, sep=””))
“`

This section allows you to build a model through iterative feature selection, using parameters that we feel are sensible. You can substitute in your own parameters chosen from the process above if you prefer.

“`{r, quicker model building}
set.seed(1)
param <- ncol(traindata)-1
model1 <- randomForest(phenotype ~ ., data=traindata, ntree=10000, mtry=param/10, na.action=na.roughfix)
model1
png(paste(directory, “/VI_fullnew.png”, sep=””), width=400, height=350)
plot(1:param, model1$importance[order(model1$importance, decreasing=T)], xlim=c(1,1000), ylab=”Variable importance”, xlab=”Top genes”)
dev.off()
pdf(paste(directory, “VI_fullnew.pdf”, sep=””), width=5, height=5)
plot(1:param, model1$importance[order(model1$importance, decreasing=T)], xlim=c(1,1000), ylab=”Variable importance”, xlab=”Top genes”)
dev.off()
train2 <- traindata[,match(names(model1$importance[model1$importance[,1]>0,]), colnames(traindata))]
train2 <- cbind(train2, phenotype=traindata$phenotype)
param <- ncol(train2)-1
model2 <- randomForest(phenotype ~ ., data=train2, ntree=10000, mtry=param/10, na.action=na.roughfix)
model2
train3 <- train2[,match(names(model2$importance[model2$importance[,1]>quantile(model2$importance[,1], 0.5),]), colnames(train2))]
train3 <- cbind(train3, phenotype=train2$phenotype)
param <- ncol(train3)-1
model3 <- randomForest(phenotype ~ ., data=train3, ntree=10000, mtry=param/10, na.action=na.roughfix)
model3
train4 <- train3[,match(names(model3$importance[model3$importance[,1]>quantile(model3$importance[,1], 0.5),]), colnames(train3))]
train4 <- cbind(train4, phenotype=train3$phenotype)
param <- ncol(train4)-1
model4 <- randomForest(phenotype ~ ., data=train4, ntree=10000, mtry=param/10, na.action=na.roughfix)
model4
train5 <- train4[,match(names(model4$importance[model4$importance[,1]>quantile(model4$importance[,1], 0.5),]), colnames(train4))]
train5 <- cbind(train5, phenotype=train4$phenotype)
param <- ncol(train5)-1
model5 <- randomForest(phenotype ~ ., data=train5, ntree=10000, mtry=param/10, na.action=na.roughfix, proximity=T)
model5
train6 <- train5[,match(names(model5$importance[model5$importance[,1]>quantile(model5$importance[,1], 0.5),]), colnames(train5))]
train6 <- cbind(train6, phenotype=train5$phenotype)
param <- ncol(train6)-1
model6 <- randomForest(phenotype ~ ., data=train6, ntree=10000, mtry=param/10, na.action=na.roughfix, proximity=T)
model6
train7 <- train6[,match(names(model6$importance[model6$importance[,1]>quantile(model6$importance[,1], 0.5),]), colnames(train6))]
train7 <- cbind(train7, phenotype=train6$phenotype)
param <- ncol(train7)-1
model7 <- randomForest(phenotype ~ ., data=train7, ntree=10000, mtry=param/10, na.action=na.roughfix, proximity=T)
model7
model7$predicted
names(model7$importance[order(model7$importance[,1], decreasing=T),])[1:10]
png(paste(directory, “final_model_VInew.png”, sep=””), width=400, height=350)

plot(1:param, model7$importance[order(model7$importance, decreasing=T),], xlab=””, ylab=”Variable importance”)
dev.off()
save(model1, model2, model3, model4, model5,model6, model7, traindata, train2, train3, train4, train5, train6, train7, file=paste(directory, “finalmodelnew.Rdata”, sep=””))
“`

The following section will show you how the performance of your model has improved as you iterated through cycles of feature selection. It will give you an idea of whether you have performed enough cycles, or whether you need to carry on.

You can look at both the out-of-bag votes, which are just votes cast by trees on data they were not trained on. This gives you a good idea of how the model would score strains that are distantly related to your training data. The second plot shows you votes cast by all of the trees on all of the serovars, and gives you a better idea of how your model would scores similar strains.

“`{r, looking at votes}
votedata <- rbind.data.frame(cbind.data.frame(model=rep(“1″, 9), Serovar=row.names(model1$votes), Invasive=model1$votes[,2]), cbind.data.frame(model=rep(“2″, 9), Serovar=row.names(model1$votes), Invasive=model2$votes[,2]), cbind.data.frame(model=rep(“3″, 9), Serovar=row.names(model1$votes), Invasive=model3$votes[,2]), cbind.data.frame(model=rep(“4″, 9), Serovar=row.names(model1$votes), Invasive=model4$votes[,2]), cbind.data.frame(model=rep(“5″, 9), Serovar=row.names(model1$votes), Invasive=model5$votes[,2]), cbind.data.frame(model=rep(“6″, 9), Serovar=row.names(model1$votes), Invasive=model6$votes[,2]), cbind.data.frame(model=rep(“7″, 9), Serovar=row.names(model1$votes), Invasive=model7$votes[,2]))
votedata$Phenotype <- phenotype[match(votedata$Serovar, phenotype[,1]),2]
ggplot(votedata, aes(x=model, y=Invasive, col=Phenotype)) + geom_jitter(width=0.1) + theme_classic(8) + ylab(“Proportion of votes for seletected strains phenotype”) + xlab(“Model iteration”) + geom_hline(yintercept=0.5, lty=2, col=”grey”) + theme(legend.key.size = unit(0.3, “cm”))
ggsave(“votesnew.png”, width=7, height=2.5)
ggsave(“votesnew.pdf”, width=7, height=2.5)
votedata$Phenotype <- phenotype[match(votedata$Serovar, phenotype[,1]),2]
ggplot(votedata, aes(x=model, y=Invasive, col=Serovar)) + geom_jitter(width=0.1) + theme_classic(8) + ylab(“Proportion of votes for seletected strains phenotype”) + xlab(“Model iteration”) + geom_hline(yintercept=0.5, lty=2, col=”grey”) + theme(legend.key.size = unit(0.3, “cm”))
ggsave(“votes.png”, width=7, height=2.5)
ggsave(“votes.pdf”, width=7, height=2.5)
fullvotes <- rbind.data.frame(cbind.data.frame(model=rep(“1″, 9), Serovar=row.names(model1$votes), Invasive=predict(model1, traindata, type=”vote”)[,2]), cbind.data.frame(model=rep(“2″, 9), Serovar=row.names(model1$votes), Invasive=predict(model2, train2, type=”vote”)[,2]), cbind.data.frame(model=rep(“3″, 9), Serovar=row.names(model1$votes), Invasive=predict(model3, train3, type=”vote”)[,2]), cbind.data.frame(model=rep(“4″, 9), Serovar=row.names(model1$votes), Invasive=predict(model4, train4, type=”vote”)[,2]), cbind.data.frame(model=rep(“5″, 9), Serovar=row.names(model1$votes), Invasive=predict(model5, train5, type=”vote”)[,2]), cbind.data.frame(model=rep(“6″, 9), Serovar=row.names(model1$votes), Invasive=predict(model6, train6, type=”vote”)[,2]), cbind.data.frame(model=rep(“7″, 9), Serovar=row.names(model1$votes), Invasive=predict(model7, train7, type=”vote”)[,2]))
fullvotes$Phenotype <- phenotype[match(fullvotes$Serovar, phenotype[,1]),2]
ggplot(fullvotes, aes(x=model, y=Invasive, col=Phenotype)) + geom_jitter(width=0.1) + theme_classic(8) + ylab(“Proportion of votes for invasive phenotype”) + xlab(“Model iteration”) + geom_hline(yintercept=0.5, lty=2, col=”grey”) + theme(legend.key.size = unit(0.3, “cm”))
ggsave(“full_votesnew.png”, width=7, height=2.5)
ggsave(“full_votesnew.pdf”, width=7, height=2.5)
“`

This section shows you how frequently particular genes are identified as top predictors across different iterations of model building, and how much importance they are assigned. Because model building involved a lot of stochasticity, you may find that your top predictors are quite subject to change across iterations.

“`{r, assessing stability of predictors}
set.seed(1)
usefulgenes <- data.frame()
topgenes <- data.frame()
for(i in 1:10) {
model <- randomForest(phenotype ~ ., data=traindata, ntree=10000, mtry=param/10, na.action=na.roughfix)
usefulgenes <- rbind(usefulgenes, cbind(model=i, gene=names(model$importance[model$importance>0,]), model$importance[model$importance>0]))
topgenes <- rbind(topgenes, cbind(model=i, gene=names(model$importance[order(model$importance, decreasing=T),][1:20]), model$importance[order(model$importance, decreasing=T),][1:20]))
}
png(paste(directory, “/gene_usefulnessnew.png”, sep=””))
hist(table(usefulgenes$gene), col=”grey”, main=””, xlab=”Number of times each gene was useful (VI >) in a model”)
dev.off()
sum(table(usefulgenes$gene)==10)
sum(table(usefulgenes$gene)<10)
topgenes$V3 <- as.numeric(as.character(topgenes$V3))
ggplot(topgenes, aes(x=model, y=gene, fill=as.numeric(V3))) + geom_tile() + ggtitle(“Imporance values for top genes across model iterations”) + scale_fill_continuous(“Importance”)
table(topgenes$gene)
png(paste(directory, “/topgenesnew.png”, sep=””))
hist(table(topgenes$gene), col=”grey”, main=””, xlab=”Number of times each gene appeared in the\ntop 20 predictors”)
dev.off()
save(usefulgenes, topgenes, file=paste(directory, “/allgenemodels.Rdata”, sep=””))
“`

This section allows you to look at the COG categories of your top predictor genes, compared to the COG categories in your original training set, to see if any are over-represented.

“`{r, COG analysis}
annotations <- read.delim(paste(directory, “/gproNOG.annotations.tsv”, sep=””), header=F) # matcing annotation file for your chosen model set can be downloaded from http://eggnogdb.embl.de/#/app/downloads
nogs <- read.delim(paste(directory, “/models_used_20ref.tsv”, sep=””), header=F)
nogs[,2] <- sub(“.*NOG\\.”, “”, nogs[,2])
nogs[,2] <- sub(“.meta_raw”, “”, nogs[,2])
info <- annotations[match(nogs[,2], annotations[,2]),]
info$V5 <- as.character(info$V5)
# proportion of each COG catergory that comes up in the top indicators
background <- nogs[match(colnames(traindata), make.names(nogs[,1])),2]
background_COGs <- annotations[match(background, annotations[,2]),5]
library(plyr)
bg_COGs2 <- aaply(as.character(background_COGs), 1, function(i){
if(!is.na(i)) {
if(nchar(i)>1) {
char <- substr(i,1,1)
for(n in 2:nchar(i)) {
char <- paste(char,substr(i,n,n),sep=”.”)
}
return(char)
} else {
return(i)
}
} else{
return(NA)
}
})

background_COGs2 <- unlist(strsplit(bg_COGs2, “[.]”))
predictors <- nogs[match(colnames(train6), make.names(nogs[,1])),2]
predictor_COGs <- annotations[match(predictors, annotations[,2]),5]
p_COGs2 <- aaply(as.character(predictor_COGs), 1, function(i){
if(!is.na(i)) {
if(nchar(i)>1) {
char <- substr(i,1,1)
for(n in 2:nchar(i)) {
char <- paste(char,substr(i,n,n),sep=”.”)
}
return(char)
} else {
return(i)
}
} else{
return(NA)
}
})
predictor_COGs2 <- unlist(strsplit(p_COGs2, “[.]”))
barplot(rbind(table(background_COGs2), table(predictor_COGs2)[match(names(table(background_COGs2)), names(table(predictor_COGs2)))]))
“`

This section allows you to compare the performance of your model predicting true phenotype and a random phenotype.

“`{r, control}
set.seed(2)
control <- traindata
control$phenotype <- sample(traindata$phenotype)
param <- ncol(control)-1
cmodel1 <- randomForest(phenotype ~ ., data=control, ntree=10000, mtry=param/10, na.action=na.roughfix)
cmodel1
png(paste(directory, “/VI_full_control.png”, sep=””), width=400, height=350)
plot(1:param, cmodel1$importance[order(cmodel1$importance, decreasing=T)], xlim=c(1,1000), ylab=”Variable importance”, xlab=”Top genes”)
dev.off()
pdf(paste(directory, “VI_full_control.pdf”, sep=””), width=5, height=5)
plot(1:param, cmodel1$importance[order(cmodel1$importance, decreasing=T)], xlim=c(1,1000), ylab=”Variable importance”, xlab=”Top genes”)
dev.off()
ctrain2 <- control[,match(names(cmodel1$importance[cmodel1$importance[,1]>0,]), colnames(control))]
ctrain2 <- cbind(ctrain2, phenotype=control$phenotype)
param <- ncol(ctrain2)-1
cmodel2 <- randomForest(phenotype ~ ., data=ctrain2, ntree=10000, mtry=param/10, na.action=na.roughfix)
cmodel2
ctrain3 <- ctrain2[,match(names(cmodel2$importance[cmodel2$importance[,1]>quantile(cmodel2$importance[,1], 0.5),]), colnames(ctrain2))]
ctrain3 <- cbind(ctrain3, phenotype=ctrain2$phenotype)
param <- ncol(ctrain3)-1
cmodel3 <- randomForest(phenotype ~ ., data=ctrain3, ntree=10000, mtry=param/10, na.action=na.roughfix)
cmodel3
ctrain4 <- ctrain3[,match(names(cmodel3$importance[cmodel3$importance[,1]>quantile(cmodel3$importance[,1], 0.5),]), colnames(ctrain3))]
ctrain4 <- cbind(ctrain4, phenotype=ctrain3$phenotype)
param <- ncol(ctrain4)-1
cmodel4 <- randomForest(phenotype ~ ., data=ctrain4, ntree=10000, mtry=param/10, na.action=na.roughfix)
cmodel4
ctrain5 <- ctrain4[,match(names(cmodel4$importance[cmodel4$importance[,1]>quantile(cmodel4$importance[,1], 0.5),]), colnames(ctrain4))]
ctrain5 <- cbind(ctrain5, phenotype=ctrain4$phenotype)
param <- ncol(ctrain5)-1
cmodel5 <- randomForest(phenotype ~ ., data=ctrain5, ntree=10000, mtry=param/10, na.action=na.roughfix, proximity=T)
cmodel5
ctrain6 <- ctrain5[,match(names(cmodel5$importance[cmodel5$importance[,1]>quantile(cmodel5$importance[,1], 0.5),]), colnames(ctrain5))]
ctrain6 <- cbind(ctrain6, phenotype=ctrain5$phenotype)
param <- ncol(ctrain6)-1
cmodel6 <- randomForest(phenotype ~ ., data=ctrain6, ntree=10000, mtry=param/10, na.action=na.roughfix, proximity=T)
cmodel6

cmodel6$predicted
names(cmodel6$importance[order(cmodel6$importance[,1], decreasing=T),])[1:10]
# compare votes and phenotypes for the control and real datasets
cbind(cmodel6$votes, control$phenotype, model6$votes, train6$phenotype)

“`

This approach repeats the full model-building process 5 times to see how similar the top predictor genes are.

“`{r, testing the robustness of the top result}
topgenes <- vector()
# picking out the top predictors from the model
for(i in 1:5) {
set.seed(i)
error <- vector()
sparsity <- vector()
for(i in c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000)) {
model <- randomForest(phenotype ~ ., data=traindata, ntree=i, na.action=na.roughfix)
error <- c(error, median(model$err.rate))
sparsity <- c(sparsity, (sum(model$importance[,1]<=0))/(ncol(traindata)-1))
}
error2 <- vector()
sparsity2 <- vector()
param <- ncol(control)-1
for(i in c(1, param/10, param/5, param/3, param/2, param)) {
model <- randomForest(phenotype ~ ., data=traindata, ntree=1000, mtry=i, na.action=na.roughfix)
error2 <- c(error2, median(model$err.rate))
sparsity2 <- c(sparsity2, (sum(model$importance[,1]<=0))/(ncol(traindata)-1))
}
train2 <- traindata[,match(names(model$importance[model$importance[,1]>0,]), colnames(traindata))]
train2 <- cbind(train2, phenotype=traindata$phenotype)
error <- vector()
sparsity <- vector()
for(i in c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000)) {
model <- randomForest(phenotype ~ ., data=train2, ntree=i, na.action=na.roughfix)
error <- c(error, median(model$err.rate))
sparsity <- c(sparsity, (sum(model$importance[,1]<=0))/(ncol(train2)-1))
}
error2 <- vector()
sparsity2 <- vector()
param <- ncol(control)-1
for(i in c(1, param/10, param/5, param/3, param/2, param)) {
model <- randomForest(phenotype ~ ., data=train2, ntree=20000, mtry=i, na.action=na.roughfix)
error2 <- c(error2, median(model$err.rate))
sparsity2 <- c(sparsity2, (sum(model$importance[,1]<=0))/(ncol(train2)-1))
}
train3 <- train2[,match(names(model$importance[model$importance[,1]>quantile(model$importance[,1], 0.5),]), colnames(train2))]
train3 <- cbind(train3, phenotype=train2$phenotype)
error <- vector()
sparsity <- vector()
for(i in c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000)) {
model <- randomForest(phenotype ~ ., data=train3, ntree=i, na.action=na.roughfix)
error <- c(error, median(model$err.rate))
sparsity <- c(sparsity, (sum(model$importance[,1]<=0))/(ncol(train3)-1))
}
error2 <- vector()
sparsity2 <- vector()
param <- ncol(control)-1
for(i in c(1, param/10, param/5, param/3, param/2, param)) {
model <- randomForest(phenotype ~ ., data=train3, ntree=2000, mtry=i, na.action=na.roughfix)
error2 <- c(error2, median(model$err.rate))
sparsity2 <- c(sparsity2, (sum(model$importance[,1]<=0))/(ncol(train3)-1))
}
train4 <- train3[,match(names(model$importance[model$importance[,1]>quantile(model$importance[,1], 0.5),]), colnames(train3))]
train4 <- cbind(train4, phenotype=train3$phenotype)
topgenes <- c(topgenes, colnames(train4))
}
table(topgenes)

微生物多样研究—差异分析

szypanther — Fri, 27 Mar 2020 07:29:11 +0000

1. 随机森林模型

随机森林是一种基于决策树（Decisiontree）的高效的机器学习算法，可以用于对样本进行分类（Classification），也可以用于回归分析（Regression）。

它属于非线性分类器，因此可以挖掘变量之间复杂的非线性的相互依赖关系。通过随机森林分析，可以找出能够区分两组样本间差异关键OTU。

Feature Importance Scores表格-来源于随机森林结果

记录了各OTU对组间差异的贡献值大小。

注：一般地，选取Mean_decrease_in_accuracy值大于0.05的OTU，作进一步分析；对于组间差异较小的样本，该值可能会降至0.03。

2. 交叉验证分析

交叉验证（Crossvalidation)，是一种统计学上将数据样本切割成较小子集的实用方法。先在一个子集上做分析，而其它子集则用来做后续对此分析的确认及验证。一开始的子集被称为训练集。而其它的子集则被称为验证集或测试集。

其中最常见的为k-foldercross-validation，它指的是将所有数据分成k个子集，每个子集均做一次测试集，其余的作为训练集。交叉验证重复k次，每次选择一个子集作为测试集，并将k次的平均交叉验证识别正确率作为结果。

所有的样本都被作为了训练集和测试集，每个样本都被验证一次。

对随机森林方法筛选出的关键OTU的组合进行遍历，以期用最少的OTU数目组合构建一个错误率最低高效分类器。

一般地，对随机森林分析筛选出的关键OTU，按照不同组合进行10倍交叉验证分析，找出能够最准确区分组间差异的最少的OTU组合，再做进一步的分析，如ROC分析等。

注：图中横坐标表示不同数量的OTU组合，纵坐标表示该数量OTU组合下分类的错误率。OTU组合数越少，且错误率越低，则该OTU组合被认为是能够区分组间差异的最少的OTU组合。

3. ROC曲线

接收者操作特征曲线（Receiveroperating characteristic curve，ROC 曲线）也是一种有效的有监督学习方法。ROC分析属于二元分类算法，用来处理只有两种分类的问题，可以用于选择最佳的判别模型,选择最佳的诊断界限值。

可依据专业知识，对疾病组和参照组测定结果进行分析，确定测定值的上下限、组距以及截断点(cut-offpoint)，按选择的组距间隔列出累积频数分布表，分别计算出所有截断点的敏感性(Sensetivity)、特异性和假阳性率(1-特异性:Specificity)。以敏感性为纵坐标代表真阳性率，(1-特异性)为横坐标代表假阳性率，作图绘成ROC曲线。ROC曲线越靠近左上角，诊断的准确性就越高。亦可通过分别计算各个试验的ROC曲线下的面积(AUC)进行比较，哪一种试验的AUC最大，则哪一种试验的诊断价值最佳。

注：图中横坐标为假阳性率false positive rate（FPR）：Specificity，纵坐标为真阳性率true positive rate（TPR）：Sensetivity。最靠近左上角的ROC曲线的点是错误最少的最好阈值，其假阳性和假阴性的总数最少。ROC曲线下的面积值在1.0和0.5之间。在AUC>0.5的情况下，AUC越接近于1，说明诊断效果越好。AUC在 0.5~0.7时有较低准确性，AUC在0.7~0.9时有一定准确性，AUC在0.9以上时有较高准确性。AUC=0.5时，说明诊断方法完全不起作用，无诊断价值。AUC<0.5不符合真实情况，在实际中极少出现。

4. Wilcoxon秩和检验分析

Wilcoxonrank-sum test，也叫曼-惠特尼U检验（Mann–WhitneyU test），是两组独立样本非参数检验的一种方法。其原假设为两组独立样本来自的两总体分布无显著差异，通过对两组样本平均秩的研究来实现判断两总体的分布是否存在差异，该分析可以对两组样品的物种进行显著性差异分析，并对p值计算假发现率（FDR）q值。

注：mean分别为两组样品物种的平均相对丰度，sd分别是两组样本物种相对丰度的标准差。P值为对两组检验原假设为真的概率值，p<0.05表示存在差异，p<0.01表示差异显著，q值为假发现率。

5. 差异菌群Heatmap分析

以10倍交叉验证（10-foldcross-validation）估计泛化误差（Generalizationerror）的大小，其余参数使用默认设置。建模结果同时包含“基线”误差（Baselineerror）的期望值，即数据集中属于最优势分类的样本全部被错误分类的概率。每个OTU根据其被移除后模型预报错误率增加的大小确定其重要度数值，重要度越高，该OTU对模型预报准确率的贡献越大。

根据挑选出来的差异OTU，根据其在每个样品中的丰度信息，对物种进行聚类，绘制成热图，便于观察哪些物种在哪些样品中聚集较多或含量较低。

注：图中越接近蓝色表示物种丰度越低，越接近橙红色表示丰度越高。左边的聚类树是根据各物种间的spearman相关性距离进行聚类；上边的聚类树是采用样本间距离算法中最常用的Bray-Curtis算法进行聚类。

6. 两组样本Welch’s t-test分析

两组不同方差的样本可使用Welch’st-test进行差异比较分析，通过此分析可获得在两组中有显著性差异的物种[或差异基因丰度—适用于元（宏）基因组]。

注：上图所示为不同基因丰度（或物种）在两组样品中的丰度比例，中间所示为95%置信度区间内，物种丰度的差异比例，最右边的值为p值，p值＜0.05，表示差异显著。

7. Shannon多样性指数比较盒状图

将不同分类或环境的多组样本的Shannon多样性指数进行四分位计算，比较不同样本组的组间Shannon指数差异。同时进行非参数Mann-Whitney判断样本组间的显著性差异

注：横坐标表示样本分组，纵坐标表示相对应的Alpha多样性指数值；图形可以显示5个统计量（最小值，第一个四分位，中位数，第三个中位数和最大值，及由下到上5条线）。p＜0.05，表示差异显著；P<0.01，表示差异极显著。

8. 基于距离的箱式图

将不同分类或环境的多组样本的距离进行四分位计算，比较不同样本组的组内和组间的距离分布差异。同时进行multipleStudent’s two-sample t-tests判断样本组间差异的显著性。

箱式图的作用：识别数据异常值；粗略估计和判断数据特征；比较几批数据的形状，同一数轴上，几批数据的箱形图并行排列，几批数据的中位数、尾长、异常值、分布区间等形状信息一目了然。

箱线图（Boxplot）也称箱须图（Box-whiskerPlot），是利用数据中的五个统计量：最小值、第一四分位数、中位数、第三四分位数与最大值来描述数据的一种方法，它也可以粗略地看出数据是否具有对称性，分布的分散程度等信息，特别可以用于对几组样本的比较。简单箱线图由五部分组成，分别是最小值、中位数、最大值和两个四分位数。

注：第一四分位数 (Q1)，又称“下四分位数”，等于该样本中所有数值由小到大排列后第25%的数字。第二四分位数 (Q2)，又称“中位数”，等于该样本中所有数值由小到大排列后第50%的数字。第三四分位数 (Q3)，又称“上四分位数”，等于该样本中所有数值由小到大排列后第75%的数字。

9. LEfSe分析

LEfSe是一种用于发现高维生物标识和揭示基因组特征的软件。包括基因，代谢和分类，用于区别两个或两个以上生物条件（或者是类群）。该算法强调的是统计意义和生物相关性。让研究人员能够识别不同丰度的特征以及相关联的类别。

LEfSe通过生物学统计差异使其具有强大的识别功能。然后，它执行额外的测试，以评估这些差异是否符合预期的生物学行为。

具体来说，首先使用non-parametric factorial Kruskal-Wallis (KW) sum-rank test（非参数因子克鲁斯卡尔—沃利斯和秩验检）检测具有显著丰度差异特征，并找到与丰度有显著性差异的类群。最后，LEfSe采用线性判别分析（LDA）来估算每个组分（物种）丰度对差异效果影响的大小。

说明：左边的图为统计两个组别当中有显著作用的微生物类群通过LDA分析（线性回归分析）后获得的LDA分值。右边的图为聚类树，节点大小表示丰度，默认从门到属依次向外排列。红色区域和绿色区域表示不同分组，树枝中红色节点表示在红色组别中起到重要作用的微生物类群，绿色节点表示在绿色组别中起到重要作用的微生物类群，黄色节点表示的是在两组中均没有起到重要作用的微生物类群。图中英文字母表示的物种名称在右侧图例中进行展示。

10. ANOSIM相似性分析

相似性分析(ANOSIM)是一种非参数检验，用来检验组间（两组或多组）的差异是否显著大于组内差异，从而判断分组是否有意义。首先利用Bray-Curtis算法计算两两样品间的距离，然后将所有距离从小到大进行排序，按以下公式计算R值，之后将样品进行置换，重新计算R*值，R*大于R的概率即为P值。

其中，

r ̅ _b：表示组间（Between groups）距离排名的平均值；

r ̅ _w：表示组内（Within groups）距离排名的平均值；

n：表示样品总数。

Table. Anosimanalysis

注：理论上，R值范围为-1到+1，实际中R值一般从0到1，R值接近1表示组间差异越大于组内差异，R值接近0则表示组间和组内没有明显差异。P值则反映了分析结果的统计学显著性，P值越小，表明各样本分组之间的差异显著性越高，P< 0.05表示统计具有显著性；Number of permutation表示置换次数。

11. Adonis多因素方差分析

Adonis又称置换多因素方差分析（permutationalMANOSVA）或非参数多因素方差分析（nonparametricMANOVA）。它利用半度量(如Bray-Curtis)或度量距离矩阵(如Euclidean)对总方差进行分解，分析不同分组因素对样品差异的解释度，并使用置换检验对划分的统计学意义进行显著性分析。

Table permutational MANOVA analysis

注：

Group：表示分组；

Df：表示自由度；

SumsOfSqs：总方差，又称离差平方和；

MeanSqs：平均方差，即SumsOfSqs/Df；

F.Model：F检验值；

R2：表示不同分组对样品差异的解释度，即分组方差与总方差的比值，即分组所能解释的原始数据中差异的比例，R2越大表示分组对差异的解释度越高；

Pr(>F)：通过置换检验获得的P值，P值越小，表明组间差异显著性越强。

作者：JarySun
链接：https://www.jianshu.com/p/87f24cceaa43
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

v2ray

szypanther — Fri, 21 Feb 2020 09:42:29 +0000

https://github.com/Jrohy/multi-v2ray

Docker运行

默认创建mkcp + 随机一种伪装头配置文件：

docker run -d --name v2ray --privileged --restart always --network host jrohy/v2ray

自定义v2ray配置文件:

docker run -d --name v2ray --privileged -v /path/config.json:/etc/v2ray/config.json --restart always --network host jrohy/v2ray

查看v2ray配置:

docker exec v2ray bash -c "v2ray info"

warning: 如果用centos，需要先关闭防火墙

systemctl stop firewalld.service
systemctl disable firewalld.service

alpha多样性

szypanther — Tue, 18 Feb 2020 03:24:22 +0000

扩增子数据分析之多样性指数： alpha多样性

多样性指数（Diversity index）和计算公式可以见： wikipedia

Alpha多样性（Alpha Diversity）是对某个样品中物种多样性的分析，包含样品中的物种类别的多样性——丰富度（Richness）和物种组成多少的整体分布——均匀度（Evenness）两个因素，通常用Richness,Chao1，Shannon，Simpson，Dominance和Equitability等指数来评估样本的物种多样性。

丰富度指数

Richness, Chao1，Shannon三个指数是常用的评估丰富度的指标，数值越高表明样品包含的物种丰富度就越高。

Richness指数: 指样本中被检测到的OTU量；
Chao1指数   : 通过低丰度OTUs来进一步预测样品中的OTUs数量；
Shannon指数 : 计算考虑到样品中的OTUs及其相对丰度信息，
             通过对数（如以2为底的shannon_2，以自然对数为底的shannon_e
             以10为底的shannon_10）转换来预测样品中的分类多样性。

均匀度指数

Simpson，Dominance和Equitability三个指数是常用的评估均匀度的指标。

Simpson指数     : 表示随机选取两条序列属于同一个分类（如OTUs）的概率（故数值在0~1之间），
                  数值越接近1表示表明OTUs的丰度分部越不均匀；
Dominancez指数  : 取值为1-Simpson，表示随机选取两条序列属于不同分类（如OTUs）的概率；
Equitability指数: 根据Shannon指数值计算，当其值为1时表明样品中的物种丰度分布绝对均匀，
                  而其值越小这表明物种丰度分布呈现出越高的偏向。

汇总表：

指数	单位	计算方式
richness	OTUs	样本中至少包含一条序列的OTU数目
chao1	OTUs	N + S^2 / (2D^2)，其中N为OTU个数, S为丰度为1的OTUs个数，D为丰度为2的OTUs数目；
shannon_2	bits	sum(f), 对所有OTU频率计算p*log(p,2)和, p为OTU的频率；
shannon_e	nats	sum(f), 对所有OTU频率计算p*log(p,e)和, p为OTU的频率；
shannon_10	dits	sum(f), 对所有OTU频率计算p*log(p,10)和, p为OTU的频率；
simpson	Probability	sum(f^2)， f为所有OTU频率的和
dominance	Probability	1-simpson
equitability		shannon/log(N), N为OTU数(logs to base 2)

实例：

USEARCH alpha_div

USEARCH 提供了alpha_div函数进行计算各种指数, 可通·-metrics 指定需要计算指数，支持的指数有： berger_parker、buzas_gibson、chao1、dominance、equitability、jost、jost1、reads、richness、robbins、simpson shannon_e、shannon_2、shannon_10

usearch -alpha_div otutable.txt -output alpha.txt
usearch -alpha_div otutable.txt -output gini.txt  -metrics gini_simpson
usearch -alpha_div otutable.txt -output alpha.txt -metrics chao1,

QIIME diversity alpha

qiime2 数据分析流程通过 qiime diversity接口提供了分析`alpha多样性·的各种命令：

--i-table  : FeatureTable
--p-metric : enspie|michaelis_menten_fit|strong|lladser_pe|fisher_alpha
             |goods_coverage|doubles|simpson|margalef|observed_otus|osd
             |shannon|pielou_e|chao1|brillouin_d|menhinick|simpson_e
             |kempton_taylor_q|robbins|dominance|lladser_ci|heip_e
             |singles|chao1_ci|mcintosh_d|ace|mcintosh_e|gini_index
             |berger_parker_d|esty_ci
--o-alpha-diversity: 输出alpha多样性；
--output-dir： 输出目录（如不指定--o-distance-matrix）；

执行：

qiime diversity alpha          \
   --i-table  table.qza       \
   --p-metric  goods_coverage \
   --o-alpha-diversity  goods_coverage.qza

物多样性测定主要有三个空间尺度：α多样性，β多样性，γ多样性。

α多样性主要关注局域均匀生境下的物种数目，因此也被称为生境内的多样性（within-habitat diversity）

β多样性指沿环境梯度不同生境群落之间物种组成的的相异性或物种沿环境梯度的更替速率也被称为生境间的多样性（between-habitat diversity），控制β多样性的主要生态因子有土壤、地貌及干扰等。

不同群落或某环境梯度上不同点之间的共有种越少，β多样性越大。精确地测定β多样性具有重要的意义。这是因为：①它可以指示生境被物种隔离的程度；②β多样性的测定值可以用来比较不同地段的生境多样性；③β多样性与α多样性一起构成了总体多样性或一定地段的生物异质性。

γ多样性描述区域或大陆尺度的多样性，是指区域或大陆尺度的物种数量，也被称为区域多样性（regional diversity）。控制γ多样性的生态过程主要为水热动态，气候和物种形成及演化的历史。主要指标为物种数（S）。γ多样性测定沿海拔梯度具有两种分布格局：偏锋分布和显著的负相关格局。


https://rdrr.io/cran/otuSummary/man/alphaDiversity.html

Invsimpson – mothur

The invsimpson calculator is the inverse of the classical Simpson diversity estimator. This parameter is preferred to other measures of alpha-diversity because it is an indication of the richness in a community with uniform evenness that would have the same level of diversity.

https://www.mothur.org/wiki/Invsimpson

Biological diversity - the great variety of life !

在探索simpson指数之前，我们需要理解几个很重要的概念：

生物多样性可以用很多种方式定量，其中两个主要的因素是丰富度（richness）和均匀度（evenness）。

1. Richness

丰富度即每个样本的物种数，样本中物种越多，样本越“丰富”。

物种丰富度从概念上讲，并不考虑（样本中）每个物种有多少个个体。它给于个体数少的物种与个体数多的数种相同的权重。因此，在某地区1朵雏菊与1000朵金凤花对丰富度的影响是一样的。

2. Evenness

均匀度即不同物种的相对丰度（abundance）,它与丰富度互相补充，相辅相成（make up）。

[译者注] 这里其实有三个概念：Richness, Evennes 和abundance。例如A组：类1有3个，类2有5个，类3有6个；B组：类1有4个，类2有4个，类3有4个。那么A组有3类，B组也有3类，所以它们的richness是一样的；A组中3个类所含个体数均不相同，而B组中3个类所含个体数相同，因此A组和B组的evennes不同；A组类1有3个，B组类1有4个，所以就类1而言B组的abundance更高。

我们对两个地区不同的野花进行取样，以此为例。第1个地区包括300朵雏菊，335朵蒲公英和365朵金凤花。第2个地区包括20朵雏菊，49朵蒲公英和931朵金凤花，如下表。两个样本丰富度相同（均有3个物种），总的个体数也相同（均为1000朵）。然而第1个地区样本的均匀度比第2个地区样本的均匀度更高。这是因为（在第1个地区）3个物种个体分布较均匀，第2个地区大多数是金凤花，仅有少数雏菊和蒲公英。因此认为样本2比样本1的多样性更低。

相比于由相似丰度的许多物种组成的群落，由一两个优势物种组成的群落具有更低的多样性。

多样性随物种丰富度和均匀度的增加而增加。Simpson指数兼顾丰富度和均匀度。

Simpson多样性指数实际上涉及三个相似的指数：

Simpson’s Index (D)

它反映的是在同一个样本中随机的抽取2个个体，这两个个体来自同一个类的概率。有以下两个版本的公式来计算simpson指数。两者不矛盾，均可接受。


n = the total number of organisms of a particular species N = the total number of organisms of all species

D值在0-1之间。0表示无限多样，1表示没有多样性。也就是说D值越大，多样性越低。这与直觉和逻辑不符，为了解决这个问题，通常会用1减去D：

Simpson’s Index of Diversity 1-D

这个值也在0-1之间，但是此时，值越大多样性越高，这就变得更直观了。这种情况下，指数代表的意义是在同一个样本中随机的抽取2个个体，这两个个体来自不同类的概率。

对于违背直觉的D值，还有另一种处理办法，即用1除以D:

Simpson’s Reciprocal Index 1 / D

1/D的最小值为1。当它为1时表示样本仅由1个物种组成。值越大，多样性越高。最大值是样本中的物种数。例如，假设一个样本中有5个物种，则1/D的最大值为5。

[译者注] 当样本中这5个物种的丰度都相等时1/D达到最大值5。大家可以通过求二阶偏导来求出极值，因非本文重点，证明从略。

以上三个指数想用哪一个取决于使用者的分析需求，但是在研究中需指明使用哪一个指标作为simpson指数！[译者注：该文作者着重强调了这一点，请注意！]

# ====================== 译文结束 =======================

这篇材料提供的案例很好，但是遗憾的是仅说明了simpson指数与evennes关系。为了进行单因素比较，作者将两组丰富度设为相同。那么如果丰富度不同呢？而且simpson指数是否与shannon指数一样与丰度无关呢？这里再举一个例子(因为各组相互独立，这里就不给生物学意义，直接上数字了，具体可查看另一篇shannon指数博文[2])：

A组：2, 4, 6, 8

B组：20, 40, 60, 80

C组：5, 5, 5, 5

D组：5, 5, 5, 5, 5

代入公式1-D计算（因为微生物16SrRNA经典流程QIIME使用的scikit库是利用这个公式计算的〔3〕），我们可以得出：

A组simpson指数为： 1-((2/20)^2+(4/20)^2+(6/20)^2+(8/20)^2) = 0.7

A组shannon指数为 1.846439（计算公式见博文[2]，下同）

B组simpson指数为： 1-((20/200)^2+(40/200)^2+(60/200)^2+(80/200)^2) = 0.7

B组shannon指数为 1.846439

C组simpson指数为： 1-((5/20)^2)*4 = 0.75

C组shannon指数为 2.0

D组simpson指数为： 1-((5/25)^2)*5 = 0.8

D组shannon指数为 2.321928

从上面的计算过程很明显看出A组和B组相等，C组和D组不相等，A组和C组也不相等。

AB组结果相同显示出在丰富度一致时simpson指数与丰度无关，它只与相对丰度（均匀度）有关。这和shannon指数一致，归根结底是因为公式中自变量都是相对丰度pi！

CD组结果不同显示出在均匀度一致时simpson指数与丰富度有关，丰富度越大，simpson指数越小。这一点也和shannon指数的情况一致，归根结底，原因在于公式中都有加和项，而且加和部分无论是simpson指数的(pi)²还是shannon指数的x*log2(x)在区间（0，1〕上均大于0（有关x*log2(x)>0, x∈（0，1〕可以查看博文〔2〕中的y= – x*log2(x)那张图）。因此，无论是shannon指数还是simpson指数每多加一项（即丰富度增加），值都会越来越小。回到抽样上来讲，当样本中每种个体数都相同时，在一个样本中随机抽取两个个体，种类越多抽到的这两个个体来自同一个种类的概率越大。

AC组显示出当丰富度相同时，样本中种类越均一，simpson指数越大，即种类越均一，随机抽取两个个体属于同一个种类的概率越大。这一点可以查看博文〔2〕中的分析过程。对应shannon指数的y = – x*log2(x)， simpson指数的y = – x²在（0，1〕间区上，也是一个斜率逐渐减小的单调递减函数。

综上，simpson和shannon指数都是均匀度和丰富度的综合指标。

〔1〕 http://www.countrysideinfo.co.uk/simpsons.htm

〔2〕 http://blog.sciencenet.cn/blog-2970729-1069399.html

〔3〕 http://scikit-bio.org/docs/latest/generated/generated/skbio.diversity.alpha.simpson.html#skbio.diversity.alpha.simpson

本文来自卢锐科学网博客。
链接地址：http://blog.sciencenet.cn/blog-2970729-1069539.html

TORMES and gapFinisher

szypanther — Mon, 30 Dec 2019 05:34:47 +0000

TORMES: an automated pipeline for whole bacterial genome analysis

Narciso M Quijada, David Rodríguez-Lázaro, Jose María Eiros, Marta Hernández

Bioinformatics, Volume 35, Issue 21, 1 November 2019, Pages 4207–4212, https://doi.org/10.1093/bioinformatics/btz220

gapFinisher: A reliable gap filling pipeline for SSPACE-LongRead scaffolder output

Juhana I. Kammonen ,
Olli-Pekka Smolander,
Lars Paulin,
Pedro A. B. Pereira,
Pia Laine,
Patrik Koskinen,
Jukka Jernvall,
Petri Auvinen

Published: September 9, 2019
https://doi.org/10.1371/journal.pone.0216885

ClinVAP: A reporting strategy from variants to therapeutic options

szypanther — Mon, 16 Dec 2019 09:37:26 +0000

https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz924/5674039

ClinVAP: A reporting strategy from variants to therapeutic options

Abstract

Motivation

Next-generation sequencing (NGS) has become routine in oncology and opens up new avenues of therapies, particularly in personalized oncology setting. An increasing number of cases also implies a need for a more robust, automated, and reproducible processing of long lists of variants for cancer diagnosis and therapy. While solutions for the large-scale analysis of somatic variants have been implemented, existing solutions often have issues with reproducibility, scalability, and interoperability.

Results

ClinVAP is an automated pipeline which annotates, filters, and prioritizes somatic single nucleotide variants (SNVs) provided in variant call format. It augments the variant information with documented or predicted clinical effect. These annotated variants are prioritized based on driver gene status and druggability. ClinVAP is available as a fully containerized, self-contained pipeline maximizing reproducibility and scalability allowing the analysis of larger scale data. The resulting JSON-based report is suited for automated downstream processing, but ClinVAP can also automatically render the information into a user-defined template to yield a human-readable report.

Availability and Implementation

ClinVAP is available at https://github.com/PersonalizedOncology/ClinVAP

Vipie: web pipeline for parallel characterization of viral populations from multiple NGS samples

szypanther — Mon, 16 Sep 2019 06:12:35 +0000

Background

Next generation sequencing (NGS) technology allows laboratories to investigate virome composition in clinical and environmental samples in a culture-independent way. There is a need for bioinformatic tools capable of parallel processing of virome sequencing data by exactly identical methods: this is especially important in studies of multifactorial diseases, or in parallel comparison of laboratory protocols.

https://sourceforge.net/projects/vipie/