小生这厢有礼了(BioFaceBook Personal Blog) » work

Using prcomp/princomp for PCA in R （二）

szypanther — Fri, 31 Aug 2012 04:08:34 +0000

###############################

PCA
###############################
install.packages(“vegan”)
library(vegan)

> STpcoa<-read.table(file=”bactera_16s_final.subsample.phylip.tre1.weighted.phylip.pcoa.axes”, header=T,row.names=1)
> STpcoa
axis1 axis2 axis3 axis4
Cellulose -0.020878 -0.234601 0.167454 0
Foodwaste -0.234592 0.221741 0.085802 0
Sludge 0.368882 0.100725 -0.010570 0
Xylan -0.113413 -0.087865 -0.242686 0
>pl.STpcoa<-princomp(STpcoa)
> summary(pl.STpcoa)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 0.2260563 0.1746944 0.1536319 0
Proportion of Variance 0.4856521 0.2900347 0.2243133 0
Cumulative Proportion 0.4856521 0.7756867 1.0000000 1

> ls(pl.STpcoa)
[1] “call” “center” “loadings” “n.obs” “scale” “scores” “sdev”
> class(pl.STpcoa)
[1] “princomp”
> nmds.col<-c(rep(“green”, 1), rep(“blue”, 1), rep(“black”,1), rep(“red”,1))
> plot(pl.STpcoa$scores, col=nmds.col, pch=20)
> legend(x=0.12, y=0.25, c(“Cellulose”,”Foodwaste”,”Sludge”,”Xylan”),c(“green”,”blue”,”black”,”red”),bty=”n”)

> biplot(pl.STpcoa)

##############################
> pl2.STpcoa<-prcomp(STpcoa)
> class(pl2.STpcoa)
[1] “prcomp”
> pl2.STpcoa$sd^2
[1] 0.06813525 0.04069083 0.03147035 0.00000000
> summary(pl2.STpcoa)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 0.2610 0.2017 0.1774 0
Proportion of Variance 0.4857 0.2900 0.2243 0
Cumulative Proportion 0.4857 0.7757 1.0000 1
> pl2.STpcoa$x
PC1 PC2 PC3 PC4
Cellulose -0.02087762 -0.2346017 0.16745306 0
Foodwaste -0.23459165 0.2217407 0.08580311 0
Sludge 0.36888225 0.1007250 -0.01056992 0
Xylan -0.11341297 -0.0878640 -0.24268626 0
> pl2.STpcoa$rotation
PC1 PC2 PC3 PC4
axis1 1.000000e+00 -9.352971e-08 -8.835157e-07 0
axis2 9.353330e-08 1.000000e+00 4.064575e-06 0
axis3 8.835153e-07 -4.064576e-06 1.000000e+00 0
axis4 0.000000e+00 0.000000e+00 0.000000e+00 1
> plot(pl2.STpcoa$x, col=nmds.col, pch=20)
> legend(x=0.12, y=0.25, c(“Cellulose”,”Foodwaste”,”Sludge”,”Xylan”),c(“green”,”blue”,”black”,”red”),bty=”n”)

> screeplot(pl2.STpcoa,type=”lines”,main=”Scree Plot”)

bactera_16s_final.subsample.phylip.tre1.weighted.phylip.pcoa.axes

Using prcomp/princomp for PCA in R （一）

szypanther — Fri, 31 Aug 2012 04:00:07 +0000

Difference between prcomp and princomp:

‘princomp’ can only be used with more units than variables”

prcomp是基于SVD分解（svd()函数，princomp是基于特征向量eigen()函数)

Good video source:

http://www.youtube.com/watch?v=oZ2nfIPdvjY

http://www.youtube.com/watch?v=I5GxNzKLIoU&feature=relmfu

http://www.planta.cn/forum/viewtopic.php?t=16754&highlight=%D3%EF%D1%D4

###########################################

以下所有代码包括练习数据，都可在R平台上直接运行。

#主成分分析和主成分回归
主成分分析的思想是Pearson 1901年提出的，Hotelling 1933进一步发展
在R中，进行主成分分析用到princomp() 函数

用法
princomp(x, cor = FALSE, scores = TRUE, covmat = NULL,
subset = rep(TRUE, nrow(as.matrix(x))), …)

# 分析用数据
# cor 是否用样本的协方差矩阵作主成分分析
prcomp()
二 summary()函数
三 loadings()函数
四 predict() 函数
五 screeplot() 函数
六 biplot() 函数

实例
某中学随机抽取某年级30名学生，测量其身高，体重，胸围，坐高，针对这30名中学生身体四项指标数据做主成分分析。
student<-data.frame(
X1=c(148, 139, 160, 149, 159, 142, 153, 150, 151, 139,
140, 161, 158, 140, 137, 152, 149, 145, 160, 156,
151, 147, 157, 147, 157, 151, 144, 141, 139, 148 ),
X2=c(41, 34, 49, 36, 45, 31, 43, 43, 42, 31,
29, 47, 49, 33, 31, 35, 47, 35, 47, 44,
42, 38, 39, 30, 48, 36, 36, 30, 32, 38 ),
X3=c(72, 71, 77, 67, 80, 66, 76, 77, 77, 68,
64, 78, 78, 67, 66, 73, 82, 70, 74, 78,
73, 73, 68, 65, 80, 74, 68, 67, 68, 70),
X4=c(78, 76, 86, 79, 86, 76, 83, 79, 80, 74,
74, 84, 83, 77, 73, 79, 79, 77, 87, 85,
82, 78, 80, 75, 88, 80, 76, 76, 73, 78 )
)
#主成分分析
student.pr <- princomp(student, cor = TRUE)
#显示结果
summary(student.pr, loadings=TRUE)
#预测，显示各样本主成分的值
pre<-predict(student.pr)
#显示碎石图
screeplot(student.pr,type=”lines”)
# 主成分分析散点图
biplot(student.pr)

例二
对128个成年男子的身材进行测量，每人测得16项指标，身高，坐高，胸围，头高，裤长，下档，手长，领围，前胸，后背，肩厚，肩宽，袖长，肋围，腰围，腿肚，分别用X1-X16表示。16项指标的相关矩阵R。从相关矩阵出发进行主成分分析，随16项指标进行分类。
命令
x<-c(
1.00,
0.79, 1.00,
0.36, 0.31, 1.00,
0.96, 0.74, 0.38, 1.00,
0.89, 0.58, 0.31, 0.90, 1.00,
0.79, 0.58, 0.30, 0.78, 0.79, 1.00,
0.76, 0.55, 0.35, 0.75, 0.74, 0.73, 1.00,
0.26, 0.19, 0.58, 0.25, 0.25, 0.18, 0.24, 1.00,
0.21, 0.07, 0.28, 0.20, 0.18, 0.18, 0.29,-0.04, 1.00,
0.26, 0.16, 0.33, 0.22, 0.23, 0.23, 0.25, 0.49,-0.34, 1.00,
0.07, 0.21, 0.38, 0.08,-0.02, 0.00, 0.10, 0.44,-0.16, 0.23, 1.00,
0.52, 0.41, 0.35, 0.53, 0.48, 0.38, 0.44, 0.30,-0.05, 0.50, 0.24, 1.00,
0.77, 0.47, 0.41, 0.79, 0.79, 0.69, 0.67, 0.32, 0.23, 0.31, 0.10, 0.62, 1.00,
0.25, 0.17, 0.64, 0.27, 0.27, 0.14, 0.16, 0.51, 0.21, 0.15, 0.31, 0.17, 0.26, 1.00,
0.51, 0.35, 0.58, 0.57, 0.51, 0.26, 0.38, 0.51, 0.15, 0.29, 0.28, 0.41, 0.50, 0.63, 1.00,
0.21, 0.16, 0.51, 0.26, 0.23, 0.00, 0.12, 0.38, 0.18, 0.14, 0.31, 0.18, 0.24, 0.50, 0.65, 1.00
)
names<-c(“X1″, “X2″, “X3″, “X4″, “X5″, “X6″, “X7″, “X8″, “X9″,
“X10″, “X11″, “X12″, “X13″, “X14″, “X15″, “X16″)
R<-matrix(0, nrow=16, ncol=16, dimnames=list(names, names))
for (i in 1:16){
for (j in 1:i){
R<-x[(i-1)*i/2+j]; R[j,i]<-R
}
}
#主成分分析
pr<-princomp(covmat=R)
load<-loadings(pr)

#
plot(load[,1:2])
text(load[,1], load[,2], adj=c(-0.4, 0.3))

主成分回归
考虑进口总额Y与三个自变量：国内总产值，存储量，总消费量之间的关系。现收集了1949-1959共11年的数据，试做线性回归和主成分回归分析。
conomy<-data.frame(
x1=c(149.3, 161.2, 171.5, 175.5, 180.8, 190.7, 202.1, 212.4, 226.1, 231.9, 239.0),
x2=c(4.2, 4.1, 3.1, 3.1, 1.1, 2.2, 2.1, 5.6, 5.0, 5.1, 0.7),
x3=c(108.1, 114.8, 123.2, 126.9, 132.1, 137.7, 146.0, 154.1, 162.3, 164.3, 167.6),
y=c(15.9, 16.4, 19.0, 19.1, 18.8, 20.4, 22.7, 26.5, 28.1, 27.6, 26.3)
)

线性回归
lm.sol<-lm(y~x1+x2+x3, data=conomy)
summary(lm.sol)
主成分回归

# 主成分分析
conomy.pr<-princomp(~x1+x2+x3, data=conomy, cor=T)
summary(conomy.pr, loadings=TRUE)
pre<-predict(conomy.pr)
conomy$z1<-pre[,1]; conomy$z2<-pre[,2]
lm.sol<-lm(y~z1+z2, data=conomy)
summary(lm.sol)

4sample CA RDA analysis

szypanther — Thu, 30 Aug 2012 04:14:47 +0000

> gtsdata_test=read.table(“gtsdata.txt”, header=T)
> gtsenv=read.table(“gtsenv.txt”, header=T)
> gtsdata_data_t<-t(gtsdata_data)
> decorana(gtsdata_data_t)

Call:
decorana(veg = gtsdata_data_t)

Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.

DCA1 DCA2 DCA3 DCA4
Eigenvalues 0.8634 0.4834 0.23788 0
Decorana values 0.8721 0.3793 0.07223 0
Axis lengths 5.3292 2.1115 1.80907 0

> gts.ca=cca(gtsdata_data_t)
> gts.ca
Call: cca(X = gtsdata_data_t)

Inertia Rank
Total 1.653
Unconstrained 1.653 3
Inertia is mean squared contingency coefficient

Eigenvalues for unconstrained axes:
CA1 CA2 CA3
0.8721 0.5037 0.2776

> plot(gts.ca,scaling=3)

> gtsdata_data_t_del<-gtsdata_data_t[1:3,]
> gtsdata_data_t_del

> gts.rda=rda(gtsdata_data_t_del,gtsenv)
> gts.rda
Call: rda(X = gtsdata_data_t_del, Y = gtsenv)

Inertia Proportion Rank
Total 101790 1
Constrained 101790 1 2
Unconstrained 0 0 0
Inertia is variance
Some constraints were aliased because they were collinear (redundant)

Eigenvalues for constrained axes:
RDA1 RDA2
81240 20549

plot(gts.rda,display=c(“sp”,”bp”,”si”),scaling=3)

gtsenv.txt

gtsdata.txt

基于Vegan 软件包的生态学数据排序分析学习

szypanther — Tue, 28 Aug 2012 04:40:20 +0000

“基于Vegan 软件包的生态学数据排序分析
赖江山米湘成
(中国科学院植物研究所植被与环境变化国家重点实验室，北京 100093)
摘要：群落学数据一般是多维数据，例如物种属性或环境因子的属性。多元统计分析是群落生态学常用的分析方法，排序（ordination）是多元统计最常用的方法之一。CANOCO是广泛使用的排序软件，但缺点是商业软件价格不菲，版本更新速度也很慢。近年来，R语言以其灵活、开放、易于掌握、免费等诸多优点，在生态学和生物多样性研究领域迅速赢得广大研究人员的青睐。R语言中的外在软件包“Vegan”是专门用于群落生态学分析的工具。Vegan能够提供所有基本的排序方法，同时具有生成精美排序图的功能，版本更新很快。我们认为Vegan包完全可以取代CANOCO，成为今后排序分析的首选统计工具。本文首先简述排序的原理和类型，然后介绍Vegan的基本信息和下载安装过程，最后以古田山24公顷样地内随机抽取40个20m×20m的样方为例，展示Vegan包内各种常用排序方法（PCA,RDA,CA和CCA）和排序图生成过程，希望能为R的初学者尽快熟悉并利用Vegan包进行排序分析提供参考。

gtsdata

gtsenv.txt

赖江山.pdf

> setwd("/winxp_disk2/shenzy/R/Vegan")
> gtsdata=read.table("gtsdata.txt", header=T)
> gtsenv=read.table("gtsenv.txt", header=T)
> install.packages("vegan")
Installing package(s) into ‘/home/shenzy/R/x86_64-pc-linux-gnu-library/2.15’
(as ‘lib’ is unspecified)
试开URL’http://cran.csiro.au/src/contrib/vegan_2.0-4.tar.gz'
Content type 'application/x-gzip' length 1576584 bytes (1.5 Mb)
打开了URL
==================================================
downloaded 1.5 Mb

* installing *source* package ‘vegan’ ...
** 成功将‘vegan’程序包解包并MD5和检查
** libs
gfortran   -fpic  -O3 -pipe  -g  -c cepin.f -o cepin.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c data2hill.c -o data2hill.o
gfortran   -fpic  -O3 -pipe  -g  -c decorana.f -o decorana.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c goffactor.c -o goffactor.o
gfortran   -fpic  -O3 -pipe  -g  -c monoMDS.f -o monoMDS.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c nestedness.c -o nestedness.o
gfortran   -fpic  -O3 -pipe  -g  -c ordering.f -o ordering.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c pnpoly.c -o pnpoly.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c stepacross.c -o stepacross.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c vegdist.c -o vegdist.o
gcc -std=gnu99 -shared -o vegan.so cepin.o data2hill.o decorana.o goffactor.o monoMDS.o nestedness.o ordering.o pnpoly.o stepacross.o vegdist.o -lgfortran -lm -lquadmath -L/usr/lib/R/lib -lR
安装至 /home/shenzy/R/x86_64-pc-linux-gnu-library/2.15/vegan/libs
** R
** data
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
   ‘decision-vegan.Rnw’
   ‘diversity-vegan.Rnw’ using ‘UTF-8’
   ‘intro-vegan.Rnw’ using ‘UTF-8’
** testing if installed package can be loaded

* DONE (vegan)

The downloaded source packages are in
	‘/tmp/RtmpmtXtEK/downloaded_packages’
> library(vegan)
载入需要的程辑包：permute

载入程辑包：‘permute’

The following object(s) are masked from ‘package:gtools’:

    permute

This is vegan 2.0-4
> decorana(gtsdata)

Call:
decorana(veg = gtsdata) 

Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.

                  DCA1   DCA2    DCA3    DCA4
Eigenvalues     0.3939 0.2239 0.09555 0.06226
Decorana values 0.5025 0.1756 0.06712 0.03877
Axis lengths    3.2595 2.5130 1.21445 1.00854

> gts.pca=rda(gtsdata)
> gts.pca
Call: rda(X = gtsdata)

              Inertia Rank
Total           352.1
Unconstrained   352.1   22
Inertia is variance 

Eigenvalues for unconstrained axes:
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8
111.779  73.580  54.607  32.959  26.481  18.063  12.763   7.637
(Showed only 8 of all 22 unconstrained eigenvalues)

Note: 通过以上命令选择排序模型（线性模型PCA、RDA或单峰模型CA、CCA），因为Axis lengths 等同于CANOCO中的DCA分析，
DCA排序数值最大max>4选单峰，<3 选线性模型， 3
> plot(gts.pca)

 
 >biplot(gts.pca)

因以上重叠现象严重，原因是物种分布差异打，分布不均匀的物种占据了大部分排序空间，可对物种数据进行单位方差标准化。通过scale参数实现，如下：
> gts.pca=rda(gtsdata, scale=T)
> biplot(gts.pca,scaling=3)

Note:scaling=1 关注物种间关系
scaling=2 关注样方之间关系
scaling=3 关注样方与物种之间关系
> biplot(gts.pca,display="sp")
> biplot(gts.pca,display="si")
> biplot(gts.pca,display="sp", choices=c(1,3))


CA分析：
> gts.ca=cca(gtsdata)
> gts.ca
Call: cca(X = gtsdata)

              Inertia Rank
Total           1.424
Unconstrained   1.424   21
Inertia is mean squared contingency coefficient 

Eigenvalues for unconstrained axes:
    CA1     CA2     CA3     CA4     CA5     CA6     CA7     CA8
0.50253 0.26564 0.14023 0.10502 0.09127 0.05540 0.05063 0.04204
(Showed only 8 of all 21 unconstrained eigenvalues)
> plot(gts.ca,scaling=3)

从CA解读即：如某一个物种靠近某个样方，表明该物种可能对样方位置起很大作用。从图可以看出20号样方与短柄饱（QUESER）很近。同时19与20号样方距离近，表明物种组结构特征也近！而只有少数样方出现的物种，如CASCAR，通常在排序空间边缘，表明只偶然发生。该列对应样方数值都很小或0！对在排序中心的物种，可能在取样区域是其最优分布。对应该列（CASERY）数值较大而多！
RDA分析（多个矩阵分析）：
> gts.rda=rda(gtsdata,gtsenv)
> gts.rda
Call: rda(X = gtsdata, Y = gtsenv)

               Inertia Proportion Rank
Total         352.0917     1.0000
Constrained   137.4026     0.3902    8
Unconstrained 214.6891     0.6098   22
Inertia is variance 

Eigenvalues for constrained axes:
   RDA1    RDA2    RDA3    RDA4    RDA5    RDA6    RDA7    RDA8
56.3864 42.7769 17.8270 13.5066  2.5020  2.1217  1.6616  0.6203 

Eigenvalues for unconstrained axes:
   PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
72.287 54.891 26.618 17.959 12.730  9.918  5.659  5.349
(Showed only 8 of all 22 unconstrained eigenvalues)
plot(gts.rda,display=c("sp","bp","si"),scaling=3)

在RDA排序中，箭头连线长度代表某个
环境因子与群落分布和种类分布间相关
程度的大小，越长相关性越大。
箭头连线和排序抽的夹角代表某个环境因子
与排序抽的相关性大小，越小相关性越大！
> gts.prda=rda(gtsdata,gtsenv[,1:4], gtsenv[,5:8])
> gts.prda
Call: rda(X = gtsdata, Y = gtsenv[, 1:4], Z = gtsenv[, 5:8])

               Inertia Proportion Rank
Total         352.0917     1.0000
Conditional    95.0318     0.2699    4
Constrained    42.3708     0.1203    4
Unconstrained 214.6891     0.6098   22
Inertia is variance 

Eigenvalues for constrained axes:
  RDA1   RDA2   RDA3   RDA4
27.522  9.087  3.442  2.320 

Eigenvalues for unconstrained axes:
   PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
72.287 54.891 26.618 17.959 12.730  9.918  5.659  5.349
(Showed only 8 of all 22 unconstrained eigenvalues)
Note: gtsenv[,1:4]表示环境矩阵只取前4列，即地形因子。Constrained为42.37除以
352.09=12.03%，表示地形因子单独所能解释的特征根占总特征根的百分比。Y，Z调换下，
可得土壤因子单独的解释量，2者总共的解释量前面已经算出，即为39.02%。所以2组环境
变量共同的解释量为39.02%-15.53%-12.03%=11.46%!

CCA分析类似
>gts.cca=cca(gtsdata,gtsenv)

R OTU heatmap2

szypanther — Thu, 23 Aug 2012 07:24:35 +0000

source(“http://www.bioconductor.org/biocLite.R”);
biocLite(“affy”);
biocLite(“Biobase”);
library(affy);
library(Biobase);

>bac_4sampledata=read.csv(“/home/R_heatmap/4sample_R_cluster.csv”, sep=”\t”)
> row.names(bac_4sampledata)<-bac_4sampledata$Group
> bac_4sample_Datamatrix<-data.matrix(bac_4sampledata[,2:5])
> heatmap.2(bac_4sample_Datamatrix, distfun=dist,col=greenred(256), scale=”row”, key=TRUE, symkey=FALSE, density.info=”none”, trace=”none”, cexRow=0.5, cexCol=0.7,margin=c(7,30), keysize=1.5);

4sample_R_cluster_stdtop100

 > heatmap.2(bac_4sample_Datamatrix, distfun = function(x) dist(x,method = 'euclidean'),hclustfun = function(x) hclust(x,method = 'centroid'),col=greenred(256), scale="row", key=TRUE, symkey=FALSE, density.info="none", trace="none", cexRow=0.5, cexCol=0.7,margin=c(7,30), keysize=1.5);

454 pyrosequencing analysis pipeline

szypanther — Thu, 16 Aug 2012 08:27:16 +0000

mothur > sffinfo(sff=454Reads_archaea.sff, flow=T)
Extracting info from 454Reads_archaea.sff …
10000
20000
30000
40000
50000
60000
70000
80000
90000
92115
It took 68 secs to extract 92115.
Output File Names:
454Reads_archaea.fasta
454Reads_archaea.qual
454Reads_archaea.flow

mothur > trim.flows(flow=454Reads_archaea.flow, oligos=oligos_LXY.txt, pdiffs=2, bdiffs=1, processors=2)
Appending files from process 15674

Output File Names:
454Reads_archaea.trim.flow
454Reads_archaea.scrap.flow
454Reads_archaea.GZ_ARC.flow
454Reads_archaea.GZ1122_ARC.flow
454Reads_archaea.GZ1122cellulose_ARC.flow
454Reads_archaea.GZ_xylan_ARC.flow
454Reads_archaea.GZ_cellulose55_ARC.flow
454Reads_archaea.SHX_xylan_ARC.flow
454Reads_archaea.GZ_xylose_ARC.flow
454Reads_archaea.Eric_ARC.flow
454Reads_archaea.Milk_D_ARC.flow
454Reads_archaea.Milk_E_ARC.flow
454Reads_archaea.ST1219_ARC.flow
454Reads_archaea.YL_ARC.flow
454Reads_archaea.SHX_xylose_ARC.flow
454Reads_archaea.SHX_cellulose55_ARC.flow
454Reads_archaea.TP_1201_ARC.flow
454Reads_archaea.ST_ARC.flow
454Reads_archaea.YL0203cellulose_ARC.flow
454Reads_archaea.TP_xylan_ARC.flow
454Reads_archaea.ST0303cellulose_ARC.flow
454Reads_archaea.SHX_ARC.flow
454Reads_archaea.ST_xylan_ARC.flow
454Reads_archaea.YL_xylan_ARC.flow
454Reads_archaea.SHX1219_ARC.flow
454Reads_archaea.SHX1125cellulose_ARC.flow
454Reads_archaea.flow.files

mothur > shhh.flows(file=454Reads_archaea.flow.files, processors=4)

mothur > trim.seqs(fasta=454Reads_archaea.shhh.fasta, name=454Reads_archaea.shhh.names, oligos=oligos_LXY.txt, pdiffs=2, bdiffs=1, maxhomop=8, minlength=150, flip=T, processors=2)

Total of all groups is 44091

Output File Names:
454Reads_archaea.shhh.trim.fasta
454Reads_archaea.shhh.scrap.fasta
454Reads_archaea.shhh.trim.names
454Reads_archaea.shhh.scrap.names
454Reads_archaea.shhh.groups

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.fasta, name=454Reads_archaea.shhh.trim.names)
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 218 218 0 3 1
2.5%-tile: 1 251 251 0 3 1103
25%-tile: 1 268 268 0 4 11023
Median: 1 274 274 0 4 22046
75%-tile: 1 281 281 0 4 33069
97.5%-tile: 1 297 297 0 5 42989
Maximum: 1 333 333 0 8 44091
Mean: 1 273.837 273.837 0 4.15944
# of unique seqs: 12780
total # of seqs: 44091

Output File Name:
454Reads_archaea.shhh.trim.summary

mothur > unique.seqs(fasta=454Reads_archaea.shhh.trim.fasta, name=454Reads_archaea.shhh.trim.names)

1000 959
2000 1691
3000 2431
4000 3358
5000 4352
6000 5335
7000 6328
8000 7261
9000 8187
10000 9082
11000 9963
12000 10859
12780 11449

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta
454Reads_archaea.shhh.trim.unique.names

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta, name=454Reads_archaea.shhh.trim.unique.names)
Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 218 218 0 3 1
2.5%-tile: 1 251 251 0 3 1103
25%-tile: 1 268 268 0 4 11023
Median: 1 274 274 0 4 22046
75%-tile: 1 281 281 0 4 33069
97.5%-tile: 1 297 297 0 5 42989
Maximum: 1 333 333 0 8 44091
Mean: 1 273.837 273.837 0 4.15944
# of unique seqs: 11449
total # of seqs: 44091

Output File Name:
454Reads_archaea.shhh.trim.unique.summary

Submit to RDP database, check and filter bacteria sequences!

http://rdp.cme.msu.edu/classifier/cl_status.jsp

domain Bacteria (1435 sequences)

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc allrank_454Reads_archaea.shhh.trim.unique.fasta_classified.txt
1435 1520 200784 allrank_454Reads_archaea.shhh.trim.unique.fasta_classified.txt

./filter_bacterseqs_for_align.py -i allrank_454Reads_archaea.shhh.trim.unique.fasta_classified.txt -f 454Reads_archaea.shhh.trim.unique.fasta -n 454Reads_archaea.shhh.trim.unique.names -g 454Reads_archaea.shhh.groups

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc 454Reads_archaea.shhh.groups.filter
42187 84374 1224728 454Reads_archaea.shhh.groups.filter
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc 454Reads_archaea.shhh.groups
44091 88182 1270431 454Reads_archaea.shhh.groups

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.filter, name=454Reads_archaea.shhh.trim.unique.names.filter)
Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 218 218 0 3 1
2.5%-tile: 1 256 256 0 3 1055
25%-tile: 1 268 268 0 4 10547
Median: 1 274 274 0 4 21094
75%-tile: 1 282 282 0 4 31641
97.5%-tile: 1 297 297 0 5 41133
Maximum: 1 333 333 0 8 42187
Mean: 1 274.543 274.543 0 4.15355
# of unique seqs: 10014
total # of seqs: 42187
mothur > screen.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.align, name=454Reads_archaea.shhh.trim.unique.names.filter, group=454Reads_archaea.shhh.groups.filter, processors=2)
Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.summary
###mothur > align.seqs(candidate=454Reads_archaea.shhh.trim.unique.fasta.filter, template=core_set_aligned.imputed.fasta, flip=T, ksize=9, align=needleman, gapopen=-1, processors=3)
###

mothur > align.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.filter, reference=core_set_aligned.fasta.imputed, flip=T, processors=3)
Using 3 processors.

Reading in the core_set_aligned.fasta.imputed template sequences… DONE.
It took 1 to read 4938 sequences.
Aligning sequences from 454Reads_archaea.shhh.trim.unique.fasta.filter …
100
…
3338
Some of you sequences generated alignments that eliminated too many bases, a list is provided in 454Reads_archaea.shhh.trim.unique.fasta.flip.accnos. If the reverse compliment proved to be better it was reported.
It took 60 secs to align 10014 sequences.
Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.align
454Reads_archaea.shhh.trim.unique.fasta.align.report
454Reads_archaea.shhh.trim.unique.fasta.flip.accnos

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.align, name=454Reads_archaea.shhh.trim.unique.names.filter)
Using 3 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 86 98 2 0 1 1
2.5%-tile: 132 1746 51 0 3 1055
25%-tile: 136 1822 268 0 4 10547
Median: 136 1834 274 0 4 21094
75%-tile: 136 1850 282 0 4 31641
97.5%-tile: 194 1887 297 0 5 41133
Maximum: 6858 6885 313 0 8 42187
Mean: 284.168 1920.46 266.145 0 4.10781
# of unique seqs: 10014
total # of seqs: 42187

Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.summary
##mothur > screen.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.align, name=454Reads_archaea.shhh.trim.unique.names.filter, group=454Reads_archaea.shhh.groups.filter, ##start=136, optimize=end, criteria=90, processors=2)
#The optimize and criteria parameters allow you set the start, end, maxabig, maxhomop, minlength and maxlength parameters relative to your set of sequences .
#For example optimize=start-end, criteria=90, would set the start and end values to the position 90% of your sequences started and ended.

mothur > screen.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.align, name=454Reads_archaea.shhh.trim.unique.names.filter, group=454Reads_archaea.shhh.groups.filter, optimize=start-end, criteria=90, processors=4)
…
Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.align
454Reads_archaea.shhh.trim.unique.fasta.bad.accnos
454Reads_archaea.shhh.trim.unique.names.good.filter
454Reads_archaea.shhh.groups.good.filter
It took 4 secs to screen 10014 sequences.

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.align, name=454Reads_archaea.shhh.trim.unique.names.good.filter)

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 107 1819 243 0 3 1
2.5%-tile: 133 1821 263 0 3 925
25%-tile: 136 1831 269 0 4 9242
Median: 136 1836 274 0 4 18484
75%-tile: 136 1853 283 0 4 27725
97.5%-tile: 136 1871 298 0 5 36042
Maximum: 136 1920 313 0 8 36966
Mean: 135.731 1840.24 276.401 0 4.14224
# of unique seqs: 7703
total # of seqs: 36966

Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.good.summary

mothur > filter.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.align, vertical=T, trump=., processors=2)
3700
3800
3851

Length of filtered alignment: 486
Number of columns removed: 7196
Length of the original alignment: 7682
Number of sequences used to construct filter: 7703

Output File Names:
454Reads_archaea.filter
454Reads_archaea.shhh.trim.unique.fasta.good.filter.fasta
mothur > unique.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.fasta, name=454Reads_archaea.shhh.trim.unique.names.good.filter)

1000 974
2000 1887
3000 2768
4000 3604
5000 4424
6000 5238
7000 6017
7703 6573

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.fasta
454Reads_archaea.shhh.trim.unique.fasta.good.filter.names

mothur > shhh.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.names, group=454Reads_archaea.shhh.groups.good.filter, processors=3)
Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.unique.fasta
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.unique.names

/******************************************/

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh.Eric_ARC.map
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh.GZ1122_ARC.map
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh.GZ1122cellulose_ARC.map
…….

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.names)
Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 484 219 0 3 1
2.5%-tile: 1 486 260 0 3 925
25%-tile: 1 486 261 0 4 9242
Median: 1 486 261 0 4 18484
75%-tile: 1 486 266 0 4 27725
97.5%-tile: 1 486 266 0 5 36042
Maximum: 3 486 282 0 7 36966
Mean: 1.00103 486 262.434 0 4.1387
# of unique seqs: 2911
total # of seqs: 36966

Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.summary

mothur > chimera.uchime(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.names, group=454Reads_archaea.shhh.groups.good.filter, processors=3)
It took 0 secs to check 46 sequences from group YL_xylan_ARC.

It took 43 secs to check 3276 sequences. 362 chimeras were found.
The number of sequences checked may be larger than the number of unique sequences because some sequences are found in several samples.

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.chimeras
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos
##############################3
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ mv 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.self
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ mv 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.chimeras 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.chimeras.self
################################

mothur >chimera.uchime(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.fasta, reference=core_set_aligned.fasta.imputed, processors=3)
05:04 26Mb 100.0% 30/969 chimeras found (3.1%)
05:11 26Mb 100.0% 88/970 chimeras found (9.1%)

It took 311 secs to check 2911 sequences. 213 chimeras were found.

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.chimeras
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos

###################################
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos
213 213 3195 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.self
362 362 5430 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.self
cat 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.self > 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum

sort 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum > 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort

merge 2 predict results of chimera and del repeat!
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ sed ‘$!N; /^$.*$\n\1$/!P; D’ 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort > 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq
382 382 5730 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq
####################################

get_fasta_from_seqname.py -i 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq -j 454Reads_archaea.fasta > 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq.fasta

chimera seqs RDP checking (http://rdp.cme.msu.edu/classifier/classifier.jsp)
Check the last genus id percent, if percent >=90%, (keep it and merge it to the non-chimera reads of each sample)
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ more allrank_454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq.fasta_classified.txt
HQ93PQ301A0ZHW;;Root;100%;Archaea;100%;”Euryarchaeota”;100%;”Methanomicrobia”;100%;Methanomicrobiales;100%;Methanospirillaceae;100%;Methanospirillum;100%
HQ93PQ301A1JYN;;Root;100%;Archaea;100%;”Euryarchaeota”;100%;”Methanomicrobia”;100%;Methanosarcinales;100%;Methanosarcinaceae;100%;Methanosarcina;100%
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ check_real_chimera_seq.py -i allrank_454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq.fasta_classified.txt -d 90 | wc
221 221 3315

The 221 sequences should be merged to non-chimera results!!

-rwxrwxrwx 1 root root 2.4K 2012-08-14 11:19 chimera.seqs.name
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc chimera.seqs.name
161 161 2415 chimera.seqs.name

######################################################################
Removing chimeras (the total predict chimera seqs by two approaches!)
######################################################################
mothur > remove.seqs(accnos=chimera.seqs.name, fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.names, group=454Reads_archaea.shhh.groups.good.filter)

Removed 1197 sequences from your name file.
Removed 161 sequences from your fasta file.
Removed 1197 sequences from your group file.

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.names
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.fasta
454Reads_archaea.shhh.groups.good.pick.filter
mothur > summary.seqs(name=current)

Using 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.names as input file for the name parameter.
Using 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.fasta as input file for the fasta parameter.

Using 3 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 484 219 0 3 1
2.5%-tile: 1 486 260 0 3 895
25%-tile: 1 486 261 0 4 8943
Median: 1 486 261 0 4 17885
75%-tile: 1 486 266 0 4 26827
97.5%-tile: 1 486 266 0 5 34875
Maximum: 3 486 278 0 7 35769
Mean: 1.00106 486 262.504 0 4.16914
# of unique seqs: 2750
total # of seqs: 35769

Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.summary
#############################
chimera number
##############################
./compute_chimera_for_singlesample.py -i 454Reads_Bacteria.Eric_BAC.shhh.groups -j chimera.seqs.name

GZ1122_ARC: 10
GZ1122cellulose_ARC: 3
GZ_ARC: 1
GZ_cellulose55_ARC: 1
GZ_xylan_ARC: 7
GZ_xylose_ARC: 4
SHX1125cellulose_ARC: 0
SHX1219_ARC: 0
SHX_ARC: 5
SHX_cellulose55_ARC: 8
SHX_xylan_ARC: 3
SHX_xylose_ARC: 35
ST0303cellulose_ARC: 15
ST1219_ARC: 5
ST_ARC: 21
ST_xylan_ARC: 11
TP_1201_ARC: 0
TP_xylan_ARC: 19
YL0203cellulose_ARC: 7
YL_ARC: 2
YL_xylan_ARC: 3

#######################
Removing “contaminants”
#######################
wget http://www.mothur.org/w/images/5/59/Trainset9_032012.pds.zip
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ unzip Trainset9_032012.pds.zip
Archive: Trainset9_032012.pds.zip
inflating: trainset9_032012.pds.tax
inflating: trainset9_032012.pds.fasta

mothur > classify.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.names, group=454Reads_archaea.shhh.groups.good.pick.filter, template=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80, processors=2)
….
Processing sequence: 1300
Processing sequence: 1300
[WARNING]: HQ93PQ301C4CTV could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
[WARNING]: HQ93PQ301CK6YP could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
[WARNING]: HQ93PQ301DJMTI could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
[WARNING]: HQ93PQ301ERRC6 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
Processing sequence: 1372
Processing sequence: 1371
It took 25 secs to classify 2750 sequences.

Reading 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.names… Done.

It took 3 secs to create the summary file for 2750 sequences.
Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.taxonomy
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.flip.accnos
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.tax.summary

mothur > remove.lineage(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.names, group=454Reads_archaea.shhh.groups.good.pick.filter, taxonomy=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.taxonomy, taxon=Mitochondria-Cyanobacteria_Chloroplast-Eukarya-Bacteria-unknown)
Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.pick.taxonomy
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.names
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.fasta
454Reads_archaea.shhh.groups.good.pick.pick.filter
mothur > summary.seqs(name=current)

Using 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.names as input file for the name parameter.
Using 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.fasta as input file for the fasta parameter.

Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 484 219 0 3 1
2.5%-tile: 1 486 260 0 3 852
25%-tile: 1 486 261 0 4 8519
Median: 1 486 261 0 4 17037
75%-tile: 1 486 266 0 4 25555
97.5%-tile: 1 486 266 0 5 33221
Maximum: 1 486 278 0 7 34072
Mean: 1 486 262.598 0 4.21437
# of unique seqs: 2644
total # of seqs: 34072

Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.summary

############################################################################################
mothur > system(cp 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.pick.taxonomy archaea_16s_final.taxonomy)
mothur > system(cp 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.names archaea_16s_final.names)
mothur > system(cp 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.fasta archaea_16s_final.fasta)
mothur > system(cp 454Reads_archaea.shhh.groups.good.pick.pick.filter archaea_16s_final.groups)
mothur > dist.seqs(fasta=archaea_16s_final.fasta, cutoff=0.1, processors=3)

Output File Name:
archaea_16s_final.dist

It took 25 to calculate the distances for 2644 sequences.

mothur > cluster(column=archaea_16s_final.dist, name=archaea_16s_final.names)
changed cutoff to 0.0392497

Output File Names:
archaea_16s_final.an.sabund
archaea_16s_final.an.rabund
archaea_16s_final.an.list

It took 9 seconds to cluster

mothur > make.shared(list=archaea_16s_final.an.list, group=archaea_16s_final.groups)

unique
0.01
0.02
0.03

Output File Names:
archaea_16s_final.an.shared
archaea_16s_final.an.Eric_ARC.rabund
archaea_16s_final.an.GZ1122_ARC.rabund
archaea_16s_final.an.GZ1122cellulose_ARC.rabund
archaea_16s_final.an.GZ_ARC.rabund
archaea_16s_final.an.GZ_cellulose55_ARC.rabund
archaea_16s_final.an.GZ_xylan_ARC.rabund
archaea_16s_final.an.GZ_xylose_ARC.rabund
archaea_16s_final.an.Milk_D_ARC.rabund
archaea_16s_final.an.Milk_E_ARC.rabund
archaea_16s_final.an.SHX1125cellulose_ARC.rabund
archaea_16s_final.an.SHX1219_ARC.rabund
archaea_16s_final.an.SHX_ARC.rabund
archaea_16s_final.an.SHX_cellulose55_ARC.rabund
archaea_16s_final.an.SHX_xylan_ARC.rabund
archaea_16s_final.an.SHX_xylose_ARC.rabund
archaea_16s_final.an.ST0303cellulose_ARC.rabund
archaea_16s_final.an.ST1219_ARC.rabund
archaea_16s_final.an.ST_ARC.rabund
archaea_16s_final.an.ST_xylan_ARC.rabund
archaea_16s_final.an.TP_1201_ARC.rabund
archaea_16s_final.an.TP_xylan_ARC.rabund
archaea_16s_final.an.YL0203cellulose_ARC.rabund
archaea_16s_final.an.YL_ARC.rabund
archaea_16s_final.an.YL_xylan_ARC.rabund
mothur > count.groups()

Using archaea_16s_final.an.shared as input file for the shared parameter.
Eric_ARC contains 14.
GZ1122_ARC contains 1780.
GZ1122cellulose_ARC contains 1063.
GZ_ARC contains 53.
GZ_cellulose55_ARC contains 1997.
GZ_xylan_ARC contains 1509.
GZ_xylose_ARC contains 1241.
Milk_D_ARC contains 19.
Milk_E_ARC contains 434.
SHX1125cellulose_ARC contains 2568.
SHX1219_ARC contains 2012.
SHX_ARC contains 1594.
SHX_cellulose55_ARC contains 2235.
SHX_xylan_ARC contains 944.
SHX_xylose_ARC contains 1932.
ST0303cellulose_ARC contains 1815.
ST1219_ARC contains 1597.
ST_ARC contains 774.
ST_xylan_ARC contains 1755.
TP_1201_ARC contains 1952.
TP_xylan_ARC contains 1849.
YL0203cellulose_ARC contains 1762.
YL_ARC contains 1154.
YL_xylan_ARC contains 2019.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^old^^^^^^^^^^^^^
Using archaea_16s_final.an.shared as input file for the shared parameter.
Eric_ARC contains 15.***********
GZ1122_ARC contains 1815.
GZ1122cellulose_ARC contains 1081.
GZ_ARC contains 54.
GZ_cellulose55_ARC contains 1997.
GZ_xylan_ARC contains 1547.
GZ_xylose_ARC contains 1245.
Milk_D_ARC contains 18. *********
Milk_E_ARC contains 434. ********
SHX1125cellulose_ARC contains 2570.
SHX1219_ARC contains 2012.
SHX_ARC contains 1593.
SHX_cellulose55_ARC contains 2236.
SHX_xylan_ARC contains 947.
SHX_xylose_ARC contains 1932.
ST0303cellulose_ARC contains 1810.
ST1219_ARC contains 1597.
ST_ARC contains 759.
ST_xylan_ARC contains 1755.
TP_1201_ARC contains 1952.
TP_xylan_ARC contains 1849.
YL0203cellulose_ARC contains 1762.
YL_ARC contains 1164.
YL_xylan_ARC contains 2019.

mothur > count.groups()

Using archaea_16s_final.an.shared as input file for the shared parameter.
Eric_ARC contains 569.
GZ1122_ARC contains 2103.
GZ1122cellulose_ARC contains 1594.
GZ_ARC contains 530.
GZ_cellulose55_ARC contains 2001.
GZ_xylan_ARC contains 1889.
GZ_xylose_ARC contains 2015.
Milk_D_ARC contains 598.
Milk_E_ARC contains 1753.
SHX1125cellulose_ARC contains 2831.
SHX1219_ARC contains 2247.
SHX_ARC contains 1660.
SHX_cellulose55_ARC contains 2249.
SHX_xylan_ARC contains 1213.
SHX_xylose_ARC contains 1991.
ST0303cellulose_ARC contains 1845.
ST1219_ARC contains 1621.
ST_ARC contains 1859.
ST_xylan_ARC contains 1769.
TP_1201_ARC contains 1969.
TP_xylan_ARC contains 1890.
YL0203cellulose_ARC contains 1785.
YL_ARC contains 1285.
YL_xylan_ARC contains 2025.

mothur > sub.sample(shared=archaea_16s_final.an.shared, size=759)

Eric_ARC contains 15. Eliminating.
GZ_ARC contains 54. Eliminating.
Milk_D_ARC contains 18. Eliminating.
Milk_E_ARC contains 434. Eliminating.
Sampling 759 from each group.
unique
0.01
0.02
0.03

Output File Names:
archaea_16s_final.an.uniquesubsample.shared
archaea_16s_final.an.0.01subsample.shared
archaea_16s_final.an.0.02subsample.shared
archaea_16s_final.an.0.03subsample.shared
mothur > classify.otu(list=archaea_16s_final.an.list, name=archaea_16s_final.names, taxonomy=archaea_16s_final.taxonomy)

reftaxonomy is not required, but if given will keep the rankIDs in the summary file static.
unique 2636
0.01 1940
0.02 1033
0.03 634

Output File Names:
archaea_16s_final.an.uniquecons.taxonomy
archaea_16s_final.an.uniquecons.tax.summary
archaea_16s_final.an.0.01cons.taxonomy
archaea_16s_final.an.0.01cons.tax.summary
archaea_16s_final.an.0.02cons.taxonomy
archaea_16s_final.an.0.02cons.tax.summary
archaea_16s_final.an.0.03cons.taxonomy
archaea_16s_final.an.0.03cons.tax.summary

My PROJECT

szypanther — Wed, 30 May 2012 03:23:48 +0000

Amoa project

A2,A3 amoa sample sequencing

(1) use geneious to deal with the sequence raw data, get 635bp DNA sequence

(2)use mafft to get the alignment results

(3)mothur deal pipline

mothur > dist.seqs(fasta=AMOA_A2A3.mafft.align.shortname.fasta, output=lt)

0 0
59 0

Output File Name:
AMOA_A2A3.mafft.align.shortname.phylip.dist

It took 0 to calculate the distances for 60 sequences.

mothur > cluster(phylip=AMOA_A2A3.mafft.align.shortname.phylip.dist, cutoff=0.05)

********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||
***********************************************************************
unique 4 49 2 1 1
0.01 11 22 4 3 0 2 0 0 0 0 0 1
0.02 16 15 4 2 0 0 0 0 0 0 0 0 0 0 0 1 1
0.03 36 10 3 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.04 38 9 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
changed cutoff to 0.0352006

Output File Names:
AMOA_A2A3.mafft.align.shortname.phylip.an.sabund
AMOA_A2A3.mafft.align.shortname.phylip.an.rabund
AMOA_A2A3.mafft.align.shortname.phylip.an.list

It took 0 seconds to cluster

mothur > unique.seqs(fasta=AMOA_A2A3.mafft.align.shortname.fasta)

60 53

Output File Names:
AMOA_A2A3.mafft.align.shortname.unique.fasta
AMOA_A2A3.mafft.align.shortname.names

mothur > parse.list(list=AMOA_A2A3.mafft.align.shortname.phylip.an.list, group=AMOA_A2A3.mafft.align.shortname.group)

unique
0.01
0.02
0.03
0.04

Output File Names:
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.list
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.list

mothur > rarefaction.single(list=AMOA_A2A3.mafft.align.shortname.phylip.an.A3.list, iters=10000, freq=0.10, calc=ace-sobs-chao, processors=3)

unique
0.01
0.02
0.03
0.04

Output File Names:
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.r_ace
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.rarefaction
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.r_chao
mothur > rarefaction.single(list=AMOA_A2A3.mafft.align.shortname.phylip.an.A2.list, iters=10000, freq=0.10, calc=ace-sobs-chao, processors=3)

unique
0.01
0.02
0.03
0.04

Output File Names:
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.r_ace
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.rarefaction
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.r_chao

mothur > rarefaction.single(list=AMOA_A2A3.mafft.align.shortname.phylip.an.A3.list, iters=10000, freq=0.10, calc=shannon, processors=3)

unique
0.01
0.02
0.03
0.04

Output File Names:
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.r_shannon
mothur > rarefaction.single(list=AMOA_A2A3.mafft.align.shortname.phylip.an.A2.list, iters=10000, freq=0.10, calc=shannon, processors=3)

unique
0.01
0.02
0.03
0.04

Output File Names:
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.r_shannon

mothur > get.oturep(phylip=AMOA_A2A3.mafft.align.shortname.phylip.dist, fasta=AMOA_A2A3.mafft.align.shortname.fasta, list=AMOA_A2A3.mafft.align.shortname.phylip.an.A2.list, group=AMOA_A2A3.mafft.align.shortname.group)

Output File Names:
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.unique.rep.names
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.0.01.rep.names
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.0.02.rep.names
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.0.03.rep.names
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.0.04.rep.names
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.0.01.rep.fasta
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.0.02.rep.fasta
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.0.03.rep.fasta
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.0.04.rep.fasta
AMOA_A2A3.mafft.align.shortname.phylip.an.A2.unique.rep.fasta
mothur > get.oturep(phylip=AMOA_A2A3.mafft.align.shortname.phylip.dist, fasta=AMOA_A2A3.mafft.align.shortname.fasta, list=AMOA_A2A3.mafft.align.shortname.phylip.an.A3.list, group=AMOA_A2A3.mafft.align.shortname.group)

Output File Names:
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.unique.rep.names
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.0.01.rep.names
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.0.02.rep.names
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.0.03.rep.names
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.0.04.rep.names
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.0.01.rep.fasta
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.0.02.rep.fasta
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.0.03.rep.fasta
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.0.04.rep.fasta
AMOA_A2A3.mafft.align.shortname.phylip.an.A3.unique.rep.fasta

R studio 画图

> setwd("/home/shenzy/Desktop/amoA_work")
> data0<-read.table(file="AMOA_A2A3.mafft.align.shortname.phylip.an.A2.rarefaction",header=T)
> data1<-read.table(file="AMOA_A2A3.mafft.align.shortname.phylip.an.A3.rarefaction",header=T)
> plot(x=data0$numsampled, y=data0$X0.03, xlab="Number of clones analyzed",ylab="Number of OTUs observed", type="l", col="green", xlim=c(0,30),ylim=c(0,15))
> points(x=data1$numsampled, y=data1$X0.03, type="l", col="blue")
> legend(x=2, y=14, c("A2","A3"),c("green","blue")