小生这厢有礼了(BioFaceBook Personal Blog) » My project

Amber的分子动力学模拟 (转贴）

szypanther — Thu, 16 Oct 2014 09:23:45 +0000

amber进行MD的步骤如下：PDB文件需要去掉结晶水、氢和金属离子（非必须）。
网上教程：http://enzyme.fbb.msu.ru/Tutorials/

金属酶的模拟：http://ambermd.org/tutorials/advanced/tutorial1_orig/
蛋白和配体的模拟：

1：分子系统的准备

   命令如下：
tleap                         //进入leap
source  leaprc.ff99SB         //加载蛋白力场
source  leaprc.gaff           //如果有小分子加载小分子力场
loadamberparams lig.frcmod //特殊情况还要加载其他特殊的力场参数
mol = loadpdb protein.pdb    //定义mol变量并加载蛋白分子
bond mol.1.SG mol.6.SG        //如果要二硫键。
check mol                     //检查蛋白
saveamberparm mol protein_vac.top protein_vac.crd //生成真空拓扑、坐标文件
solvatebox mol TIP3PBOX 10.0     //此为方形，八面体用solvateoct
charge mol                     //检测系统电荷

    addions mol Na+ 1             //加入+电荷
    addions mol CL- 1             //加入-电荷
    saveamberparm mol protein_sol.top protein_sol.crd //生成拓扑、坐标文件
    quit                           //退出 leap
    ambpdb -p protein_sol.top    protein_sol.pdb //生成体系pdb便于查看。

说明：
        <1>PDB格式文件里面规定了二硫键，在amber生成参数文件时候，要用bond命令将二硫键固定。
SSBOND 1 CME A  381 CME A  381
SSBOND 2 CME A  417 CME A  417
<2> 特殊配体参数文件的制备
antecharmber -i lig.pdb -fi pdb -o lig.mol2 -fo mol2 -c bcc（或者gas) -s 2
parmchk -i lig.mol2 -f mol2 -o lig.frcmod
source leaprc.gaff
LIG = loadmol2 lig.mol2

          check LIG
           loadamberparm lig.frcmod
           check LIG
           saveamberparm LIG lig.top lig.crd
          <3> 特殊配体参数文件的制备
          amber 加力场时如出现   Created a new atom named : .R .A
          表示amber可能不识别这些原子。可以在pdb中删除让amber自动添加。
          <4> Amber 力场
ff10力场(parm10.dat)：对ff99的各种参数补丁的集合，相当于parm99.dat+frcmod.ff03+bsc0+chi.OL3+新的离子参数+原子和残基名的修改以顺应PDB format version 3。这是目前最好的amber力场。
ff03.r1力场(parm99.dat+frcmod.ff03)：ff99力场的修改版。获取电荷时通过连续介电模型表现溶剂可极化效应，修改了蛋白phi、psi骨架参数，减少了对螺旋构象的偏爱。核酸参数相对于ff99没变。ff03.r1与amber9中的ff03略有不同，那时仍用的是ff94的方法得来的碳、氮端基原子电荷，如果仍想用那时代的ff03就调用oldff/leaprc.ff03.
ff03ua力场(parm99.dat+frcmod.ff03+frcmod.ff03ua)：ff03力场的united-atom版本，侧链的氢原子被united了，骨架上的氢原子和芳香环上的氢原子仍被保留。由于骨架还是全原子故骨架势参数没变，侧链上的参数因用了united故重新拟合。核酸参数完全没变，且还是全原子。
ff02力场(parm99.dat+frcmod.ff02pol.r1)：ff99力场的可极化版，给原子上增加了可极化的偶极子。frcmod.ff02pol.r1是对原ff02的扭转参数的修正。
ff02EP力场(parm99EP.dat+frcmod.ff02pol.r1)：ff02力场基础上给诸如氧、氮、硫原子增加了偏离原子中心的点电荷以表现孤对电子效应。据称比ff02稍好点。
ff99力场(parm99.dat)：大部分参数来自ff94力场，修改了许多扭转角的参数。甘氨酸的骨架参数有问题，螺旋和延展构象的平衡性不对。而对于DNA，ff99长时间模拟中亚稳态占统治地位，即alpha和gamma二面角倾向于分别为gauche+和trans状态。虽然在RNA中也有这问题，但不严重。ff99的这些毛病在ff94里也有。
ff99SB力场(parm99.dat+frcmod.ff99SB)：对ff99的蛋白二面角参数进行修正，二级结构间分布的比例得到了改善，也解决了甘氨酸骨架参数问题。
bsc0(frcmod.parmbsc0)：解决上述ff99在核酸模拟问题上的补丁，同时还改进了RNA的糖苷的gamma二面角扭转势。可参考http://mmb.pcb.ub.es/PARMBSC0。
ff99SB+bsc0力场：把bsc0补丁用到ff99SB上，相对于ff99同时增进对蛋白和核酸的效果。这个组合使gamma二面角过分偏离了trans型。如果初始结构有很多gamma角为trans的情况，还是用ff99比较好。
ff99SBildn(frcmod.ff99SBildn)：在ff99SB基础上修改氨基酸侧链参数的补丁。
ff99SBnmr(frcmod.ff99SBnmr)：在ff99SB基础上修改骨架扭转项参数以更符合NMR数据的补丁。
ff98力场(parm98.dat)：对ff94改进了糖苷的扭转角参数。
ff96力场(parm96.dat)：与ff94扭转角不同，算出来的能量更接近量化结果。来自Beachy et al，由于构象有明显偏向beta等问题，使用不广泛。
ff94力场(parm94.dat)：来自Cornell, Kollman et al。适合溶剂环境。电荷由RESP HF/6-31G*获得。
ff86力场(parm91X.dat)：将ff84扩展为全原子力场。和ff84一样对氢键也是用Lennard-Jones 10-12势，故如果想在sander里用ff84/86，得重新带着-DHAS_10_12选项编译。之所以相应的文件叫parm91X是因为对原始ff86做了一些修正。（parm91X.dat是parm91.dat的补完版，加入了一些非键项，但非键项比如Mg、I等的参数都没调好，只是近似。）
ff84(parm91X.ua.dat)：最早的AMBER力场，用于模拟核酸和蛋白质的联合原子力场。不推荐使用，但在真空或者距离依赖的介电常数下模拟还有用。
parmAM1和parmPM3力场(parmAM1.dat/parmPM3.dat)：用这个参数对蛋白质优化可以得出与AM1/PM3相同的优化结果。如今已没什么价值。
GAFF力场(gaff.dat)=Generation Amber Force Field：普适型有机小分子力场，函数形式和AMBER力场相同，与AMBER力场完全兼容。
GLYCAM-06力场(GLYCAM_06g.dat)：对以前GLYCAM力场做了改进，并且纳入了一小部分脂类的参数。
GLYCAM-04EP力场(GLYCAM04EP.dat)：将GLYCAM04扩展到可用于TIP5P模型下的模拟。给氧加上非原子中心点电荷表现孤对电子效应。
GLYCAM-04力场(GLYCAM04.dat)=glycans and glycoconjugates in AMBER：专用于糖的模拟，和AMBER完全兼容，可一起用于糖蛋白的模拟。官网：http://glycam.ccrc.uga.edu/ccrc/index.jsp
另外，Amber程序还支持AMOEBA力场，也可以通过自带的CHAMBER工具来支持CHARMM力场，这里就不提了。

2：溶剂优化
模拟的参数文件如下：min1.in
protein: initial minimisation solvent + ions
&cntrl
imin = 1,                 //任务是优化， 0为分子动力学
maxcyc = 1000,     //优化步数
ncyc = 500,            // 前500为最陡下降后500为共轭梯度
ntb = 1,                  // 周期边界条件 0 不采用 1 定容 2 定压

ntr = 1,                   //优化时需要一些约束原子 -ref
cut = 10                  //非键相互作用阈值为 10唉
/
Hold the protein fixed   //约束说明
500.0                            //作用在肽键上的力
RES 1 194                    //限制的残基序号
END
END
    红色部分表示限制蛋白部分。其中500.0单位是kcal/mol，表示作用在肽链上使其不动的力。“RES 1 194”表示肽链残基数目，因为我们学习使用的protein有194个残基。
    模拟命令如下：
sander –O –i min1.in –o min1.out –p protein_sol.top –c protein_sol.crd –r protein_min1.rst –ref protein_sol.crd &
3:蛋白的优化 min2.in
protein: initial minimisation whole system
&cntrl
imin = 1,
maxcyc = 2500,
ncyc = 1000,
ntb = 1,
ntr = 0,
cut = 10
/
命令如下：
sander –O –i min2.in –o min2.out –p protein_sol.top –c protein_min1.rst –r protein_min2.rst &
4: 有限分子动力学模拟
       在这个步骤中，我们将主要目的是对特定的原子使用作用力使其能量优化。我们要优化溶剂环境，至少需要10ps，我们将使用20ps用来优化我们上两步制作的分子系统的周期性边界的溶剂环境。
命令配置文件md1.in如下：
protein: 20ps MD with res on protein
&cntrl
imin = 0,  irest = 0,   ntx = 1,           //表示模拟过程为分子动力学，不是能量最优化。
nstlim = 10000, dt = 0.002,            //表示计算的步数。
ntc = 2,             //1表示不实用使用，2表示氢键将被计算，3表示所有键都将被计算在内。
ntf = 2,
cut = 10,
ntb = 1,                 // 表示分子动力学过程保持体积固定。
ntr = 1,
tautp = 0.1,           //热浴时间常数，缺省为1.0。小的时间常数可以得到较好的耦联。
ntpr = 100,
ntwx = 500,
ntwr = 1000
ntt = 3,                      //温度转变控制，3表示使用兰格氏动力学
gamma_ln = 1.0,    //表示当ntt＝3时的碰撞频率，单位为ps-1（请参考AMBER手册）
tempi = 0.0,          //系统开始时的温度。
temp0 = 300.0,     //表示最后系统到达并保持的温度，单位为K。
/
keep protein fixed with weak restraints
10.0
RES 1 194   //蛋白的残基数
END
END
    我们将使用一个较小的作用力，10kcal/mol。在分子动力学中，当ntr＝1时，作用力只需要5-10kcal/mol（我们需要引用一个坐标文件做分子动力学过程的比较，我们需要使用”-ref”参数）。太大的作用力同时使用Shake算法和2fs步长将使整个系统变得不稳定，因为大的作用力使系统中的原子产生大频率的振动，模拟过程并步需要。

运行命令如下：

sander –O –i md1.in –o md1.out –p protein_sol.top –c protein_min2.rst –r protein_md1.rst –x protein_md1.mdcrd –ref protein_min2.rst –inf md1.info&
5、生成相分子动力学模拟
这一步进行整个系统的分子动力学模拟，而不对某些特定原子位置进行限制。配置文件md2.in如下：
protein: 250ps MD
&cntrl
imin = 0, irest = 1, ntx = 7,
ntb = 2, pres0 = 1.0, ntp = 1,
taup = 2.0,
cut = 10, ntr = 0,
ntc = 2, ntf = 2,
tempi = 300.0, temp0 = 300.0,
ntt = 3, gamma_ln = 1.0,
nstlim = 125000, dt = 0.002,
ntpr = 100, ntwx = 500, ntwr = 1000
/
使用以下命令进行MD：
sander –O –i md2.in –o md2.out –p protein.top –c protein_md1.rst – r protein_md2.rst –x protein_md2.mdcrd –ref protein_md1.rst –inf md2.info &
6:Amber模拟数据的分析
模拟完成后，接下来要对得到的数据进行分析。主要数据文件.out文件，包含系统能量、温度，压力等等；.mdcrd文件，是分子动力学轨迹文件，可以求系统蛋白的RMSD，回转半径等等。

R pheatmap

szypanther — Wed, 12 Feb 2014 04:04:41 +0000

> library(caTools);
> library(bitops);
> library(grid);
> data=read.csv(“/home/shenzy/Desktop/R/Bac.heatmap1.csv”)
> data=read.csv(“/home/shenzy/Desktop/R/Bac.heatmap1.2.csv”)
> View(data)
> data=read.csv(“/home/shenzy/Desktop/R/Bac.heatmap1.2.csv”,sep=”\t”)
> View(data)
> row.names(data) <- data$X.OTU.ID;
> View(data)
> data_matrix<-data[,2:15]
> View(data_matrix)
> data_matrix<-data[,2:14]
> View(data_matrix)
> View(data)
> data_matrix<-data[,1:14]
> View(data_matrix)
> library(pheatmap)
> data_matrix[is.na(data_matrix)]<-1
> View(data_matrix)
> data_log10<-log10(data_matrix)
> View(data_log10)
> data_log2<-log2(data_matrix)
> View(data_log2)
> pheatmap(data_log2,fontsize=9, fontsize_row=6)
> pheatmap(data_log2, color = colorRampPalette(c(“white”, “yellow”, “firebrick3″))(50), fontsize=9, fontsize_row=6)
> pheatmap(data_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6)
> help(pheatmap)
> pheatmap(data_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6,cluster_rows=FALSE, cluster_cols=FALSE)
> pheatmap(data_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6)
> data2=read.csv(/home/shenzy/Desktop/R/Bac.heatmap1.3_GZ.csv)
Error: unexpected ‘/’ in “data2=read.csv(/”
> data2=read.csv(“/home/shenzy/Desktop/R/Bac.heatmap1.3_GZ.csv”,sep=”\t”)
> View(data2)
> row.names(data2) <- data2$X.OTU.ID;
> data3=read.csv(“/home/shenzy/Desktop/R/Bac.heatmap1.4_SH.csv”,sep=”\t”)
> row.names(data3) <- data3$X.OTU.ID;
> View(data3)
> data2_matrix<-data2[,1:7]
> View(data2_matrix)
> data3_matrix<-data3[,1:7]
> View(data3_matrix)
> data2_matrix[is.na(data2_matrix)]<-1
> data3_matrix[is.na(data3_matrix)]<-1
> data2_log2<-log2(data2_matrix)
> data3_log2<-log2(data3_matrix)
> pheatmap(data2_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6,cluster_rows=FALSE, cluster_cols=FALSE)
> pheatmap(data2_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6)
> pheatmap(data3_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6,cluster_rows=FALSE, cluster_cols=FALSE)
> pheatmap(data3_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6)

graphlan a good tool for draw circle picture with tree in it

szypanther — Fri, 24 Jan 2014 06:54:53 +0000

#!/bin/sh
graphlan_annotate.py hmptree.xml hmptree.annot.xml –annot annot.txt
graphlan.py hmptree.annot.xml hmptree.png –dpi 150 –size 14

0.9.5.tar

Batch download protein sequences from CMR (comprehensive microbial resource)

szypanther — Thu, 28 Feb 2013 08:11:00 +0000

NCBI 有时批量下载的protein sequence会有不一致时，可以从以下资源数据库下载（eg, eth195)

http://cmr.jcvi.org/cgi-bin/CMR/shared/MakeFrontPages.cgi?page=batchdownload

Circos for comparative genomes

szypanther — Tue, 15 Jan 2013 08:03:30 +0000

Circos:
nucmer –prefix=refBAV1_qryVScds Dehalococcoides_BAV1.fasta Dehalococcoides_VS.cds.fasta
show-tiling -i 80 -c refBAV1_qryVScds.delta > refBAV1_qryVScds.tiling
awk -F ” ” ‘{print “chr1″ “\t” $1 “\t” $2}’ refBAV1_qryVScds.tiling > Dehalococcoidessp.VScds.gene_tableBAV1.txt

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/circos/circos-tutorials-0.62/tutorials/8/1$ perl ../../../../circos-0.62-1/bin/circos -conf circos.conf

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/circos/circos-tutorials-0.62/tutorials/8/1$ ll
total 16001
-rwxrwxrwx 1 root root 528519 2012-12-27 16:11 circos1.png
-rwxrwxrwx 1 root root 1065925 2012-12-27 14:33 circos2.png
-rwxrwxrwx 1 root root 1230394 2013-01-15 14:04 circosBAV1_VS11a_genome.jpg
-rwxrwxrwx 1 root root 4654 2013-01-15 13:11 circos.conf
-rwxrwxrwx 1 root root 1119210 2013-01-15 13:36 circos.jpg
-rwxrwxrwx 1 root root 794858 2013-01-15 13:15 circos.png
-rwxrwxrwx 1 root root 2900177 2013-01-15 13:15 circos.svg
-rwxrwxrwx 1 root root 4242152 2013-01-15 13:32 circos.xcf
-rwxrwxrwx 1 root root 4242152 2013-01-15 13:30 circos.xcf.png
-rwxrwxrwx 1 root root 128 2013-01-14 15:01 highlights.conf
-rwxrwxrwx 1 root root 2075 2013-01-14 14:42 highlights.conf.org
-rwxrwxrwx 1 root root 892 2013-01-15 13:15 ideogram.conf
-rwxrwxrwx 1 root root 35132 2011-03-22 00:40 image-01.png
-rwxrwxrwx 1 root root 30883 2011-03-22 00:40 image-02.png
-rwxrwxrwx 1 root root 126878 2011-03-22 00:40 image-03.png
-rwxrwxrwx 1 root root 1305 2012-05-05 00:25 ticks.conf

circos.conf

ideogram.conf

Dehalococcoidessp.BAV1.gene_table2

Dehalococcoidessp.11a.gene_tableBAV1

Dehalococcoidessp.GTcds.gene_tableBAV1

karyotype.microbe

Dehalococcoidessp.CBDB1cds.gene_tableBAV1

Dehalococcoidessp.195cds.gene_tableBAV1

How to measure codon usage bias

szypanther — Tue, 15 Jan 2013 07:53:16 +0000

Codon adaptation index (CAI) is one of them. To examine the CAI value of a gene, a reference table of RSCU (relative synonymous codon usage) values for highly expressed genes is compiled.

A software call CodonW, you can download it from: http://codonw.sourceforge.net/. There is also a PhD thesis associated to it.

shenzy@shenzy-ubuntu:~/Downloads/CondonW/codonW$ codonw input.dat -all_indices -c_type 2 -f_type 4 -nomenu

eg:

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/circos/work$ codonw Dehalococcoidessp.BAV1.cds.fasta.dat -all_indices -c_type 2 -f_type 4 -nomenu

Using prcomp/princomp for PCA in R （二）

szypanther — Fri, 31 Aug 2012 04:08:34 +0000

###############################

PCA
###############################
install.packages(“vegan”)
library(vegan)

> STpcoa<-read.table(file=”bactera_16s_final.subsample.phylip.tre1.weighted.phylip.pcoa.axes”, header=T,row.names=1)
> STpcoa
axis1 axis2 axis3 axis4
Cellulose -0.020878 -0.234601 0.167454 0
Foodwaste -0.234592 0.221741 0.085802 0
Sludge 0.368882 0.100725 -0.010570 0
Xylan -0.113413 -0.087865 -0.242686 0
>pl.STpcoa<-princomp(STpcoa)
> summary(pl.STpcoa)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 0.2260563 0.1746944 0.1536319 0
Proportion of Variance 0.4856521 0.2900347 0.2243133 0
Cumulative Proportion 0.4856521 0.7756867 1.0000000 1

> ls(pl.STpcoa)
[1] “call” “center” “loadings” “n.obs” “scale” “scores” “sdev”
> class(pl.STpcoa)
[1] “princomp”
> nmds.col<-c(rep(“green”, 1), rep(“blue”, 1), rep(“black”,1), rep(“red”,1))
> plot(pl.STpcoa$scores, col=nmds.col, pch=20)
> legend(x=0.12, y=0.25, c(“Cellulose”,”Foodwaste”,”Sludge”,”Xylan”),c(“green”,”blue”,”black”,”red”),bty=”n”)

> biplot(pl.STpcoa)

##############################
> pl2.STpcoa<-prcomp(STpcoa)
> class(pl2.STpcoa)
[1] “prcomp”
> pl2.STpcoa$sd^2
[1] 0.06813525 0.04069083 0.03147035 0.00000000
> summary(pl2.STpcoa)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 0.2610 0.2017 0.1774 0
Proportion of Variance 0.4857 0.2900 0.2243 0
Cumulative Proportion 0.4857 0.7757 1.0000 1
> pl2.STpcoa$x
PC1 PC2 PC3 PC4
Cellulose -0.02087762 -0.2346017 0.16745306 0
Foodwaste -0.23459165 0.2217407 0.08580311 0
Sludge 0.36888225 0.1007250 -0.01056992 0
Xylan -0.11341297 -0.0878640 -0.24268626 0
> pl2.STpcoa$rotation
PC1 PC2 PC3 PC4
axis1 1.000000e+00 -9.352971e-08 -8.835157e-07 0
axis2 9.353330e-08 1.000000e+00 4.064575e-06 0
axis3 8.835153e-07 -4.064576e-06 1.000000e+00 0
axis4 0.000000e+00 0.000000e+00 0.000000e+00 1
> plot(pl2.STpcoa$x, col=nmds.col, pch=20)
> legend(x=0.12, y=0.25, c(“Cellulose”,”Foodwaste”,”Sludge”,”Xylan”),c(“green”,”blue”,”black”,”red”),bty=”n”)

> screeplot(pl2.STpcoa,type=”lines”,main=”Scree Plot”)

bactera_16s_final.subsample.phylip.tre1.weighted.phylip.pcoa.axes

Using prcomp/princomp for PCA in R （一）

szypanther — Fri, 31 Aug 2012 04:00:07 +0000

Difference between prcomp and princomp:

‘princomp’ can only be used with more units than variables”

prcomp是基于SVD分解（svd()函数，princomp是基于特征向量eigen()函数)

Good video source:

http://www.youtube.com/watch?v=oZ2nfIPdvjY

http://www.youtube.com/watch?v=I5GxNzKLIoU&feature=relmfu

http://www.planta.cn/forum/viewtopic.php?t=16754&highlight=%D3%EF%D1%D4

###########################################

以下所有代码包括练习数据，都可在R平台上直接运行。

#主成分分析和主成分回归
主成分分析的思想是Pearson 1901年提出的，Hotelling 1933进一步发展
在R中，进行主成分分析用到princomp() 函数

用法
princomp(x, cor = FALSE, scores = TRUE, covmat = NULL,
subset = rep(TRUE, nrow(as.matrix(x))), …)

# 分析用数据
# cor 是否用样本的协方差矩阵作主成分分析
prcomp()
二 summary()函数
三 loadings()函数
四 predict() 函数
五 screeplot() 函数
六 biplot() 函数

实例
某中学随机抽取某年级30名学生，测量其身高，体重，胸围，坐高，针对这30名中学生身体四项指标数据做主成分分析。
student<-data.frame(
X1=c(148, 139, 160, 149, 159, 142, 153, 150, 151, 139,
140, 161, 158, 140, 137, 152, 149, 145, 160, 156,
151, 147, 157, 147, 157, 151, 144, 141, 139, 148 ),
X2=c(41, 34, 49, 36, 45, 31, 43, 43, 42, 31,
29, 47, 49, 33, 31, 35, 47, 35, 47, 44,
42, 38, 39, 30, 48, 36, 36, 30, 32, 38 ),
X3=c(72, 71, 77, 67, 80, 66, 76, 77, 77, 68,
64, 78, 78, 67, 66, 73, 82, 70, 74, 78,
73, 73, 68, 65, 80, 74, 68, 67, 68, 70),
X4=c(78, 76, 86, 79, 86, 76, 83, 79, 80, 74,
74, 84, 83, 77, 73, 79, 79, 77, 87, 85,
82, 78, 80, 75, 88, 80, 76, 76, 73, 78 )
)
#主成分分析
student.pr <- princomp(student, cor = TRUE)
#显示结果
summary(student.pr, loadings=TRUE)
#预测，显示各样本主成分的值
pre<-predict(student.pr)
#显示碎石图
screeplot(student.pr,type=”lines”)
# 主成分分析散点图
biplot(student.pr)

例二
对128个成年男子的身材进行测量，每人测得16项指标，身高，坐高，胸围，头高，裤长，下档，手长，领围，前胸，后背，肩厚，肩宽，袖长，肋围，腰围，腿肚，分别用X1-X16表示。16项指标的相关矩阵R。从相关矩阵出发进行主成分分析，随16项指标进行分类。
命令
x<-c(
1.00,
0.79, 1.00,
0.36, 0.31, 1.00,
0.96, 0.74, 0.38, 1.00,
0.89, 0.58, 0.31, 0.90, 1.00,
0.79, 0.58, 0.30, 0.78, 0.79, 1.00,
0.76, 0.55, 0.35, 0.75, 0.74, 0.73, 1.00,
0.26, 0.19, 0.58, 0.25, 0.25, 0.18, 0.24, 1.00,
0.21, 0.07, 0.28, 0.20, 0.18, 0.18, 0.29,-0.04, 1.00,
0.26, 0.16, 0.33, 0.22, 0.23, 0.23, 0.25, 0.49,-0.34, 1.00,
0.07, 0.21, 0.38, 0.08,-0.02, 0.00, 0.10, 0.44,-0.16, 0.23, 1.00,
0.52, 0.41, 0.35, 0.53, 0.48, 0.38, 0.44, 0.30,-0.05, 0.50, 0.24, 1.00,
0.77, 0.47, 0.41, 0.79, 0.79, 0.69, 0.67, 0.32, 0.23, 0.31, 0.10, 0.62, 1.00,
0.25, 0.17, 0.64, 0.27, 0.27, 0.14, 0.16, 0.51, 0.21, 0.15, 0.31, 0.17, 0.26, 1.00,
0.51, 0.35, 0.58, 0.57, 0.51, 0.26, 0.38, 0.51, 0.15, 0.29, 0.28, 0.41, 0.50, 0.63, 1.00,
0.21, 0.16, 0.51, 0.26, 0.23, 0.00, 0.12, 0.38, 0.18, 0.14, 0.31, 0.18, 0.24, 0.50, 0.65, 1.00
)
names<-c(“X1″, “X2″, “X3″, “X4″, “X5″, “X6″, “X7″, “X8″, “X9″,
“X10″, “X11″, “X12″, “X13″, “X14″, “X15″, “X16″)
R<-matrix(0, nrow=16, ncol=16, dimnames=list(names, names))
for (i in 1:16){
for (j in 1:i){
R<-x[(i-1)*i/2+j]; R[j,i]<-R
}
}
#主成分分析
pr<-princomp(covmat=R)
load<-loadings(pr)

#
plot(load[,1:2])
text(load[,1], load[,2], adj=c(-0.4, 0.3))

主成分回归
考虑进口总额Y与三个自变量：国内总产值，存储量，总消费量之间的关系。现收集了1949-1959共11年的数据，试做线性回归和主成分回归分析。
conomy<-data.frame(
x1=c(149.3, 161.2, 171.5, 175.5, 180.8, 190.7, 202.1, 212.4, 226.1, 231.9, 239.0),
x2=c(4.2, 4.1, 3.1, 3.1, 1.1, 2.2, 2.1, 5.6, 5.0, 5.1, 0.7),
x3=c(108.1, 114.8, 123.2, 126.9, 132.1, 137.7, 146.0, 154.1, 162.3, 164.3, 167.6),
y=c(15.9, 16.4, 19.0, 19.1, 18.8, 20.4, 22.7, 26.5, 28.1, 27.6, 26.3)
)

线性回归
lm.sol<-lm(y~x1+x2+x3, data=conomy)
summary(lm.sol)
主成分回归

# 主成分分析
conomy.pr<-princomp(~x1+x2+x3, data=conomy, cor=T)
summary(conomy.pr, loadings=TRUE)
pre<-predict(conomy.pr)
conomy$z1<-pre[,1]; conomy$z2<-pre[,2]
lm.sol<-lm(y~z1+z2, data=conomy)
summary(lm.sol)

4sample CA RDA analysis

szypanther — Thu, 30 Aug 2012 04:14:47 +0000

> gtsdata_test=read.table(“gtsdata.txt”, header=T)
> gtsenv=read.table(“gtsenv.txt”, header=T)
> gtsdata_data_t<-t(gtsdata_data)
> decorana(gtsdata_data_t)

Call:
decorana(veg = gtsdata_data_t)

Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.

DCA1 DCA2 DCA3 DCA4
Eigenvalues 0.8634 0.4834 0.23788 0
Decorana values 0.8721 0.3793 0.07223 0
Axis lengths 5.3292 2.1115 1.80907 0

> gts.ca=cca(gtsdata_data_t)
> gts.ca
Call: cca(X = gtsdata_data_t)

Inertia Rank
Total 1.653
Unconstrained 1.653 3
Inertia is mean squared contingency coefficient

Eigenvalues for unconstrained axes:
CA1 CA2 CA3
0.8721 0.5037 0.2776

> plot(gts.ca,scaling=3)

> gtsdata_data_t_del<-gtsdata_data_t[1:3,]
> gtsdata_data_t_del

> gts.rda=rda(gtsdata_data_t_del,gtsenv)
> gts.rda
Call: rda(X = gtsdata_data_t_del, Y = gtsenv)

Inertia Proportion Rank
Total 101790 1
Constrained 101790 1 2
Unconstrained 0 0 0
Inertia is variance
Some constraints were aliased because they were collinear (redundant)

Eigenvalues for constrained axes:
RDA1 RDA2
81240 20549

plot(gts.rda,display=c(“sp”,”bp”,”si”),scaling=3)

gtsenv.txt

gtsdata.txt

基于Vegan 软件包的生态学数据排序分析学习

szypanther — Tue, 28 Aug 2012 04:40:20 +0000

“基于Vegan 软件包的生态学数据排序分析
赖江山米湘成
(中国科学院植物研究所植被与环境变化国家重点实验室，北京 100093)
摘要：群落学数据一般是多维数据，例如物种属性或环境因子的属性。多元统计分析是群落生态学常用的分析方法，排序（ordination）是多元统计最常用的方法之一。CANOCO是广泛使用的排序软件，但缺点是商业软件价格不菲，版本更新速度也很慢。近年来，R语言以其灵活、开放、易于掌握、免费等诸多优点，在生态学和生物多样性研究领域迅速赢得广大研究人员的青睐。R语言中的外在软件包“Vegan”是专门用于群落生态学分析的工具。Vegan能够提供所有基本的排序方法，同时具有生成精美排序图的功能，版本更新很快。我们认为Vegan包完全可以取代CANOCO，成为今后排序分析的首选统计工具。本文首先简述排序的原理和类型，然后介绍Vegan的基本信息和下载安装过程，最后以古田山24公顷样地内随机抽取40个20m×20m的样方为例，展示Vegan包内各种常用排序方法（PCA,RDA,CA和CCA）和排序图生成过程，希望能为R的初学者尽快熟悉并利用Vegan包进行排序分析提供参考。

gtsdata

gtsenv.txt

赖江山.pdf

> setwd("/winxp_disk2/shenzy/R/Vegan")
> gtsdata=read.table("gtsdata.txt", header=T)
> gtsenv=read.table("gtsenv.txt", header=T)
> install.packages("vegan")
Installing package(s) into ‘/home/shenzy/R/x86_64-pc-linux-gnu-library/2.15’
(as ‘lib’ is unspecified)
试开URL’http://cran.csiro.au/src/contrib/vegan_2.0-4.tar.gz'
Content type 'application/x-gzip' length 1576584 bytes (1.5 Mb)
打开了URL
==================================================
downloaded 1.5 Mb

* installing *source* package ‘vegan’ ...
** 成功将‘vegan’程序包解包并MD5和检查
** libs
gfortran   -fpic  -O3 -pipe  -g  -c cepin.f -o cepin.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c data2hill.c -o data2hill.o
gfortran   -fpic  -O3 -pipe  -g  -c decorana.f -o decorana.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c goffactor.c -o goffactor.o
gfortran   -fpic  -O3 -pipe  -g  -c monoMDS.f -o monoMDS.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c nestedness.c -o nestedness.o
gfortran   -fpic  -O3 -pipe  -g  -c ordering.f -o ordering.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c pnpoly.c -o pnpoly.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c stepacross.c -o stepacross.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c vegdist.c -o vegdist.o
gcc -std=gnu99 -shared -o vegan.so cepin.o data2hill.o decorana.o goffactor.o monoMDS.o nestedness.o ordering.o pnpoly.o stepacross.o vegdist.o -lgfortran -lm -lquadmath -L/usr/lib/R/lib -lR
安装至 /home/shenzy/R/x86_64-pc-linux-gnu-library/2.15/vegan/libs
** R
** data
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
   ‘decision-vegan.Rnw’
   ‘diversity-vegan.Rnw’ using ‘UTF-8’
   ‘intro-vegan.Rnw’ using ‘UTF-8’
** testing if installed package can be loaded

* DONE (vegan)

The downloaded source packages are in
	‘/tmp/RtmpmtXtEK/downloaded_packages’
> library(vegan)
载入需要的程辑包：permute

载入程辑包：‘permute’

The following object(s) are masked from ‘package:gtools’:

    permute

This is vegan 2.0-4
> decorana(gtsdata)

Call:
decorana(veg = gtsdata) 

Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.

                  DCA1   DCA2    DCA3    DCA4
Eigenvalues     0.3939 0.2239 0.09555 0.06226
Decorana values 0.5025 0.1756 0.06712 0.03877
Axis lengths    3.2595 2.5130 1.21445 1.00854

> gts.pca=rda(gtsdata)
> gts.pca
Call: rda(X = gtsdata)

              Inertia Rank
Total           352.1
Unconstrained   352.1   22
Inertia is variance 

Eigenvalues for unconstrained axes:
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8
111.779  73.580  54.607  32.959  26.481  18.063  12.763   7.637
(Showed only 8 of all 22 unconstrained eigenvalues)

Note: 通过以上命令选择排序模型（线性模型PCA、RDA或单峰模型CA、CCA），因为Axis lengths 等同于CANOCO中的DCA分析，
DCA排序数值最大max>4选单峰，<3 选线性模型， 3
> plot(gts.pca)

 
 >biplot(gts.pca)

因以上重叠现象严重，原因是物种分布差异打，分布不均匀的物种占据了大部分排序空间，可对物种数据进行单位方差标准化。通过scale参数实现，如下：
> gts.pca=rda(gtsdata, scale=T)
> biplot(gts.pca,scaling=3)

Note:scaling=1 关注物种间关系
scaling=2 关注样方之间关系
scaling=3 关注样方与物种之间关系
> biplot(gts.pca,display="sp")
> biplot(gts.pca,display="si")
> biplot(gts.pca,display="sp", choices=c(1,3))


CA分析：
> gts.ca=cca(gtsdata)
> gts.ca
Call: cca(X = gtsdata)

              Inertia Rank
Total           1.424
Unconstrained   1.424   21
Inertia is mean squared contingency coefficient 

Eigenvalues for unconstrained axes:
    CA1     CA2     CA3     CA4     CA5     CA6     CA7     CA8
0.50253 0.26564 0.14023 0.10502 0.09127 0.05540 0.05063 0.04204
(Showed only 8 of all 21 unconstrained eigenvalues)
> plot(gts.ca,scaling=3)

从CA解读即：如某一个物种靠近某个样方，表明该物种可能对样方位置起很大作用。从图可以看出20号样方与短柄饱（QUESER）很近。同时19与20号样方距离近，表明物种组结构特征也近！而只有少数样方出现的物种，如CASCAR，通常在排序空间边缘，表明只偶然发生。该列对应样方数值都很小或0！对在排序中心的物种，可能在取样区域是其最优分布。对应该列（CASERY）数值较大而多！
RDA分析（多个矩阵分析）：
> gts.rda=rda(gtsdata,gtsenv)
> gts.rda
Call: rda(X = gtsdata, Y = gtsenv)

               Inertia Proportion Rank
Total         352.0917     1.0000
Constrained   137.4026     0.3902    8
Unconstrained 214.6891     0.6098   22
Inertia is variance 

Eigenvalues for constrained axes:
   RDA1    RDA2    RDA3    RDA4    RDA5    RDA6    RDA7    RDA8
56.3864 42.7769 17.8270 13.5066  2.5020  2.1217  1.6616  0.6203 

Eigenvalues for unconstrained axes:
   PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
72.287 54.891 26.618 17.959 12.730  9.918  5.659  5.349
(Showed only 8 of all 22 unconstrained eigenvalues)
plot(gts.rda,display=c("sp","bp","si"),scaling=3)

在RDA排序中，箭头连线长度代表某个
环境因子与群落分布和种类分布间相关
程度的大小，越长相关性越大。
箭头连线和排序抽的夹角代表某个环境因子
与排序抽的相关性大小，越小相关性越大！
> gts.prda=rda(gtsdata,gtsenv[,1:4], gtsenv[,5:8])
> gts.prda
Call: rda(X = gtsdata, Y = gtsenv[, 1:4], Z = gtsenv[, 5:8])

               Inertia Proportion Rank
Total         352.0917     1.0000
Conditional    95.0318     0.2699    4
Constrained    42.3708     0.1203    4
Unconstrained 214.6891     0.6098   22
Inertia is variance 

Eigenvalues for constrained axes:
  RDA1   RDA2   RDA3   RDA4
27.522  9.087  3.442  2.320 

Eigenvalues for unconstrained axes:
   PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
72.287 54.891 26.618 17.959 12.730  9.918  5.659  5.349
(Showed only 8 of all 22 unconstrained eigenvalues)
Note: gtsenv[,1:4]表示环境矩阵只取前4列，即地形因子。Constrained为42.37除以
352.09=12.03%，表示地形因子单独所能解释的特征根占总特征根的百分比。Y，Z调换下，
可得土壤因子单独的解释量，2者总共的解释量前面已经算出，即为39.02%。所以2组环境
变量共同的解释量为39.02%-15.53%-12.03%=11.46%!

CCA分析类似
>gts.cca=cca(gtsdata,gtsenv)