小生这厢有礼了(BioFaceBook Personal Blog) » statistics

STAMP (Statistical Analysis of Metagenomic Profiles)

szypanther — Wed, 02 Apr 2014 07:17:27 +0000

STAMP

Using STAMP to identify SEED subsystems which are differentially abundant between CandidatusAccumulibacter phosphatis sequences obtained from a pair of enhanced biological phosphorus removal (EBPR) sludge metagenomes(data originally described in Parks and Beiko, 2010).

STAMP (Statistical Analysis of Metagenomic Profiles) is a software package for analyzing metagenomic profiles (e.g., a taxonomic profile indicating the number of marker genes assigned to different taxonomic units or a functional profile indicating the number of sequences assigned to different biological subsystems or pathways) that promotes ‘best practices’ in choosing appropriate statistical techniques and reporting results. It encourages the use of effect sizes and confidence intervals in assessing biological importance. A user friendly, graphical interface permits easy exploration of statistical results and generation of publication quality plots for inferring the biological relevance of features in a metagenomic profile. STAMP is open source, extensible via a plugin framework, and available for all major platforms.

Announcements

Mar. 6, 2014: STAMP v2.0.2 released. Adds support for BIOM profiles including those generated by PICRUSt andQIIME.
Feb. 9, 2014: STAMP v2.0.1 released. Adds heatmap plot. Fixes bug in multi-group plots causing DP values to be scaled by an 100.
Sept. 2, 2013: OS X user may be experiencing problems running STAMP due to a bug in the latest version of matplotlib. Click here for details.
May 24, 2012: STAMP v2.0.0 (release candidate 6) released. Improved handling of degenerate cases and unclassified taxa in STAMP profiles.
Previous announcements

Mailing list

Join our mailing list to keep informed about STAMP developments.

Documentation

Quick installation instructions (Microsoft Windows, Apple’s Mac OS X, Linux, Command-line interface)
User’s Guide
FAQs
Version history

Downloads

Please uninstall previous versions of STAMP before installing a new release.

STAMP v2.0.2 setup package for Microsoft Windows (~35MB)
OS X and Linux users can follow the instructions to install from source
STAMP GitHub Repository
Previous versions

Examples

Citing STAMP

If you use STAMP in your research, please cite:

Parks, D.H. and Beiko, R.G. (2010). Identifying biologically relevant differences between metagenomic communities. Bioinformatics, 26, 715-721. (Abstract)

Contact Information

STAMP is in active development and we are interested in discussing all potential applications of this software. We encourage you to send us suggestions for new features. Suggestions, comments, and bug reports can be sent to Rob Beiko (beiko [at] cs.dal.ca). If reporting a bug, please provide as much information as possible and a simplified version of the data set which causes the bug. This will allow us to quickly resolve the issue.

Funding

The development and deployment of STAMP has been supported by several organizations:

Genome Atlantic
The Dalhousie Centre for Comparative Genomics and Evolutionary Bioinformatics, and the Tula Foundation
Killam Trust
The Natural Sciences and Engineering Research Council of Canada
The Dalhousie Faculty of Computer Science

Reading the NCBI’s GEO microarray SOFT files in R/BioConductor

szypanther — Thu, 23 May 2013 02:47:38 +0000

http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/geo/

This page discusses how to load GEO SOFT format microarray data from the Gene Expression Omnibus database (GEO) (hosted by the NCBI) into R/BioConductor. SOFT stands for Simple Omnibus Format in Text. There are actually four types of GEO SOFT file available:

GEO Platform (GPL)
These files describe a particular type of microarray. They are annotation files.

GEO Sample (GSM)
Files that contain all the data from the use of a single chip. For each gene there will be multiple scores including the main one, held in the VALUE column.

GEO Series (GSE)
Lists of GSM files that together form a single experiment.

GEO Dataset (GDS)
These are curated files that hold a summarised combination of a GSE file and its GSM files. They contain normalised expression levels for each gene from each sample (i.e. just the VALUE field from the GSM file).

As long as you just need the expression level then a GDS file will suffice. If you need to dig deeper into how the expression levels were calculated, you’ll need to get all the GSM files instead (which are listed in the GDS or GSE file).

To me, it was natural to ask: How can I turn a GEO DataSet (GDS file) into an R/BioConductor expression set object (exprSet)? (answer) And while we’re at it, how to load the GEO Platform annotation (GPL file) too? (answer)

In the MOAC Module 5 assignment, the approach taken was to sanitize the data by hand, allowing it to be loaded into R with a simple call to the read.table command. Its a good idea to look at the raw files to understand what you are dealing with, but surely there is a more elegant way…

It turns out there are several existing GEO parsers, but one stands out above all others: Sean Davis’GEOquery (released roughly December 2005).

Installing GEOquery

Assuming you are running a recent version of BioConductor (1.8 or later) you should be able to install it from within R as follows:

> source("http://www.bioconductor.org/biocLite.R") > biocLite("GEOquery") Running bioCLite version 0.1.6 with R version 2.3.1 ...

For those of you on an older version of BioConductor, you will have to download and install it by hand fromhere.

If you are using Windows, download GEOquery_1.6.0.zip (or similar) and save it. Then from within the R program, use the menu option “Packages”, “Install package(s) from local zip files…” and select the ZIP file.

On Linux, download GEOquery_1.6.0.tar.gz (or similar) and use sudo R CMD INSTALL GEOquery_1.6.0.tar.gz at the command prompt.

Loading a GDS file with GEOquery

Here is a quick introduction to how to load a GDS file, and turn it into an expression set object:

library(Biobase) library(GEOquery) #Download GDS file, put it in the current directory, and load it: gds858 <- getGEO('GDS858', destdir=".") #Or, open an existing GDS file (even if its compressed): gds858 <- getGEO(filename='GDS858.soft.gz')

I’m using GDS858 as input. The SOFT file is available in compressed form here GDS858.soft.gz, but GEOquery takes care of finding this file for you and unzipping it automatically.

Loading this file from the hard disk takes about two minutes on my laptop.

There are two main things the GDS object gives us, meta data (from the file header) and a table of expression data. These are extracted using the Meta and Table functions. First lets have a look at the metadata:

> Meta(gds858)$channel_count [1] "1" > Meta(gds858)$description [1] "Comparison of lung epithelial Calu-3 cells infected ..." > Meta(gds858)$feature_count [1] "22283" > Meta(gds858)$platform [1] "GPL96" > Meta(gds858)$sample_count [1] "19" > Meta(gds858)$sample_organism [1] "Homo sapiens" > Meta(gds858)$sample_type [1] "cDNA" > Meta(gds858)$title [1] "Mucoid and motile Pseudomonas aeruginosa infected lung epithelial cell comparison" > Meta(gds858)$type [1] "gene expression array-based"

Useful stuff, and now the expression data table:

> colnames(Table(gds858)) [1] "ID_REF" "IDENTIFIER" "GSM14498" "GSM14499" "GSM14500" [6] "GSM14501" "GSM14513" "GSM14514" "GSM14515" "GSM14516" [11] "GSM14506" "GSM14507" "GSM14508" "GSM14502" "GSM14503" [16] "GSM14504" "GSM14505" "GSM14509" "GSM14510" "GSM14511" [21] "GSM14512" > Table(gds858)[1:10,1:6] ID_REF IDENTIFIER GSM14498 GSM14499 GSM14500 GSM14501 1 1007_s_at U48705 3736.9 3811.0 3699.6 3897.6 2 1053_at M87338 343.0 500.3 288.3 341.3 3 117_at X51757 120.9 34.3 145.8 110.5 4 121_at X69699 1523.8 1281.1 1281.9 1493.4 5 1255_g_at L36861 51.6 15.9 45.9 8.1 6 1294_at L13852 253.2 164.8 200.0 205.2 7 1316_at X55005 199.6 250.7 290.3 218.6 8 1320_at X79510 81.7 13.4 13.9 88.7 9 1405_i_at M21121 18.9 5.6 11.0 9.5 10 1431_at J02843 99.7 74.5 72.6 114.8

Now, lets turn this GDS object into an expression set object (using base 2 logarithms) and have a look at it:

> eset <- GDS2eSet(gds858, do.log2=TRUE) > eset Expression Set (exprSet) with 22283 genes 19 samples phenoData object with 4 variables and 19 cases varLabels : sample : infection : genotype/variation : description > geneNames(eset)[1:10] [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" [6] "1294_at" "1316_at" "1320_at" "1405_i_at" "1431_at" > sampleNames(eset) [1] "GSM14498" "GSM14499" "GSM14500" "GSM14501" "GSM14513" [6] "GSM14514" "GSM14515" "GSM14516" "GSM14506" "GSM14507" [11] "GSM14508" "GSM14502" "GSM14503" "GSM14504" "GSM14505" [16] "GSM14509" "GSM14510" "GSM14511" "GSM14512"

GEOquery does an excellent job of extracting the phenotype data, as you can see:

> pData(eset)$infection [1] FRD1 FRD1 FRD1 FRD1 FRD440 [6] FRD440 FRD440 FRD440 FRD875 FRD875 [11] FRD875 FRD875 FRD1234 FRD1234 FRD1234 [16] uninfected uninfected uninfected uninfected Levels: FRD1 FRD1234 FRD440 FRD875 uninfected > pData(eset)$"genotype/variation" [1] control control [3] control control [5] mucoid mucoid [7] mucoid mucoid [9] motile motile [11] motile motile [13] non-mucoid, non-motile non-mucoid, non-motile [15] non-mucoid, non-motile non-mucoid, non-motile [17] non-mucoid, non-motile non-mucoid, non-motile [19] non-mucoid, non-motile Levels: control motile mucoid non-mucoid, non-motile

As with any expression set object, its easy to pull out a subset of the data:

> eset["1320_at","GSM14504"] Expression Set (exprSet) with 1 genes 1 samples phenoData object with 4 variables and 1 cases varLabels : sample : infection : genotype/variation : description > exprs(eset["1320_at","GSM14504"]) GSM14504 1320_at 6.70044

You should be able to produce a heatmap of differentially expressed genes easily enough using this page, especially as the phenotype/sub-sample information has been sorted out for you.

Loading a GPL (Annotation) file with GEOquery

In addition to loading a GDS file to get the expression levels, you can also load the associated platform annotation file. You can find this out from the GDS858 meta information:

> Meta(gds858)$platform [1] "GPL96"

So, for GDS858, the platform is GPL96, Affymetrix GeneChip Human Genome U133 Array Set HG-U133A.

Now let’s load up the GPL file and have a look at it (its a big file, about 12 MB, so this takes a while!):

library(Biobase) library(GEOquery) #Download GPL file, put it in the current directory, and load it: gpl96 <- getGEO('GPL96', destdir=".") #Or, open an existing GPL file: gpl96 <- getGEO(filename='GPL96.soft')

As with the GDS object, we can use the Meta and Table functions to extract information:

> Meta(gpl96)$title [1] "Affymetrix GeneChip Human Genome U133 Array Set HG-U133A" > colnames(Table(gpl96)) [1] "ID" "Species.Scientific.Name" [3] "Annotation.Date" "GB_LIST" [5] "SPOT_ID" "Sequence.Source" [7] "Representative.Public.ID" "Gene.Title" [9] "Gene.Symbol" "Entrez.Gene" [11] "RefSeq.Transcript.ID" "Gene.Ontology.Biological.Process" [13] "Gene.Ontology.Cellular.Component" "Gene.Ontology.Molecular.Function"

Lets look at the first four columns, for the first ten genes:

> Table(gpl96)[1:10,1:4] ID Species.Scientific.Name Annotation.Date GB_LIST 1 1007_s_at Homo sapiens 16-Sep-05 U48705 2 1053_at Homo sapiens 16-Sep-05 M87338 3 117_at Homo sapiens 16-Sep-05 X51757 4 121_at Homo sapiens 16-Sep-05 X69699 5 1255_g_at Homo sapiens 16-Sep-05 L36861 6 1294_at Homo sapiens 16-Sep-05 L13852 7 1316_at Homo sapiens 16-Sep-05 X55005 8 1320_at Homo sapiens 16-Sep-05 X79510 9 1405_i_at Homo sapiens 16-Sep-05 M21121 10 1431_at Homo sapiens 16-Sep-05 J02843

This shows a hand picked selection of the columns, again for the first ten genes:

> Table(gpl96)[1:10,c("ID","GB_LIST","Gene.Title","Gene.Symbol","Entrez.Gene")] ID GB_LIST Gene.Title Gene.Symbol Entrez.Gene 1 1007_s_at U48705 discoidin domain receptor family, member 1 DDR1 780 2 1053_at M87338 replication factor C (activator 1) 2, 40kDa RFC2 5982 3 117_at X51757 heat shock 70kDa protein 6 (HSP70B') HSPA6 3310 4 121_at X69699 paired box gene 8 PAX8 7849 5 1255_g_at L36861 guanylate cyclase activator 1A (retina) GUCA1A 2978 6 1294_at L13852 ubiquitin-activating enzyme E1-like UBE1L 7318 7 1316_at X55005 thyroid hormone receptor, alpha (erythroblastic...) THRA 7067 8 1320_at X79510 protein tyrosine phosphatase, non-receptor type 21 PTPN21 11099 9 1405_i_at M21121 chemokine (C-C motif) ligand 5 CCL5 6352 10 1431_at J02843 cytochrome P450, family 2, subfamily E, polypeptide 1 CYP2E1 1571

The above all used the 12MB file GPL96.soft, but you can also get a much smaller 3MB file GPL96.annot(compressed as GPL96.annot.gz) which has slightly different information in it… see here.

Using the BioConductor hgu133a package

Instead of loading the GEO annotation file for GPL96/HG-U133A, we could use an existing annotation package from the BioConductor annotation sets, hgu133a. These libraries exist for most of the popular microarray gene chips.

First of all, we need to install the package:

> source("http://www.bioconductor.org/biocLite.R") > biocLite("hgu133a") Running bioCLite version 0.1 with R version 2.1.1 ...

Then we can load the newly installed library:

> library(hgu133a)

There is any easy way to check when this was lasted updated, and what it can translate the Affy probe names into:

> hgu133a() Quality control information for hgu133a Date built: Created: Tue May 17 13:02:12 2005 Number of probes: 22277 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: hgu133aACCNUM found 22277 of 22277 hgu133aCHRLOC found 20195 of 22277 hgu133aCHR found 21283 of 22277 hgu133aENZYME found 2507 of 22277 hgu133aGENENAME found 18726 of 22277 hgu133aGO found 18647 of 22277 hgu133aLOCUSID found 21747 of 22277 hgu133aMAP found 21183 of 22277 hgu133aOMIM found 15109 of 22277 hgu133aPATH found 5067 of 22277 hgu133aPMID found 21004 of 22277 hgu133aREFSEQ found 21002 of 22277 hgu133aSUMFUNC found 0 of 22277 hgu133aSYMBOL found 21303 of 22277 hgu133aUNIGENE found 21128 of 22277 Mappings found for non-probe based rda files: hgu133aCHRLENGTHS found 25 hgu133aENZYME2PROBE found 663 hgu133aGO2ALLPROBES found 5912 hgu133aGO2PROBE found 4326 hgu133aORGANISM found 1 hgu133aPATH2PROBE found 142 hgu133aPMID2PROBE found 96291

And now lets test some of those mappings on the fourth gene 121_at in the GPL file:

> Table(gpl96)[4,c("ID","GB_LIST","Gene.Title","Gene.Symbol","Entrez.Gene")] ID GB_LIST Gene.Title Gene.Symbol Entrez.Gene 4 121_at X69699 paired box gene 8 PAX8 7849

Now, what does the annotation file have to say?

> mget("121_at",hgu133aACCNUM) $"121_at" [1] "X69699" > mget("121_at",hgu133aGENENAME) $"121_at" [1] "paired box gene 8" > mget("121_at",hgu133aSYMBOL) $"121_at" [1] "PAX8" > mget("121_at",hgu133aUNIGENE) $"121_at" [1] "Hs.469728"

You will notice that there is some overlap between the information in the GEO annotation table, and thehgu133a package (which compiles its data from a range of sources). See help(hgu133a) .

You should also read this introduction, Bioconductor: Annotation Package Overview

Meta分析（Meta-analysis）简介

szypanther — Sat, 11 May 2013 23:28:34 +0000

Meta分析的由来：
Meta分析（Meta-analysis）是心理学家Glass于1976年首次提出，原文是这样的：
Meta-analysis refers to the analysis of analyses. I use it to refer to the statistical analysis of a large collection of results from individual studies for the purpose of integrating the findings. It connotes a rigorous alternative to the casual, narrative discussions of research studies which typify our attempts to make sense of the rapidly expanding research literature. (Glass, 1976)

翻译过来可以得到以下Meta分析的概念。

Meta分析的概念：
Meta分析是指用统计学方法对收集的多个研究资料进行分析和概括，以提供量化的平均效果来回答研究的问题。具体的，Meta分析是以同一课题的多项独立研究的结果为研究对象，在严格设计的基础上，运用适当的统计学方法对多个研究结果进行系统、客观、定量的综合分析。其优点是通过增大样本含量来增加结论的可信度，解决研究结果之间的异质性或不一致性，当然一切的前提是Meta分析本身需要设计合理、严密。

Meta分析的基本步骤：

1. 明确简洁地提出需要解决的问题；
2. 制定检索策略，全面广泛地收集随机对照试验；
3. 确定纳入和排除标准，剔除不符合要求的文献；
4. 资料选择和提取，包括原文的结果数据、图表等；
5. 各试验的质量评估和特征描述；
6. 统计学处理，包括：
a．异质性检验（齐性检验）；
b．统计合并效应量（加权合并，计算效应尺度及95%的置信区间）并进行统计推断；
c．图示单个试验的结果和合并后的结果；
d．敏感性分析；
e．通过“失安全数”的计算或采用“倒漏斗图”了解潜在的发表偏倚。
7. 结果解释、作出结论及评价；
8. 维护和更新资料。
（乐研生物 Lovelyresearcher.com）

data analysis in 系统评价

真理在缩水，还是上帝在掷骰子？(转贴）

szypanther — Thu, 13 Dec 2012 09:45:13 +0000

最近在Google Reader中看见科学松鼠会有两篇文章被频繁分享，名为《真理在缩水——现代科学研究方法并不尽善尽美？》（上）与（下），下文简称《缩水》。文章很有意思，而实际上说的是我们的老本行——统计学，因此我在这里也发表一些我的想法和理解，包括这两年我在美帝学习的一些思考，部分内容受益于两位老师Kaiser和Nettleton教授，先向他们致谢（尽管他们永远都不会看到这篇文章）。同时我也要先说明一下，读这篇文章可能会很花时间（至少我花了大约二十小时写这篇文章），即使我的观点没有价值，我相信里面的引用文献是有价值的。

初读文章，我脑子里冒出的一句话是“上帝在跟我们掷骰子”，文中给出了大量的不可重复的试验，仿佛就像那些号称“具有统计学意义”（下文我再说这个所谓的“意义”）的试验结果只是若干次骰子中的一次运气好的结果而已。读完文章，我们可能不禁要问，到底是真理在缩水，还是它根本就不曾存在？下面我从四个方面来展开，分别说明人对随机性的认识、统计推断的基石、让无数英雄折腰的P值、以及可重复的统计研究。

一、感知随机

随机变量在统计分析中占据中心地位，数学上关于随机变量的定义只是一个“干巴巴的函数”，从样本空间映射到实数集，保证从实数集上的Borel域逆回去的集合仍然在原来的sigma域中即可。随机变量的性质由其分布函数刻画。写这段话的目的不是为了吓唬你，也不是为了作八股文，而是来说明我为什么不喜欢数学的理由，对我而言，我觉得有些数学工具只是为了让自己不要太心虚，相信某时某刻某个角落有个理论在支撑你，但后果是弱化了人的感知，当然，也有很多数学工具有很强的直觉性（如果可能，我想在未来下一篇文章里面总结这些问题）。我一直认为很多人对随机性的感知是有偏差的，对概率的解释也容易掉进陷阱（参见Casella & Berger的Statistical Inference第一章，例如条件概率的三囚徒问题）。

《缩水》一文发表了很多不可重复的试验案例，我们应该吃惊吗？我的回答是，未必。举两个简单的例子：

第一个例子：很多数据分析人员都很在意所谓的“离群点”，论坛上也隔三差五有人问到这样的问题（如何判断和处理离群点），而且也有很多人的做法就是粗暴地删掉，我从来都反对这种做法。除了基于“数据是宝贵的”这样简单的想法之外，另一点原因是，离群点也许并非“异类”。离群点是否真的不容易出现？请打开R或其它统计软件，生成30个标准正态分布N(0, 1)随机数看看结果，比如R中输入rnorm(30)，这是我运行一次的结果：

> rnorm(30)
 [1]  1.19062761 -0.85917341  2.90110515  0.59532402 -0.05081508 -0.06814796
 [7]  2.08899701  0.76423007  0.92587075 -1.16232929 -0.68074378 -1.40437532
[13] -0.17932604 -0.72980545 -0.53850923  0.21685537 -0.35650714 -1.32591808
[19] -0.88071526 -1.25832441  0.24001498 -0.41682799 -0.09576492 -0.17059052
[25] -0.99947485  0.25108253 -0.47566842 -0.28028786  0.79856649 -0.13250974

30在现实中是一个比较小的样本量，你看到了什么？2.901？它接近3倍标准差的位置了。还有2.089？……如果你不知道这批数据真的是从标准正态分布中生成出来的，现在你会有什么反应？把2.9删掉？标准正态分布是一个在我们眼中很“正常”的分布，而一个不太大的样本量一次试验足以生成几个“离群点”，那么要是成千上万的试验中没能产生几项震惊世界的结果，你会怎样想？（上帝的骰子坏掉了）

另一个例子和统计学结合紧密一点，我们谈谈残差的QQ图。QQ图是用来检查数据正态性的一种统计图形，与腾讯无关，细节此处略去，大意是图中的点若呈直线状（大致分布在对角线上），那么可以说明数据的正态性比较好，因此QQ图经常被用在对回归模型残差的正态性诊断上。我的问题是，即使数据真的是正态分布，你是否真的会看见一些分布在直线上的点？若答案是否定的，那么我们就得重新审视我们对分布和随机的认识了。下图是一幅教科书式的QQ图（仍然基于30个正态分布随机数）：

“正常的”QQ图（来自R语言qqnorm(rnorm(30))）

随机性并没有这么美好，即使数据真的来自正态分布，你也有可能很容易观察到歪歪扭扭的QQ图，尤其是小样本的情况下。比如下图是50次重复抽样的正态数据QQ图，它和你想象的QQ图本来的样子差多远？

library(animation)
set.seed(710)
ani.options(interval = 0.1, nmax = 50)
par(mar = c(3, 3, 2, 0.5), mgp = c(1.5, 0.5, 0), tcl = -0.3)
sim.qqnorm(n = 30, pch = 19, col = "red", last.plot = expression(abline(0, 1)))

真实的正态分布QQ图（图中直线为y = x）

正态分布是统计学中比较“正常”的一类分布（台湾学者甚至译为“常态分布”），我们尚不能很好感知它的随机性，就不必说其它“怪异”的分布了。

这是我想说的第一点，作为人类，我们对上帝的骰子至少在感知上就可能有问题（别误会，我不信教），接下来从理性角度分析一下。

二、统计推断

《缩水》一文中提到的基本上都是统计推断方法带来的结果，为了理解这些结果，我们必须三思统计推断本身的过程。一般说来，统计推断有两种途径：随机试验和（概率）统计模型，或者说基于试验的推断和基于模型的推断。前者的代表性方法为置换检验（Permutation test），不过它似乎被大多数人遗忘了，更多的人拿到数据首先想的就是用哪个统计模型和分布；实际上，置换检验是一种极具代表性的统计推理方法，可以用典型的“三段论”来说明它（参见去年江堂的文章）：

要么A，要么B
若有A，则有C
若非C，则非A，于是B

置换检验的场景是，一次试验中，我们为试验单元随机分配不同的处理（treatment），为了简单起见，假设这里只有两种处理水平A和B，我们想知道两种处理在试验单元的某项指标上是否有显著差异。逻辑是这样：假设处理毫无效果，那么某一个试验对象的指标将不受处理影响，不管我们给老鼠嗑的是A药还是B药，结果都一样，那么我们可以把处理的标签随机打乱（某些A、B随机互换），打乱后A组和B组的差异不应该会和原来的差异很不一样（因为药不管用嘛），否则，我们恐怕得说，药还是管用的。就这样，我们随机打乱标签很多次，形成一个“人工生成”的AB差异分布（因为我们生成了很多个差异值），看原来的差异在这个分布的什么位置上，如果在很靠近尾巴的位置上，那么就认为P值很小。当了个当，当了个当，P值出场了。对置换检验熟悉的人是否有想过，好像我们一直没谈到什么分布的假设，那这个概率值（P值）是从哪里生出来的？答案是最初的“随机分配处理到试验单元上”。这就涉及到试验设计的一大原则：随机化。为什么要随机化？因为试验单元之间可能本来就有差异，换句话说，如果你不随机化，那么你观察到的差异有可能来自试验单元本身。比如，你从笼子里抓前10只老鼠给嗑A药，后10只老鼠给B药，这就没有随机化，前10只老鼠可能很笨或是老弱病残，容易被你抓住，而后10只老鼠跑得贼快。随机化将这些个体差异转变为了随机的误差，例如两只老鼠之间的确本身有差异，但我要是把它们随机分配给处理，那么这种个体差异就会随机进入处理组，保证统计推断有根基。如果这一点没有理解透彻，试验人员很容易在数据收集阶段就已经收集了错误的数据。《缩水》一文中的试验都是怎么做的，我没空去细究。

基于模型的推断的一大特征就是开始对随机变量做一些分布上的假设，例如两样本t检验，假设样本来自独立同方差的正态分布。仔细想想这里面的问题，对建模和理解模型结果有至关重要的作用。一个最直接的问题就是，假设条件是否可靠？George Box大人很久很久以前就说了一句被引用了无数遍的话：所有的模型都是错的，但有些是有用的。统计学方法很“滑”（用麦兜的话说），它的科学与艺术双重身份，总让人感觉拿捏不准。学数学的人可能觉得艺术太不靠谱，其它外行人士可能觉得科学太神秘。这个问题我不想作答，也无法作答，搁在一边，先说一些我曾经考虑过的问题，它们与《缩水》一文可能并没有直接联系，但应该能或多或少启发我们从源头考虑统计模型，直接上手统计模型的隐患就在于你认为一切都理所当然，随着时间推移，假设渐渐变成了“公理”和“常识”，我觉得这是很可怕的。

第一个问题是似然函数（likelihood function），它是频率学派的命脉，而我们大多数人仍然都是频率学派的“教徒”。对于离散变量来说，基于似然函数的方法如极大似然估计很容易理解：我们要找一个参数值，使得观察到的数据发生的概率最大。这里的“概率”二字应该被重重划上记号！接下来我要提一个让我曾经觉得后背发凉的问题：

为什么对连续变量来说，似然函数是密度函数的乘积？

你是否想过这个问题？我们知道连续变量取任何一个值的概率都是0，也就是说，如果按照概率的方式解释似然函数，那么连续变量的似然函数应该是0才对，换句话说，你观察到的数据发生的概率是0。现在你觉得似然函数还是一个理所当然的统计工具吗？

有一位统计学家叫J. K. Lindsey，1998年在（英国）皇家统计学会宣读了一篇论文，叫Some statistical heresies（一些统计异端邪说），如果你没见过统计学家打仗，那么这篇论文将会让你看到一场超大规模的战争，后面的讨论者中包括Murray Aitkin、D. R. Cox和J. A. Nelder等老江湖。Lindsey的文章几乎是被大家围攻了（期待我这篇文章也能被围攻），不过他对于似然函数的解释倒是让我有点茅塞顿开。细节我不展开，大意是，似然函数也是一种近似（用积分中值定理去想）。

第二个问题是渐近统计（asymptotic statistics），同样，这也是统计学家的日常工具之一，因为太常见，所以也有一种理所当然的味道。比如我们看到列联表就想到卡方检验（检验行列变量的独立性），殊不知卡方检验只是一种大样本下的近似方法。渐近统计的基石是什么？如果你要挖，我想挖到头你一定会挖到泰勒展开。至少目前我认为渐近统计“基本上就是泰勒展开的第二项（或少数情况下第三项）”。理论上泰勒展开有无穷多项，我们往往根据一些假设条件，把后面的高阶项都消灭掉，剩下一次项或二次项。比如你展开似然函数，就得到了似然比检验的卡方分布；你展开极大似然估计的一个连续函数，你就有了Delta方法（当然，需要依分布收敛的前提）；就这样左展右展，展出了中心极限定理（对特征函数或母函数展开），展出了拉普拉斯近似（对对数密度函数展开）。如果你能看到这一点，就不会奇怪为什么正态分布在统计推断中有如此中心地位（谁叫正态分布的对数密度函数是个二次函数呢）。

第三个问题是，贝叶斯方法又如何？既然频率学派中几乎处处是近似，贝叶斯学派是否会好一些？我的回答是好不到哪儿去。贝叶斯的逻辑很简单，但由于灵活性太强，应用时非常摧残人的脑力，导致争议不断（先验分布怎么取、MCMC是否收敛等）。在《缩水》一文中，恐怕是没有基于贝叶斯方法的试验，因为在所谓的科学试验中，人们往往排斥“先验”这种想法，就好像先验一定代表主观、而客观一定正确一样。逻辑上，这是荒谬的。关于这些问题的重磅讨论，可参考Efron去年发表的The Future of Indirect Evidence以及文章后面Gelman他们三个人的讨论。我总感觉这不是我这个年龄应该看的东西，太哲学了。

我提到这些问题，本意是给统计学家看的，如果你是一个合格的统计学家，你自己首先应该意识到统计学的局限性，而不是拿到数据就分析。

三、万能的P值？

早些年当我还是个无知轻狂小子的时候，我曾戏称“统计软件就是为了放个P”，这里的P指的是P值，不是粗话。这话好像也不全然轻狂无知。使用统计方法的人，难道不是在追逐一个小于0.05的P值吗？如果你的结果不显著，那么肯定发表不了。换句话说，发表的结果很有可能是我们在自欺欺人。下面的漫画生动刻画了人们寻找P值的过程（来自xkcd）：

Significant

重大科学发现！吃绿色的软糖会让你长痘痘！95%置信度！

当你看到95%的时候，你看不到红色的、灰色的、褐色的、橙色的、黄色的软糖……这便是《缩水》一文中说的“发表偏见”（publication bias，“偏见”翻译似乎不妥），即发表的结果是经过人工选择的，你已经不能用正常的概率意义去解读它，或者说，概率已经变了样。

插一句“统计学意义”：这个概念本来的英文是statistical significance，但是被很多专业的人翻译为“统计学意义”，我一直认为这很不妥，给人一种错觉，仿佛统计学保证了什么东西有意义一样，我提倡的译法是“统计显著性”。尤其是“由于P值小于0.05，所以具有统计学意义”这种话，我觉得应该见一句删一句。

上面的软糖问题涉及到传统的多重比较（multiple comparison）与P值调整，这里“传统”指的是要控制族错误率（Familywise Error Rate，下文简称FWER）。所谓控制FWER，也就是要使得一族（多个）统计检验中，一个第一类错误都不犯的概率控制在一定水平之下，比如0.05。让多个检验全都不犯错和单个检验不犯错（指第一类错误）显然是有区别的，比如假设所有的检验都是独立的，一个检验不犯错的概率是95%，两个都不犯错的概率就变成了95% * 95% = 90.25%，检验越多，不犯错的概率就越小。把整体的第一类错误率控制在某个alpha值之下，就意味着单个检验必须更“严格”，因此我们不能再以0.05去衡量每个检验是否显著，而要以更小的值去衡量，这就是P值的调整，老办法有Bonferroni等方法。

控制FWER显得有些苛刻，比如有10000个基因都需要检验在不同处理下的表达差异，那么要是按照传统的P值调整方法，恐怕是很难得出某个基因显著的结论（因为alpha值会被调得极其小）。FWER的目标是一个错误都不能犯，但实际上我们也许可以容忍在那些我们宣称显著的结果中，有少数其实是犯错的，就看你是不是“宁愿错杀三千，也不放过一个”了。

Efron在前面我提到的文章中把Benjamini和Hochberg在1995年的论文称为“二战以来统计界最伟大的成果”（他把第二名给了James & Stein的有偏估计），那么B&H做了什么？答案就是虚假发现率（False Discovery Rate，下文简称FDR）。FDR要控制的是在宣称显著的结论中错误的结论的比例，比如10000个基因中有100个基因显著，但实际上有5个是虚假的发现（即本来这5个基因分别在两种处理下的表达并没有差异）。尽管有错误存在，但我们认为可以承受。

初学者到这里应该可以意识到了，通过FDR选出来的结果在理论上就已经存在错误了，当然这只是小问题，更大的问题在于，FDR的定义实际上是一个期望的形式：真实的零假设个数除以被拒绝的零假设个数的期望（零假设是没有差异）。凡是涉及到期望的东西，我们都应该冷静三秒想一下，期望意味着什么？

假设有一个游戏，你获胜的概率是70%，要是赢了，你得到一百万，要是输了，你付出一百万；获利的期望是40万。现在我请你去玩，一把定输赢，你玩不玩？我相信大多数人不会玩（除非你实在太有钱了），为什么期望是赚40万你也不玩？因为期望往往是“样本量无穷大”你才能隐约看到的东西，玩一次游戏，输掉一百万的概率是30%，恐怕你不敢。

FDR是个期望，也就是你做很多次试验，平均来看，FDR在某个数值附近。一次试验中它究竟在哪儿，谁都不知道。就像（频率学派的）置信区间一样，我给你一个区间，其实你并不知道参数在不在区间里，你也无法用概率的方式去描述单个区间，比如你不能说“参数落在这个区间内的概率是95%”（只有无数次抽样重新计算区间，这无数个区间覆盖真实参数的概率才是95%）。

所以，某种意义上，概率论源于赌博，而统计学在骨子里一直都是赌博。旁观者看赌徒，总觉得他在赚钱。当然，统计学家是“高级赌徒”，他们不是随机乱赌。

四、可重复的统计研究

看到这里，你大概也脑子有点抽筋了（如果你把我提到的Lindsey和Efron的文章都看过了的话），我们换个轻松点的话题：可重复的统计研究。这不是我第一次提它，我们一直都在号召可重复的统计研究（如《Sweave：打造一个可重复的统计研究流程》）。还是老话一句，不谈道德问题，因为这是不可控的因素，我们可控的只有制度和工具。源代码不会撒谎。

我们期待学术研究过程的透明化，至少统计之都在努力。