小生这厢有礼了(BioFaceBook Personal Blog)

记录生物信息学点滴足迹（NGS,Genome,Meta,Linux)

Quick guide for parameters in tophat-cufflinks in nematode RNA-seq analysis

The summary of tophat-cufflinks protocol is like that:

step1: generate a tophat_out folder with bam files

tophat  -G genes.gtf <index>  sample1_1.fq  sample1_2.fq    
tophat  -G genes.gtf <index>  sample2_1.fq  sample2_2.fq

step2: generate new .gtf files (assemble isoform)

cufflinks sample1/accepted_hits.bam     
cufflinks sample2/accepted_hits.bam

step3: prepare a text file named assemblies.txt with following gtf files

cat << EOF > assemblies.txt
>sample1/transcript.gtf
>sample2/transcript.gtf
>EOF

step4: run cuffmerge to generate merged.gtf

cuffmerge -g genes.gtf -s genome.fa assemblies.txt

step5: compare gene expressions of two samples

cuffdiff merged.gtf  sample1/accepted_hits.bam  sample2/accepted_hits.bam

The protocol specifically used for our data

step0: access to the data

Open the web serve at , the passwd is

The result can be downloaded and viewed in ***

in the shell, type: ‘cd ~/new2/RNAseq/trim’

step1: generate a tophat_out folder with bam files, using only JU1421-1 as example

“-N 8 \ –read-gap-length 8 \ –read-edit-dist 8 \” are generally called mismatch, this means the mismatch for the mapping is 8. Using this parameter, we can only find 69% JU1421 reads are mapped.

tophat2 -p 15 -i 20 -I 5000 -g 10 \
-N 8 \
--read-gap-length 8 \
--read-edit-dist 8 \
-o ./tophat_out/JU1421-1  \
-G ../genome/GENES.gff3 \
../genome/cb4_ws242 \
JU1421-1_S1_L001_R1_001_trimpair.fastq.gz,JU1421-1_S1_L001_R2_001_trimpair.fastq.gz\

All reads should be mapped using the same parameters. For AF16, the example is:

tophat2 -p 15 -i 20 -I 5000 -g 10 \
-N 8 \
--read-gap-length 8 \
--read-edit-dist 8 \
-o ./tophat_out/AF16-1  \
-G ../genome/GENES.gff3 \
../genome/cb4_ws242 \
AF16-1_S1_L001_R1_001_trimpair.fastq.gz,AF16-1_S1_L001_R2_001_trimpair.fastq.gz\

step2: generate new .gtf files (assemble isoform)

cufflinks -p 8 -o ./tophat_out/JU1421-1 ./tophat_out/JU1421-1/accepted_hits.bam     
cufflinks -p 8 -o ./tophat_out/JU1421-2 ./tophat_out/JU1421-2/accepted_hits.bam 
cufflinks -p 8 -o ./tophat_out/JU1421-3 ./tophat_out/JU1421-3/accepted_hits.bam
cufflinks -p 8 -o ./tophat_out/AF16-1 ./tophat_out/AF16-1/accepted_hits.bam     
cufflinks -p 8 -o ./tophat_out/AF16-2 ./tophat_out/AF16-2/accepted_hits.bam 
cufflinks -p 8 -o ./tophat_out/AF16-3 ./tophat_out/AF16-3/accepted_hits.bam

step3: prepare a text file named assemblies.txt with following gtf files

cat << EOF > assemblies.txt
>JU1421-1/transcript.gtf
>JU1421-2/transcript.gtf
>JU1421-3/transcript.gtf
>AF16-1/transcript.gtf
>AF16-2/transcript.gtf
>AF16-3/transcript.gtf
>EOF

step4: run cuffmerge to generate merged.gtf

cuffmerge -g ../genome/GENES.gff3 -s ../genome/cb4_ws242.fa assemblies.txt

step5: compare gene expressions of two samples

cuffdiff -p 8 merged.gtf –L JU1421,AF16\
./JU1421-1/accepted_hits.bam,\
./JU1421-2/accepted_hits.bam,\
./JU1421-3/accepted_hits.bam \
./AF16-1/accepted_hits.bam,\     
./AF16-2/accepted_hits.bam,\ 
./AF16-3/accepted_hits.bam \

RNA-seq差异表达分析工作流程 (转帖）

之前写过博文《从RNA-seq结果到差异表达》给出了一些关于RNA-seq分析的描述，这篇博文的目的是给出一个示例性质的工作流程。

需要使用到的工具：
TopHat
Cufflinks
Samtools

参考：http://vallandingham.me/RNA_seq_differential_expression.html

首先安装的tophat需要事先安装好bowtie。至于安装方面的问题，这里不至赘述。

整个pipeline非常明确：Sequences → TopHat → Manual Check → Cufflinks → Analysis

第一个问题，是否需要做duplicate removal，如果要做，什么时候做？在回答这两个问题之前，我们还是先来看看什么是duplicate。我们将deep sequence中完全相同的序列统称为duplicate。通常这种重复会有几个来源，一，测序模板中存在一模一样的片断；二，测序过程中PCR产生的重复；三，信号读取过程中读到了同一pcr产物。按照这里的讨论，对于 copy number detection, SV detection, ChIP-seq, and RNA-seq都应该做duplicate removal。去除的优点是可以大量的减少计算，降低假阳性。但是去除的话也有造成数据大量损失的风险，也就是说会降低真阳性结果。有文章对相同的library做了两次测序，一次是single end, 一次paired end。比较发现，SE的duplicate高达28%，而PE的duplicate只有8%。当把PE的结果当成SE结果来处理时，duplicate又升至28%。还有些私下的讨论认为，实际的duplicate应该只有1%左右。这里强调了去除duplicate对于数据完整性的影响。那么为什么人们在做CN/SV/ChIP-seq/RNA-seq的时候倾向于做duplicate removal呢？这主要的理论依据是在准备library的步骤中，所有模板小片段都是由超声波震断的，而相同的mRNA分子在同一地方被打断的可能性几乎为零。另一方面，当测序深度过深时，不可避免的，同一模板会被多次测序。这时候更应该去除duplicate，可以消除饱和。对于一些由酶切产生的片段，比如clip-seq, REDseq (Restriction Enzyme digestion sequence)等，就不需要做去除duplicate。在做去除duplicate之前，首先要在genome browser中观察一下mapped好的序列，看看其duplicate的存在的程度。肉眼观察这种事情，因为没有一定的标尺，所以非常不好总结。做这件事情的唯一好处就是，看得多了，就明白什么是好的测序结果。

那么duplicate removal什么时候做呢？现在的观点一般都认为是在map之后做。这样的好处是不依据序列一致就去除它，因为同一段序列可能map至不同的位点。在map之后，使用samtools rmdup或者Picard MarkDuplicates。这里需要注意的是，无论是samtools还是Picard，在duplicate removal时，所有的mapping结果对于每个read，应该只保留一个位置，而不是多个位置。2，对于PE的结果，LR的名字应该一致，否则程序可能无法识别。这些工具的出发点都是PE的，如果是SE的测序，可能需要指定参数。

 java -Xmx2g -jar /path/to/MarkDuplicates.jar INPUT=accepted_hits_sorted.bam OUTPUT=duplicated.removed.bam METRICS_FILE=picard_info.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT

或者

 samtools rmdup

第二个问题，bowtie的index文件哪里下载？在bowtie2的主站上提供了一个很有用的链接：iGenome, 这里集中了目前大部分的index文件以及相关的注释文件，可以很方便的下载。本教程就是使用从这里获取的打包文件。

tophat

tophat是针对mRNA-seq对bowtie的map结果进行了优化，它表现在两个方面，第一，从基因组中提取出mRNA junction的可能组合，对没有map结果的reads进行二次比对。第二，对于pe测序的结果进行依照mRNA测序的特点进行调整。它在参数设置方面，非常简单，只需要搞清楚几个重要参数即可。

Average Mate-Pair Inner Distance： -r/–mate-inner-dist

这个参数就是设置mate的两个测序reads之间的平均距离应该是多少。通常PE测序制库时通常片段大小的平均值为300bp左右，这个300bp包括了两端的adapter, barcodes, 以及序列本身，而-r参数需要设置的是两个测序结果之间的距离，所以它应该是总长-2*(adpter+barcode+reads)。如下图所示：

我们需要注意的是，很多adapter可能比我们想象的要长，所以需要搞清楚具体的实验时的adapter长度。

Gene Model Annotations： -G/–GTF

通常这个参数我们并不设置。设置了之后可能会提高效率，但是也可能会产生倾向性。对于参照基因模型，通常的做法是在后面的步骤中再传入。

Threads：-p/–num-threads

线程数。现在多核处理器非常普遍，所以如果有四核的话，我们不妨设置-p 3，使用其中三个核，留下一核用于其它任务。但有一点要非常注意，如果你使用MPI的话，这个程序并不是MPI书写的程序，所以在cluster上运行时，需要申请独占模式，也就是使用-pe single参数而不是openmpi参数，并且使用-l mem_free=16G申请16G以上的内存以独占计算机，而不是多个内核。否则，多线程不安全。

Output：-o/–output-dir

输出目录。

–library-type

库类型，有三种，fr-unstranded Standard Illumina， fr-firststrand dUTP, NSR, NNSR，以及fr-secondstrand Ligation, Standard SOLiD。默认为fr-unstranded。

下面是一个运行tophat2的例子：

export PATH=/share/bin/samtools:/share/bin/bowtie2/:/share/bin/tophat2/:/share/bin/python:$PATH
python /share/bin/tophat2/tophat --library-type fr-unstranded --mate-inner-dist 70 -p 8 \
        --output-dir tophat_output \
	/path/to/Genomes/UCSC/mm10/Sequence/Bowtie2Index/genome \
	R1_reads.fastq R2_reads.fastq

运行完tophat之后，我们需要检查一下map的质量，这时可以使用fastqc或者samtools来查看。但其实在run log中就已经记录了所有这些信息，可以从日志文件中查找我们需要的质量信息。

此进可以使用前文所述的脚本进行duplicates removal动作。

Cufflinks

对于Cufflinks，无论我们是否需要发现新的基因，最好还是走完三步。否则的话，结果可能会有一定倾向性。这三步是cufflinks, cuffmerge以及cuffdiff

对于每一个bam文件，我们都需要运行一遍cufflinks。

/share/bin/cufflinks/cufflinks -p 8 -o cufflinks_outputs accepted_hits.bam

然后我们将所有生成的transcripts.gtf文件的路径写入一个assemblies.txt文件中。我们知道，cufflinks的目的是通过重组mRNA来确定所有的transcripts，所以它会生成gtf文件来标记哪些是exon,这些exon组成了哪些transcripts。而下一步运行cuffmerge就是为了将所有发现的transcripts与已知的基因模型进行合并。
这里是一个典型的assemblies.txt的范例：

~/scratch/RNAseq/cufflinks_sample1/transcripts.gtf
~/scratch/RNAseq/cufflinks_sample2/transcripts.gtf

运行完cufflinks之后运行cuffmerge，这一步需要把注释文件传入。

/share/bin/cufflinks/cuffmerge -g /path/to/Genomes/UCSC/mm10/Annotation/Genes/genes.gtf \
          -s /path/to/Genomes/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa \
          -o merged_asm -p 8 \
          /path/to/assemblies.txt

之后就是运行cuffdiff了。这一步将需要比较的两组的bam文件都传来。需要介绍几个参数，其中，–min-reps-for-js-test是指每组中至少多少个样品，默认为3，可以依据具体的实验调整。–labels之后跟着的是两个组标，用于标记传入cuffdiff计算的两个组的名字，注意不要把顺序标错。而之后的gtf文件就是第二步cuffmerge之后的merged文件。

/share/bin/cufflinks/cuffdiff -o cuffdiff_outputs -p 8 \
        --labels group1label,group2label --min-reps-for-js-test 3 \
	merged_asm/merged.gtf \
	gp1rep1/accepted_hits.bam,gp1rep2/accepted_hits.bam,gp1rep3/accepted_hits.bam \
	gp2rep1/accepted_hits.bam,gp2rep2/accepted_hits.bam,gp2rep3/accepted_hits.bam

结过了以上的步骤，我们就可以得到诸如gene_exp.diff, isoform之类的结果。具体的，可以在输出目录中去一一查看。

之后就可以使用R来分析过滤结果了。这里就不做介绍

Linux/Centos下/lib64/libc.so.6: version `GLIBC_2.14′ not found问题(zhuanzai)

前天，在Centos的某个版本下编译了一个可执行程序，复制到另外一个Centos环境下去执行，结果出现了以下错误：

/lib64/libc.so.6: version `GLIBC_2.14′ not found

貌似是一个很普遍的错误，去网上搜集了相关的资料并整理了一下

出现这种错误表明程序运行需要GLIBC_2.14，但是系统中却并不存在，因此可以先用strings命令查看下系统中的GLIBC版本

strings /lib64/libc.so.6 | grep GLIBC

发现系统中最高只支持GLIBC_2.12，解决这个问题有多种方法。

在你准备升级GLIBC库之前，你要好好思考一下，
你真的要升级GLIBC么？
你知道你自己在做什么么？

http://baike.baidu.com/view/1323132.htm?fr=aladdin

glibc是gnu发布的libc库，即c运行库。glibc是linux系统中最底层的api，几乎其它任何运行库都会依赖于glibc。glibc除了封装linux操作系统所提供的系统服务外，它本身也提供了许多其它一些必要功能服务的实现…
总的来说，不说运行在linux上的一些应用，或者你之前部署过的产品，就是很多linux的基本命令，比如cp, rm, ll之类，都得依赖于它
网上很多人有惨痛教训，甚至升级失败后系统退出后无法重新进入了。。。。。。

对于CentOS这样的系统，为了追求稳定性（这个值得商榷）往往各种库版本都很低，比如6.5甚至7.0自带的还是glibc2.12, 而ubuntu 14.04带glibc2.19
如果升级基本C运行库到一个太新的版本，可能会影响CentOS的运行。所以大家如果遇到CentOS基本库的问题，影响了自己程序的运行，应该可以考虑：
1. 在低版本的系统编译自己的产品，如果自己的产品确实不需要新版才支持的新特性
2. 用版本高的系统来编译，比如ubuntu，和centos的新版，但可能需要部署到较低版本，那么可以考虑用mock等技术制作更好的安装包，把依赖打入包内
3.利用容器技术，如Docker，在低版本的操作系统内，轻量级的隔离出一个虚拟运行环境，适应你的程序。
好在我遇到的问题是glibc2.15就满足要求升级后暂时没发现问题，所以大家可以参考我的方法：
首先查看现有的情况，在CentOS6.5下

ll /lib64/libc.so.6

libc.so.6是一个软连接，当前的glibc是2.12版本,我遇到的是GLIBC_2.15找不到的问题，所以需至少升级到2.15
首先，从网上下载glibc 2.15的rpm安装包，但这个不容易，因为.rpm针对的是centOS和redhat，高版本安装包很少见。也可以直接从其他系统上拷一个编译好的文件libc.so.6（对应glibc 2.15或者更高的），不过最保险的方式就是下载源代码在本地编译一次（有的人实在编译不成功，那也只能从别的地方找一份了）
各个版本的glibc可以从http://ftp.gnu.org/gnu/glibc/找，包括其插件glibc-port
最新到2.20，我保守的选择2.15
对于低版本glibc，还有glibc-linuxthreads-2.x需要编译，可参考很多网上文档，但2.15没有，所以不用了

[plain] view plain copy

wget http://ftp.gnu.org/gnu/glibc/glibc-2.15.tar.gz
wget http://ftp.gnu.org/gnu/glibc/glibc-ports-2.15.tar.gz
tar -xvf glibc-2.15.tar.gz
tar -xvf glibc-ports-2.15.tar.gz
mv glibc-ports-2.15 glibc-2.15/ports
mkdir glibc-build-2.15
cd glibc-build-2.15
../glibc-2.15/configure –prefix=/usr –disable-profile –enable-add-ons –with-headers=/usr/include –with-binutils=/usr/bin
make
make install

如果提示install成功，去看glibc所在的共享库：

可以看到2.12的旧库文件还在，多了2.15版本的库文件，而且软链接文件全部指向了2.15版本。

[plain] view plain copy

-rwxr-xr-x 1 root root 1921096 Aug 30 02:16 /lib64/libc-2.12.so
-rwxr-xr-x 1 root root 9801632 Sep 25 13:46 /lib64/libc-2.15.so
lrwxrwxrwx. 1 root root 18 May 19 18:51 /lib64/libcap-ng.so.0 -> libcap-ng.so.0.0.0
-rwxr-xr-x. 1 root root 18672 Jun 25 2011 /lib64/libcap-ng.so.0.0.0
lrwxrwxrwx. 1 root root 14 May 19 18:51 /lib64/libcap.so.2 -> libcap.so.2.16
-rwxr-xr-x 1 root root 19016 Dec 8 2011 /lib64/libcap.so.2.16
lrwxrwxrwx. 1 root root 19 May 19 18:57 /lib64/libcgroup.so.1 -> libcgroup.so.1.0.40
-rwxr-xr-x 1 root root 97016 Dec 9 2013 /lib64/libcgroup.so.1.0.40
-rwxr-xr-x 1 root root 197064 Aug 30 02:16 /lib64/libcidn-2.12.so
-rwxr-xr-x 1 root root 267972 Sep 25 13:46 /lib64/libcidn-2.15.so
lrwxrwxrwx 1 root root 15 Sep 25 13:52 /lib64/libcidn.so.1 -> libcidn-2.15.so
lrwxrwxrwx. 1 root root 17 May 19 18:51 /lib64/libcom_err.so.2 -> libcom_err.so.2.1
-rwxr-xr-x 1 root root 17256 Nov 22 2013 /lib64/libcom_err.so.2.1
-rwxr-xr-x 1 root root 40400 Aug 30 02:16 /lib64/libcrypt-2.12.so
-rwxr-xr-x 1 root root 142947 Sep 25 13:46 /lib64/libcrypt-2.15.so
lrwxrwxrwx. 1 root root 22 May 19 18:57 /lib64/libcryptsetup.so.1 -> libcryptsetup.so.1.1.0
-rwxr-xr-x 1 root root 97072 Jun 22 2012 /lib64/libcryptsetup.so.1.1.0
lrwxrwxrwx 1 root root 16 Sep 25 13:52 /lib64/libcrypt.so.1 -> libcrypt-2.15.so
lrwxrwxrwx 1 root root 12 Sep 25 13:52 /lib64/libc.so.6 -> libc-2.15.so

有些人会在make install后出现error。这儿error我没去细究，此时可以查看一下系统此时的GLIBC版本，参考一开始的做法。如果版本未升级，我们只能手动安装一下：

首先make是成功了，那么我们会发现build目录下编译出了一个新的libc.so.6 （/glibc-build-2.15/libc.so.6，我们会发现这实际上也是一个软连接，真实的lib文件是libc.so，输出

[plain] view plain copy

$ ll libc.so.6
lrwxrwxrwx 1 root root 7 Sep 23 07:41 libc.so.6 -> libc.so
[usr@linux glibc-build-2.15]$ strings libc.so | grep GLIBC
GLIBC_2.2.5
GLIBC_2.2.6
GLIBC_2.3
GLIBC_2.3.2
GLIBC_2.3.3
GLIBC_2.3.4
GLIBC_2.4
GLIBC_2.5
GLIBC_2.6
GLIBC_2.7
GLIBC_2.8
GLIBC_2.9
GLIBC_2.10
GLIBC_2.11
GLIBC_2.12
GLIBC_2.13
GLIBC_2.14
GLIBC_2.15
GLIBC_PRIVATE

这是我们需要的lib了，然后去更新系统的库。
这里要注意，更新系统里的链接（我的是/lib64/libc.so.6) 很容易出错，我不清楚有没有更好的办法，一般都是删除旧链接，建立新链接
但删除旧链接后，很多命令直接不能用了，因为此时系统中不到glibc的库了。这个时候就需要临时指定一个glibc库，方法如下（libc.so改个名以便好与以后更新的其他版本区分）：

[plain] view plain copy

cp /****/glibc-build-2.15/libc.so /lib64/libc-2.15.so
rm -rf /lib64/libc.so.6
LD_PRELOAD=/lib64/libc-2.15.so ln -s /lib64/libc-2.15.so /lib64/libc.so.6

更新连接完毕，然后

[plain] view plain copy

$ strings /lib64/libc.so.6 | grep GLIBC
GLIBC_2.2.5
GLIBC_2.2.6
GLIBC_2.3
GLIBC_2.3.2
GLIBC_2.3.3
GLIBC_2.3.4
GLIBC_2.4
GLIBC_2.5
GLIBC_2.6
GLIBC_2.7
GLIBC_2.8
GLIBC_2.9
GLIBC_2.10
GLIBC_2.11
GLIBC_2.12
GLIBC_2.13
GLIBC_2.14
GLIBC_2.15
GLIBC_PRIVATE

说明连接更新成功，再编译的话，GLIBC_2.15及以下版本的依赖问题就不会出现了。

http://love.junzimu.com/archives/2269

用tophat和cufflinks分析RNAseq数据(转载）

链接地址：http://blog.sciencenet.cn/blog-635619-884213.html

人的基因组一共有两万多个基因，但是这些基因不是每时每刻都在表达，在不同发育时期和不同组织中，基因的表达是不同的，一个检测这些表达的有效的方法就是RNA-seq，它结合了下一代测序的技术来对细胞整个的mRNA进行测序，从而确定每一个基因的表达量和表达区段，主要用在分析不同条件下细胞内基因表达差异和分析基因表达的不同可变剪接上。

RNAseq分析大致分下面几个步骤，首先要把测到的序列map到基因组上，然后根据map到的区段对细胞构建转录本，然后比较几种细胞的转录本并且合并，最后衡量差异和可变间接和其他的分析。

Mapping

所有的序列分析的第一步，都是把测到的序列map到基因组上，这样就能知道序列原来是在基因组的什么地方。mapping一般基于两种快速索引算法，一种是哈希，MOSAIK，SOAP，SHRiMP用的就是这种算法，在对参照基因组建好哈希表之后，可以在常数次的运算里查找到给定序列的位置，虽然高效，但是由于基因组有些区段重复性很高，所以查找次数虽是常数，但有时会变得非常大，降低效率；还有一种叫Burrows-Wheeler变换，BWA，Bowtie 和SOAP2都是用它，Burrows-Wheeler变换的设计比哈希更加巧妙，它最开始是一种文本压缩算法，文本重复性越高，它的压缩比就越大，这正好克服了基因组重复性高的问题，而且对于一个精确的序列查找，最多在给定序列的长度的次数里就能找到匹配，所以说基于Burrows-Wheeler变换的软件在mapping里用得更加广泛。可是RNAseq的map还有另外一个问题，那就是要允许可变剪接的存在，因为一条RNA不一定是一个外显子表达出来的，也有可能是几个外显子结合在了一起，原来基因里的内含子被空了出来，这些内含子的长度从五十到十万个碱基不等，如果直接用DNAseq的方法的话去在基因组里寻找，有些正好在两个exon连接处的序列就会有错配，而且有些在进化过程中遗漏下来的假基因是没有intron的，这样就导致有些序列会被map到假基因上去，使假基因的表达变得很高，所以，传统的bwa和bowtie在RNAseq里都不是最好的选择。

更加适合RNA mapping的软件需要克服上面的两个问题，Tophat，subread, STAR, GSNAP, RUM, MapSplice都是为RNA测序而开发的，我只用过Tophat，它的新版Tophat2，在map的过程中分三个步骤，如果基因注释文件存在的话，它会先用注释文件的转录组来map，然后再对剩下的序列用bowtie进行普通的map，最后再用bowtie里用过的所有的序列做剪接map，所以跟其他的软件比起来会有比较高的正确率。在运行tophat之前，要对参考基因组作index，samtools可以轻松搞定。

samtools faidx hg38.fasta

如果用的是tophat的话，还需要用bowtie2做index.

bowtie2-build genome.fa genome

hg38.fasta是人类的参考基因组，对参考基因组做index是为了提高mapping时查找的效率.

(对懒人来说，在这里可以直接下载到Bowtie打包做好的index文件，这样就可以省略掉做index的步骤了 )

然后用Tophat开始mapping：

path/tophat

-p 8

-G $path_ref/Homo.GTF

-o tophat_output

$path_ref/hg38

single_end.fastq;

-p指定用几个线程来工作，-G指定注释文件的位置，-o指定输出文件的路径和文件名，最后两个参数分别告诉参考基因组的位置和要map的fastq文件。在参考基因组里不用加.fa的后缀，因为程序还要去寻找其他后缀的index文件。

这步的运行时间是根据fastq文件的大小和设定的线程数来决定的，一般单端8个线程需要每G一个小时，双端各4G，线程数设为16的话需要五六小时，运行完之后会在fastq文件的目录里产生一个-o命令指定的文件夹，这个文件夹里有几个bam文件和bed文件，还有一个summary，在下一步需要用到的是accepted_hits.bam这个文件。

注释文件的后缀是GTF，它包含了所有已知的基因的外显子在基因组中的位置。（hg38的注释文件可以在这里下载）所以对于已经map在基因组上的序列，我们可以直接根据它的位置从注释文件里查找它是不是属于一个外显子，或者是一个转录本。对于要不要在这一步提供注释文件，各有各的看法，我用单端测序的序列用两种方法做了实验，发现他们有些差别：

有注释的时候：

没注释的时候：

发现没注释的时候有更多的序列被map上去，但是重复的map也变多了，但是既然Tophat2三步map的第一步是根据注释文件来map，我还是觉得在运行tophat的时候用注释文件比较好。

Mapping的最后一步是去除map到基因组中多于一处的序列，如果出现好几个序列都map在完全相同的一个区段，那么就应该只保留一个这样的序列，所以，只保留匹配最高的那一个。而且这样的序列占很大一部分，这步也很简单，samtools里的rmdup可以轻松解决：

samtools rmdup -s input.bam output.bam

-s小写是告诉samtools，bam文件是单端测序的结果，不指定-s的话默认是双端。

2。构建转录本

Mapping完了以后，cufflinks就可以把map到基因组里的序列组装成一个转录组了，这个转录组理论上包含了所有当时细胞里的所有mRNA，组装好的转录组包含了可能的剪切信息和所有转录的表达量，这个表达量是根据map到基因组的序列的总数和每个转录片断的长度进行归一化的，听起来比较难懂，它是对于在转录片断里的每一千个碱基对，在每一百万个成功map的序列中，map在这一千个碱基对上的序列的比例，fragments per kilobase of transcript per million mapped fragments (FKPM)。

FPKM是这么算出来的：

在公式里，C代表的是map在这一千个碱基对上的序列的个数，N是所有成功map的序列的个数，L是转录片断的长度。

命令：

~/software/cufflinks-2.2.1/cufflinks

-g /wrk/ref/Homo_sapiens.GRCh38.78.gtf

-o ../cufflinks_sample1

-p 8

accepted_hits.bam;

然后在输出的cufflinks_sample1文件夹里会产生四个文件，genes.fpkm_tracking， isoforms.fpkm_tracking，skipped.gtf 和 transcripts.gtf，下一步需要用到的就是transcripts.gtf这个文件，transcripts.gtf就是这个样品的转录组。

3。合并转录组

为了比较不同样本间的差异，需要把实验组和对照组的转录组合并起来，cuffmerge不仅可以用来合并两个或者多个转录组，还能把注释过后的基因组的信息也合并起来，从而找到新的基因可变剪接，提高合并转录组的质量。有人说需要在合并之前用cuffcompare，但从官网的说法来看没有是必要的。它们最大的区别是，cuffcompare不改变原有样本里的转录片段，只是将他们的位置作比较，输出的combined文件也只是包含了所有的小转录片段，而cuffmerge会寻找几个样本间的不同，试着把几个样本里的转录片段从头开始尽可能拼接成更长更完整的片段，所以cuffmerge的输出merge文件比cuffcompare输出的combined文件更有说服力。两个小工具的作用都是为了生成一个合并的注释文件给接下来要用的cuffdiff。

命令：

~/software/cufflinks-2.2.1/cuffmerge

-g /wrk/ref/Homo_sapiens.GRCh38.78.gtf

-s /wrk/ref/hg38.fa

-p 16

assembly_list.txt

在最后输入的assemby_list.txt文件里，要写上所有需要合并的transcripts.gtf文件的路径，两个和多个都可以，然后cuffmerge就会生成一个merged文件夹，里面有一个merged.gtf，这个文件就是合并好的转录组。

基因差异表达分析

最后一步就是分析可变剪接和差异表达了，用到的小工具叫cuffdiff，这个的输入比较复杂，不仅需要上一步的merge文件，还需要每个样本的mapping结果的bam文件，最后还需要对每一个bam文件对应的样本按顺序起一个名字作为标签，标签之间记得用逗号。

~/software/cufflinks-2.2.1/cuffdiff

-o diff22rv1/

-labels vap,vcap_neg_control

-p 32

-u merged_asm/merged.gtf

vcap/accepted_hits.bam vcapneg/accepted_hits.bam

在这个例子里只有一个case和一个control，所以我们只要两个标签，在最后按顺序输入bam文件。

diff的输出比较多，他会对每个基因，每个转录片段，每个编码序列，和每个基因的不同剪接体进行FPKM，个数和样本间差异进行分析，最后生成几组不同的文件，按照不同的分析需求，就可以试着往下分析了。

到现在结果就基本都差不多了，剩下的主要是作图了，发现新的基因的可变剪接，发现差异表达的基因，对差异表达的基因做富集分析等等。作图也是非常重要的环节，再好的结果也需要有好的图表示出来，可变剪接我也没做过，但是如果做差异表达的话，CummRbund是一个非常兼容cufflinks的作图工具。

CummRbund是R里的一个包，用来分析cuffdiff的结果非常方便，在安装好这个包之后，要做的只是把路径改在cuffdiff生成结果的文件夹里，然后在R里运行这两行代码就好了。

library(cummeRbud) #这一行需要在R里加载已经安装好的cummeRbud包

cuff <- readCufflinks() #这一行就告诉R把所有cuffdiff生成的结果读入cuff这个变量里。

下一步就可以作差异表达的基因的热图了，这里稍微复杂一点：

#取前100个差异最显著的基因，或者取多少个随你便，标准是t检验的p值，p值越小差异就越大。

gene.diff <- diffData(genes(cuff))

gene.diff.top <- gene.diff[order(gene.diff$p_value),][1:100,]

#找到这前100个差异基因的ID

myGeneIds <- gene.diff.top$gene_id

# 然后根据基因ID来得到基因的名字

myGenes <- getGenes(cuff, myGeneIds)

# 是画图

csHeatmap(myGenes, cluster=”both”)

作出的图基本上是这个样子的：

接下来就可以进行富集分析了，有很多种方法，可以直接把基因的名字导出来后上传到david来分析，也可以用bioconductor里的Goseq包来分析，详细的goseq用法的代码有点长，我以后会写，如果需要的话可以直接去这里下载我使用Goseq时的代码。

Cuff系列的分析流程是这篇文章介绍的，它里面也有非常详细的命令和例子，可以去这里看看。Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

CentOS 6.3下Samba服务器的安装与配置(转载）

一、简介

Samba是一个能让Linux系统应用Microsoft网络通讯协议的软件，而SMB是Server Message Block的缩写，即为服务器消息块，SMB主要是作为Microsoft的网络通讯协议，后来Samba将SMB通信协议应用到了Linux系统上，就形成了现在的Samba软件。后来微软又把 SMB 改名为 CIFS（Common Internet File System），即公共 Internet 文件系统，并且加入了许多新的功能，这样一来，使得Samba具有了更强大的功能。

Samba最大的功能就是可以用于Linux与windows系统直接的文件共享和打印共享，Samba既可以用于windows与Linux之间的文件共享，也可以用于Linux与Linux之间的资源共享，由于NFS(网络文件系统）可以很好的完成Linux与Linux之间的数据共享，因而 Samba较多的用在了Linux与windows之间的数据共享上面。

SMB是基于客户机/服务器型的协议，因而一台Samba服务器既可以充当文件共享服务器，也可以充当一个Samba的客户端，例如，一台在Linux 下已经架设好的Samba服务器，windows客户端就可以通过SMB协议共享Samba服务器上的资源文件，同时，Samba服务器也可以访问网络中其它windows系统或者Linux系统共享出来的文件。
Samba在windows下使用的是NetBIOS协议，如果你要使用Linux下共享出来的文件，请确认你的windows系统下是否安装了NetBIOS协议。

组成Samba运行的有两个服务，一个是SMB，另一个是NMB；SMB是Samba 的核心启动服务，主要负责建立 Linux Samba服务器与Samba客户机之间的对话，验证用户身份并提供对文件和打印系统的访问，只有SMB服务启动，才能实现文件的共享，监听139 TCP端口；而NMB服务是负责解析用的，类似与DNS实现的功能，NMB可以把Linux系统共享的工作组名称与其IP对应起来，如果NMB服务没有启动，就只能通过IP来访问共享文件，监听137和138 UDP端口。

例如，某台Samba服务器的IP地址为10.0.0.163，对应的工作组名称为davidsamba，那么在Windows的IE浏览器输入下面两条指令都可以访问共享文件。其实这就是Windows下查看Linux Samba服务器共享文件的方法。
\\10.0.0.163\共享目录名称
\\davidsamba\共享目录名称

Samba服务器可实现如下功能：WINS和DNS服务；网络浏览服务； Linux和Windows域之间的认证和授权； UNICODE字符集和域名映射；满足CIFS协议的UNIX共享等。

二、系统环境

系统平台：CentOS release 6.3 (Final)

Samba版本：samba-3.5.10-125.el6.x86_64

Samba Server IP：10.0.0.163

防火墙已关闭/iptables: Firewall is not running.

SELINUX=disabled

三、安装Samba服务

1、在可以联网的机器上使用yum工具安装，如果未联网，则挂载系统光盘进行安装。

# yum install samba samba-client samba-swat

有依赖关系的包samba-common、samba-winbind-clients、libsmbclient将自动安装上去。

2、查看安装状况

3、安装包说明

samba-common-3.5.10-125.el6.x86_64               //主要提供samba服务器的设置文件与设置文件语法检验程序testparm
samba-client-3.5.10-125.el6.x86_64                    //客户端软件，主要提供linux主机作为客户端时，所需要的工具指令集
samba-swat-3.5.10-125.el6.x86_64                    //基于https协议的samba服务器web配置界面
samba-3.5.10-125.el6.x86_64                            //服务器端软件，主要提供samba服务器的守护程序，共享文档，日志的轮替，开机默认选项

Samba服务器安装完毕，会生成配置文件目录/etc/samba和其它一些samba可执行命令工具，/etc/samba/smb.conf是samba的核心配置文件，/etc/init.d/smb是samba的启动/关闭文件。

4、启动Samba服务器

可以通过/etc/init.d/smb start/stop/restart来启动、关闭、重启Samba服务，启动SMB服务如下所示：

5、查看samba的服务启动情况

# service smb status

6、设置开机自启动

# chkconfig –level 35 smb on //在3、5级别上自动运行samba服务

四、配置Samba服务

Samba的主配置文件为/etc/samba/smb.conf

主配置文件由两部分构成

Global Settings (55-245行)

该设置都是与Samba服务整体运行环境有关的选项，它的设置项目是针对所有共享资源的。

Share Definitions （246-尾行）

该设置针对的是共享目录个别的设置，只对当前的共享资源起作用。

全局参数：

#==================Global Settings ===================
[global]

config file = /usr/local/samba/lib/smb.conf.%m
说明：config file可以让你使用另一个配置文件来覆盖缺省的配置文件。如果文件不存在，则该项无效。这个参数很有用，可以使得samba配置更灵活，可以让一台 samba服务器模拟多台不同配置的服务器。比如，你想让PC1（主机名）这台电脑在访问Samba Server时使用它自己的配置文件，那么先在/etc/samba/host/下为PC1配置一个名为smb.conf.pc1的文件，然后在 smb.conf中加入：config file = /etc/samba/host/smb.conf.%m。这样当PC1请求连接Samba Server时，smb.conf.%m就被替换成smb.conf.pc1。这样，对于PC1来说，它所使用的Samba服务就是由 smb.conf.pc1定义的，而其他机器访问Samba Server则还是应用smb.conf。

workgroup = WORKGROUP
说明：设定 Samba Server 所要加入的工作组或者域。

server string = Samba Server Version %v
说明：设定 Samba Server 的注释，可以是任何字符串，也可以不填。宏%v表示显示Samba的版本号。

netbios name = smbserver
说明：设置Samba Server的NetBIOS名称。如果不填，则默认会使用该服务器的DNS名称的第一部分。netbios name和workgroup名字不要设置成一样了。

interfaces = lo eth0 192.168.12.2/24 192.168.13.2/24
说明：设置Samba Server监听哪些网卡，可以写网卡名，也可以写该网卡的IP地址。

hosts allow = 127. 192.168.1. 192.168.10.1
说明：表示允许连接到Samba Server的客户端，多个参数以空格隔开。可以用一个IP表示，也可以用一个网段表示。hosts deny 与hosts allow 刚好相反。
例如：hosts allow=172.17.2.EXCEPT172.17.2.50
表示容许来自172.17.2.*的主机连接，但排除172.17.2.50
hosts allow=172.17.2.0/255.255.0.0
表示容许来自172.17.2.0/255.255.0.0子网中的所有主机连接
hosts allow=M1，M2
表示容许来自M1和M2两台计算机连接
hosts allow=@pega
表示容许来自pega网域的所有计算机连接

max connections = 0
说明：max connections用来指定连接Samba Server的最大连接数目。如果超出连接数目，则新的连接请求将被拒绝。0表示不限制。

deadtime = 0
说明：deadtime用来设置断掉一个没有打开任何文件的连接的时间。单位是分钟，0代表Samba Server不自动切断任何连接。

time server = yes/no
说明：time server用来设置让nmdb成为windows客户端的时间服务器。

log file = /var/log/samba/log.%m
说明：设置Samba Server日志文件的存储位置以及日志文件名称。在文件名后加个宏%m（主机名），表示对每台访问Samba Server的机器都单独记录一个日志文件。如果pc1、pc2访问过Samba Server，就会在/var/log/samba目录下留下log.pc1和log.pc2两个日志文件。

max log size = 50
说明：设置Samba Server日志文件的最大容量，单位为kB，0代表不限制。

security = user
说明：设置用户访问Samba Server的验证方式，一共有四种验证方式。
1. share：用户访问Samba Server不需要提供用户名和口令, 安全性能较低。
2. user：Samba Server共享目录只能被授权的用户访问,由Samba Server负责检查账号和密码的正确性。账号和密码要在本Samba Server中建立。
3. server：依靠其他Windows NT/2000或Samba Server来验证用户的账号和密码,是一种代理验证。此种安全模式下,系统管理员可以把所有的Windows用户和口令集中到一个NT系统上,使用 Windows NT进行Samba认证, 远程服务器可以自动认证全部用户和口令,如果认证失败,Samba将使用用户级安全模式作为替代的方式。
4. domain：域安全级别,使用主域控制器(PDC)来完成认证。

passdb backend = tdbsam
说明：passdb backend就是用户后台的意思。目前有三种后台：smbpasswd、tdbsam和ldapsam。sam应该是security account manager（安全账户管理）的简写。
1.smbpasswd：该方式是使用smb自己的工具smbpasswd来给系统用户（真实用户或者虚拟用户）设置一个Samba密码，客户端就用这个密码来访问Samba的资源。smbpasswd文件默认在/etc/samba目录下，不过有时候要手工建立该文件。
2.tdbsam：该方式则是使用一个数据库文件来建立用户数据库。数据库文件叫passdb.tdb，默认在/etc/samba目录下。passdb.tdb用户数据库可以使用smbpasswd –a来建立Samba用户，不过要建立的Samba用户必须先是系统用户。我们也可以使用pdbedit命令来建立Samba账户。pdbedit命令的参数很多，我们列出几个主要的。
pdbedit –a username：新建Samba账户。
pdbedit –x username：删除Samba账户。
pdbedit –L：列出Samba用户列表，读取passdb.tdb数据库文件。
pdbedit –Lv：列出Samba用户列表的详细信息。
pdbedit –c “[D]” –u username：暂停该Samba用户的账号。
pdbedit –c “[]” –u username：恢复该Samba用户的账号。
3.ldapsam：该方式则是基于LDAP的账户管理方式来验证用户。首先要建立LDAP服务，然后设置“passdb backend = ldapsam:ldap://LDAP Server”

encrypt passwords = yes/no
说明：是否将认证密码加密。因为现在windows操作系统都是使用加密密码，所以一般要开启此项。不过配置文件默认已开启。

smb passwd file = /etc/samba/smbpasswd
说明：用来定义samba用户的密码文件。smbpasswd文件如果没有那就要手工新建。

username map = /etc/samba/smbusers
说明：用来定义用户名映射，比如可以将root换成administrator、admin等。不过要事先在smbusers文件中定义好。比如：root = administrator admin，这样就可以用administrator或admin这两个用户来代替root登陆Samba Server，更贴近windows用户的习惯。

guest account = nobody
说明：用来设置guest用户名。

socket options = TCP_NODELAY SO_RCVBUF=8192 SO_SNDBUF=8192
说明：用来设置服务器和客户端之间会话的Socket选项，可以优化传输速度。

domain master = yes/no
说明：设置Samba服务器是否要成为网域主浏览器，网域主浏览器可以管理跨子网域的浏览服务。

local master = yes/no
说明：local master用来指定Samba Server是否试图成为本地网域主浏览器。如果设为no，则永远不会成为本地网域主浏览器。但是即使设置为yes，也不等于该Samba Server就能成为主浏览器，还需要参加选举。

preferred master = yes/no
说明：设置Samba Server一开机就强迫进行主浏览器选举，可以提高Samba Server成为本地网域主浏览器的机会。如果该参数指定为yes时，最好把domain master也指定为yes。使用该参数时要注意：如果在本Samba Server所在的子网有其他的机器（不论是windows NT还是其他Samba Server）也指定为首要主浏览器时，那么这些机器将会因为争夺主浏览器而在网络上大发广播，影响网络性能。
如果同一个区域内有多台Samba Server，将上面三个参数设定在一台即可。

os level = 200
说明：设置samba服务器的os level。该参数决定Samba Server是否有机会成为本地网域的主浏览器。os level从0到255，winNT的os level是32，win95/98的os level是1。Windows 2000的os level是64。如果设置为0，则意味着Samba Server将失去浏览选择。如果想让Samba Server成为PDC，那么将它的os level值设大些。

domain logons = yes/no
说明：设置Samba Server是否要做为本地域控制器。主域控制器和备份域控制器都需要开启此项。

logon script = %u.bat
说明：当使用者用windows客户端登陆，那么Samba将提供一个登陆档。如果设置成%u.bat，那么就要为每个用户提供一个登陆档。如果人比较多，那就比较麻烦。可以设置成一个具体的文件名，比如start.bat，那么用户登陆后都会去执行start.bat，而不用为每个用户设定一个登陆档了。这个文件要放置在[netlogon]的path设置的目录路径下。

wins support = yes/no
说明：设置samba服务器是否提供wins服务。

wins server = wins服务器IP地址
说明：设置Samba Server是否使用别的wins服务器提供wins服务。

wins proxy = yes/no
说明：设置Samba Server是否开启wins代理服务。

dns proxy = yes/no
说明：设置Samba Server是否开启dns代理服务。

load printers = yes/no
说明：设置是否在启动Samba时就共享打印机。

printcap name = cups
说明：设置共享打印机的配置文件。

printing = cups
说明：设置Samba共享打印机的类型。现在支持的打印系统有：bsd, sysv, plp, lprng, aix, hpux, qnx

共享参数：
#================== Share Definitions ==================
[共享名]

comment = 任意字符串
说明：comment是对该共享的描述，可以是任意字符串。

path = 共享目录路径
说明：path用来指定共享目录的路径。可以用%u、%m这样的宏来代替路径里的unix用户和客户机的Netbios名，用宏表示主要用于[homes] 共享域。例如：如果我们不打算用home段做为客户的共享，而是在/home/share/下为每个Linux用户以他的用户名建个目录，作为他的共享目录，这样path就可以写成：path = /home/share/%u; 。用户在连接到这共享时具体的路径会被他的用户名代替，要注意这个用户名路径一定要存在，否则，客户机在访问时会找不到网络路径。同样，如果我们不是以用户来划分目录，而是以客户机来划分目录，为网络上每台可以访问samba的机器都各自建个以它的netbios名的路径，作为不同机器的共享资源，就可以这样写：path = /home/share/%m 。

browseable = yes/no
说明：browseable用来指定该共享是否可以浏览。

writable = yes/no
说明：writable用来指定该共享路径是否可写。

available = yes/no
说明：available用来指定该共享资源是否可用。

admin users = 该共享的管理者
说明：admin users用来指定该共享的管理员（对该共享具有完全控制权限）。在samba 3.0中，如果用户验证方式设置成“security=share”时，此项无效。
例如：admin users =david，sandy（多个用户中间用逗号隔开）。

valid users = 允许访问该共享的用户
说明：valid users用来指定允许访问该共享资源的用户。
例如：valid users = david，@dave，@tech（多个用户或者组中间用逗号隔开，如果要加入一个组就用“@组名”表示。）

invalid users = 禁止访问该共享的用户
说明：invalid users用来指定不允许访问该共享资源的用户。
例如：invalid users = root，@bob（多个用户或者组中间用逗号隔开。）

write list = 允许写入该共享的用户
说明：write list用来指定可以在该共享下写入文件的用户。
例如：write list = david，@dave

public = yes/no
说明：public用来指定该共享是否允许guest账户访问。

guest ok = yes/no
说明：意义同“public”。

几个特殊共享：
[homes]
comment = Home Directories
browseable = no
writable = yes
valid users = %S
; valid users = MYDOMAIN\%S

[printers]
comment = All Printers
path = /var/spool/samba
browseable = no
guest ok = no
writable = no
printable = yes

[netlogon]
comment = Network Logon Service
path = /var/lib/samba/netlogon
guest ok = yes
writable = no
share modes = no

[Profiles]
path = /var/lib/samba/profiles
browseable = no
guest ok = yes

Samba安装好后，使用testparm命令可以测试smb.conf配置是否正确。使用testparm –v命令可以详细的列出smb.conf支持的配置参数。

默认的smb.conf有很多个选项和内容，比较繁琐，这里我们按照案例来讲解配置选项，先备份一下自己的smb.conf文件，然后重新建立一个smb.conf。

# cp -p /etc/samba/smb.conf /etc/samba/smb.conf.orig

案例一、公司现有一个工作组workgroup，需要添加samba服务器作为文件服务器，并发布共享目录/share，共享名为public，此共享目录允许所有员工访问。

a. 修改samba的主配置文件如下：

#======================= Global Settings =====================================

[global]                                                  //该设置与Samba服务整体运行环境有关，它的设置项目针对所有共享资源       

# ----------------------- Network Related Options -------------------------
#
# workgroup = NT-Domain-Name or Workgroup-Name, eg: MIDEARTH
#
# server string is the equivalent of the NT Description field
#
# netbios name can be used to specify a server name not tied to the hostname

        workgroup = WORKGROUP                             //定义工作组，也就是windows中的工作组概念
        server string = David Samba Server Version %v     //定义Samba服务器的简要说明
        netbios name = DavidSamba                         //定义windows中显示出来的计算机名称

# --------------------------- Logging Options -----------------------------
#
# Log File let you specify where to put logs and how to split them up.

        log file = /var/log/samba/log.%m                  //定义Samba用户的日志文件，%m代表客户端主机名
                                                          //Samba服务器会在指定的目录中为每个登陆主机建立不同的日志文件
# ----------------------- Standalone Server Options ------------------------
#
# Scurity can be set to user, share(deprecated) or server(deprecated)

        security = share                                  //共享级别，用户不需要账号和密码即可访问

#============================ Share Definitions ==============================

[public]                                                  //设置针对的是共享目录个别的设置，只对当前的共享资源起作用
        comment = Public Stuff                            //对共享目录的说明文件，自己可以定义说明信息
        path = /share                                     //用来指定共享的目录，必选项
        public = yes                                      //所有人可查看,等效于guest ok = yes

b. 建立共享目录

上面设置了共享目录为/share，下面就需要建立/share目录：

由于要设置匿名用户可以下载或上传共享文件，所以要给/share目录授权为nobody权限。

c. 重启smb服务

d. 测试smb.conf配置是否正确

e. 访问Samba服务器的共享文件

在Linux下访问Samba服务器的共享文件

在windows下访问Samba服务器的共享文件

案例二、公司现有多个部门，因工作需要，将TS部的资料存放在samba服务器的/ts 目录中集中管理，以便TS人员浏览，并且该目录只允许TS部员工访问。

a. 添加TS部组和用户

建立用户的同时加入到相应的组中的方式：useradd -g 组名用户名

b. 在根目录下建立/ts 文件夹

c. 将刚才建立的两个帐户添加到samba的账户中

d. 修改主配置文件如下：

#======================= Global Settings =====================================

[global]

# ----------------------- Network Related Options -------------------------
#
# workgroup = NT-Domain-Name or Workgroup-Name, eg: MIDEARTH
#
# server string is the equivalent of the NT Description field
#
# netbios name can be used to specify a server name not tied to the hostname

        workgroup = WORKGROUP
        server string = David Samba Server Version %v
        netbios name = DavidSamba

# --------------------------- Logging Options -----------------------------
#
# Log File let you specify where to put logs and how to split them up.

        log file = /var/log/samba/log.%m

# ----------------------- Standalone Server Options ------------------------
#
# Scurity can be set to user, share(deprecated) or server(deprecated)

        security = user                                   //用户级别，由提供服务的Samba服务器负责检查账户和密码

#============================ Share Definitions ==============================

[homes]                                                   //设置用户宿主目录
        comment = Home Directories
        browseable = no
        writable = yes
;       valid users = %S
;       valid users = MYDOMAIN\%S

[public]
        comment = Public Stuff
        path = /share
        public = yes

[ts]                                                      //ts 组目录，只允许ts组成员访问
        comment = TS
        path = /ts
        valid users = @ts

e. 重新加载配置

f. 到windows客户端验证，访问\\10.0.0.163，提示输入用户名和密码，在此输入sandy验证，如下图：

g. 访问成功，可以看到公共的public目录，用户sandy的宿主目录，和其有权限访问的ts目录

h. 进入ts目录，有刚才创建的newyork.city文件

案例三、实现不同的用户访问同一个共享目录具有不同的权限，便于管理和维护。基本上能满足一些企业用户的需求。（整理自网络）

a. 需求

1. 某公司有5个大部门，分别为：人事行政部（HR & Admin Dept）、财务部（Financial Management Dept）、技术支持部（Technical Support Dept）、项目部（Project Dept）、客服部（Customer Service Dept）。
2. 各部门的文件夹只允许本部门员工有权访问；各部门之间交流性质的文件放到公用文件夹中。
3. 每个部门都有一个管理本部门文件夹的管理员账号和一个只能新建和查看文件的普通用户权限的账号。
4. 公用文件夹中分为存放工具的文件夹和存放各部门共享文件的文件夹。
5. 对于各部门自己的文件夹，各部门管理员具有完全控制权限，而各部门普通用户可以在该部门文件夹下新建文件及文件夹，并且对于自己新建的文件及文件夹有完全控制权限，对于管理员新建及上传的文件和文件夹只能访问，不能更改和删除。不是本部门用户不能访问本部门文件夹。
6. 对于公用文件夹中的各部门共享文件夹，各部门管理员具有完全控制权限，而各部门普通用户可以在该部门文件夹下新建文件及文件夹，并且对于自己新建的文件及文件夹有完全控制权限，对于管理员新建及上传的文件和文件夹只能访问，不能更改和删除。本部门用户（包括管理员和普通用户）在访问其他部门共享文件夹时，只能查看不能修改删除新建。对于存放工具的文件夹，只有管理员有权限，其他用户只能访问。

b. 规划

根据公司需求情况，现做出如下规划：
1. 在系统分区时单独分一个Company的区，在该区下有以下几个文件夹：HR、 FM、TS、PRO、CS和Share。在Share下又有以下几个文件夹：HR、FM、TS、PRO、CS和Tools。
2. 各部门对应的文件夹由各部门自己管理，Tools文件夹由管理员维护。
3. HR管理员账号：hradmin；普通用户账号：hruser。
FM管理员账号：fmadmin；普通用户账号：fmuser。
TS管理员账号：tsadmin；普通用户账号：tsuser。
PRO管理员账号：proadmin；普通用户账号：prouser。
CS管理员账号：csadmin；普通用户账号：csuser。
Tools管理员账号：admin。

文件夹之间的关系见下图：

c. 新建用户

使用useradd命令新建系统账户，然后再使用smbpasswd –a建立SMB账户。

[root@TS-DEV ~]# useradd -s /sbin/nologin hradmin          
[root@TS-DEV ~]# useradd -g hradmin -s /sbin/nologin hruser
[root@TS-DEV ~]# useradd -s /sbin/nologin fmadmin            
[root@TS-DEV ~]# useradd -g fmadmin -s /sbin/nologin fmuser     
[root@TS-DEV ~]# useradd -s /sbin/nologin tsadmin
[root@TS-DEV ~]# useradd -g tsadmin -s /sbin/nologin tsuser
[root@TS-DEV ~]# useradd -s /sbin/nologin proadmin         
[root@TS-DEV ~]# useradd -g proadmin -s /sbin/nologin prouser 
[root@TS-DEV ~]# useradd -s /sbin/nologin csadmin
[root@TS-DEV ~]# useradd -g csadmin -s /sbin/nologin csuser
[root@TS-DEV ~]# useradd -s /sbin/nologin admin            
[root@TS-DEV ~]# 

[root@TS-DEV ~]# smbpasswd -a hradmin
New SMB password:
Retype new SMB password:
Added user fmuser.
[root@TS-DEV ~]# smbpasswd -a hruser
[root@TS-DEV ~]# smbpasswd -a fmadmin
[root@TS-DEV ~]# smbpasswd -a fmuser
[root@TS-DEV ~]# smbpasswd -a tsadmin
[root@TS-DEV ~]# smbpasswd -a tsuser
[root@TS-DEV ~]# smbpasswd -a proadmin
[root@TS-DEV ~]# smbpasswd -a prouser
[root@TS-DEV ~]# smbpasswd -a csadmin 
[root@TS-DEV ~]# smbpasswd -a csuser
[root@TS-DEV ~]# smbpasswd -a admin    
[root@TS-DEV ~]#

d. 新建目录

e. 更改目录属性

[root@TS-DEV Company]# chown hradmin.hradmin HR
[root@TS-DEV Company]# chown fmadmin.fmadmin FM
[root@TS-DEV Company]# chown tsadmin.tsadmin TS    
[root@TS-DEV Company]# chown proadmin.proadmin PRO    
[root@TS-DEV Company]# chown csadmin.csadmin CS      
[root@TS-DEV Company]# chown admin.admin Share

[root@TS-DEV Company]# cd Share/
[root@TS-DEV Share]# chown hradmin.hradmin HR && chown fmadmin.fmadmin FM && chown tsadmin.tsadmin TS && chown proadmin.proadmin PRO && chown csadmin.csadmin CS && chown admin.admin Tools
[root@TS-DEV Share]# chmod 1775 HR FM TS PRO CS

f. 配置samba如下：

#======================= Global Settings =====================================

[global]

# ----------------------- Network Related Options -------------------------
#
# workgroup = NT-Domain-Name or Workgroup-Name, eg: MIDEARTH
#
# server string is the equivalent of the NT Description field
#
# netbios name can be used to specify a server name not tied to the hostname

        workgroup = WORKGROUP
        server string = David Samba Server Version %v
        netbios name = DavidSamba

# --------------------------- Logging Options -----------------------------
#
# Log File let you specify where to put logs and how to split them up.

        log file = /var/log/samba/log.%m
        max log size = 50

# ----------------------- Standalone Server Options ------------------------
#
# Scurity can be set to user, share(deprecated) or server(deprecated)

        security = user
        passdb backend = tdbsam

#============================ Share Definitions ==============================

[HR]
     comment = This is a directory of HR.
     path = /Company/HR/
     public = no
     admin users = hradmin
     valid users = @hradmin
     writable = yes
     create mask = 0750
     directory mask = 0750
 
[FM]
     comment = This is a directory of FM.
     path = /Company/FM/
     public = no
     admin users = fmadmin
     valid users = @fmadmin
     writable = yes
     create mask = 0750
     directory mask = 0750
 
[TS]
     comment = This is a directory of TS.
     path = /Company/TS/
     public = no
     admin users = tsadmin
     valid users = @tsadmin
     writable = yes
     create mask = 0750
     directory mask = 0750
 
[PRO]
     comment = This is a PRO directory.
     path = /Company/PRO/
     public = no
     admin users = proadmin
     valid users = @proadmin
     writable = yes
     create mask = 0750
     directory mask = 0750
 
[CS]
     comment = This is a directory of CS.
     path = /Company/CS/
     public = no
     admin users = csadmin
     valid users = @csadmin
     writable = yes
     create mask = 0750
     directory mask = 0750
 
[Share]
     comment = This is a share directory.
     path = /Company/Share/
     public = no
     valid users = admin,@hradmin,@fmadmin,@tsadmin,@proadmin,@csadmin
     writable = yes
     create mask = 0755
     directory mask = 0755

g. 测试

以 hradmin登录系统

试图访问ts部门文件夹，要求输入用户名及密码

试图在\\10.0.0.163\Share\TS下新建文件

在自己部门所属文件夹下新建成功

其他测试自行完成。

配置完毕。

五、将共享目录映射成Windows的驱动器

将Samba共享的public目录，映射成 Windows 的一个驱动器盘符：

a. 右击“计算机”–>“映射网络驱动器”

b. 在文件夹栏输入共享地址及路径，点击“完成”输入用户名和密码

c. 映射完毕后，打开资源管理器可以看到映射的共享目录

Tips：

在windows下通过“\\ip地址”的方式访问其它文件资源时，一般第一次需要输入密码，以后就无需输入密码直接登陆了，那么如果我们要切换到其它Samba用户怎么办呢？可以在windows下执行如下指令实现：
首先通过开始–>运行–>cmd 输入：“net use”命令查看现有的连接，然后执行“net use \\Samba服务器IP地址或者netbios名称\ipc$ /del”，删除Samba服务器已经建立的连接。或者执行“net use * /del”将现在所有的连接全部删除。最后，再次执行“\\ip地址”时，就可以切换用户了。

六、Linux客户端访问操作

上面介绍了windows客户端访问Samba服务器的操作，那么在Linux作为客户端时，查看其它Linux Samba服务器共享的文件时，应该如何操作呢？

这就要用到smbclient这个工具，系统默认自带了这个命令，Smbclient常见用法介绍如下：

1、查看Samba服务器的共享资料

# smbclient –L //Samba服务器的ip地址 -U Samba用户名

“-L”即为list的含义，“-U”是user的意思，如果Samba服务器是无密码访问的话，可以省略“-U Samba用户名”。

例如：samba需要密码登陆时，查看共享方法如下：

# smbclient -L //10.0.0.163/public –U david

Samba无密码访问时，执行如下命令：

# smbclient -L //10.0.0.163/public

password: 直接回车即可。

2、登陆Samba服务器

如果需要在Linux客户端登陆Samba服务器，用法如下：

# smbclient //Samba服务器的ip地址 -U Samba用户名

请看下面执行的操作：

# smbclient //10.0.0.163/public -U david

smb: \> ? //在这里输入?即可查看在smb命令行可用的所有命令。

操作过程与登陆FTP服务器很类似，登陆Samba服务器后，就可以进行文件的上传与下载，如果有足够的权限，还可以进行修改文件操作。

此外，Samba服务器共享出来的文件还可以在Linux客户端进行挂载，这就要用到mount命令，如下所示：

# mount -t cifs -l //10.0.0.163/public /mnt/samba/

七、Samba Web管理工具 SWAT

SWAT(Samba WEB Administration Tool) 是通过浏览器对 Samba 进行管理的工具之一。通过 SWAT，可以在 Samba 允许访问范围内的客户端，用浏览器对服务端的 Samba 进行控制。在线文档的阅览、smb.conf 的确认和编辑，以及密码的变更、服务的重启等等都可以通过 SWAT 来完成，它的直观让 Samba 变得温和化，对那些不喜欢文本界面管理服务器的朋友来说，是一个强大的工具。

swat工具嵌套在xinetd超级守护进程中，要通过启用xinetd进程来启用swat。因此要先安装xinetd工具包，然后安装swat工具包。上面已经安装过samba-swat-3.5.10-125.el6.x86_64，这里不再赘述。

1、配置swat

因为swat是xinetd超级守护进程的一个子进程，所以swat工具配置文件在xinetd目录中。我们要设置swat配置文件，开启此子进程，以便在启用xinetd进程是来启用swat。swat配置文件在/etc/xinetd.d目录中。

打开并编辑 /etc/xinetd.d/swat

# default: off
# description: SWAT is the Samba Web Admin Tool. Use swat \
#              to configure your Samba server. To use SWAT, \
#              connect to port 901 with your favorite web browser.
service swat
{
        port            = 901                    //swat默认使用tcp的901端口, 可以修改
        socket_type     = stream                 //通过web来配置samba, 默认使用root账号进入, 可以修改成其他的系统用户
        wait            = no
        only_from       = 127.0.0.1              
        only_from       = 10.0.0.0               //添加此行, 将“only_from=127.0.0.1”改成“only_from=10.0.0.0”, 只允许内网范围对SWAT进行访问
        user            = root
        server          = /usr/sbin/swat         //swat的执行程序默认在/usr/sbin目录下
        log_on_failure  += USERID
        disable         = yes                    //将“disable=yes”改成“disable=no”, 这样swat子进程就可以随xinetd超级守护进程一起启动了
}

2、启动 swat

因为swat是xinetd的子进程，所以只要启用了xinetd，那么swat也就会伴随xinetd启动。

3、打开 swat

在服务端启动 swat后，我们就可以通过 swat允许范围内的客户机的浏览器中，通过 http://服务器的内网IP:901 来访问服务端的 swat了，输入 root用户的用户名及密码进入 swat的管理首页，如下所示：

swat管理中心的首页

通过 swat管理 Samba 与直接修改 smb.conf 的方式，在本质上并无差异，但通过浏览器访问的方式，可以使 Samba 的管理更加温和化，更加适用于不擅长使用文本界面、直接修改配置文件的朋友。

4、通过swat配置samba

在swat页面我们可以看到有8个选项，每个选项可以配置samba的不同功能。

HOME：Samba相关程序及文件说明。

GLOBALS：设置Samba的全局参数。即smb.conf文件的[global]。

SHARES：设置Samba的共享参数。

PRINTERS：设置Samba的打印参数。

WIZARD：Samba配置向导。

STATUS：查看和设置Samba的服务状况。

VIEW：查看Samba的文本配置文件，即smb.conf。

PASSWORD：设置Samba用户，可以修改密码，新建删除用户。

详细设置请自行查阅资料，以下仅供参考：

http://yuanbin.blog.51cto.com/363003/117105

至此，Samba服务器的所有配置完成。

Linux 基础 —— 文件共享服务 FTP，NFS 和 Samba

One, the establishment of samba sharing, shared directory for /data, requirements:

1)The share name is shared, the working group for the MYDATA, can be viewed; 2) adding group develop, add Gentoo, CentOS and Ubuntu, where Gentoo and CentOS with develop as the additive group, 
Ubuntu does not belong to the develop group; password are user name; 3) add Samba user Gentoo, CentOS and Ubuntu, The password is "Samba"; 4) the samba shared shared only allows develop group has write permissions, Other users can only access to read-only; 5) the samba shared service only allows host access from the 192.168.1.0/24 network,

# Note, all of the following configuration is done in the same it in the machine, CentOS6.5 system

# A, install the software

yum install samba -y

# Two, to the user

groupadd develop
useradd -G develop gentoo
useradd -G develop centos
useradd ubuntu
echo "gentoo" | passwd gentoo --stdin
echo "centos" | passwd centos --stdin
echo "ubuntu" | passwd ubuntu --stdin
echo -e "samba\nsamba" | smbpasswd  -a gentoo -s
echo -e "samba\nsamba" | smbpasswd  -a centos -s
echo -e "samba\nsamba" | smbpasswd  -a ubuntu -s

If not smbpasswd -a USER ‘xxxx’

Mkdir /data

setfacl -m g:develop:rwx /data/

The main configuration file modification

vim /etc/samba/smb.conf

wps_clip_image-15518

wps_clip_image-31898

Restart

wps_clip_image-17584

Test

wps_clip_image-15521

The test of sharing

wps_clip_image-9878

User testing

wps_clip_image-2681

Vsftp

Two, vsftp service

Vsftp is really Linux FTP to achieve linux–windows data transmission between, this service is transmitted in plaintext information, not safe

Program installation

yum -y install vsftpd mysql-server
/etc/logrotate.d/The vsftpd log
/etc/pam.d/Vsftpd key authentication
/etc/rc.d/init.d/The vsftpd startup script
/etc/vsftpd/Ftpusers saves the user
/etc/vsftpd/The user_list users list
/etc/vsftpd/The vsftpd.conf configuration file
/var/ftp/The pub home directory

The default installation allow anonymous users to download, other permissions are not, you can start the installation is finished, the default program is FTP,

Linux server to create the file

wps_clip_image-4411

Windos view

wps_clip_image-24230

Download CMD and change directory, can help help

wps_clip_image-9851

A modified configuration document

Anonymous user configuration:

Anonymous_enable=YES allow anonymous users to access
Anon_upload_enable=YES opens the upload function
Anon_mkdir_write_enable=YES can create, modify permissions
Other anon_ohter_write_enable=YES users also have the upload and download of Quax
Users of the system configuration: 
Local_enable=YES local user can log on
Write_enable=YES can write
The local_umask=022 default file permissions
Imprison all FTP local user on the home directory: 
chroot_local_user=YES
FTP local user specified imprisons the file on the home directory: 
chroot_list_enable=YES
chroot_list_file=/etc/vsftpd/Chroot_list the user people write in this file, batch to imprisonment
Journal: 
xferlog_enable=YES
xferlog_std_format=YES
xferlog_file=/var/log/xferlog
Change the file owner: 
Chown_uploads=YES allows to upload
chown_username=whoever
Vsftpd uses PAM to complete the user authentication, the use of the PAM configuration file: 
pam_service_name=vsftpd
Whether to enable the control user list file
userlist_enable=YES
userlist_deny=YES|NO
The default file/etc/vsftpd/user_list
Connection limit: 
Max_clients: maximum number of concurrent connections, 
The number of concurrent requests per IP also launched max_per_ip:, 
Transmission rate: 
The maximum transmission rate of anon_max_rate: anonymous users, the unit is "byte / sec";
Local_max_rate: local user login for the most

Actual intersection user permissions = directory and user rights

Experiment

Mkdir /var/ftp/pub/book
Setfacl -m u:ftp:rwx book
Vim /etc/etc/vsftpd/vsftpd.conf
Anonymous_enable=YES allow anonymous users to access
Anon_upload_enable=YES opens the upload function
Anon_mkdir_write_enable=YES can create, modify permissions
Other anon_ohter_write_enable=YES users also have the upload and download of Quax
Useradd gentoo
Passwd  Gentoo

The Gentoo upload download modify all permissions directory inside the book sharing, taking into account the security server that even the best

Set the ciphertext transmission information, or limit the number of asphalt etc.

Nfs

Three, nfs

Yum install nfs-utils -y

Linux–linux–windos–unix

Virtual machine: the NFS server IP address: 192.168.1.146

Virtual machine two: client IP address: 192.168.1.143

Vmware1

Yum install nfs-utils -y
Vim /etc/exports
User   192.168.1.163(rw)
Showmount  -e  ip

wps_clip_image-32557

Vim /etc/exports

Set sharing to the users, permissions

wps_clip_image-25399

Mount -t nfs ip :/sharename /Catalog

Vmware2

wps_clip_image-12457

Nfs is using uid to identify the user.

Add a specified uid user on the server, the server to add a uid and server as the same user and user name is not the same,

This setting permissions, specify the user has access to, we went to the same server user client authentication, and people do not have access to,

But uid the same username different can access, which indicates that the NFS set user access is by uid to control access.

Four, experiment

Schematic diagram

Order a NFS service file sharing service, DNS, for two web servers, providing a forum Webpage preservation, reach users log on different hosts

On the IP with a website, achieve the basic information of the user, the invariant

wps_clip_image-20436

Web1 environment installed

Lamp environment to build

Yum install httpd php php-mysql php-gd mysql-server -y

Vim /etc/httpd/conf/httpd.conf

Notes

1, The main directory

wps_clip_image-16189

2, Comment out the list

wps_clip_image-8181

3, The new virtual host

wps_clip_image-28643

4, service httpd restart

5, Database installation authorization

wps_clip_image-27910

Service mysql restart

Mount NFS

Check there is no share to you, there is no authority

Showmount -e 172.16.1.13

Mount

Mount -t nfs 172.16.1.13: /share/data

To view the file sharing

wps_clip_image-9547

Install WordPress

This themselves to the official website to download here won’t show.

Another configuration and the same configuration, should pay attention to things and authorization, authorization to user is connected to the local MySQL

The NFS server is set

Install NFS

Yum install nfs-utils -y

Service nfs restart

Vim/etc/exports

wps_clip_image-18974

Showmount -e 172.16.1.140

Service nfs restart

The word press on the /share/data, and then go to where the client to see if

Entering a site editor

数据分析之美：如何进行回归分析

1. 确定自变量与Y是否相关

证明：自变量X1，X2，….XP中至少存在一个自变量与因变量Y相关

For any given value of n（观测数据的数目） and p（自变量X的数目）, any statistical software package can be used to compute the p-value associated with the F-statistic using this distribution. Based on this p-value, we can determine whether or not to reject H0. （用软件计算出的与F-statistic 相关的p-value来验证假设，the p-value associated with the F-statistic）

例子：

Is there a relationship between advertising sales（销售额） and budget（广告预算：TV, radio, and newspaper）?

the p-value corresponding to the F-statistic in Table 3.6 is very low, indicating clear evidence of a relationship between advertising and sales.

背景知识回顾：

t-statistic T统计量（t检验）与F-statistic

t-statistic T统计量=（回归系数β的估计值-0）/β的标准误，which measures the number of standard deviations thatβis away from 0。用来对计量经济学模型中关于参数的单个假设进行检验的一种统计量。

我们一般用t统计量来检验回归系数是否为0做检验。例如：线性回归Y=β0+β1X，为了验证X与Y是否相关，

假设H0：X与Y无关,即β1=0

假设H1：X与Y相关,即β1不等于0

计算t-statistic，如果t-statistic is far away from zero,则x和y相关。一般用p-values来检验X和Y是否相关。

1）p-values（Probability，Pr）

1 定义

pvalue的定义：在原假设正确的情况下，出现当前情况或者更加极端情况的概率。

p值是用来衡量统计显著性的常用指标。

P值( P-Value，Probability，Pr）即概率，反映某一事件发生的可能性大小。统计学根据显著性检验方法所得到的P 值，一般以P < 0.05 为显著， P <0.01 为非常显著，其含义是样本间的差异由抽样误差所致的概率小于0.05 或0.01。实际上，P 值不能赋予数据任何重要性，只能说明某事件发生的机率。

假设检验是推断统计中的一项重要内容。在假设检验中常见到P 值( P-Value，Probability，Pr)，P 值是进行检验决策的另一个依据。

大的pvalue说明还没有足够的证据拒绝原假设。

2 为何有p-value

P值方法的思路是先进行一项实验，然后观察实验结果是否符合随机结果的特征。研究人员首先提出一个他们想要推翻的“零假设”（null hypothesis），比如，两组数据没有相关性或两组数据没有显著差别。接下来，他们会故意唱反调，假设零假设是成立的，然后计算实际观察结果与零假设相吻合的概率。这个概率就是P值。费希尔说，P值越小，研究人员成功证明这个零假设不成立的可能性就越大。

其实理解起来很简单，基本原理只有两个：

1）一个命题只能证伪，不能证明为真

2）小概率事件不可能发生

证明逻辑就是：我要证明命题为真->证明该命题的否命题为假->在否命题的假设下，观察到小概率事件发生了->搞定。

3 demo

投飞镖，假设一个飞镖有10，9,8,7,6,5,4,3,2,1总共十个环（10是中心），定义合格投手为其真实水平能投到10~3环，而不管他临场表现如何。假设10~3环占靶子面积的95%。

H0：A是一个合格投手

H1：A不是合格投手

结合这个例子来看：证明A是合格的投手-》证明“A不是合格投手”的命题为假-》观察到一个事件（比如A连续10次投中10环），而这个事件在“A不是合格投手”的假设下，概率为p，小于0.05->小概率事件发生，否命题被推翻。

可以看到p越小-》这个事件越是小概率事件-》否命题越可能被推翻-》原命题越可信

2）F-statistic

t检验是单个系数显著性的检验，检验一个变量X与Y是否相关，如电视上广告投入是否有利于销售额。T检验的原假设为某一解释变量的系数为0 。

F检验是是所有的自变量在一起对因变量的影响，当处理3个及其以上的时候（变量X1，X2，X3…等）用的是F检验。F检验的原假设为所有回归系数为0。

即F检验用于证明变量X1，X2，X3…中至少有一个变量和Y相关

F检验的原假设是H0：所有回归参数都等于0，所以F检验通过的话说明模型总体存在，F检验不通过，其他的检验就别做了，因为模型所有参数不显著异于0，相当于模型不存在（即没有任何一个变量X1，X2，X3… have no relationship with Y）。

2.确定有用的自变量子集

Do all the predictors help to explain Y , or is only a subset of the predictors useful? （确定对Y有用的自变量）

The first step in a multiple regression analysis is to compute the F-statistic and to examine the associated pvalue. If we conclude on the basis of that p-value that at least one of the predictors is related to the response, then it is natural to wonder which are the guilty ones!

The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection.

There are three classical approaches for this task:Forward selection.Forward selection.Forward selection.

1）Forward selection.

We begin with the null model—a model that conforward selection null model tains an intercept but no predictors. We then fit p simple linear regressions

and add to the null model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model. This approach is continued until some stopping rule is satisfied.

2)Backward selection.

We start with all variables in the model, and backward remove the variable with the largest p-value—that is, the variable selection that is the least statistically significant. The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed. This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.

3)Mixed selection.

This is a combination of forward and backward semixed lection. We start with no variables in the model, and as with forward selection , we add the variable that provides the best fit. We continue to add variables one-by-one. Of course, as we noted with the Advertising example, the p-values for variables can become larger as new predictors are added to the model. Hence, if at any point the p-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model. We continue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model.

Compare:

Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large.

How to selecting the best model among a collection of models with different numbers of predictors?

Instead, we wish to choose a model with a low test error. As is evident here, the training error can be a poor estimate of the test error. Therefore, RSS and R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors.

These approaches can be used to select among a set of models with different numbers of variables.

Cp, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted R2

In the past, performing cross-validation was computationally prohibitive for many problems with large p and/or large n, and so AIC, BIC, Cp, and adjusted R2 were more attractive approaches for choosing among a set of models. However, nowadays with fast computers, the computations required to perform cross-validation are hardly ever an issue. Thus, crossvalidation is a very attractive approach for selecting from among a number of models under consideration.

TODO chapter6 <Linear Model Selection and Regularization>

3.模型误差（RSE,R^2）

How well does the model fit the data?

An R2 value close to 1 indicates that the model explains a large portion of the variance（自变量X） in the response variable（因变量Y）.
It turns out that R2 will always increase when more variables are added to the model, even if those variables are only weakly associated with the response.

例子：

For the Advertising data, the RSE is 1,681units while the mean value for the response is 14,022, indicating a percentage error of roughly 12 %(RSE/mean value). Second, the R2 statistic records the percentage of variability in the response that is explained by the predictors. The predictors explain almost 90 % of the variance in sales.

背景知识：

RSE标准差

R2 Statistic（R-square）用于评判一个模型拟合好坏的重要标准

R平方介于0~1之间，越接近1，回归拟合效果越好，模型越精确。

R^2判定系数就是拟合优度判定系数，它体现了回归模型中自变量Y的变异在因变量X的变异中所占的比例。即用来表示y值中有多少可以用x值来解释（R2 measures the proportion
of variability in Y that can be explained using X.），0.92的意思就是y值中有92%可以用x值来解释。

当R^2=1时表示，所有观测点都落在拟合的直线或曲线上；当R^2=0时，表示自变量与因变量不存在直线或曲线关系。

如何根据R-squared判断模型是否准确？

However, it can still be challenging to determine what is a good R2 value, and in general, this will depend on the application. For instance, in certain problems in

physics, we may know that the data truly comes from a linear model with a small residual error. In this case, we would expect to see an R2 value that is extremely close to 1, and a substantially smaller R2 value might indicate a serious problem with the experiment in which the data were generated. On the other hand, in typical applications in biology, psychology, marketing, and other domains, the linear model (3.5) is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large. In this setting, we would expect only a very small proportion of the variance in the response to be explained by the

predictor, and an R2 value well below 0.1 might be more realistic!

4.应用模型：Y准确度（置信度，置信区间，预测区间）

Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Once we have fit the multiple regression model, it is straightforward to apply in order to predict the response Y on the basis of a set of values for the predictors X1, X2, . . . , Xp.

We can compute a confidence interval （置信区间）in order to determine how close Y'(用模型计算出的值) will be to f(X)（理论中的真实值）.

predict an individual response use prediction interval, predict the average response use confidence interval.

confidence interval置信区间与预测区间

1 置信区间

表示在给定预测变量的指定设置时，平均响应可能落入的范围。

置信区间是结合置信度来说的，简单来说就是随机变量有一定概率落在一个范围内，这个概率就叫置信度，范围就是对应的置信区间。

真实数据往往是实际上不能获知的，我们只能进行估计，估计的结果是给出一对数据，比如从1到1.5，真实的值落在1到1.5之间的可能性是95%（也有5%的可能性在这区间之外的）。

90%置信区间（Confidence Interval,CI）：当给出某个估计值的90%置信区间为【a,b】时，可以理解为我们有90%的信心（Confidence）可以说样本的平均值介于a到b之间，而发生错误的概率为10%。

2 预测区间Prediction Interval

表示在给定预测变量的指定设置时，单个观测值可能落入的范围。

预测区间PI总是要比对应的置信区间CI大，这是因为在对单个响应与响应均值的预测中包括了更多的不确定性。

The basic syntax is lm(y∼x,data), where y is the response（预测值）, x is the predictor（影响因子：x1,x2）, and data is the data set in which these two variables are kept.

5.模型修正

1）各自变量X1，X2…对因变量Y的影响程度

Which media contribute to sales?

To answer this question, we can examine the p-values associated with each predictor’s t-statistic. In the multiple linear regression displayed in Table 3.4, the p-values for TV and radio are low,but the p-value for newspaper is not. This suggests that only TV and radio are related to sales.

2）解决共线性问题

所谓多重共线性（Multicollinearity）是指线性回归模型中的解释变量之间由于存在精确相关关系或高度相关关系而使模型估计失真或难以估计准确。一般来说，由于经济数据的限制使得模型设计不当，导致设计矩阵中解释变量间存在普遍的相关关系。

如何解决共线性问题？

方差膨胀因子（Variance Inflation Factor，VIF）：容忍度的倒数，VIF越大，显示共线性越严重。经验判断方法表明：当0＜VIF＜10，不存在多重共线性；当10≤VIF＜100，存在较强的多重共线性；当VIF≥100，存在严重多重共线性。

3）交互项系数（interaction terms）

衡量的是一个变量对于“另一个变量对因变量影响能力”的影响。

Is there synergy among the advertising media?

Perhaps spending $50,000 on television advertising and $50,000 on radio advertising results in more sales than allocating $100,000 to either television or radio individually. In marketing, this is known as

a synergy effect, while in statistics it is called an interaction effect.

何时适合在模型中加入交互系数？

4）异常值outlier检测

Residual plots（残差散点图） can be used to identify outliers. 检测到异常值后，从数据中去掉异常值，再生成纠正后的模型。
残差是指观测值与预测值（拟合值）之间的差，即是实际观察值与回归估计值的差。残差分析就是通过残差所提供的信息，分析出数据的可靠性、周期性或其它干扰。在线性回归中,残差的重要应用之一是根据它的绝对值大小判定异常点。

But in practice, it can be difficult to decide how large a residual needs to be before we consider the point to be an outlier. To address this problem, instead of plotting the residuals, we can plot the studentized residuals, computed by dividing each residual ei by its estimated standard studentized error. Observations whose studentized residuals are greater than 3 in abso- residuallute value are possible outliers.

linux中的NFS服务器配置及/etc/exports

linux Centos (6.6)服务器之间文件共享挂载

目的：因为服务器设置了负载均衡，多服务器的文件上传必然要同步，这里的目的把服务器1设置为主文件服务器

服务器1：192.168.1.100

服务器2：192.168.1.20

风来了.呆狐狸

安装基础所需套件[每台]

1.nfs

[html] view plain copy

print?

yum install nfs-utils

2.设置服务自启动

[plain] view plain copy

print?

chkconfig rpcbind on
chkconfig nfs on

3.启动服务

[plain] view plain copy

print?

service rpcbind start
service nfs start

[root@localhost database]# yum install portmap
已加载插件：fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirrors.cug.edu.cn
* epel: ftp.cuhk.edu.hk
* extras: mirrors.sina.cn
* updates: mirrors.zju.edu.cn
软件包 rpcbind-0.2.0-33.el7_2.x86_64 已安装并且是最新版本
无须任何处理
[root@localhost database]# /etc/init.d/portmap start
bash: /etc/init.d/portmap: 没有那个文件或目录
[root@localhost database]# /usr/sbin/showmount
clnt_create: RPC: Program not registered
[root@localhost database]# /etc/init.d/rpcbind start
bash: /etc/init.d/rpcbind: 没有那个文件或目录
[root@localhost database]# /etc/init.d/rpcbind start
bash: /etc/init.d/rpcbind: 没有那个文件或目录
[root@localhost database]# /etc/init.d/
iprdump iprinit iprupdate mysql mysql-proxy netconsole network
[root@localhost database]# service rpcbind start
Redirecting to /bin/systemctl start rpcbind.service
[root@localhost database]# service nfs start
Redirecting to /bin/systemctl start nfs.service
[root@localhost database]# exportfs -r
[root@localhost database]# show
show-changed-rco showconsolefont show-installed showkey showmount showrgb
[root@localhost database]# show
show-changed-rco showconsolefont show-installed showkey showmount showrgb
[root@localhost database]# showmount -e 143.89.115.4
Export list for 143.89.115.4:
/home/data *

登陆本地服务器， root

mount -t nfs -o rw 143.89.115.4:/home/data /home/shenzy/database/

Local Blast2go database installation 2016

Note 1: Use the “silent mode” (-s) to produce less output and to speed up the import. (On a Windows system use also (-b) to suppress the beep when errors occur.)
Note 2: You can alternatively also first open a mysql shell and than use the command “source” to import the data.

Import the next two files (yet unzipped) you downloaded from NCBI with:
gene2accession

mysql -h dbhost -u dbuser -p dbpass b2gdb -e "LOAD DATA LOCAL INFILE '/your/path/to/gene2accession' INTO TABLE gene2accession FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';"

mysql -h dbhost -u dbuser -p dbpass b2gdb -e "LOAD DATA LOCAL INFILE '/your/path/to/gene_info' INTO TABLE gene_info FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';"

Note: On a Windows system the ‘\t’ has to be replaced by a “\t”.

To import the mapping file obtained from PIR we will use the following command:
```
java -cp .:mysql-connector-java-5.0.8-bin.jar: ImportIdMapping /your/path/to/idmapping.tb localhost b2gdb blast2go blast4it
```
Note: On a Windows system, replace the java classpath separator “:” with “;”.
This step can take several hours, please be patient.

Congratulations, you are done. To test your installation, check again the new database settings in Blast2GO and click on the green little arrows in the main application window. If the green GO graphs pop up you database is working correctly.

How to create a Fasta file database for local Blast and to import XML results successfully into Blast2GO

(This is now obsolete. The use of the Blast2GO Command Line is recommended for this task.)

This page describes how to create a local installation of the B2G MySQL database and is intended for users with knowledge in MySQL databases and general server administration tasks. If you are not able to follow the steps given here, we recommend you to ask for some help from your system administrator.

After this installation you will end up with a working DB to run Blast2GO with a DB name “b2gdb”, user “blast2go” and a default password “blast4it” being able to run Blast2GO nearly independent from our server. This does NOT include the graph layout generation, GO-Slim reduction and other web-services which are hosted at the Blast2GO or third party server sides which you will still need to access remotely.

B2G users should keep in mind that the B2G-DB schema can be subject of changes as consequence of the continuous efforts for improving the performance and quality of this annotation framework. We will promptly inform users about any major change, especially concerning the database, through the B2G mail list (http://groups.google.com/group/blast2go) and web site. In general, all datasets needed to create a local instance of the Blast2GO database are freely available at different public locations. Since the used resources are not maintained by us, third party changes can affect these instructions. We will try to keep this as up to date as possible and are happy for receiving any feedback and bug-reports. Thank you.

Step-by-Step Description

Note: For the whole installation you will need about 120 GB of free disk space and approx. 12 hours to import the different datasets depending on your hardware and network connection speed!

Download this zip file and unzip it: local_b2g_db.zip
Install a MySQL Database Server: Download and install a MySQL Database Server for example from http://dev.mysql.com/downloads or use e.g. apt-get in Linux/Ubuntu:
```
sudo apt-get install mysql-server
```
Download and unzip these files:
- From http://archive.geneontology.org/latest-full download the file go_YYYYMM-assocdb-data.gz where YYYYMM corresponds to the actual month and year
- ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
- ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
- ftp://ftp.pir.georgetown.edu/databases/idmapping/idmapping.tb.gz

Execute the file b2g_db.sql to create a database and additional tables.
```
mysql -h dbhost -u dbuser -p dbpass < b2gdb.sql
```

Create the database user “blast2go” for local access:

mysql -h dbhost -u dbuser -p dbpass -e "GRANT ALL ON b2gdb.* TO 'blast2go'@'localhost' IDENTIFIED BY 'blast4it';"

mysql -h dbhost -u dbuser -p dbpass -e "FLUSH PRIVILEGES;"

Import the latest mysql database dump (first file downloaded in step 1) to the created database. Since the GO-DB-Dump is quite big, you should have the file local on the database server and import it directly, this speeds up the import.
You unzipped the file in step 3.
Execute:
```
mysql -s -h dbhost -u dbuser -p dbpass b2gdb < go_YYYYMM-assocdb-data
```
Note 1: Use the “silent mode” (-s) to produce less output and to speed up the import. (On a Windows system use also (-b) to suppress the beep when errors occur.)
Note 2: You can alternatively also first open a mysql shell and than use the command “source” to import the data.

Import the next two files (yet unzipped) you downloaded from NCBI with:
gene2accession

mysql -h dbhost -u dbuser -p dbpass b2gdb -e "LOAD DATA LOCAL INFILE '/your/path/to/gene2accession' INTO TABLE gene2accession FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';"

mysql -h dbhost -u dbuser -p dbpass b2gdb -e "LOAD DATA LOCAL INFILE '/your/path/to/gene_info' INTO TABLE gene_info FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';"

Note: On a Windows system the ‘\t’ has to be replaced by a “\t”.

To import the mapping file obtained from PIR we will use the following command:
```
java -cp .:mysql-connector-java-5.0.8-bin.jar: ImportIdMapping /your/path/to/idmapping.tb localhost b2gdb blast2go blast4it
```
Note: On a Windows system, replace the java classpath separator “:” with “;”.
This step can take several hours, please be patient.

分类

Recent Comments

链接表

step1: generate a tophat_out folder with bam files

step2: generate new .gtf files (assemble isoform)

step3: prepare a text file named assemblies.txt with following gtf files

step4: run cuffmerge to generate merged.gtf

step5: compare gene expressions of two samples

The protocol specifically used for our data

step0: access to the data

step1: generate a tophat_out folder with bam files, using only JU1421-1 as example

step2: generate new .gtf files (assemble isoform)

step3: prepare a text file named assemblies.txt with following gtf files

step4: run cuffmerge to generate merged.gtf

step5: compare gene expressions of two samples

1. 确定自变量与Y是否相关

2.确定有用的自变量子集

3.模型误差（RSE,R^2）

4.应用模型：Y准确度（置信度，置信区间，预测区间）

5.模型修正

1）各自变量X1，X2…对因变量Y的影响 程度

2）解决共线性问题

3）交互项系数（interaction terms）

4）异常值outlier检测

安装基础所需套件[每台]

1.nfs

2.设置服务自启动

3.启动服务

Step-by-Step Description

Tags

Archives

Meta

1）各自变量X1，X2…对因变量Y的影响程度