linux iptables导致httpd网页打不开(转帖)

linux iptables导致httpd网页打不开
原创作品,允许转载,转载时请务必以超链接形式标明文章 原始出处 、作者信息和本声明。否则将追究法律责任。http://taotao1240.blog.51cto.com/731446/611758

  问题:httpd 服务已启动,80端口已开,但是网页就是打不开,重启服务器还是不行

忽然想看看log,记录如下:

[Fri Jul 15 00:41:03 2011] [notice] SELinux policy enabled; httpd running as context root:system_r:httpd_t:s0

[Fri Jul 15 00:41:03 2011] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)

[Fri Jul 15 00:41:03 2011] [notice] Digest: generating secret for digest authentication …

[Fri Jul 15 00:41:03 2011] [notice] Digest: done

[Fri Jul 15 00:41:03 2011] [notice] Apache/2.2.3 (CentOS) configured — resuming normal operations

看到 suexec,明白了,果断iptables -L ,结果如下:

ACCEPT     all  —  anywhere             anywhere

ACCEPT     icmp —  anywhere             anywhere            icmp any

ACCEPT     esp  —  anywhere             anywhere

ACCEPT     ah   —  anywhere             anywhere

ACCEPT     udp  —  anywhere             224.0.0.251         udp dpt:mdns

ACCEPT     udp  —  anywhere             anywhere            udp dpt:ipp

ACCEPT     tcp  —  anywhere             anywhere            tcp dpt:ipp

ACCEPT     all  —  anywhere             anywhere            state RELATED,ESTABLISHED

ACCEPT     tcp  —  anywhere             anywhere            state NEW tcp dpt:ssh

REJECT     all  —  anywhere             anywhere            reject-with icmp-host-prohibited

 

全部干掉,iptables -F ,再打开网页,显示正常

linux 如何显示一个文件的某几行(中间几行)

linux 如何显示一个文件的某几行(中间几行)

【一】从第3000行开始,显示1000行。即显示3000~3999行

cat filename | tail -n +3000 | head -n 1000

 

【二】显示1000行到3000行

cat filename| head -n 3000 | tail -n +1000

 

*注意两种方法的顺序

 

分解:

tail -n 1000:显示最后1000行

tail -n +1000:从1000行开始显示,显示1000行以后的

head -n 1000:显示前面1000行

 

【三】用sed命令

 

sed -n ‘5,10p’ filename 这样你就可以只查看文件的第5行到第10行。

Bowtie2-manual (转贴)

getting started with Bowtie 2: Lambda phage example-从这里开始使用Bowtie2:λ噬菌体的例子

Bowtie2自带了一些入门级的示例文件,这些示例文件并不具有科学含义,我们用λ噬菌体的参考基因组只是因为它很短,并且例子里面的reads是由一个电脑程序生成的而不是测序的结果。但是,这些文件能让你立即开始运行Bowtie2和下游的程序。

首先按照获取Bowtie2的指导下载它。设置Bowtie2环境变量BT2_HOME,把它指向含有bowtie2, bowtie2-buildbowtie2-inspect二进制文件的新Bowtie2的文件夹。这一步很重要,因为在下面的命令当中,变量BT2_HOME被用来代表那个文件夹的位置。

Indexing a reference genome-对参考基因组建立索引

为了对Bowtie2内置的λ噬菌体的参考基因组建索引,先新建一个临时文件夹(建在哪里无所谓),进入那个文件夹,然后运行:

$BT2_HOME/bowtie2-build $BT2_HOME/example/reference/lambda_virus.fa lambda_virus

这条命令在结束前应该会打印很多行输出。当其运行完毕时,当前文件夹会产生4个新的文件,它们的文件名都以lambda_virus开始,分别以.1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2.rev.2.bt2结束。这些文件构成了索引——你完成了!

你可以使用bowtie2-build对一组任意来源的FASTA文件构建索引,包括像UCSCNCBI,和Ensembl这些站点。当对多个FASTA文件建立索引时,你要在指定所有的文件,并用逗号隔开。更多关于如何用bowtie2-build建立索引的信息,请查看使用手册的建立索引部分。你可能也会直接获取一个已经建好的索引,从而绕过这一步。使用已建好的索引给出了例子。

Aligning example reads-比对示例reads

在上一步创建的文件夹中,现在含有lambda_virus的索引文件。接下来,运行:

$BT2_HOME/bowtie2 -x lambda_virus -U $BT2_HOME/example/reads/reads_1.fq -S eg1.sam

这个命令会运行Bowtie2的比对软件,它会使用上一步建立的索引,把一组非双端测序的reads比对到λ噬菌体的参考基因组上。这步比对的结果是SAM格式的,输出文件是eg1.sam,同时比对的总结会被输出到终端控制台。(事实上,总结是被写进了“standard error” 或 “stderr”,即标准错误句柄里面,通常它会被输出到终端。)

要查看SAM结果的前几行,运行:

head eg1.sam

你会看到类似于下面的东西:

@HD VN:1.0  SO:unsorted
@SQ SN:gi|9626243|ref|NC_001416.1|  LN:48502
@PG ID:bowtie2  PN:bowtie2  VN:2.0.1
r1  0   gi|9626243|ref|NC_001416.1| 18401   42  122M    *   0   0   TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG  +"@6<:27(F&5)9)"B:%B+A-%5A?2$HCB0B+0=D<7E/<.03#!.F77@6B==?C"7>;))%;,3-$.A06+<-1/@@?,26">=?*@'0;$:;??G+:#+(A?9+10!8!?()?7C>  AS:i:-5 XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:59G13G21G26    YT:Z:UU
r2  0   gi|9626243|ref|NC_001416.1| 8886    42  275M    *   0   0   NTTNTGATGCGGGCTTGTGGAGTTCAGCCGATCTGACTTATGTCATTACCTATGAAATGTGAGGACGCTATGCCTGTACCAAATCCTACAATGCCGGTGAAAGGTGCCGGGATCACCCTGTGGGTTTATAAGGGGATCGGTGACCCCTACGCGAATCCGCTTTCAGACGTTGACTGGTCGCGTCTGGCAAAAGTTAAAGACCTGACGCCCGGCGAACTGACCGCTGAGNCCTATGACGACAGCTATCTCGATGATGAAGATGCAGACTGGACTGC (#!!'+!$""%+(+)'%)%!+!(&++)''"#"#&#"!'!("%'""("+&%$%*%%#$%#%#!)*'(#")(($&$'&%+&#%*)*#*%*')(%+!%%*"$%"#+)$&&+)&)*+!"*)!*!("&&"*#+"&"'(%)*("'!$*!!%$&&&$!!&&"(*"$&"#&!$%'%"#)$#+%*+)!&*)+(""#!)!%*#"*)*')&")($+*%%)!*)!('(%""+%"$##"#+(('!*(($*'!"*('"+)&%#&$+('**$$&+*&!#%)')'(+(!%+ AS:i:-14    XN:i:0  XM:i:8  XO:i:0  XG:i:0  NM:i:8  MD:Z:0A0C0G0A108C23G9T81T46 YT:Z:UU
r3  16  gi|9626243|ref|NC_001416.1| 11599   42  338M    *   0   0   GGGCGCGTTACTGGGATGATCGTGAAAAGGCCCGTCTTGCGCTTGAAGCCGCCCGAAAGAAGGCTGAGCAGCAGACTCAAGAGGAGAAAAATGCGCAGCAGCGGAGCGATACCGAAGCGTCACGGCTGAAATATACCGAAGAGGCGCAGAAGGCTNACGAACGGCTGCAGACGCCGCTGCAGAAATATACCGCCCGTCAGGAAGAACTGANCAAGGCACNGAAAGACGGGAAAATCCTGCAGGCGGATTACAACACGCTGATGGCGGCGGCGAAAAAGGATTATGAAGCGACGCTGTAAAAGCCGAAACAGTCCAGCGTGAAGGTGTCTGCGGGCGAT  7F$%6=$:9B@/F'>=?!D?@0(:A*)7/>9C>6#1<6:C(.CC;#.;>;2'$4D:?&B!>689?(0(G7+0=@37F)GG=>?958.D2E04C<E,*AD%G0.%$+A:'H;?8<72:88?E6((CF)6DF#.)=>B>D-="C'B080E'5BH"77':"@70#4%A5=6.2/1>;9"&-H6)=$/0;5E:<8G!@::1?2DC7C*;@*#.1C0.D>H/20,!"C-#,6@%<+<D(AG-).?&#0.00'@)/F8?B!&"170,)>:?<A7#1(A@0E#&A.*DC.E")AH"+.,5,2>5"2?:G,F"D0B8D-6$65D<D!A/38860.*4;4B<*31?6  AS:i:-22    XN:i:0  XM:i:8  XO:i:0  XG:i:0  NM:i:8  MD:Z:80C4C16A52T23G30A8T76A41   YT:Z:UU
r4  0   gi|9626243|ref|NC_001416.1| 40075   42  184M    *   0   0   GGGCCAATGCGCTTACTGATGCGGAATTACGCCGTAAGGCCGCAGATGAGCTTGTCCATATGACTGCGAGAATTAACNGTGGTGAGGCGATCCCTGAACCAGTAAAACAACTTCCTGTCATGGGCGGTAGACCTCTAAATCGTGCACAGGCTCTGGCGAAGATCGCAGAAATCAAAGCTAAGT(=8B)GD04*G%&4F,1'A>.C&7=F$,+#6!))43C,5/5+)?-/0>/D3=-,2/+.1?@->;)00!'3!7BH$G)HG+ADC'#-9F)7<7"$?&.>0)@5;4,!0-#C!15CF8&HB+B==H>7,/)C5)5*+(F5A%D,EA<(>G9E0>7&/E?4%;#'92)<5+@7:A.(BG@BG86@.G AS:i:-1 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:77C106 YT:Z:UU
r5  0   gi|9626243|ref|NC_001416.1| 48010   42  138M    *   0   0   GTCAGGAAAGTGGTAAAACTGCAACTCAATTACTGCAATGCCCTCGTAATTAAGTGAATTTACAATATCGTCCTGTTCGGAGGGAAGAACGCGGGATGTTCATTCTTCATCACTTTTAATTGATGTATATGCTCTCTT  9''%<D)A03E1-*7=),:F/0!6,D9:H,<9D%:0B(%'E,(8EFG$E89B$27G8F*2+4,-!,0D5()&=(FGG:5;3*@/.0F-G#5#3->('FDFEG?)5.!)"AGADB3?6(@H(:B<>6!>;>6>G,."?%  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:138    YT:Z:UU
r6  16  gi|9626243|ref|NC_001416.1| 41607   42  72M2D119M   *   0   0   TCGATTTGCAAATACCGGAACATCTCGGTAACTGCATATTCTGCATTAAAAAATCAACGCAAAAAATCGGACGCCTGCAAAGATGAGGAGGGATTGCAGCGTGTTTTTAATGAGGTCATCACGGGATNCCATGTGCGTGACGGNCATCGGGAAACGCCAAAGGAGATTATGTACCGAGGAAGAATGTCGCT 1H#G;H"$E*E#&"*)2%66?=9/9'=;4)4/>@%+5#@#$4A*!<D=="8#1*A9BA=:(1+#C&.#(3#H=9E)AC*5,AC#E'536*2?)H14?>9'B=7(3H/B:+A:8%1-+#(E%&$$&14"76D?>7(&20H5%*&CF8!G5B+A4F$7(:"'?0$?G+$)B-?2<0<F=D!38BH,%=8&5@+ AS:i:-13    XN:i:0  XM:i:2  XO:i:1  XG:i:2  NM:i:4  MD:Z:72^TT55C15A47  YT:Z:UU
r7  16  gi|9626243|ref|NC_001416.1| 4692    42  143M    *   0   0   TCAGCCGGACGCGGGCGCTGCAGCCGTACTCGGGGATGACCGGTTACAACGGCATTATCGCCCGTCTGCAACAGGCTGCCAGCGATCCGATGGTGGACAGCATTCTGCTCGATATGGACANGCCCGGCGGGATGGTGGCGGGG -"/@*7A0)>2,AAH@&"%B)*5*23B/,)90.B@%=FE,E063C9?,:26$-0:,.,1849'4.;F>FA;76+5&$<C":$!A*,<B,<)@<'85D%C*:)30@85;?.B$05=@95DCDH<53!8G:F:B7/A.E':434> AS:i:-6 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:98G21C22   YT:Z:UU

头几行(以@开始的)是SAM的头文件,余下的部分是SAM的比对结果,每行代表一个read或mate。更多细节,请查看Bowtie2使用手册的SAM输出部分SAM格式标准

Paired-end example-双端测序示例

要比对Bowtie2自带的双端测序reads,继续在同样的文件夹下,运行:

$BT2_HOME/bowtie2 -x lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam

这个命令会比对一组双端测序的reads到参考基因组上,结果文件输出在了eg2.sam里面。

Local alignment example-局部对比示例

要使用局部对比比对Bowtie2自带的一些更长的reads,继续在同样的文件夹下,运行:

$BT2_HOME/bowtie2 --local -x lambda_virus -U $BT2_HOME/example/reads/longreads.fq -S eg3.sam

这会使用局部比对,把长reads比对到参考基因组上,结果文件在eg3.sam里面。

Using SAMtools/BCFtools downstream-使用SAMtools/BCFtools做下游分析

SAMtools是一个用来处理和分析SAM和BAM比对文件的集成工具包。BCFtools是一个用来识别变异和处理VCF和BCF文件的集成工具包,通常它会被集成在SAMtools里面。把这些工具合起来使用,可以让你从SAM格式的比对文件中获取VCF格式中的被识别的变异。下面的例子假设你已经安装了samtoolsbcftools,而且含有那些工具的二进制文件的文件夹地址已经在你的PATH环境变量里了。

运行双端测序的示例:

$BT2_HOME/bowtie2 -x $BT2_HOME/example/index/lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam

使用samtools view把这个SAM文件转换成一个BAM文件。BAM是与SAM文字格式相对应的二进制格式。运行:

samtools view -bS eg2.sam > eg2.bam

使用samtools sort把这个BAM文件转换成排过序的BAM文件。

samtools sort eg2.bam eg2.sorted

现在我们有一个排过序的BAM文件了,它的名字是eg2.sorted.bam。排序后的BAM是一个有用的格式,因为这些比对都是(a)被压缩过的,这样有利于长期存储,(b)排过序的,这样有利于发现变异。要生成VCF格式的变异识别文件,运行:

samtools mpileup -uf $BT2_HOME/example/reference/lambda_virus.fa eg2.sorted.bam | bcftools view -bvcg - > eg2.raw.bcf

之后要查看这些变异,运行:

bcftools view eg2.raw.bcf

更多细节和这一流程的一些变化,请查阅SAMtools的官方指南:用SAMtools/BCFtools识别SNPs/INDELs(单核苷酸多态/插入删除)

关于PYTHON_EGG_CACHE无权限的问题

Perhaps your account does not have write access to this directory?  You can
change the cache directory by setting the PYTHON_EGG_CACHE environment
variable to point to an accessible directory.
Traceback (most recent call last):
  File "/data/wap/www/spider/picture.py", line 8, in ?
    import MySQLdb, datetime, time
  File "build/bdist.linux-x86_64/egg/MySQLdb/__init__.py", line 19, in ?
  File "build/bdist.linux-x86_64/egg/_mysql.py", line 7, in ?
  File "build/bdist.linux-x86_64/egg/_mysql.py", line 4, in __bootstrap__
  File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-py2.4.egg/pkg_resources.py", line 881, in resource_filename
    return get_provider(package_or_requirement).get_resource_filename(
  File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-py2.4.egg/pkg_resources.py", line 1351, in get_resource_filename
    self._extract_resource(manager, self._eager_to_zip(name))
  File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-py2.4.egg/pkg_resources.py", line 1372, in _extract_resource
    real_path = manager.get_cache_path(
  File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-py2.4.egg/pkg_resources.py", line 962, in get_cache_path
    self.extraction_error()
  File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-py2.4.egg/pkg_resources.py", line 928, in extraction_error
    raise err
pkg_resources.ExtractionError: Can't extract file(s) to egg cache
The following error occurred while trying to extract file(s) to the Python egg
cache:
  [Errno 13] Permission denied: '/root/.python-eggs'
The Python egg cache directory is currently set to:
  /root/.python-eggs
Perhaps your account does not have write access to this directory?  You can
change the cache directory by setting the PYTHON_EGG_CACHE environment
variable to point to an accessible directory.

上面错误是我在php执行python脚本的时候出现的问题:

解决办法是在开始的文件前面加上了这段代码

import os,sys
os.environ['PYTHON_EGG_CACHE'] = '/tmp/.python-eggs'
abspath = os.path.dirname(__file__)
sys.path.append(abspath)
os.chdir(abspath)

参考链接:http://blog.blankyao.com/story/solve-python_egg_cache.html

ubuntu下搭建vpn历程(转贴)

1、卸载pptpd和iptables,重新安装pptpd

#卸载pptpd
$ apt-get autoremove pptpd
$ apt-get purge pptpd
#卸载iptables
$ apt-get autoremove iptables*
$ apt-get purge iptables*
#安装pptpd
$ apt-get install pptpd

2、配置pptpd

(1)首先,编辑pptpd.conf文件,设置localip和remoteip

$ vim /etc/pptpd.conf

查找到localip和remoteip,打开注释进行设置。将localip设置为你的vps的服务器公网ip,不知道可以通过ifconfig查看。remoteip是设置给VPN用户分配的IP段,我这里设置为10.100.0.2-100。

localip VPS_IP
remoteip 10.100.0.2-100

(2)修改dns设置,设置为google的DNS

$ vim /etc/ppp/pptpd-options

查找到ms-dns,配置dns如下:

ms-dns 8.8.8.8
ms-dns 8.8.4.4

(3)设置VPN的账号密码

编辑chap-secrets文件

$ vim /etc/ppp/chap-secrets

在chap-secrets文件中添加一行配置

#client server secret IP address
liuchungui pptpd 123456 *

其中,liuchungui是VPN的用户名,使用的VPN服务类型是pptpd,密码是123456,*代表不限制IP

3、启动pptpd服务

$ /etc/init.d/pptpd restart

输入上面命令,如果提示`

  • Restarting PoPToP Point to Point Tunneling Server pptpd [ OK ] `
    就说明启动成功了

4、设置系统的ipv4的转发开关

$ vim /etc/sysctl.conf

编辑/etc/sysctl.conf文件,找到net.ipv4.ip_forward=1,把这行的注释打开并保存。
运行:sysctl -p 让上面的修改立即生效。

5、配置iptables

(1)安装iptables

$ apt-get install iptables

(2)添加一个NAT,这里特别注意:eth1是vps的ip网卡接口,可以通过ifconfig查看

$ iptables -t nat -A POSTROUTING -s 10.100.0.0/24 -o eth1 -j MASQUERADE

(3)设置MTU,防止包过大

$ iptables -A FORWARD -s 10.100.0.0/24 -p tcp -m tcp –tcp-flags SYN,RST SYN -j TCPMSS –set-mss 1200

(4)再添加一个NAT,45.62.119.172就是你的vps的公网ip

$ iptables -t nat -A POSTROUTING -s 10.100.0.0/24 -j SNAT –to-source 45.62.119.172

(5)将iptables规则保存,令重启后规则不丢失:

$ iptables-save > /etc/iptables-rules

(6)编辑网卡文件,加载网卡时自动加载规则

$ vim /etc/network/interfaces

interfaces文件末尾加上:pre-up iptables-restore < /etc/iptables-rules

(7)安装iptables配置持久化

$ apt-get install iptables-persistent

(8)运行保存配置命令

$ service iptables-persistent start

RNA-seq :TopHat2 + Cufflinks分析流程 (转帖)

RNA-seq :TopHat2 + Cufflinks分析流程:
   1、测序数据质量控制:fastqc软件
      1)使用方法:/life/rjian/software/fastQC/FastQC/fastqc -o /life/rjian/data/liyan/filename_fastqc \filename.fq >>filename.log
      2)参数说明:-o:输出文件所在目录,并且是已经存在的目录,如:filename_fastqc
                   –noextract:不解压缩输出文件
                   最后加上fastq文件:filename.fq;重定向结果到日志文件:filename.log,以便查看。
           filename:表示是一个样品的一个生物学重复,一般有多个样品,每个样品有多个重复,如:C1_R1; 如果是双端测序则后面会加上数字,如:filename_1.fq和filename_2.fq

   2、reads trim工具——trimmomatic
      1)使用方法:java -jar  /life/rjian/software/Trimmomatic-0.32/trimmomatic-0.32.jar SE -threads 5 \-phred33 -trimlog filename_trimmomatic.log filename.fq filename_out.fq ILLUMINACLIP:adapter.fa:2:30:10 \SLIDINGWINDOW:4:15 MINLEN:36
      2)参数说明:SE:指定单端测序,PE:双端测序
                  -threads:指定线程数
                  -phred33:指定fastq文件的质量格式,或者:-phred64
                  -trimlog:指定日志文件,后加上输入和输出文件
                  ILLUMINACLIP:adapter.fa:2:30:10   :adapter.fa为adapter文件,2:允许的最大mismatch 数,30:palindrome模式下匹配碱基数阈值,10:simple模式下的匹配碱基数阈值
                  SLIDINGWINDOW:4:15 MINLEN:36  :滑动窗口的size是4个碱基,其平均碱基质量小于15,则切除。
                  MINLEN:36  :最低reads长度为36
     
3、bowtie2建立参考基因组的索引——bowtie2-build
 1)使用方法: bowtie2-build <要生成的索引文件前缀名>;比如:
 nohup /home/cuckoo/software/bowtie2-2.2.3/bowtie2-build genome.fa bowtie2index/genome>>bowtie2.log &
 2)参数说明:genome.fa是fasta文件;
     genome是要生成的索引文件的前缀名;
     bowtie2index是一个文件夹,用来存放索引文件,方便日后查看和使用;
     注意:程序运行完后genome.fa文件要放在bowtie2index索引目录中,tophat2软件才能正确运行。

4、reads mapping到参考基因组——tophat2软件:基于bowtie2
 1)使用方法:/home/cuckoo/software/tophat-2.0.12.Linux_x86_64/tophat2 -p 8 -G \/home/cuckoo/data/liyan/train/genes.gtf -o filename_thout/ \/home/cuckoo/data/liyan/train/bowtie2index/genome /home/cuckoo/data/liyan/train/filename.fq \>filenametophat.log
 2)参数说明:-p :指定线程数,默认为1
     -G :指定已有的基因组注释信息,gtf或gff文件;
     -o :指定输出目录,默认为”./tophat_out“;
     后面加上索引文件:与前面的bowtie2建立的索引相对应,只取前缀名。
     最后加上fastq文件:filename.fq;如果是双端测序则是filename_1.fq和filename_2.fq 两个文件。

5、转录本组装——Cufflinks:Cufflinks是一套拼接转录本,定量表达量。
    1)使用方法:/home/cuckoo/software/RNAseq/cufflinks-2.2.1.Linux_x86_64/cufflinks -p 8 -o \filename_clout filename_thout/accepted_hits.bam >filename_cufflinks.log
 2)参数说明:-p :指定线程数;
     -o :指定输出文件所在目录;
     后面跟上Tophat2中生成的bam文件:
    
6、转录本合并——Cuffmerge
 1)使用方法:/home/cuckoo/software/RNAseq/cufflinks-2.2.1.Linux_x86_64/cuffmerge -g genes.gtf -s \/home/cuckoo/data/liyan/train/bowtie2index/genome.fa -p 8 assemblies.txt
 2)参数说明:-g :参考基因组注释文件
     -s :参考基因组序列文件
     -p :指定线程数
     -o :指定输出文件merged.gtf所在目录,默认情况下是 merged_asm
     最后assemblies.txt :一个包含每个样品(重复)拼接后的gtf文件的列表;如下:两个文件分别是在上一步中生成的样品的转录本注释文件。
     ./s0924fb_clout/transcripts.gtf
     ./sCal27_clout/transcripts.gtf
    
7、基因和转录本表达定量——cuffquant
 1)使用方法:/home/cuckoo/software/RNAseq/cufflinks-2.2.1.Linux_x86_64/cuffquant -o sample_quant -p 8 \-u AT.gff sample_thout/accepted_hits.bam
 2)参数说明:-o :指定结果输出目录:包含结果文件abundances.cxb
     -p :指定线程数
     -u :指定对比对上基因组上多个位置的reads进行统计分析。
    
     加上参考基因组注释文件:AT.gff
    
     最后加上Tophat2产生的该样本的比对结果文件:accepted_hits.bam
    
8、 基因和转录本表达水平标准化——cuffnorm
 1)使用方法:/home/cuckoo/software/RNAseq/cufflinks-2.2.1.Linux_x86_64/cuffnorm -o cuffnorm_out -p 8 \-L 0h_1,12h_CK1,12h_E1 AT.gff /data/disk2/liyan/AT/0h_1_quant/abundances.cxb \/data/disk2/liyan/AT/12h_CK1_quant/abundances.cxb /data/disk2/liyan/AT/12h_E1_quant/abundances.cxb
 2)参数说明:-o :指定结果输出目录
     -p :指定线程数
     -L :为每个样本(处理)作标记
     –total-hits-norm :计算所有的fragments,包括与所有的参考转录本不容的,默认不激活。
     –compatible-hits-norm :只计算与一些参考转录本相容的fragments,默认激活。
     加上参考基因组注释文件:AT.gff
     最后加上每个样本(处理)的cuffquant产生的abundances.cxb文件,样本的每个重复之间用逗号”,“分割;样本之间则由空格分割。
    
9、转录本差异表达分析——Cuffdiff:分析差异表达基因的工具。
 1)使用方法:/home/cuckoo/software/RNAseq/cufflinks-2.2.1.Linux_x86_64/cuffdiff -o diff_out -b \bowtie2index/genome.fa -p 8 -L C1,C2 -u merged_asm/merged.gtf ./C1_thout/accepted_hits.bam \./C2_thout/accepted_hits.bam
 2)参数说明:-o :指定输出目录
     -b :参考基因组序列文件
     -p :指定线程数
     -L :为每个样本标上名称    -u:-u命令指cuffdiff对回帖的基因组中多个位置的read进行一个初步的估计,然后加权分配到各个基因组位置。而不是简单的平均分配,其功能与Cufflinks中的u命令相同。
    
     加上合并后的转录本:merged.gtf;由cuffmerge产生。
     最后是TopHat产生的样本的bam文件,如果一个样本有多个生物学重复,那么我们需要提供每个重复的bam文件,文件名之间以逗号隔开并且样本名应与-L参数相对应。 

10、转录本与参考基因组注释文件比较——Cuffcompare,发现新基因,转录本
    1)使用方法:cuffcompare -i gtf_out_list.txt -r genes.gtf
    2)参数说明:-i :输入文件,是cufflinks组装转录本的结果文件——transcripts.gtf的列表;
                   其中gtf_out_list.txt是由find . -name transcripts.gtf  > gtf_out_list.txt命令产生的集合了所有样本转录本文件的列表。
                 -o :指定输出文件的前缀,如果没有指定默认为cuffcmp。
                 -r :指定参考基因组注释文件。
    
     注:结果文件大部分位于cmp_out(自己先建立好)目录中,统计汇总所有转录本的比较情况;而单个样本转录本的比较结果文件:cuffcmp.transcripts.gtf.tmap 和 cuffcmp.transcripts.gtf.refmap 分别位于样本的cufflinks运行结果transcripts.gtf所在目录中,统计单个样本的比较情况以鉴定新转录本。

用DESeq进行差异分析的源代码

要保证当前文件夹下面有了742KO1.count等4个文件,就是用htseq对比对的bam文件进行处理后的输出文件

library(DESeq)
#加载数据
K1=read.table(“742KO1.count”,row.names=1)
K2=read.table(“743KO2.count”,row.names=1)
W1=read.table(“740WT1.count”,row.names=1)
W2=read.table(“741WT2.count”,row.names=1)
#列名
data=cbind(K1,K2,W1,W2)
#如果是htseq的结果,则删除data最后四行
n=nrow(data)
data=data

[c language=”(-n+4:-n),”][/c]

#如果是bedtools的结果,取出统计个数列和行名
kk1=cbind(K1$V5)
rownames(kk1)=rownames(K1)
K1=kk1

#差异分析
colnames(data)=c(“K1″,”K2″,”W1″,”W2″)
type=rep(c(“K”,”W”),c(2,2))
de=newCountDataSet(data,type)
de=estimateSizeFactors(de)
de=estimateDispersions(de)
res=nbinomTest(de,”K”,”W”)

#res就是我们的表达量检验结果

到这里,理论上差异基因的分析已经结束啦!后面只是关于R的bioconductor包的一些简单结合使用而已

library(org.Mm.eg.db)

tmp=select(org.Mm.eg.db, keys=res$id, columns=c(“ENTREZID”,”SYMBOL”), keytype=”ENSEMBL”)

#合并res和tmp
res=merge(tmp,res,by.x=”ENSEMBL”,by.y=”id”,all=TRUE)

#go
tmp=select(org.Mm.eg.db, keys=res$ENSEMBL, columns=”GO”, keytype=”ENSEMBL”)
ensembl_go=unlist(tapply(tmp[,2],as.factor(tmp[,1]),function(x) paste(x,collapse =”|”),simplify =F))

#为res加入go注释,
res$go=ensembl_go[res$ENSEMBL]#为res加入一列go

#写入all——data
all_res=res
write.csv(res,file=”all_data.csv”,row.names =F)

uniq=na.omit(res)#删除无效基因
sort_uniq=uniq[order(uniq$padj),]#按照矫正p值排序

#写入排序后的all_data
write.csv(res,file=”all_data.csv”,row.names =F)

#标记上下调基因
sort_uniq$up_down=ifelse(sort_uniq$baseMeanA>sort_uniq$baseMeanB,”up”,”down”)
#交换上下调基因列位置
final_res=sort_uniq[,c(12,1:11)]
#写出最后数据
write.csv(final_res,file=”final_annotation_gene_bedtools_sort_uniq.csv”,row.names =F)

#然后挑选出padj值小于0.05的数据来做富集
tmp=select(org.Mm.eg.db, keys=sort_uniq[sort_uniq$padj<0.05,1], columns=”ENTREZID”, keytype=”ENSEMBL”)
diff_ENTREZID=tmp$ENTREZID
require(DOSE)
require(clusterProfiler)
diff_ENTREZID=na.omit(diff_ENTREZID)
ego <- enrichGO(gene=diff_ENTREZID,organism=”mouse”,ont=”CC”,pvalueCutoff=0.05,readable=TRUE)
ekk <- enrichKEGG(gene=diff_ENTREZID,organism=”mouse”,pvalueCutoff=0.05,readable=TRUE)
write.csv(summary(ekk),”KEGG-enrich.csv”,row.names =F)
write.csv(summary(ego),”GO-enrich.csv”,row.names =F)

getfasta (bedtools)

getfasta

../../_images/getfasta-glyph.pngbedtools getfasta extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file.

Tip

1. The headers in the input FASTA file must exactly match the chromosome column in the BED file.

2. You can use the UNIX fold command to set the line width of the FASTA output. For example, fold -w 60 will make each line of the FASTA file have at most 60 nucleotides for easy viewing.

3. BED files containing a single region require a newline character at the end of the line, otherwise a blank output file is produced.

See also

maskfasta

Usage and option summary

Usage

$ bedtools getfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF>

(or):

$ getFastaFromBed [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF>
Option Description
-fo Specify an output file name. By default, output goes to stdout.
-name Use the “name” column in the BED file for the FASTA headers in the output FASTA file.
-tab Report extract sequences in a tab-delimited format instead of in FASTA format.
-bedOut Report extract sequences in a tab-delimited BED format instead of in FASTA format.
-s Force strandedness. If the feature occupies the antisense strand, the sequence will be reverse complemented. Default: strand information is ignored.
-split Given BED12 input, extract and concatenate the sequences from the BED “blocks” (e.g., exons)

Default behavior

bedtools getfasta will extract the sequence defined by the coordinates in a BED interval and create a new FASTA entry in the output file for each extracted sequence. By default, the FASTA header for each extracted sequence will be formatted as follows: “<chrom>:<start>-<end>”.

$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1 5 10

$ bedtools getfasta -fi test.fa -bed test.bed
>chr1:5-10
AAACC

# optionally write to an output file
$ bedtools getfasta -fi test.fa -bed test.bed -fo test.fa.out

$ cat test.fa.out
>chr1:5-10
AAACC

-name Using the BED “name” column as a FASTA header.

Using the -name option, one can set the FASTA header for each extracted sequence to be the “name” columns from the BED feature.

$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1 5 10 myseq

$ bedtools getfasta -fi test.fa -bed test.bed -name
>myseq
AAACC

-tab Creating a tab-delimited output file in lieu of FASTA output.

Using the -tab option, the -fo output file will be tab-delimited instead of in FASTA format.

$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1 5 10 myseq

$ bedtools getfasta -fi test.fa -bed test.bed -name -tab
myseq AAACC

-bedOut Creating a tab-delimited BED file in lieu of FASTA output.

Using the -tab option, the -fo output file will be tab-delimited instead of in FASTA format.

$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1 5 10 myseq

$ bedtools getfasta -fi test.fa -bed test.bed -tab
chr1 5 10 AAACC

-s Forcing the extracted sequence to reflect the requested strand

bedtools getfasta will extract the sequence in the orientation defined in the strand column when the “-s” option is used.

$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1 20 25 forward 1 +
chr1 20 25 reverse 1 -

$ bedtools getfasta -fi test.fa -bed test.bed -s -name
>forward
CGCTA
>reverse
TAGCG

-split Extracting BED “blocks”.

One can optionally request that FASTA records be extracting and concatenating each block in a BED12 record. For example, consider a BED12 record describing a transcript. By default, getfasta will extract the sequence representing the entire transcript (intons, exons, UTRs). Using the -split option, getfasta will instead produce separate a FASTA record representing a transcript that splices together each BED12 block (e.g., exons and UTRs in the case of genes described with BED12).

$ cat genes.bed12
chr1  164404  173864  ENST00000466557.1       0       -       173864  173864  0       6       387,59,66,216,132,112,  0,1479,3695,4644,8152,9348,
chr1  235855  267253  ENST00000424587.1       0       -       267253  267253  0       4       2100,150,105,158,       0,2562,23161,31240,
chr1  317810  328455  ENST00000426316.1       0       +       328455  328455  0       2       323,145,        0,10500,

$ bedtools getfasta -fi chr1.fa -bed genes.bed12 -split -name
>ENST00000466557.1
gaggcgggaagatcacttgatatcaggagtcgaggcgggaagatcacttgacgtcaggagttcgagactggcccggccaacatggtgaaaccgcatctccactaaaaatacaaaaattagcctggtatggtggtgggcacctgtaatcccagtgacttgggaggctaaggcaggagaatttcttgaacccaggaggcagaggttgcagtgaccagcaaggttgcgccattgcaccccagcctgggcgataagagtgaaaactccatctcaaaaaaaaaaaaaaaaaaaaaaTTCCTTTGGGAAGGCCTTCTACATAAAAATCTTCAACATGAGACTGGAAAAAAGGGTATGGGATCATCACCGGACCTTTGGCTTTTACAGCTCGAGCTGACAAAGTTGATTTATCAAGTTGTAAATCTTCACCTGTTGAATTCATAAGTTCATGTCATATTTTCTTTCAGACAATTCTTCAGTTTGTTTACGTAGATCAGCGATACGATGATTCCATTTCTtcggatccttgtaagagcagagcaggtgatggagagggtgggaggtgtagtgacagaagcaggaaactccagtcattcgagacgggcagcacaagctgcggagtgcaggccacctctacggccaggaaacggattctcccgcagagcctcggaagctaccgaccctgctcccaccttgactcagtaggacttactgtagaattctggccttcagacCTGAGCCTGGCAGCTCTCTCCAACTTTGGAAGCCCAGGGGCATGGCCCCTGTCCACAGATGCACCTGGCATGAGGCGTGCCCAGAGGGACAGAGGCAGATGAGTttcgtctcctccactggattgtgagggcCAGAGTTGAACTCCCTCATTTTCCGTTCCCCAGCATTGGCAGGTTCTGGGACTGGTGGCTGTGGTGGCTCGTTGGTCTTTGTCTCTTAGAAGGTGGGGAATAATCATCATCT
>ENST00000424587.1
ccaggaagtgaaaatgacactttactgttttaatttgcatttctctgcttacaagtggattacacacattttcgtgtgctgttggctacttatTCATTCAGAAAACATACTAAGTGCTGGCTCTTTTTCATGTCCTTTATCAAGTTTGGATCATGTCATTTGCTATTTTCTTTCTGATGTAAACTCTCAAAGTCTGAAGTGTATTGTCTTTTCCTGACACATATGTTGTAAATAATTTTCTGGCTTACATTTTGACTTTTAATTTCATTCACGATGTTTTTAATGAATAATTTTAATTTTTATGAATGCAAGTTAAAATAATTCTTTCATTGTGGTCTCTGACATGTCATGCCAATAAGGGTCTTCTCCTCCAAGAGCACAGAAATATTTGCCAATACTGTCCTTAAAATCGGTCACAGTTTCATTTTTTATATATGCATTTTACTTCAATTGGGGCTTCATTTTACTGAATGCCCTATTTGAAGCAAGTTTCTCAGTTAATTCTTTTCTCAAAGGGCTAAGTATGGTAGATTGCAAACATAAGTGGCCACATAATGCTCTCACCTCctttgcctcctctcccaggaggagatagcgtccatctttccactccttaatctgggcttggccgtgtgacttgcactggccaatgggatattaacaagtctgatgtgcacagaggctgtagaatgtgcacgggggcttggtctctcttgctgccctggagaccagctgccCCACGAAGGAACCAGAGCCAACCTGCTGCTTCCTGGAGGAAGACAGTCCCTCTGTCCCTCTGTCTCTGCCAACCAGTTAACCTGCTGCTTCCTGGAGGGAGACAGTCCCTCAGTCCCTCTGTCTCTGCCAACCAGTTAACCTGCTGCTTCCTGGAGGAAGACAGTCACTCTGTCTCTGccaacccagttgaccgcagacatgcaggtctgctcaggtaagaccagcacagtccctgccctgtgagccaaaccaaatggtccagccacagaatcgtgagcaaataagtgatgcttaagtcactaagatttgggCAAAAGCTGAGCATTTATCCCAATCCCAATACTGTTTGTCCTTCTGTTTATCTGTCTGTCCTTCCCTGCTCATTTAAAATGCCCCCACTGCATCTAGTACATTTTTATAGGATCAGGGATCTGCTCTTGGATTAATGTTGTGTTCCCACCTCGAGGCAGCTTTGTAAGCTTCTGAGCACTTCCCAATTCCGGGTGACTTCAGGCACTGGGAGGCCTGTGCATCAGCTGCTGCTGTCTGTAGCTGACTTCCTTCACCCCTCTGCTGTCCTCAGCTCCTTCACCCCTGGGCCTCAGGAAATCAATGTCATGCTGACATCACTCTAGATCTAAAAGTTGGGTTCTTGgaccaggcgtggtggctcacacctgtaatcccagcactttgggaggccgaggcgggtggatcacaaggtcaggagatcaagacgattctggctaacacggtgaaaccccgtctctactaaaaatacaaaaaaattagccgggtgtggtggcaggtgcctgtagccccagctacttgggaggctgaggcaggagaatggcttgaacctgggaggtggagcttgcagtgagccaagatcacgccactgcactccagaatgggagagagagcgagactttctcaaaaaaaaaaaaaaaaCTTAGGTTCTTGGATGTTCGGGAAAGGGGGTTATTATCTAGGATCCTTGAAGCACCCCCAAGGGCATCTTCTCAAAGTTGGATGTGTGCATTTTCCTGAGAGGAAAGCTTTCCCACATTATACAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGCTACGTCCTCTCTTGAGCTACAGCAGATTCACTCTGTTCTGTTTCATTGTTGTTTAGTTTGCGTTGTGTTTCTCCAACTTTGTGCCTCATCAGGAAAAGCTTTGGATCACAATTCCCAGtgctgaagaaaaggccaaactcttggttgtgttctttgattAGTgcctgtgacgcagcttcaggaggtcctgagaacgtgtgcacagtttagtcggcagaaacttagggaaatgtaagaccaccatcagcacataggagttctgcattggtttggtctgcattggtttggtCTTTTCCTGGATACAGGTCTTGATAGGTCTCTTGATGTCATTTCACTTCAGATTCTTCTTTAGAAAACTTGGACAATAGCATTTGCTGTCTTGTCCAAATTGTTACTTCAAGTTTGCTCTTAGCAAGTAATTGTTTCAGTATCTATATCAAAAATGGCTTAAGCCTGCAACATGTTTCTGAATGATTAACAAGGTGATAGTCAGTTCTTCATTGAATCCTGGATGCTTTATTTTTCTTAATAAGAGGAATTCATATGGATCAG
>ENST00000426316.1
AATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAAGCTCTTctggagagtgagagaagcttccagttaaggtgacattgaagccaagtcctgaaagatgaggaagagttgtatgagagtggggagggaagggggaggtggagggaTGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTCATTCCTCACGTCGGCCATTCCAGGCCCCGTTTTTCCCTTCCGGCAGCCTCTTGGCCTCTAATTTGTTTATCTTTTGTGTATAAATCCCAAAATATTGAATTTTGGAATATTTCCACCATTATGTAAATATTTTGATAGGTAA

# use the UNIX fold command to wrap the FASTA sequence such that each line
# has at most 60 characters
$ bedtools getfasta -fi chr1.fa -bed genes.bed12 -split -name | \
      fold -w 60
>ENST00000466557.1
gaggcgggaagatcacttgatatcaggagtcgaggcgggaagatcacttgacgtcaggag
ttcgagactggcccggccaacatggtgaaaccgcatctccactaaaaatacaaaaattag
cctggtatggtggtgggcacctgtaatcccagtgacttgggaggctaaggcaggagaatt
tcttgaacccaggaggcagaggttgcagtgaccagcaaggttgcgccattgcaccccagc
ctgggcgataagagtgaaaactccatctcaaaaaaaaaaaaaaaaaaaaaaTTCCTTTGG
GAAGGCCTTCTACATAAAAATCTTCAACATGAGACTGGAAAAAAGGGTATGGGATCATCA
CCGGACCTTTGGCTTTTACAGCTCGAGCTGACAAAGTTGATTTATCAAGTTGTAAATCTT
CACCTGTTGAATTCATAAGTTCATGTCATATTTTCTTTCAGACAATTCTTCAGTTTGTTT
ACGTAGATCAGCGATACGATGATTCCATTTCTtcggatccttgtaagagcagagcaggtg
atggagagggtgggaggtgtagtgacagaagcaggaaactccagtcattcgagacgggca
gcacaagctgcggagtgcaggccacctctacggccaggaaacggattctcccgcagagcc
tcggaagctaccgaccctgctcccaccttgactcagtaggacttactgtagaattctggc
cttcagacCTGAGCCTGGCAGCTCTCTCCAACTTTGGAAGCCCAGGGGCATGGCCCCTGT
CCACAGATGCACCTGGCATGAGGCGTGCCCAGAGGGACAGAGGCAGATGAGTttcgtctc
ctccactggattgtgagggcCAGAGTTGAACTCCCTCATTTTCCGTTCCCCAGCATTGGC
AGGTTCTGGGACTGGTGGCTGTGGTGGCTCGTTGGTCTTTGTCTCTTAGAAGGTGGGGAA
TAATCATCATCT
>ENST00000424587.1
ccaggaagtgaaaatgacactttactgttttaatttgcatttctctgcttacaagtggat
tacacacattttcgtgtgctgttggctacttatTCATTCAGAAAACATACTAAGTGCTGG
CTCTTTTTCATGTCCTTTATCAAGTTTGGATCATGTCATTTGCTATTTTCTTTCTGATGT
AAACTCTCAAAGTCTGAAGTGTATTGTCTTTTCCTGACACATATGTTGTAAATAATTTTC
TGGCTTACATTTTGACTTTTAATTTCATTCACGATGTTTTTAATGAATAATTTTAATTTT
TATGAATGCAAGTTAAAATAATTCTTTCATTGTGGTCTCTGACATGTCATGCCAATAAGG
GTCTTCTCCTCCAAGAGCACAGAAATATTTGCCAATACTGTCCTTAAAATCGGTCACAGT
TTCATTTTTTATATATGCATTTTACTTCAATTGGGGCTTCATTTTACTGAATGCCCTATT
TGAAGCAAGTTTCTCAGTTAATTCTTTTCTCAAAGGGCTAAGTATGGTAGATTGCAAACA
TAAGTGGCCACATAATGCTCTCACCTCctttgcctcctctcccaggaggagatagcgtcc
atctttccactccttaatctgggcttggccgtgtgacttgcactggccaatgggatatta
acaagtctgatgtgcacagaggctgtagaatgtgcacgggggcttggtctctcttgctgc
cctggagaccagctgccCCACGAAGGAACCAGAGCCAACCTGCTGCTTCCTGGAGGAAGA
CAGTCCCTCTGTCCCTCTGTCTCTGCCAACCAGTTAACCTGCTGCTTCCTGGAGGGAGAC
AGTCCCTCAGTCCCTCTGTCTCTGCCAACCAGTTAACCTGCTGCTTCCTGGAGGAAGACA
GTCACTCTGTCTCTGccaacccagttgaccgcagacatgcaggtctgctcaggtaagacc
agcacagtccctgccctgtgagccaaaccaaatggtccagccacagaatcgtgagcaaat
aagtgatgcttaagtcactaagatttgggCAAAAGCTGAGCATTTATCCCAATCCCAATA
CTGTTTGTCCTTCTGTTTATCTGTCTGTCCTTCCCTGCTCATTTAAAATGCCCCCACTGC
ATCTAGTACATTTTTATAGGATCAGGGATCTGCTCTTGGATTAATGTTGTGTTCCCACCT
CGAGGCAGCTTTGTAAGCTTCTGAGCACTTCCCAATTCCGGGTGACTTCAGGCACTGGGA
GGCCTGTGCATCAGCTGCTGCTGTCTGTAGCTGACTTCCTTCACCCCTCTGCTGTCCTCA
GCTCCTTCACCCCTGGGCCTCAGGAAATCAATGTCATGCTGACATCACTCTAGATCTAAA
AGTTGGGTTCTTGgaccaggcgtggtggctcacacctgtaatcccagcactttgggaggc
cgaggcgggtggatcacaaggtcaggagatcaagacgattctggctaacacggtgaaacc
ccgtctctactaaaaatacaaaaaaattagccgggtgtggtggcaggtgcctgtagcccc
agctacttgggaggctgaggcaggagaatggcttgaacctgggaggtggagcttgcagtg
agccaagatcacgccactgcactccagaatgggagagagagcgagactttctcaaaaaaa
aaaaaaaaaCTTAGGTTCTTGGATGTTCGGGAAAGGGGGTTATTATCTAGGATCCTTGAA
GCACCCCCAAGGGCATCTTCTCAAAGTTGGATGTGTGCATTTTCCTGAGAGGAAAGCTTT
CCCACATTATACAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAA
GGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCA
GCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGCTACGTCCTCTCTTGAGCTA
CAGCAGATTCACTCTGTTCTGTTTCATTGTTGTTTAGTTTGCGTTGTGTTTCTCCAACTT
TGTGCCTCATCAGGAAAAGCTTTGGATCACAATTCCCAGtgctgaagaaaaggccaaact
cttggttgtgttctttgattAGTgcctgtgacgcagcttcaggaggtcctgagaacgtgt
gcacagtttagtcggcagaaacttagggaaatgtaagaccaccatcagcacataggagtt
ctgcattggtttggtctgcattggtttggtCTTTTCCTGGATACAGGTCTTGATAGGTCT
CTTGATGTCATTTCACTTCAGATTCTTCTTTAGAAAACTTGGACAATAGCATTTGCTGTC
TTGTCCAAATTGTTACTTCAAGTTTGCTCTTAGCAAGTAATTGTTTCAGTATCTATATCA
AAAATGGCTTAAGCCTGCAACATGTTTCTGAATGATTAACAAGGTGATAGTCAGTTCTTC
ATTGAATCCTGGATGCTTTATTTTTCTTAATAAGAGGAATTCATATGGATCAG
>ENST00000426316.1
AATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAAGCTCTTctggagagtgag
agaagcttccagttaaggtgacattgaagccaagtcctgaaagatgaggaagagttgtat
gagagtggggagggaagggggaggtggagggaTGGGGAATGGGCCGGGATGGGATAGCGC
AAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTT
TGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTT
GTATGGAATTTTGCCTGAGAGACCTCATTCCTCACGTCGGCCATTCCAGGCCCCGTTTTT
CCCTTCCGGCAGCCTCTTGGCCTCTAATTTGTTTATCTTTTGTGTATAAATCCCAAAATA
TTGAATTTTGGAATATTTCCACCATTATGTAAATATTTTGATAGGTAA

 

How to Compare Regression Slopes

How to Compare Regression Slopes

If you perform linear regression analysis, you might need to compare different regression lines to see if their constants and slope coefficients are different. Imagine there is an established relationship between X and Y. Now, suppose you want to determine whether that relationship has changed. Perhaps there is a new context, process, or some other qualitative change, and you want to determine whether that affects the relationship between X and Y.

For example, you might want to assess whether the relationship between the height and weight of football players is significantly different than the same relationship in the general population.

You can graph the regression lines to visually compare the slope coefficients and constants. However, you should also statistically test the differences. Hypothesis testing helps separate the true differences from the random differences caused by sampling error so you can have more confidence in your findings.

In this blog post, I’ll show you how to compare a relationship between different regression models and determine whether the differences are statistically significant. Fortunately, these tests are easy to do using Minitab statistical software.

In the example I’ll use throughout this post, there is an input variable and an output variable for a hypothetical process. We want to compare the relationship between these two variables under two different conditions. Here is the Minitab project file with the data.

Comparing Constants in Regression Analysis

When the constants (or y intercepts) in two different regression equations are different, this indicates that the two regression lines are shifted up or down on the Y axis. In the scatterplot below, you can see that the Output from Condition B is consistently higher than Condition A for any given Input value. We want to determine whether this vertical shift is statistically significant.

Scatterplot with two regression lines that have different constants.

To test the difference between the constants, we just need to include a categorical variable that identifies the qualitative attribute of interest in the model. For our example, I have created a variable for the condition (A or B) associated with each observation.

To fit the model in Minitab, I’ll use: Stat > Regression > Regression > Fit Regression Model. I’ll include Output as the response variable, Input as the continuous predictor, and Condition as the categorical predictor.

In the regression analysis output, we’ll first check the coefficients table.

Coefficients table that shows that the constants are different

This table shows us that the relationship between Input and Output is statistically significant because the p-value for Input is 0.000.

The coefficient for Condition is 10 and its p-value is significant (0.000). The coefficient tells us that the vertical distance between the two regression lines in the scatterplot is 10 units of Output. The p-value tells us that this difference is statistically significant—you can reject the null hypothesis that the distance between the two constants is zero. You can also see the difference between the two constants in the regression equation table below.

Regression equation table that shows constants that are different

Comparing Coefficients in Regression Analysis

When two slope coefficients are different, a one-unit change in a predictor is associated with different mean changes in the response. In the scatterplot below, it appears that a one-unit increase in Input is associated with a greater increase in Output in Condition B than in Condition A. We can see that the slopes look different, but we want to be sure this difference is statistically significant.

Scatterplot that shows two slopes that are different

How do you statistically test the difference between regression coefficients? It sounds like it might be complicated, but it is actually very simple. We can even use the same Condition variable that we did for testing the constants.

We need to determine whether the coefficient for Input depends on the Condition. In statistics, when we say that the effect of one variable depends on another variable, that’s an interaction effect. All we need to do is include the interaction term for Input*Condition!

In Minitab, you can specify interaction terms by clicking the Model button in the main regression dialog box. After I fit the regression model with the interaction term, we obtain the following coefficients table:

Coefficients table that shows different slopes

The table shows us that the interaction term (Input*Condition) is statistically significant (p = 0.000). Consequently, we reject the null hypothesis and conclude that the difference between the two coefficients for Input (below, 1.5359 and 2.0050) does not equal zero. We also see that the main effect of Condition is not significant (p = 0.093), which indicates that difference between the two constants is not statistically significant.

Regression equation table that shows different slopes

It is easy to compare and test the differences between the constants and coefficients in regression models by including a categorical variable. These tests are useful when you can see differences between regression models and you want to defend your conclusions with p-values.

If you’re learning about regression, read my regression tutorial!

Size Matters: Metabolic Rate and Longevity (Regression analysis sample)

Size Matters: Metabolic Rate and Longevity

John Tukey once said, “The best thing about being a statistician is that you get to play in everyone’s backyard.” I enthusiastically agree!

I frequently enjoy reading and watching science-related material. This invariably raises questions, involving other “backyards,” that I can better understand using statistics. For instance, see my post about the statistical analysis of dolphin sounds.

The latest topic that grabbed my attention was an apparent error in the BBC program Wonders of Life. In the episode “Size Matters,” Professor Brian Cox presents a graph with a linear regression line that illustrates the relationship between the size of mammals and their metabolic rate.

How Does the Size of Mammals Affect Their Lives?

Brian Cox, a theoretical physicist, is a really smart guy and one of my favorite science presenters. So, I was surprised when his interpretation of the linear regression model seemed incorrect. Below is a closer look at the graph he presents, and his claim.

Wonder of Life plot of Mass and Energy

Cox points out the straight line and states, “That implies, gram-for-gram, large animals use less energy than small animals . . . because the slope is less than one.”

For linear regression, the slope being less than 1 is irrelevant. Instead, the fact that it is a straight line indicates that the same relationship applies for both small and large mammals. If you increase mass by 1 unit for a small mammal and for a large mammal, metabolism increases by the same average amount for both sizes. In other words, gram-for-gram, size doesn’t seem to matter!

However, it’s unlikely that Cox would make such a fundamental mistake, so I conducted some research. It turns out that biologists use a log-log plot to model the relationship between the mass of mammals and their basal metabolic rate.

Perhaps Cox actually presented a log-log plot but glossed over the details?

If so, this dramatically changes the interpretation of this graph, because log-log plots transform both axes in order to model curvature while using linear regression. If the slope on a log-log plot of metabolic rate by mass is between 0 and 1, it indicates that the nonlinear effect of mass on metabolic rate lessens as mass increases.

This description fits Cox’s statements about the slope and how mass effects metabolic rate.

Let’s test Cox’s hypothesis ourselves! Thanks to the PanTHERIA database*, we can fit the same type of log-log plot using similar data.

Log-Log Plot of Mammal Mass and Basal Metabolic Rate

I’ll use the Fitted Line Plot in Minitab statistical software to fit a regression line to 572 mammals, ranging from the masked shrew (4.2 grams) to the common eland (562,000 grams). You can find the data here.

In Minitab, go to Stat > Regression > Fitted Line Plot. Metabolic rate is our response variable and adult mass is our predictor. Go to Options and check all four boxes under Transformations to produce the log-log plot.

Fitted line plot of mammal mass by metabolic rate

The slope (0.7063) is significant (p = 0.000) and its value is consistent with recently published estimates. Because the slope is between 0 and 1, it confirms Cox’s interpretation. Gram-for-gram, larger animals use less energy than smaller animals. In order to function, a cell in a larger animal requires less energy than a cell in a smaller animal.

I’m quite amazed that the R-squared is 94.3%! Usually you only see R-squared values this high for a low-noise physical process. Instead, these data were collected by a variety of researchers in different settings and cover a huge range of mammals that live in completely different habitats.

So Cox presented the correct interpretation after all: for mammals, size matters. However, he presented an oversimplified version of the underlying analysis by not mentioning the double-log transformations. This is television, after all!

There are important implications based on the fact that this model is curved rather than linear. If the increase in metabolic rate remained constant as mass increased (the linear model), we’d have to eat 16,000 calories a day to sustain ourselves. Further, mammals wouldn’t be able to grow larger than a goat due to overheating!

Basal Metabolic Rate and Longevity

Cox also presented the idea that mammals with a slower metabolism live longer than those with a faster metabolism. However, he didn’t present data or graphs to support this contention. Fortunately, we can test this hypothesis as well.

In Minitab, I used Calc > Calculator to divide metabolic rate by grams. This division allows us to make the gram-for-gram comparison. Time for another fitted line plot with a double-log transformation!

Fitted line plot of metabolic rate by longevity

The negative slope is significant (0.000) and tells us that as metabolic rate per gram increases, longevity decreases. The R-squared is 45.8%, which is pretty good considering that we’re looking at just one of many factors than can impact maximum lifespan!

However, it is not a linear relationship because this is a log-log plot. Maximum longevity asymptotically approaches a minimum value around 13 months as metabolism increases. The graph below shows the curvilinear relationship using the natural scale.

Fitted line for metabolic rate and longevity

A one-unit increase in the slower metabolic rates (left side of x-axis) produces much larger drops in longevity than a on-unit increase in the faster metabolic rates (right side of x-axis).

Once again, we’ve shown that size does matter! Larger mammals tend to have a slower metabolism. And animals that have a slower metabolism tend to live longer. That’s fortunate for us because without our slower metabolism, we’d only live about a year!