利用tophat和Cufflinks做转录组差异表达分析的步骤详解

今天一个同学给我推荐一篇Nature Protocol 上文章,关于转录组差异表达分析。尚在正式通读之前习惯性浏览一遍图表,说实在这篇文章着实让我觉得有点“另类”。这是一篇活生生的利用Bowtie、tophat和Cufflinks做转录组差异表达分析的protocol。里面详细讲解每一步需要分析什么,用哪些些软件,已经相关命令和参数。

根据文章介绍的workflow,做转录组分析,无论是链特异性转录组数据(Strand-specific RNA-seq)还是非特异性数据,主要内容包括下面几个部分:

1)reads mapping,这里面推荐两款软件一个是Bowtie,另一个是tophat(此软件相对于Bowtie或者bwa,可以识别转录本的可变剪接)

2)转录组本组装(利用Cufflinks),转录本与已有基因组注释比较(利用Cuffcompare)、合并(利用Cuffmerge),转录组本差异表达分析(利用Cuffdiff)。

下面附上原文中的两张图片供大家快速预览转录组分析大致过程,其中图1是转录组分析中可能会用到的软件以及相关功能,图2:是转录本分析的一般流程。

图1

图2

关于转录组分析的相关软件在分析数据过程中的命令和参数,这里就不附加上来了,请大家直接阅读原文。

Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn & Lior Pachter. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 7, 562–578 […]

各大序列数据库序列号之间的关联查询

来NCBI后自己的一个工作就是对其他人多年前的一些数据进行再分析,期望发现新的规律。其中涉及到很多序列号与序列对应,蛋白与核酸序列对应,不同数据库数据对应的问题,这里做个总结,希望对其他人也有用。

先说说NCBI序列数据库中的编号问题。NCBI序列数据中我们常用到两种编号:一个是GI,一个是ACCESSION。ACCESSION形式为CC_#####,其中CC为两个字母,其不同组合又可以区分为蛋白序列、核酸序列或基因组序列,而#为位数不等的数字;GI则全为数字表示。ACCESSION后面又会加版本号,以CC_####.#形式表示,最后的尾数递增表示序列信息较之前的版本有所修改。这样ACCESSION+版本号就是一个唯一的表示,代表一个唯一的序列,而且这个编号不会改变。而GI原则上也是一个序列一个编号,但因为序列经常会变,因此GI号也会改变,而一些GI号可能会被删除,大家可以通过NCBI的Sequence Revision History查询来看每个GI的变迁历史,包括被删除的GI。原则上讲GI并不是一个很好的序列代号,但因为其组成比较单一(全是数字),结构比较简单,因此用脚本语言比较好处理,用的很多,因此也就会导致很多问题(例如多个GI号可能会对应一个ACCESSION,而你可能无法用GI号从BioPerl中调用相应序列)。 现在讲序列数据关联的第一个问题,那就是GI与ACCESSION关联问题,以及蛋白序列与核算序列关联问题。我们很可能拿到手一些GI号,我们想知道其对应的蛋白序列,以及蛋白序列对应的核酸序列是什么。原则上这些信息都在数据库中存在,关键是我们如何获得,如果是少数几个那就好说了,把GI号拿去NCBI数据库查询,它会给出对应的蛋白,根据网站的链接,我们就会获得其相应的核算序列。但如果是大量GI呢,这种情况在比较基因组分析中很常见。NCBI提供了其文本关联文件,可以在其ftp上找到,例如gene2refseq中就包含了所有这些信息(还有更多信息,例如gene id, tax id,chromsome位置等等)。 而第二个问题是关于跨数据库间数据的关联问题。我们经常遇到这样的情况,我们得到一套数据,但其中基因用一套代号表示,而我们知道的序列信息却是另外一套编号,如何关联呢(还是大量数据的问题)?EMBL的数据如何跟与NCBI的数据关联呢,UniProt的数据如何跟NCBI关联呢,一些特殊物种的数据库(例如FlyBase,WormBase)如何跟NCBI数据关联呢?EBI上有专门的问题解答,其中提到各种服务以及数据资源。而NCBI也给出跟不同数据库的关联信息,存放在其ftp上,定期更新。一些专门的网站也来提供相应的服务,自己接触的一个是bioDBnet,其不仅给出大量数据库之间的关联查询,还能帮助你找到你需要的数据库! 先写这么多,有更新再加上。

本文引用地址:http://blog.sciencenet.cn/blog-286438-424412.html

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/

名称 大小 修改日期 [上级目录] ASN_BINARY/ 12-12-20 下午12:30:00 ASN_OLD/ 05-8-2 上午8:00:00 GENE_INFO/ 12-12-20 下午12:31:00 README 0 B 12-2-10 上午8:00:00 README_ensembl 1000 B 12-11-3 下午5:14:00 gene2accession.gz 416 MB 12-12-20 上午11:02:00 gene2ensembl.gz 5.5 MB 12-12-20 上午11:11:00 gene2go.gz 13.1 MB 12-12-20 上午11:14:00 gene2pubmed.gz 31.1 MB 12-12-20 上午11:14:00 […]

Batch download sequences from uniprot based on protein names

Ok, I’ll do mine in English:

go to UniProt.org. click tab “retrieve” Paste list into text box. Click Retrieve button. On results page, click FASTA download [ Download (30 KB*) | Open ] (Or you could click open just to have a look).

Circos 安装和学习 (一)

http://circos.ca/documentation/ Tutorials and Course

The tutorials serve as a walkthrough through Circos. The course is a more structured set of materials that takes you through creating an image from scratch.

The tutorials act as documentation — each lesson presents a specific feature of Circos.

Example Image

Once you download and install Circos,

# install circos […]

Bio3D in R Utilities for the analysis of protein structure and sequence data

http://users.mccammon.ucsd.edu/~bgrant/bio3d/user_guide/user_guide.html#example

Some Beginner Examples

 

library(bio3d) # load the bio3d package

lbio3d() # list the functions within the package

 

 

## See the help pages of individual functions for full documentation and worked examples.

help(read.pdb) # type “q” to exit help page and return to the R prompt

example(read.pdb)

 

## Read a PDB […]

Positive-Unlabeled Learning for Disease Gene Identification

Background: Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive training set P and the unknown genes as the […]

RazerS 3: Faster, fully sensitive read mapping

Motivation: During the last years NGS sequencing has become a key technology for many applications in the biomedical sciences. Throughput continues to increase and new protocols provide longer reads than currently available. In almost all applications, read mapping is a first step. Hence, it is crucial to have algorithms and implementations that perform fast, […]

FacPad: Bayesian Sparse Factor Modeling for the Inference of Pathways Responsive to Drug Treatment

Motivation: It is well recognized that the effects of drugs are far beyond targeting individual proteins, but rather influencing the complex interactions among many relevant biological pathways. Genome-wide expression profiling before and after drug treatment has become a powerful approach for capturing a global snapshot of cellular response to drugs, as well as to […]

Qualimap: evaluating next generation sequencing alignment data

Motivation: The sequence alignment/map (SAM) and the binary alignment/map (BAM) formats have become the standard method of representation of nucleotide sequence alignments for next-generation sequencing data. SAM/BAM files usually contain information from tens to hundreds of millions of reads. Often, the sequencing technology, protocol, and/or the selected mapping algorithm introduce some unwanted biases in […]

MEGA-CC: Computing Core of Molecular Evolutionary Genetics Analysis program for automated and iterative data analysis

Summary: There is a growing need in the research community to apply the Molecular Evolutionary Genetics Analysis (MEGA) software tool for batch processing a large number of datasets and to integrate it into analysis workflows. We now make available the computing core of the MEGA software as a stand-alone executable (MEGA-CC), along with an […]