小生这厢有礼了(BioFaceBook Personal Blog) » 生物芯片

Cytoscape基础教程笔记

szypanther — Tue, 11 Jun 2013 05:29:32 +0000

昨天开始学用Cytoscape，其tutorial分为两个部分，基础的和高级的。基础教程又分成了四课：Getting Started、Filters & Editor、Fetching External Data和Expression Analysis。为防忘记，做个摘记。

第一课新手上路，地址见http://goo.gl/FJLxp。
Cytoscape可以本地安装，也可以web start。软件得用java，所以要装JRE。我在Ubuntu下装了OpenJDK，可以运行。因为以前一直没把jnlp文件和java关联起来，所以从没成功web start过，试了一下“课文”里给出的链接，似乎不太靠谱，总之是没法启动。
启动Cytoscape后，得下载两个样例文件。以sif为后缀名的是蛋白相互作用网络信息，里面的蛋白以数字形式区别，以na为后缀名的是各数字id的注释，似乎两者的文件名必须相同才能关联起来。
sif文件的打开\导入有两种方式：File → Import → Network(Multiple File Types)或者直接Ctrol+L，na文件是File → Import → Node Attributes。Network导入之后有多种显示风格，2.8版默认风格下，圆圈是各蛋白，称为节点（node），其间各线为edge，代表相互作用。点中圆圈就选中了一个节点，想要多选，可以采用同时按Shift的方法，也可以先在Select → Mouse Drag Selects设置好选node还是选edge，然后鼠标拖放，一选一大片。
此外还可以有目的地选择。比如可以Select → Nodes → By Name，然后输入蛋白id，即可选中此节点。大海捞针即告完成。此操作的快捷键是Ctrl+F。
如果已经选中了节点，还可以Select → Nodes → First neighbors of selected nodes，可将所选蛋白的直接相互作用蛋白选中，再选File → New → Network → From selected nodes, all edges，即将相互作用网络的一个子网络剥离出来。
Layout菜单的功能比较花哨，是关于相互作用网络图的组织原则的。可以乱试一通，一张一张放在ppt里唬人，呵呵。
一团乱麻般的相互作用网络图下是查看节点或连线所代表的信息的地方，称为data panel。按Attributes按钮会弹出一个小窗口，可供选择需要列出的栏目的名字，比如id，或者对应的基因名，当然这个基因名信息是从na文件里导入的。
乱麻左边的窗口是有多个tab的控制面板。Network那个tab里可以在导入的sif里切换。VizMapper tab里可以定制显示样式，比如圆圈变成方形，或者变大一些，或者换个底色，连线换个粗细和颜色等等。如果莫名找不到的话，Cytoscape的菜单栏下有几个快捷按钮，其中有一个可以打开VizMapper。各种样式设置好之后一定要点Apply，还可以新建或者另存，便于把所有的网络打上自创风格的烙印。
Cytoscape还支持从网上直接导入相互作用网络，也是在File → Import → Network (Multiple File Types)一步，选择remote，然后输入url。从例题来看应该至少兼容SBML格式的。不过lin下代理我还是搞不定，这种“奢侈”的功能姐还是表痴心妄想了……
最后是是第一课课文里我最喜欢的一段，意译如下：
恭喜！贺喜！你竟然活着看完了整个教程里最无聊的一部分：“踩盘子”。放纵一下吧，弄个带榨菜的煎饼好好享受一番！

Reading the NCBI’s GEO microarray SOFT files in R/BioConductor

szypanther — Thu, 23 May 2013 02:47:38 +0000

http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/geo/

This page discusses how to load GEO SOFT format microarray data from the Gene Expression Omnibus database (GEO) (hosted by the NCBI) into R/BioConductor. SOFT stands for Simple Omnibus Format in Text. There are actually four types of GEO SOFT file available:

GEO Platform (GPL)
These files describe a particular type of microarray. They are annotation files.

GEO Sample (GSM)
Files that contain all the data from the use of a single chip. For each gene there will be multiple scores including the main one, held in the VALUE column.

GEO Series (GSE)
Lists of GSM files that together form a single experiment.

GEO Dataset (GDS)
These are curated files that hold a summarised combination of a GSE file and its GSM files. They contain normalised expression levels for each gene from each sample (i.e. just the VALUE field from the GSM file).

As long as you just need the expression level then a GDS file will suffice. If you need to dig deeper into how the expression levels were calculated, you’ll need to get all the GSM files instead (which are listed in the GDS or GSE file).

To me, it was natural to ask: How can I turn a GEO DataSet (GDS file) into an R/BioConductor expression set object (exprSet)? (answer) And while we’re at it, how to load the GEO Platform annotation (GPL file) too? (answer)

In the MOAC Module 5 assignment, the approach taken was to sanitize the data by hand, allowing it to be loaded into R with a simple call to the read.table command. Its a good idea to look at the raw files to understand what you are dealing with, but surely there is a more elegant way…

It turns out there are several existing GEO parsers, but one stands out above all others: Sean Davis’GEOquery (released roughly December 2005).

Installing GEOquery

Assuming you are running a recent version of BioConductor (1.8 or later) you should be able to install it from within R as follows:

> source("http://www.bioconductor.org/biocLite.R") > biocLite("GEOquery") Running bioCLite version 0.1.6 with R version 2.3.1 ...

For those of you on an older version of BioConductor, you will have to download and install it by hand fromhere.

If you are using Windows, download GEOquery_1.6.0.zip (or similar) and save it. Then from within the R program, use the menu option “Packages”, “Install package(s) from local zip files…” and select the ZIP file.

On Linux, download GEOquery_1.6.0.tar.gz (or similar) and use sudo R CMD INSTALL GEOquery_1.6.0.tar.gz at the command prompt.

Loading a GDS file with GEOquery

Here is a quick introduction to how to load a GDS file, and turn it into an expression set object:

library(Biobase) library(GEOquery) #Download GDS file, put it in the current directory, and load it: gds858 <- getGEO('GDS858', destdir=".") #Or, open an existing GDS file (even if its compressed): gds858 <- getGEO(filename='GDS858.soft.gz')

I’m using GDS858 as input. The SOFT file is available in compressed form here GDS858.soft.gz, but GEOquery takes care of finding this file for you and unzipping it automatically.

Loading this file from the hard disk takes about two minutes on my laptop.

There are two main things the GDS object gives us, meta data (from the file header) and a table of expression data. These are extracted using the Meta and Table functions. First lets have a look at the metadata:

> Meta(gds858)$channel_count [1] "1" > Meta(gds858)$description [1] "Comparison of lung epithelial Calu-3 cells infected ..." > Meta(gds858)$feature_count [1] "22283" > Meta(gds858)$platform [1] "GPL96" > Meta(gds858)$sample_count [1] "19" > Meta(gds858)$sample_organism [1] "Homo sapiens" > Meta(gds858)$sample_type [1] "cDNA" > Meta(gds858)$title [1] "Mucoid and motile Pseudomonas aeruginosa infected lung epithelial cell comparison" > Meta(gds858)$type [1] "gene expression array-based"

Useful stuff, and now the expression data table:

> colnames(Table(gds858)) [1] "ID_REF" "IDENTIFIER" "GSM14498" "GSM14499" "GSM14500" [6] "GSM14501" "GSM14513" "GSM14514" "GSM14515" "GSM14516" [11] "GSM14506" "GSM14507" "GSM14508" "GSM14502" "GSM14503" [16] "GSM14504" "GSM14505" "GSM14509" "GSM14510" "GSM14511" [21] "GSM14512" > Table(gds858)[1:10,1:6] ID_REF IDENTIFIER GSM14498 GSM14499 GSM14500 GSM14501 1 1007_s_at U48705 3736.9 3811.0 3699.6 3897.6 2 1053_at M87338 343.0 500.3 288.3 341.3 3 117_at X51757 120.9 34.3 145.8 110.5 4 121_at X69699 1523.8 1281.1 1281.9 1493.4 5 1255_g_at L36861 51.6 15.9 45.9 8.1 6 1294_at L13852 253.2 164.8 200.0 205.2 7 1316_at X55005 199.6 250.7 290.3 218.6 8 1320_at X79510 81.7 13.4 13.9 88.7 9 1405_i_at M21121 18.9 5.6 11.0 9.5 10 1431_at J02843 99.7 74.5 72.6 114.8

Now, lets turn this GDS object into an expression set object (using base 2 logarithms) and have a look at it:

> eset <- GDS2eSet(gds858, do.log2=TRUE) > eset Expression Set (exprSet) with 22283 genes 19 samples phenoData object with 4 variables and 19 cases varLabels : sample : infection : genotype/variation : description > geneNames(eset)[1:10] [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" [6] "1294_at" "1316_at" "1320_at" "1405_i_at" "1431_at" > sampleNames(eset) [1] "GSM14498" "GSM14499" "GSM14500" "GSM14501" "GSM14513" [6] "GSM14514" "GSM14515" "GSM14516" "GSM14506" "GSM14507" [11] "GSM14508" "GSM14502" "GSM14503" "GSM14504" "GSM14505" [16] "GSM14509" "GSM14510" "GSM14511" "GSM14512"

GEOquery does an excellent job of extracting the phenotype data, as you can see:

> pData(eset)$infection [1] FRD1 FRD1 FRD1 FRD1 FRD440 [6] FRD440 FRD440 FRD440 FRD875 FRD875 [11] FRD875 FRD875 FRD1234 FRD1234 FRD1234 [16] uninfected uninfected uninfected uninfected Levels: FRD1 FRD1234 FRD440 FRD875 uninfected > pData(eset)$"genotype/variation" [1] control control [3] control control [5] mucoid mucoid [7] mucoid mucoid [9] motile motile [11] motile motile [13] non-mucoid, non-motile non-mucoid, non-motile [15] non-mucoid, non-motile non-mucoid, non-motile [17] non-mucoid, non-motile non-mucoid, non-motile [19] non-mucoid, non-motile Levels: control motile mucoid non-mucoid, non-motile

As with any expression set object, its easy to pull out a subset of the data:

> eset["1320_at","GSM14504"] Expression Set (exprSet) with 1 genes 1 samples phenoData object with 4 variables and 1 cases varLabels : sample : infection : genotype/variation : description > exprs(eset["1320_at","GSM14504"]) GSM14504 1320_at 6.70044

You should be able to produce a heatmap of differentially expressed genes easily enough using this page, especially as the phenotype/sub-sample information has been sorted out for you.

Loading a GPL (Annotation) file with GEOquery

In addition to loading a GDS file to get the expression levels, you can also load the associated platform annotation file. You can find this out from the GDS858 meta information:

> Meta(gds858)$platform [1] "GPL96"

So, for GDS858, the platform is GPL96, Affymetrix GeneChip Human Genome U133 Array Set HG-U133A.

Now let’s load up the GPL file and have a look at it (its a big file, about 12 MB, so this takes a while!):

library(Biobase) library(GEOquery) #Download GPL file, put it in the current directory, and load it: gpl96 <- getGEO('GPL96', destdir=".") #Or, open an existing GPL file: gpl96 <- getGEO(filename='GPL96.soft')

As with the GDS object, we can use the Meta and Table functions to extract information:

> Meta(gpl96)$title [1] "Affymetrix GeneChip Human Genome U133 Array Set HG-U133A" > colnames(Table(gpl96)) [1] "ID" "Species.Scientific.Name" [3] "Annotation.Date" "GB_LIST" [5] "SPOT_ID" "Sequence.Source" [7] "Representative.Public.ID" "Gene.Title" [9] "Gene.Symbol" "Entrez.Gene" [11] "RefSeq.Transcript.ID" "Gene.Ontology.Biological.Process" [13] "Gene.Ontology.Cellular.Component" "Gene.Ontology.Molecular.Function"

Lets look at the first four columns, for the first ten genes:

> Table(gpl96)[1:10,1:4] ID Species.Scientific.Name Annotation.Date GB_LIST 1 1007_s_at Homo sapiens 16-Sep-05 U48705 2 1053_at Homo sapiens 16-Sep-05 M87338 3 117_at Homo sapiens 16-Sep-05 X51757 4 121_at Homo sapiens 16-Sep-05 X69699 5 1255_g_at Homo sapiens 16-Sep-05 L36861 6 1294_at Homo sapiens 16-Sep-05 L13852 7 1316_at Homo sapiens 16-Sep-05 X55005 8 1320_at Homo sapiens 16-Sep-05 X79510 9 1405_i_at Homo sapiens 16-Sep-05 M21121 10 1431_at Homo sapiens 16-Sep-05 J02843

This shows a hand picked selection of the columns, again for the first ten genes:

> Table(gpl96)[1:10,c("ID","GB_LIST","Gene.Title","Gene.Symbol","Entrez.Gene")] ID GB_LIST Gene.Title Gene.Symbol Entrez.Gene 1 1007_s_at U48705 discoidin domain receptor family, member 1 DDR1 780 2 1053_at M87338 replication factor C (activator 1) 2, 40kDa RFC2 5982 3 117_at X51757 heat shock 70kDa protein 6 (HSP70B') HSPA6 3310 4 121_at X69699 paired box gene 8 PAX8 7849 5 1255_g_at L36861 guanylate cyclase activator 1A (retina) GUCA1A 2978 6 1294_at L13852 ubiquitin-activating enzyme E1-like UBE1L 7318 7 1316_at X55005 thyroid hormone receptor, alpha (erythroblastic...) THRA 7067 8 1320_at X79510 protein tyrosine phosphatase, non-receptor type 21 PTPN21 11099 9 1405_i_at M21121 chemokine (C-C motif) ligand 5 CCL5 6352 10 1431_at J02843 cytochrome P450, family 2, subfamily E, polypeptide 1 CYP2E1 1571

The above all used the 12MB file GPL96.soft, but you can also get a much smaller 3MB file GPL96.annot(compressed as GPL96.annot.gz) which has slightly different information in it… see here.

Using the BioConductor hgu133a package

Instead of loading the GEO annotation file for GPL96/HG-U133A, we could use an existing annotation package from the BioConductor annotation sets, hgu133a. These libraries exist for most of the popular microarray gene chips.

First of all, we need to install the package:

> source("http://www.bioconductor.org/biocLite.R") > biocLite("hgu133a") Running bioCLite version 0.1 with R version 2.1.1 ...

Then we can load the newly installed library:

> library(hgu133a)

There is any easy way to check when this was lasted updated, and what it can translate the Affy probe names into:

> hgu133a() Quality control information for hgu133a Date built: Created: Tue May 17 13:02:12 2005 Number of probes: 22277 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: hgu133aACCNUM found 22277 of 22277 hgu133aCHRLOC found 20195 of 22277 hgu133aCHR found 21283 of 22277 hgu133aENZYME found 2507 of 22277 hgu133aGENENAME found 18726 of 22277 hgu133aGO found 18647 of 22277 hgu133aLOCUSID found 21747 of 22277 hgu133aMAP found 21183 of 22277 hgu133aOMIM found 15109 of 22277 hgu133aPATH found 5067 of 22277 hgu133aPMID found 21004 of 22277 hgu133aREFSEQ found 21002 of 22277 hgu133aSUMFUNC found 0 of 22277 hgu133aSYMBOL found 21303 of 22277 hgu133aUNIGENE found 21128 of 22277 Mappings found for non-probe based rda files: hgu133aCHRLENGTHS found 25 hgu133aENZYME2PROBE found 663 hgu133aGO2ALLPROBES found 5912 hgu133aGO2PROBE found 4326 hgu133aORGANISM found 1 hgu133aPATH2PROBE found 142 hgu133aPMID2PROBE found 96291

And now lets test some of those mappings on the fourth gene 121_at in the GPL file:

> Table(gpl96)[4,c("ID","GB_LIST","Gene.Title","Gene.Symbol","Entrez.Gene")] ID GB_LIST Gene.Title Gene.Symbol Entrez.Gene 4 121_at X69699 paired box gene 8 PAX8 7849

Now, what does the annotation file have to say?

> mget("121_at",hgu133aACCNUM) $"121_at" [1] "X69699" > mget("121_at",hgu133aGENENAME) $"121_at" [1] "paired box gene 8" > mget("121_at",hgu133aSYMBOL) $"121_at" [1] "PAX8" > mget("121_at",hgu133aUNIGENE) $"121_at" [1] "Hs.469728"

You will notice that there is some overlap between the information in the GEO annotation table, and thehgu133a package (which compiles its data from a range of sources). See help(hgu133a) .

You should also read this introduction, Bioconductor: Annotation Package Overview

RSeQC: quality control of RNA-seq experiments

szypanther — Wed, 08 Aug 2012 02:16:29 +0000

Abstract

Motivation: RNA-seq has been extensively used for transcriptome study. Quality control (QC) is critical to ensure that RNA-seq data are of high quality and suitable for subsequent analyses. However, QC is a time-consuming and complex task, due to the massive size and versatile nature of RNA-seq data. Therefore, a convenient and comprehensive QC tool to assess RNA-seq quality is sorely needed.

Results: We developed the RSeQC package to comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure. RSeQC takes both SAM and BAM files as input, which can be produced by most RNA-seq mapping tools as well as BED files, which are widely used for gene models. Most modules in RSeQC take advantage of R scripts for visualization, and they are notably efficient in dealing with large BAM/SAM files containing hundreds of millions of alignments.

Availability and implementation: RSeQC is written in Python and C. Source code and a comprehensive user’s manual are freely available at:http://code.google.com/p/rseqc/.

Contact: WL1@bcm.edu

Introduction

RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data. “Basic modules” quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while “RNA-seq specific modules” investigate sequencing saturation status of both splicing junction detection and expression estimation, mapped reads clipping profile, mapped reads distribution, coverage uniformity over gene body, reproducibility, strand specificity and splice junction annotation

RSeQC Manual

Download RSeQC from BCM or go to Downloads page

Release history

RSeQC v2.3.1:

Add normalization option to bam2wig.py. With this option, user can normalize different sequencing depth into the same scale when converting BAM into wiggle format.

Add another script. geneBody_coverage2.py. This script uses BigWig instead of BAM as input, and requires much less memory (~ 200M)