小生这厢有礼了(BioFaceBook Personal Blog) » bioinformatics

BEReX : Biomedical Entity-Relation eXplorer

szypanther — Fri, 06 Dec 2013 02:14:20 +0000

BEReX is a new biomedical knowledge integration, search, and exploration tool. BEReX integrates eight popular databases (STRING, DrugBank, KEGG, PharmGKB, BioGRID, GO, HPRD, and MSigDB) and delineates an integrated network by combining the information available from these databases. Users search the integrated network by entering keywords and BEReX returns a sub-network matching the keywords. The resulting graph can be explored interactively. BEReX allows users to find the shortest paths between two remote nodes; find the most relevant drugs, diseases, pathways and so on, related to the current network; expand the network by particular types of entities and relations; and modify the network by removing or adding selected nodes. BEReX is implemented as a stand-alone Java application.

1. Program availability and requirements

Operating systems : Windows, Mac, Linux
Java runtime : JRE6 or higher is need to run the application

2. Installation

Install JRE(skip this step if you have JRE 6 or later, or JDK 1.6 or later)
Download BEReX (v.1.0) and unzip the package file:
1. Windows : berex-v1-windows.zip(184MB)
2. Mac : berex-v1-mac.zip(184MB)
3. Linux : berex-v1-linux.zip(184MB)
Run BEReX.bat (for Windows) or BEReX.sh.command (for Mac) or BEReX.sh (for Linux)
Please cite the following article when using BEReX.
- Jeon,M., Lee,S., Lee,K., Tan,A., Kang,J.; BEReX: Biomedical Entity-Relationship eXplorer. Bioinformatics (2013) doi: 10.1093/bioinformatics/btt598

3. Documentation

User Guide : BEReX v1.0 User Guide

4. Souce Code

Source Code : berex_sourcecode.zip

5. License

BEReX is licensed under the GNU General Public License and is 100% freely available to both commercial and academic users. See the file LICENSE.txt in the BEReX distribution package or this URL for the full text of the license: http://www.gnu.org/licenses/gpl.html

6. Contact – for bugs, comments and questions

Minji Jeon: mjjeon@korea.ac.kr
Jaewoo Kang: kangj@korea.ac.kr

Last updated on Sep 25, 2013

multiple sequence alignment software

szypanther — Fri, 09 Aug 2013 03:48:54 +0000

PAGAN is a general-purpose method for the alignment of sequence graphs. It includes e.g.:

phylogenetic multiple sequence alignment
alignment extension by phylogenetic sequence placement
modelling of Roche 454 sequencing error
alignment and placement of NGS sequences
pileup alignment of similar/noisy NGS reads or sequences
inference of ancestral sequences

PAGAN documentation is available at the Wiki page. PAGAN source code is available with git.

PAGAN is under development. If you have questions, comments or suggestions how to improve the method, please post them through thePAGAN discussion group. Bug reports can be entered through the Issues page.

awk 一些简单的用法

szypanther — Thu, 25 Jul 2013 02:53:01 +0000

# 打印每行，并删除第二列

awk ‘{ $2 = “”; print }’ file1
awk ‘{ $2 = “”;$1 = “”; print }’ test1

＃打印部分文本

bash-3.2$ # 打印文件的前十行（模拟 “head”）

bash-3.2$ awk ‘NR < 11′ test1

# 打印文件的最后两行（模拟 “tail -2″）

awk ‘{y=x “\n” $0; x=$0};END{print y}’

# 打印文件的最后一行（模拟 “tail -1″）

awk ‘END{print}’

# 打印第5列等于“abc123″的行

awk ‘$5 == “abc123″‘ file1

# 打印指定行之间的内容（8-12行, 包括第8和第12行）

awk ‘NR==8,NR==12′

＃打印2,3列

✓ awk ‘{print $2, $3}’ file1 > file2

# 打印每行的最后一列

awk ‘{ print $NF }’

# 打印最后一行的最后一列

awk ‘{ field = $NF }; END{ print field }’

# 打印列数超过4的行

awk ‘NF > 4′

# 打印最后一列大于4的行

awk ‘$NF > 4′

选择性的删除某些行：

# 删除所有空白行（类似于 “grep ‘.’ “）

awk NF

awk ‘/./’

# 删除重复连续的行（模拟 “uniq”）

awk ‘a !~ $0; {a=$0}’

# 删除重复的、非连续的行

awk ‘! a[$0]++’ # 最简练

awk ‘!($0 in a) {a[$0];print}’ # 最有效

文本间隔：

# 每行后面增加一行空行

awk ‘1;{print “”}’

awk ‘BEGIN{ORS=”\n\n”};1′

# 每行后面增加两行空行

awk ‘1;{print “\n”}’

# 以文件为单位，在每句行前加上编号（左对齐）

# 使用制表符（\t）来代替空格可以有效保护页变的空白。

awk ‘{print FNR “\t” $0}’ files*

其中，0为显示所有列；亦可用1,2来显示1,2列

awk ‘{print FNR “\t” $1,$2}’ test1

# 用制表符（\t）给所有文件加上连贯的编号。

awk ‘{print NR “\t” $0}’ files*

# 计算行数（模拟 “wc -l”）

awk ‘END{print NR}’

＃计算每行之和：

awk ‘{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}’ file1

＃＃其中，｛｝内同c语句。

# 打印每行每区域的绝对值

awk ‘{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }’

awk ‘{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }’

# 计算所有行所有区域行数乘以列数

awk ‘{ total = total + NF }; END {print total}’ file

＃替换

# 每行用 “bar” 查找替换 “foo”

awk ‘{sub(/foo/,”bar”)}; 1′ # 仅仅替换第一个找到的“foo”

gawk ‘{$0=gensub(/foo/,”bar”,4)}; 1′ # 仅仅替换第四个找到的“foo”

awk ‘{gsub(/foo/,”bar”)}; 1′ # 全部替换

# 在包含 “baz” 的行里，将 “foo” 替换为 “bar”

awk ‘/baz/{gsub(/foo/, “bar”)}; 1′

# 在不包含 “baz” 的行里，将 “foo” 替换为 “bar”

awk ‘!/baz/{gsub(/foo/, “bar”)}; 1′

# 将 “scarlet” 或者 “ruby” 或者 “puce” 替换为 “red”

awk ‘{gsub(/scarlet|ruby|puce/, “red”)}; 1′

Cake: a bioinformatics pipeline for the integrated analysis of somatic variants in cancer genomes.

szypanther — Thu, 25 Jul 2013 01:42:57 +0000

Description

Cake is a bioinformatics tool to identify putative somatic mutations from cancer genome/exome data. Cake combines somatic calls from a number of publicly available SNP/somatic variant calling tools with an array of variant filtering modules to discard unwanted

http://sourceforge.net/projects/cakesomatic/

Reading the NCBI’s GEO microarray SOFT files in R/BioConductor

szypanther — Thu, 23 May 2013 02:47:38 +0000

http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/geo/

This page discusses how to load GEO SOFT format microarray data from the Gene Expression Omnibus database (GEO) (hosted by the NCBI) into R/BioConductor. SOFT stands for Simple Omnibus Format in Text. There are actually four types of GEO SOFT file available:

GEO Platform (GPL)
These files describe a particular type of microarray. They are annotation files.

GEO Sample (GSM)
Files that contain all the data from the use of a single chip. For each gene there will be multiple scores including the main one, held in the VALUE column.

GEO Series (GSE)
Lists of GSM files that together form a single experiment.

GEO Dataset (GDS)
These are curated files that hold a summarised combination of a GSE file and its GSM files. They contain normalised expression levels for each gene from each sample (i.e. just the VALUE field from the GSM file).

As long as you just need the expression level then a GDS file will suffice. If you need to dig deeper into how the expression levels were calculated, you’ll need to get all the GSM files instead (which are listed in the GDS or GSE file).

To me, it was natural to ask: How can I turn a GEO DataSet (GDS file) into an R/BioConductor expression set object (exprSet)? (answer) And while we’re at it, how to load the GEO Platform annotation (GPL file) too? (answer)

In the MOAC Module 5 assignment, the approach taken was to sanitize the data by hand, allowing it to be loaded into R with a simple call to the read.table command. Its a good idea to look at the raw files to understand what you are dealing with, but surely there is a more elegant way…

It turns out there are several existing GEO parsers, but one stands out above all others: Sean Davis’GEOquery (released roughly December 2005).

Installing GEOquery

Assuming you are running a recent version of BioConductor (1.8 or later) you should be able to install it from within R as follows:

> source("http://www.bioconductor.org/biocLite.R") > biocLite("GEOquery") Running bioCLite version 0.1.6 with R version 2.3.1 ...

For those of you on an older version of BioConductor, you will have to download and install it by hand fromhere.

If you are using Windows, download GEOquery_1.6.0.zip (or similar) and save it. Then from within the R program, use the menu option “Packages”, “Install package(s) from local zip files…” and select the ZIP file.

On Linux, download GEOquery_1.6.0.tar.gz (or similar) and use sudo R CMD INSTALL GEOquery_1.6.0.tar.gz at the command prompt.

Loading a GDS file with GEOquery

Here is a quick introduction to how to load a GDS file, and turn it into an expression set object:

library(Biobase) library(GEOquery) #Download GDS file, put it in the current directory, and load it: gds858 <- getGEO('GDS858', destdir=".") #Or, open an existing GDS file (even if its compressed): gds858 <- getGEO(filename='GDS858.soft.gz')

I’m using GDS858 as input. The SOFT file is available in compressed form here GDS858.soft.gz, but GEOquery takes care of finding this file for you and unzipping it automatically.

Loading this file from the hard disk takes about two minutes on my laptop.

There are two main things the GDS object gives us, meta data (from the file header) and a table of expression data. These are extracted using the Meta and Table functions. First lets have a look at the metadata:

> Meta(gds858)$channel_count [1] "1" > Meta(gds858)$description [1] "Comparison of lung epithelial Calu-3 cells infected ..." > Meta(gds858)$feature_count [1] "22283" > Meta(gds858)$platform [1] "GPL96" > Meta(gds858)$sample_count [1] "19" > Meta(gds858)$sample_organism [1] "Homo sapiens" > Meta(gds858)$sample_type [1] "cDNA" > Meta(gds858)$title [1] "Mucoid and motile Pseudomonas aeruginosa infected lung epithelial cell comparison" > Meta(gds858)$type [1] "gene expression array-based"

Useful stuff, and now the expression data table:

> colnames(Table(gds858)) [1] "ID_REF" "IDENTIFIER" "GSM14498" "GSM14499" "GSM14500" [6] "GSM14501" "GSM14513" "GSM14514" "GSM14515" "GSM14516" [11] "GSM14506" "GSM14507" "GSM14508" "GSM14502" "GSM14503" [16] "GSM14504" "GSM14505" "GSM14509" "GSM14510" "GSM14511" [21] "GSM14512" > Table(gds858)[1:10,1:6] ID_REF IDENTIFIER GSM14498 GSM14499 GSM14500 GSM14501 1 1007_s_at U48705 3736.9 3811.0 3699.6 3897.6 2 1053_at M87338 343.0 500.3 288.3 341.3 3 117_at X51757 120.9 34.3 145.8 110.5 4 121_at X69699 1523.8 1281.1 1281.9 1493.4 5 1255_g_at L36861 51.6 15.9 45.9 8.1 6 1294_at L13852 253.2 164.8 200.0 205.2 7 1316_at X55005 199.6 250.7 290.3 218.6 8 1320_at X79510 81.7 13.4 13.9 88.7 9 1405_i_at M21121 18.9 5.6 11.0 9.5 10 1431_at J02843 99.7 74.5 72.6 114.8

Now, lets turn this GDS object into an expression set object (using base 2 logarithms) and have a look at it:

> eset <- GDS2eSet(gds858, do.log2=TRUE) > eset Expression Set (exprSet) with 22283 genes 19 samples phenoData object with 4 variables and 19 cases varLabels : sample : infection : genotype/variation : description > geneNames(eset)[1:10] [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" [6] "1294_at" "1316_at" "1320_at" "1405_i_at" "1431_at" > sampleNames(eset) [1] "GSM14498" "GSM14499" "GSM14500" "GSM14501" "GSM14513" [6] "GSM14514" "GSM14515" "GSM14516" "GSM14506" "GSM14507" [11] "GSM14508" "GSM14502" "GSM14503" "GSM14504" "GSM14505" [16] "GSM14509" "GSM14510" "GSM14511" "GSM14512"

GEOquery does an excellent job of extracting the phenotype data, as you can see:

> pData(eset)$infection [1] FRD1 FRD1 FRD1 FRD1 FRD440 [6] FRD440 FRD440 FRD440 FRD875 FRD875 [11] FRD875 FRD875 FRD1234 FRD1234 FRD1234 [16] uninfected uninfected uninfected uninfected Levels: FRD1 FRD1234 FRD440 FRD875 uninfected > pData(eset)$"genotype/variation" [1] control control [3] control control [5] mucoid mucoid [7] mucoid mucoid [9] motile motile [11] motile motile [13] non-mucoid, non-motile non-mucoid, non-motile [15] non-mucoid, non-motile non-mucoid, non-motile [17] non-mucoid, non-motile non-mucoid, non-motile [19] non-mucoid, non-motile Levels: control motile mucoid non-mucoid, non-motile

As with any expression set object, its easy to pull out a subset of the data:

> eset["1320_at","GSM14504"] Expression Set (exprSet) with 1 genes 1 samples phenoData object with 4 variables and 1 cases varLabels : sample : infection : genotype/variation : description > exprs(eset["1320_at","GSM14504"]) GSM14504 1320_at 6.70044

You should be able to produce a heatmap of differentially expressed genes easily enough using this page, especially as the phenotype/sub-sample information has been sorted out for you.

Loading a GPL (Annotation) file with GEOquery

In addition to loading a GDS file to get the expression levels, you can also load the associated platform annotation file. You can find this out from the GDS858 meta information:

> Meta(gds858)$platform [1] "GPL96"

So, for GDS858, the platform is GPL96, Affymetrix GeneChip Human Genome U133 Array Set HG-U133A.

Now let’s load up the GPL file and have a look at it (its a big file, about 12 MB, so this takes a while!):

library(Biobase) library(GEOquery) #Download GPL file, put it in the current directory, and load it: gpl96 <- getGEO('GPL96', destdir=".") #Or, open an existing GPL file: gpl96 <- getGEO(filename='GPL96.soft')

As with the GDS object, we can use the Meta and Table functions to extract information:

> Meta(gpl96)$title [1] "Affymetrix GeneChip Human Genome U133 Array Set HG-U133A" > colnames(Table(gpl96)) [1] "ID" "Species.Scientific.Name" [3] "Annotation.Date" "GB_LIST" [5] "SPOT_ID" "Sequence.Source" [7] "Representative.Public.ID" "Gene.Title" [9] "Gene.Symbol" "Entrez.Gene" [11] "RefSeq.Transcript.ID" "Gene.Ontology.Biological.Process" [13] "Gene.Ontology.Cellular.Component" "Gene.Ontology.Molecular.Function"

Lets look at the first four columns, for the first ten genes:

> Table(gpl96)[1:10,1:4] ID Species.Scientific.Name Annotation.Date GB_LIST 1 1007_s_at Homo sapiens 16-Sep-05 U48705 2 1053_at Homo sapiens 16-Sep-05 M87338 3 117_at Homo sapiens 16-Sep-05 X51757 4 121_at Homo sapiens 16-Sep-05 X69699 5 1255_g_at Homo sapiens 16-Sep-05 L36861 6 1294_at Homo sapiens 16-Sep-05 L13852 7 1316_at Homo sapiens 16-Sep-05 X55005 8 1320_at Homo sapiens 16-Sep-05 X79510 9 1405_i_at Homo sapiens 16-Sep-05 M21121 10 1431_at Homo sapiens 16-Sep-05 J02843

This shows a hand picked selection of the columns, again for the first ten genes:

> Table(gpl96)[1:10,c("ID","GB_LIST","Gene.Title","Gene.Symbol","Entrez.Gene")] ID GB_LIST Gene.Title Gene.Symbol Entrez.Gene 1 1007_s_at U48705 discoidin domain receptor family, member 1 DDR1 780 2 1053_at M87338 replication factor C (activator 1) 2, 40kDa RFC2 5982 3 117_at X51757 heat shock 70kDa protein 6 (HSP70B') HSPA6 3310 4 121_at X69699 paired box gene 8 PAX8 7849 5 1255_g_at L36861 guanylate cyclase activator 1A (retina) GUCA1A 2978 6 1294_at L13852 ubiquitin-activating enzyme E1-like UBE1L 7318 7 1316_at X55005 thyroid hormone receptor, alpha (erythroblastic...) THRA 7067 8 1320_at X79510 protein tyrosine phosphatase, non-receptor type 21 PTPN21 11099 9 1405_i_at M21121 chemokine (C-C motif) ligand 5 CCL5 6352 10 1431_at J02843 cytochrome P450, family 2, subfamily E, polypeptide 1 CYP2E1 1571

The above all used the 12MB file GPL96.soft, but you can also get a much smaller 3MB file GPL96.annot(compressed as GPL96.annot.gz) which has slightly different information in it… see here.

Using the BioConductor hgu133a package

Instead of loading the GEO annotation file for GPL96/HG-U133A, we could use an existing annotation package from the BioConductor annotation sets, hgu133a. These libraries exist for most of the popular microarray gene chips.

First of all, we need to install the package:

> source("http://www.bioconductor.org/biocLite.R") > biocLite("hgu133a") Running bioCLite version 0.1 with R version 2.1.1 ...

Then we can load the newly installed library:

> library(hgu133a)

There is any easy way to check when this was lasted updated, and what it can translate the Affy probe names into:

> hgu133a() Quality control information for hgu133a Date built: Created: Tue May 17 13:02:12 2005 Number of probes: 22277 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: hgu133aACCNUM found 22277 of 22277 hgu133aCHRLOC found 20195 of 22277 hgu133aCHR found 21283 of 22277 hgu133aENZYME found 2507 of 22277 hgu133aGENENAME found 18726 of 22277 hgu133aGO found 18647 of 22277 hgu133aLOCUSID found 21747 of 22277 hgu133aMAP found 21183 of 22277 hgu133aOMIM found 15109 of 22277 hgu133aPATH found 5067 of 22277 hgu133aPMID found 21004 of 22277 hgu133aREFSEQ found 21002 of 22277 hgu133aSUMFUNC found 0 of 22277 hgu133aSYMBOL found 21303 of 22277 hgu133aUNIGENE found 21128 of 22277 Mappings found for non-probe based rda files: hgu133aCHRLENGTHS found 25 hgu133aENZYME2PROBE found 663 hgu133aGO2ALLPROBES found 5912 hgu133aGO2PROBE found 4326 hgu133aORGANISM found 1 hgu133aPATH2PROBE found 142 hgu133aPMID2PROBE found 96291

And now lets test some of those mappings on the fourth gene 121_at in the GPL file:

> Table(gpl96)[4,c("ID","GB_LIST","Gene.Title","Gene.Symbol","Entrez.Gene")] ID GB_LIST Gene.Title Gene.Symbol Entrez.Gene 4 121_at X69699 paired box gene 8 PAX8 7849

Now, what does the annotation file have to say?

> mget("121_at",hgu133aACCNUM) $"121_at" [1] "X69699" > mget("121_at",hgu133aGENENAME) $"121_at" [1] "paired box gene 8" > mget("121_at",hgu133aSYMBOL) $"121_at" [1] "PAX8" > mget("121_at",hgu133aUNIGENE) $"121_at" [1] "Hs.469728"

You will notice that there is some overlap between the information in the GEO annotation table, and thehgu133a package (which compiles its data from a range of sources). See help(hgu133a) .

You should also read this introduction, Bioconductor: Annotation Package Overview

MrBayes Tree

szypanther — Thu, 02 May 2013 08:25:11 +0000

Use clustalw to generate nexus format file

#NEXUS
BEGIN DATA;
dimensions ntax=55 nchar=534;
format missing=?
symbols=”ABCDEFGHIKLMNPQRSTUVWXYZ”
interleave datatype=DNA gap= -;

Change to as follows:
#NEXUS
BEGIN DATA;
dimensions ntax=55 nchar=534;
format datatype=dna interleave=yes gap=- missing=?;

then type mb -i *.nex

MrBayes > lset nst=6 rates=invgamma

Setting Nst to 6
Setting Rates to Invgamma
Successfully set likelihood model parameters

MrBayes > mcmc ngen=20000 samplefreq=100 printfreq=100 diagnfreq=1000

Setting number of generations to 20000
Setting sample frequency to 100
Setting print frequency to 100
Setting diagnosing frequency to 1000
Running Markov chain
MCMC stamp = 4956565474
Seed = 1367482907
Swapseed = 1367482907
Model settings:

………
MrBayes > sump
Type sump to summarize the parameter values using the same burn-in as
the diagnostics in the mcmc command. The program will output a table with
summaries of the samples of the substitution model parameters, including the
mean, mode, and 95 % credibility interval (region of Highest Posterior Density,
HPD) of each parameter.

MrBayes > sumt
The program will output a cladogram with the posterior
probabilities for each split and a phylogram with mean branch lengths. Both
trees will also be printed to a file that can be read by FigTree and other
tree-drawing programs, such as TreeView and Mesquite.

Note:
If the standard deviation of split frequencies is below 0.01 after 20,000
generations, stop the run by answering no when the program asks Continue the
analysis? (yes/no). Otherwise, keep adding generations until the value falls
below 0.01. If you are interested mainly in the well-supported parts of the tree, a
standard deviation below 0.05 may be adequate.

converting file formats
Having the proper data file format is essential as many programs can only input certain file types. The following are some of the input and output file formats for specific programs.

CLUSTAL file format:
Programs that input this file type: Clustal W/X
Programs that output this file type: Clustal W/X

FASTA file format:
Programs that input this file type: Clustal W/X, MAFFT
Programs that output this file type: Clustal W/X, MAFFT

NEXUS file format:
Programs that input this file type: BEAUti, GARLI, Modeltest, MrBayes, PAUP*
Programs that output this file type: Clustal W/X, PAUP*

PHYLIP file format:
Programs that input this file type: GARLI, LAMARC, Migrate-n, PAML, PHYLIP
Programs that output this file type: Clustal W/X, PAUP*, PHYLIP

PIR file format:
Programs that input this file type: Clustal W/X
Programs that output this file type: Clustal W/X

微生物基因组中的GC-skew(zhuantie)

szypanther — Mon, 29 Apr 2013 03:11:09 +0000

如果给出两个关键词：生物信息、GC，可能很多人的第一反应是“GC含量”(GC-content)或者“CpG岛”(CpG island)吧。这两个星期开始做非编码RNA(Non-coding RNA)预测(对象是Sinorhizobium meliloti,草木樨中华根瘤菌)，接触到一个以前没听说过的新的“GC理论”：GC-skew.查国内文献，几乎找不到对它的详细介绍（也没有对应的中文翻译，skew有“ 歪，偏，斜”的意思，通过我对这个理论的理解，就把GC-skew翻译为“GC偏移”吧）。这里翻译一篇Nature上的Review，和大家分享一下。

微生物基因组中的GC-skew
在大多数细菌基因组中，我们注意到前导链(leading strand)和滞后链(lagging strand)在碱基组成上存在很明显的不同——前导链富含G和T，而滞后链中的A和C更多一些。打破A=T和C=G的碱基频率发生的偏移，被称之为“AT偏移(AT-skew)”和“GC偏移(GC-skew)”。由于通常GC偏移比AT偏移发生的更明显，所以我们更多地只考虑GC偏移。衡量GC偏移的一个方法是延基因序列做一个滑动窗口(sliding window)，计算(G-C)/(G+C)的值并绘图。这个公式给出了G超过C的百分比含量——值为正，则代表的是前导链；值为负，则为滞后链。

（图片来源：Nature.com）
是什么引起了GC偏移呢？我们对此还知之甚少。可能是因为前导链和滞后链在以单链DNA(single-stranded DNA)形态进行复制的时候两者花费的时间不同，所以易受不同的突变压力影响，从而导致暴露在不同的DNA受损环境之中。由于T-G和G-T的碱基互补配对错位(mispair)多于C-A和A-C，所以更容易出错的链(error-prone strand)可能相对地富含G和T.另一个理论依托于胞嘧啶脱氨水解(hydrolytic deamination of cytosine)，这一过程显著地发生在单链DNA之中。复制叉(Replication fork)的非对称结构使得滞后链模板产生暂时性单链，使之更容易发生胞嘧啶脱氨。胞嘧啶脱氨导致生成尿嘧啶，其在复制过程中和鸟嘌呤互补配对，实质是引起了C到T的突变。因此，C到T的脱氨基作用将增加那条链中G和T的百分比含量和其互补链中的C和A的百分比含量。
为什么分析GC偏移很重要呢？因为GC偏移在前导链中是正值而在滞后链中为负值，所以GC偏移值是前导链起点、终点以及转变成滞后链的信号，反之亦然。这使得GC偏移成为在环状染色体(circular chromosomes)中标记起点和终点的一个有用的工具。曲线图中显而易见的局部的变化，可以标记出例如近来反向序列的重组或者与外源DNA的同化。DNA的丢失不会造成GC偏移曲线基本形状的改变，尽管和外部DNA新近的合成可能将会对局部方差产生影响。
实际上，GC偏移的可视化会遭受局部波动的影响。所以最好利用GC偏移的累积量，其值是计算序列中任意某一起点到指定点中相邻滑动窗口GC偏移值的总和。图中所示为Wolinella succinogenes DSM1740基因组的GC偏移值和GC偏移累加值，并表明了GC偏移值如何改变了复制起点和终点的信号。GC偏移累加值分别在这些位置上标记出了最大值和最小值。

文章来源：http://www.nature.com/nrmicro/journal/v2/n11/box/nrmicro1024_BX1.html

Install genometools

szypanther — Fri, 25 Jan 2013 04:27:52 +0000

the ‘new’ error message refers to a nonexistant Cairo library on your system, which is needed for the AnnotationSketch component of GenomeTools. If you do not need this, do a ‘make cleanup’ and recompile with the additional make option ‘cairo=no’, e.g. ‘make errorcheck=no cairo=no’. This will disable support for AnnotationSketch and remove the cairo dependency.

As for your other question, you can use the ‘gt suffixerator’ tool as described. However, the ‘gt’ binary is placed in the ‘bin/’ subdirectory of your GenomeTools source directory after compiling. Please keep that in mind and call ‘bin/gt’ if necessary.
You should be able to run the command line exactly as described if you install the ‘gt’ binary system-wide (‘make install’) or add its location to your PATH environment variable.

First, we should also install ruby and cairo separately !

DSK: k-mer counting with very low memory usage

szypanther — Wed, 23 Jan 2013 01:35:07 +0000

Summary: Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count.

We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed, user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered.

DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 hours. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers.

Availability: http://minia.genouest.org/dsk

Contact: rayan.chikhi@ens-cachan.org

How to measure codon usage bias

szypanther — Tue, 15 Jan 2013 07:53:16 +0000

Codon adaptation index (CAI) is one of them. To examine the CAI value of a gene, a reference table of RSCU (relative synonymous codon usage) values for highly expressed genes is compiled.

A software call CodonW, you can download it from: http://codonw.sourceforge.net/. There is also a PhD thesis associated to it.

shenzy@shenzy-ubuntu:~/Downloads/CondonW/codonW$ codonw input.dat -all_indices -c_type 2 -f_type 4 -nomenu

eg:

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/circos/work$ codonw Dehalococcoidessp.BAV1.cds.fasta.dat -all_indices -c_type 2 -f_type 4 -nomenu