BEReX : Biomedical Entity-Relation eXplorer

BEReX is a new biomedical knowledge integration, search, and exploration tool. BEReX integrates eight popular databases (STRING, DrugBank, KEGG, PharmGKB, BioGRID, GO, HPRD, and MSigDB) and delineates an integrated network by combining the information available from these databases. Users search the integrated network by entering keywords and BEReX returns a sub-network matching the keywords. […]

multiple sequence alignment software

PAGAN is a general-purpose method for the alignment of sequence graphs. It includes e.g.:

phylogenetic multiple sequence alignment alignment extension by phylogenetic sequence placement modelling of Roche 454 sequencing error alignment and placement of NGS sequences pileup alignment of similar/noisy NGS reads or sequences inference of ancestral sequences

 

PAGAN documentation is available at the […]

awk 一些简单的用法

# 打印每行,并删除第二列

awk ‘{ $2 = “”; print }’ file1 awk ‘{ $2 = “”;$1 = “”; print }’ test1 # 打印部分文本

bash-3.2$ # 打印文件的前十行 (模拟 “head”)

 

bash-3.2$ awk ‘NR < 11′ test1

 

 

# 打印文件的最后两行 (模拟 “tail -2″)

awk ‘{y=x “\n” $0; x=$0};END{print y}’

 

# 打印文件的最后一行 (模拟 “tail -1″)

awk ‘END{print}’

[…]

Cake: a bioinformatics pipeline for the integrated analysis of somatic variants in cancer genomes.

 

Description

 

Cake is a bioinformatics tool to identify putative somatic mutations from cancer genome/exome data. Cake combines somatic calls from a number of publicly available SNP/somatic variant calling tools with an array of variant filtering modules to discard unwanted

 

http://sourceforge.net/projects/cakesomatic/

Reading the NCBI’s GEO microarray SOFT files in R/BioConductor

http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/geo/

This page discusses how to load GEO SOFT format microarray data from the Gene Expression Omnibus database (GEO) (hosted by the NCBI) into R/BioConductor. SOFT stands for Simple Omnibus Format in Text. There are actually four types of GEO SOFT file available:

GEO Platform (GPL) These files describe a particular type of microarray. They […]

MrBayes Tree

Use clustalw to generate nexus format file

#NEXUS BEGIN DATA; dimensions ntax=55 nchar=534; format missing=? symbols=”ABCDEFGHIKLMNPQRSTUVWXYZ” interleave datatype=DNA gap= -;

Change to as follows: #NEXUS BEGIN DATA; dimensions ntax=55 nchar=534; format datatype=dna interleave=yes gap=- missing=?;

then type mb -i *.nex

MrBayes > lset nst=6 rates=invgamma

Setting Nst to 6 Setting Rates to Invgamma Successfully […]

微生物基因组中的GC-skew(zhuantie)

如果给出两个关键词:生物信息、GC,可能很多人的第一反应是“GC含量”(GC-content)或者“CpG岛”(CpG island)吧。这两个星期开始做非编码RNA(Non-coding RNA)预测(对象是Sinorhizobium meliloti,草木樨中华根瘤菌),接触到一个以前没听说过的新的“GC理论”:GC-skew.查国内文献,几乎找不到对它的详细介绍(也没有对应的中文翻译,skew有“ 歪,偏, 斜”的意思,通过我对这个理论的理解,就把GC-skew翻译为“GC偏移”吧)。这里翻译一篇Nature上的Review,和大家分享一下。

微生物基因组中的GC-skew 在大多数细菌基因组中,我们注意到前导链(leading strand)和滞后链(lagging strand)在碱基组成上存在很明显的不同——前导链富含G和T,而滞后链中的A和C更多一些。打破A=T和C=G的碱基频率发生的偏移,被称之为“AT偏移(AT-skew)”和“GC偏移(GC-skew)”。由于通常GC偏移比AT偏移发生的更明显,所以我们更多地只考虑GC偏移。衡量GC偏移的一个方法是延基因序列做一个滑动窗口(sliding window),计算(G-C)/(G+C)的值并绘图。这个公式给出了G超过C的百分比含量——值为正,则代表的是前导链;值为负,则为滞后链。 (图片来源:Nature.com) 是什么引起了GC偏移呢?我们对此还知之甚少。可能是因为前导链和滞后链在以单链DNA(single-stranded DNA)形态进行复制的时候两者花费的时间不同,所以易受不同的突变压力影响,从而导致暴露在不同的DNA受损环境之中。由于T-G和G-T的碱基互补配对错位(mispair)多于C-A和A-C,所以更容易出错的链(error-prone strand)可能相对地富含G和T.另一个理论依托于胞嘧啶脱氨水解(hydrolytic deamination of cytosine),这一过程显著地发生在单链DNA之中。复制叉(Replication fork)的非对称结构使得滞后链模板产生暂时性单链,使之更容易发生胞嘧啶脱氨。胞嘧啶脱氨导致生成尿嘧啶,其在复制过程中和鸟嘌呤互补配对,实质是引起了C到T的突变。因此,C到T的脱氨基作用将增加那条链中G和T的百分比含量和其互补链中的C和A的百分比含量。 为什么分析GC偏移很重要呢?因为GC偏移在前导链中是正值而在滞后链中为负值,所以GC偏移值是前导链起点、终点以及转变成滞后链的信号,反之亦然。这使得GC偏移成为在环状染色体(circular chromosomes)中标记起点和终点的一个有用的工具。曲线图中显而易见的局部的变化,可以标记出例如近来反向序列的重组或者与外源DNA的同化。DNA的丢失不会造成GC偏移曲线基本形状的改变,尽管和外部DNA新近的合成可能将会对局部方差产生影响。 实际上,GC偏移的可视化会遭受局部波动的影响。所以最好利用GC偏移的累积量,其值是计算序列中任意某一起点到指定点中相邻滑动窗口GC偏移值的总和。图中所示为Wolinella succinogenes DSM1740基因组的GC偏移值和GC偏移累加值,并表明了GC偏移值如何改变了复制起点和终点的信号。GC偏移累加值分别在这些位置上标记出了最大值和最小值。

文章来源:http://www.nature.com/nrmicro/journal/v2/n11/box/nrmicro1024_BX1.html

[…]

Install genometools

the ‘new’ error message refers to a nonexistant Cairo library on your system, which is needed for the AnnotationSketch component of GenomeTools. If you do not need this, do a ‘make cleanup’ and recompile with the additional make option ‘cairo=no’, e.g. ‘make errorcheck=no cairo=no’. This will disable support for AnnotationSketch and remove the cairo […]

DSK: k-mer counting with very low memory usage

Summary: Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count.

We present a […]

How to measure codon usage bias

Codon adaptation index (CAI) is one of them. To examine the CAI value of a gene, a reference table of RSCU (relative synonymous codon usage) values for highly expressed genes is compiled.

A software call CodonW, you can download it from: http://codonw.sourceforge.net/. There is also a PhD thesis associated to it.

shenzy@shenzy-ubuntu:~/Downloads/CondonW/codonW$ codonw input.dat -all_indices […]