小生这厢有礼了(BioFaceBook Personal Blog) » NGS

gff3 to gft method

szypanther — Fri, 25 Apr 2014 01:59:05 +0000

The easiest way is to use the gffread program that comes with the Cufflinks software suite (Tuxedo)

gffread my.gff3 -T -o my.gtf

See gffread -h for more information

ADD REPLY • link

KEGG annotation pipeline

szypanther — Wed, 16 Oct 2013 05:10:45 +0000

KEGG Pathway Pipeline:

blastall -p blastp -d KEGG -i Haiyan.Pep.fasta -m 7 -a 10 -o Haiyan.Pep.fasta.blastp.m7 &
./tBLASTnParser.pl Haiyan.Pep.fasta.blastp.m7 Haiyan.Pep.fasta.blastp.m8
sed ‘1,1d’ Haiyan.Pep.fasta.blastp.m8 > Haiyan.Pep.fasta.blastp.m8.delhead

/home/zhouzh/lib/454-2.5/bin/runAssembly -m -cpu 16 -cdna -nobig -o Test sff/GV1NGBM02.sff

./draw_png.py -i ACYPIprot.KO.file -p /home/shenzy/KEGG/ko_org -o map_result5

step 1:
/home/soft/blast-2.2.23/bin/blastall -p blastp -d KEGG -i MBL_relation.fa -a 15 -b 30 -v 30 -m 7 -FF -o MBL_relation.fa.blastp2.m7 &

step2:
/home/shenzy/work_python_script_bak/tBLASTnParser.pl MBL_relation.fa. 2.m7 MBL_relation.fa.blastp2.m8

sed -e ‘1d’ G_seq_fkegg_Mix4.blastp.m8.result > G_seq_fkegg_Mix4.blastp.m8.result.nohead

./handle_KEGG_blast.py -i MBL_relation.fa.blastp2.m8 -j ../ko_gene -g anno_file2 -s anno_file_status2 > KO_list_file2

step3:
handle anno_file_status !!!!!!!!! not ko_list_file !! (must del BR:ko04091: …… and PATH:…..)

./draw_png2.py -i MBL2.KOFILE -p /home/shenzy/KEGG/ko_org/ -o MBLkeggMAP.result2

/home/soft/velvet_1.0.19/shuffleSequences_fastq.pl lane3_1209.read2.fq.t10l40.bowtie.file lane3_1209.read1.fq.t10l40.bowtie.file lane3_1209.t10l40.bowtie.pe12.fq
/home/soft/fastx_toolkit-0.0.13/src/fastq_to_fasta/fastq_to_fasta -n -i lane3_1209.t10l40.bowtie.pe12.fq -o lane3_1209.t10l40.bowtie.pe12.fa
cat lane3_1209.t10l40.bowtie.pe12.fa lane3_read12.fa lane4_read12.fa s_2_pe12.fasta > s_2343_pe12.fasta

blastall -p blastp -d ../KEGG -i AphisVelvet.pep -a 15 -b 30 -v 30 -m 7 -FF -o AphisVelvet.pep.blastp.result2 &:q!:q!
QueryName HSP QueryLength SubjctLength QueryAlignment SubjctAlignment Annotation Score BitScore EValue IdentityRate QueryFrame QueryStart QueryEnd SubjectFrame SubjectStart SubjectEnd

cdhit-cluster-consensus 1.GAC.454Reads.fna.cluster.clstr 1.GAC.454Reads.fna cdhit.result &[2] 17965
…read 379519 clusters from file “1.GAC.454Reads.fna.cluster.clstr”000 lines
read 737073 sequences from file “1.GAC.454Reads.fna”
write 5000 singleton clusters
write 293650 singleton clustersCDNA$ write 6000 singleton clusters
finish 85869 clusters out of 85869 non-singleton clusters

ACYPI000002-PA RefSeq peptide NP_001119607 gi|187097094|ref|NP_001119607.1 sucrase [Acyrthosiphon pisum] 1 590 588 95.76 97.45 dme:Dmel_CG8690 CG8690 gene product from transcript CG8690-RA (EC:3.2.1.20); K01187 alpha-glucosidase [EC:3.2.1.20] 1255 488.034 7.63337e-136 46.34 1 19 583 1 15 587

######################################
Kegg results have protein name deb:DehaBAV1_0078, we can get the related information from COG.mappings.v8.3.txt
######################################

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/BGI/lla/gene_annotation$ more 11aRayScalf_all.fna.cds.faa.blastp_kegg.m8.top1
QueryName HSP QueryLength SubjctLength QueryAlignment SubjctAlignment Annotation Score BitScore EValue IdentityRate QueryFrame QueryStar
t QueryEnd SubjectFrame SubjectStart SubjectEnd
11aRayScalf10001 87 87 734 648 648 0.891375 0.891375 3 D 3 (translation) 1 215 319 100.00 67.40 deb:DehaBAV1_0078 phage integrase family protein 1
085 422.55 5.04278e-117 94.42 1 1 215 1 66 280

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/BGI/lla/gene_annotation$ grep “DehaBAV1_0078″ COG.mappings.v8.3.txt
216389.DehaBAV1_0078 32 296 COG4974 Phage integrase family protein

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/BGI/lla/gene_annotation/kegg$ ./draw_png_good.py -i anno_file_status -p /winxp_disk2/shenzy/KEGG/img/ -o test.out

#################################################################################################################

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/BGI/MB/gene_annotation/kegg$ blastall -p blastp -d kegg-Prokaryotes -i 47_acc_num_xiaoying.txt.fasta -a 15 -b 30 -v 30 -m 7 -FF -o 47_acc_num_xiaoying.txt.fasta.blastp.m7 &

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/BGI/MB/gene_annotation$ handle_KEGG_blast.py -i MBrayScalfALL.fna.cds.faa.blasp.kegg-P.m8.nohead -j ko -g anno_file2 -s anno_file_status2 > KO_list_file2

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/BGI/MB/gene_annotation/kegg$ handle_KEGG_blast_filterzero.py -i 2548N_stat.siggenes_102030min.filter.protein.fasta.blastp.m8.nohead -j ko.pep.fasta -g 162_anno_file -s 162_anno_file_status > 162_KO_list_file &
#################################################################################################################

blastall -p blastp -d kegg-Prokaryotes -i 176_protein.fasta -a 15 -b 30 -v 30 -m 7 -FF -o 176_acc_relation.fa.blastp.m7 &

handle_KEGG_blast.py -i 176_acc_relation.fa.blastp.m8.nohead -j ko -g 176_anno_file -s 176_anno_file_status > 176_KO_list_file &

draw_png_good.py -i 176_anno_file_status -p img -o 176_kegg_results &

Reordering contigs in draft genomes by MAUVE

szypanther — Wed, 18 Sep 2013 02:08:17 +0000

When to use Mauve Contig Mover (MCM)

The Mauve Contig Mover (MCM) can be used to order a draft genome relative to a related reference genome. The functionality of this software module has been described in Rissman et al. 2009 , a publication in Bioinformatics. The Mauve Contig Mover can ease a comparative study between draft and reference sequences by ordering draft contigs according to the reference genome. In many cases, true rearrangements in the draft relative to the reference can be identified. The quality of the reorder is limited by the distance between the sequences, as indicated by the amount of shared gene content among the two organisms. A more distant reference will usually yield fewer ordered draft genome contigs, and may also induce erroneous placements of draft contigs. In addition to ordering contigs, MCM also orient them in the most likely orientation, and, if annotated sequence features are specified in an input file (e.g. with GenBank format input for the draft), MCM will output adjusted coordinates ranges for the features.

Using Mauve Contig Mover

Mauve Contig Mover can be launched from the Tools->Order Contigs Menu of The Mauve Viewer.

Once the Mauve Contig Mover has been launched, it starts by requesting the user to specify an output directory. MCM will create a series of mauve alignments in the output directory, structured into several subfolders, so creating a new, empty output folder is often best to minimize clutter. The output directory selection is shown below.

Once the output directory has been set, a window similar to the Progressive Mauve Alignment Window appears. Using the Progressive Mauve alignment window is described in more detail in the section “Constructing a genome alignment “, but when used for reordering contigs, the following additional notes and constraints apply:
1.    Only two sequences should be entered. The first must always be the reference, the second the draft genome to reorder. The first may also be a draft, but only the second will be reordered.
2.    The reference genome may be in any of the allowable file formats, the draft must either be in a fasta or genbank file. If the draft is in genbank format, a unique identifier must be specified in either the LOCUS tag of each contig.
3.    The alignment parameters have already been adjusted for what is generally best for draft genomes. This may depend on the draft and reference, and can be adjusted if needed.

The Reordering Process

The reordering will begin when the start button is pressed. It is an iterative process, and may take anywhere from a half hour to several hours. It may be cancelled at any point (intermediate results will be viewable). If it is canceled after the first reorder the reordered draft genome will be available in a Multi-FastA file in the corresponding output directory, although an alignment of the canceled ordering step will not be present. If the ordering process is not manually ended, it will terminate when it finds an order has repeated. Sometimes the order will cycle through several possibilities; this indicates it cannot determine which of them is most likely. Alignment parameters may be changed before reorder starts or any time between alignments.
The following message will appear when the reorder process is complete:

The MCM Output Files

MCM will output a series of folders called alignment1-alignmentX, representing each iteration of the reorder. Each has the standard Mauve alignment files, as described in the section Mauve Output File Formats . Each folder also has an additional file called name_of_genome_contigs.tab, where “name_of_genome” is the draft genome’s name. This file is included for ease of interpreting reorder results, and also acts as an index to the fasta as the contig orders and orientations change (even if the draft was originally input as a genbank, after the first alignment, it will be converted to a fasta with annotation information preserved in a file described below). The file is divided into 3 sections, each containing a list of contigs. The data for each contig includes its label (name), its location in the genome (numbered in pseudocoordinates from the first to last contig; these coordinates can be entered into the View->Go To->Sequence Position menu option to jump to that contig using the Mauve Alignment Viewer), and whether it is oriented the same as originally input, or was complemented. The three sections are described below:

1. Contigs to reverse: This section contains contigs whose order is reversed with respect to the previous iteration. Note that contigs in this section may be oriented the same as originally input, this can be determined from the forward orcomplement designation.

2. Ordered Contings: This is a list of all the contigs in the order and orientation they appear in the fasta for the draft of this iteration of the reorder. Since these include all the contigs in the original input, those with no ordering information (no aligned region) will be clustered at the end. These will appear as contigs with no LCBs at the end of the draft genome.

3. Contigs with Conflicting Order information: This is a list of contigs containing LCBs suggesting multiple possible locations. These may be of interest to verify positioning, or to look at points of potential rearrangement or misassembly.

If the draft was input as an annotated genbank file, a second file will appear in each alignment folder called name_of_genome_features.tab. This file will contain a line for each annotation, information about its current orientation and location (which will change if the contig is inverted), coordinates from the previous iteration (indicating relative orientation), and whether it is reversed from the original input. It will also have a label field used to identify each feature. This will be gotten from the annotation, as checked in the following order: db_xref, label, gene, and locus_tag.

Thus, the folder with the highest numbered alignment contains a fasta and one (or possibly two) descriptor files representing the final order of the draft genome.

Reordering contigs from the command-line (batch mode)

In situations where it is necessary to order contigs in a large number of draft genomes it is often more desirable to automate the process using command-line interfaces and scripts. Mauve Contig Mover supports command-line operation through the Mauve Java JAR file.

Given a reference genome file called “reference.gbk” and a draft genome called “draft.fasta”, one would invoke the reorder program with the following syntax:

java -Xmx500m -cp Mauve.jar org.gel.mauve.contigs.ContigOrderer -output results_dir -ref reference.gbk -draft draft.fasta

The file Mauve.jar is part of the Mauve distribution. On windows systems it can usually be found in C:\Program Files\Mauve X\Mauve.jar where X is the version of Mauve. On Mac OS X it is located inside the Mauve application. For example, if Mauve has been placed in the OS X applications folder, Mauve.jar can be found at /Applications/Mauve.app/Contents/Resources/Java/Mauve.jar. On Linux, Mauve.jar is simply at the top level of the tar.gz archive. In the above example command, it will be necessary to specify the full path to the Mauve.jar file.

http://asap.ahabs.wisc.edu/mauve-aligner/mauve-user-guide/reording-contigs-in-draft-genomes.html

PyroHMMvar: a sensitive and accurate method to call short INDELs and SNPs for Ion Torrent and 454 data

szypanther — Tue, 03 Sep 2013 06:08:43 +0000

Motivation: The identification of short indels and SNPs from Ion Torrent and 454 reads is a challenging problem, essentially because these techniques are prone to sequence erroneously at homopolymers and can, therefore, raise indels in reads. Most of the existing mapping programs do not model homopolymer errors when aligning reads against the reference. The resulting alignments will then contain various kinds of mismatches and indels that confound the accurate determination of variant loci and alleles.

Results: To address these challenges, we realign reads against the reference using our previously proposed hidden Markov model (HMM) that models homopolymer errors and then merges these pairwise alignments into a weighted alignment graph. Based on our weighted alignment graph and HMM, we develop a method called PyroHMMvar which can simultaneously detect short indels and SNPs, as demonstrated in human resequencing data. Specifically, by applying our methods to simulated diploid datasets, we demonstrate that PyroHMMvar produces more accurate results than state-of-the-art methods, such as Samtools and GATK, and is less sensitive to mapping parameter settings than the other methods. We also apply PyroHMMvar to analyze one human whole genome resequencing dataset and the results confirm that PyroHMMvar predicts SNPs and indels accurately.

Availability and Implementation: Source code freely available at the following URL: https://code.google.com/p/pyrohmmvar/, implemented in C++ and supported on Linux.

微生物基因组中的GC-skew(zhuantie)

szypanther — Mon, 29 Apr 2013 03:11:09 +0000

如果给出两个关键词：生物信息、GC，可能很多人的第一反应是“GC含量”(GC-content)或者“CpG岛”(CpG island)吧。这两个星期开始做非编码RNA(Non-coding RNA)预测(对象是Sinorhizobium meliloti,草木樨中华根瘤菌)，接触到一个以前没听说过的新的“GC理论”：GC-skew.查国内文献，几乎找不到对它的详细介绍（也没有对应的中文翻译，skew有“ 歪，偏，斜”的意思，通过我对这个理论的理解，就把GC-skew翻译为“GC偏移”吧）。这里翻译一篇Nature上的Review，和大家分享一下。

微生物基因组中的GC-skew
在大多数细菌基因组中，我们注意到前导链(leading strand)和滞后链(lagging strand)在碱基组成上存在很明显的不同——前导链富含G和T，而滞后链中的A和C更多一些。打破A=T和C=G的碱基频率发生的偏移，被称之为“AT偏移(AT-skew)”和“GC偏移(GC-skew)”。由于通常GC偏移比AT偏移发生的更明显，所以我们更多地只考虑GC偏移。衡量GC偏移的一个方法是延基因序列做一个滑动窗口(sliding window)，计算(G-C)/(G+C)的值并绘图。这个公式给出了G超过C的百分比含量——值为正，则代表的是前导链；值为负，则为滞后链。

（图片来源：Nature.com）
是什么引起了GC偏移呢？我们对此还知之甚少。可能是因为前导链和滞后链在以单链DNA(single-stranded DNA)形态进行复制的时候两者花费的时间不同，所以易受不同的突变压力影响，从而导致暴露在不同的DNA受损环境之中。由于T-G和G-T的碱基互补配对错位(mispair)多于C-A和A-C，所以更容易出错的链(error-prone strand)可能相对地富含G和T.另一个理论依托于胞嘧啶脱氨水解(hydrolytic deamination of cytosine)，这一过程显著地发生在单链DNA之中。复制叉(Replication fork)的非对称结构使得滞后链模板产生暂时性单链，使之更容易发生胞嘧啶脱氨。胞嘧啶脱氨导致生成尿嘧啶，其在复制过程中和鸟嘌呤互补配对，实质是引起了C到T的突变。因此，C到T的脱氨基作用将增加那条链中G和T的百分比含量和其互补链中的C和A的百分比含量。
为什么分析GC偏移很重要呢？因为GC偏移在前导链中是正值而在滞后链中为负值，所以GC偏移值是前导链起点、终点以及转变成滞后链的信号，反之亦然。这使得GC偏移成为在环状染色体(circular chromosomes)中标记起点和终点的一个有用的工具。曲线图中显而易见的局部的变化，可以标记出例如近来反向序列的重组或者与外源DNA的同化。DNA的丢失不会造成GC偏移曲线基本形状的改变，尽管和外部DNA新近的合成可能将会对局部方差产生影响。
实际上，GC偏移的可视化会遭受局部波动的影响。所以最好利用GC偏移的累积量，其值是计算序列中任意某一起点到指定点中相邻滑动窗口GC偏移值的总和。图中所示为Wolinella succinogenes DSM1740基因组的GC偏移值和GC偏移累加值，并表明了GC偏移值如何改变了复制起点和终点的信号。GC偏移累加值分别在这些位置上标记出了最大值和最小值。

文章来源：http://www.nature.com/nrmicro/journal/v2/n11/box/nrmicro1024_BX1.html

Solexa与Hiseq测序技术中常见术语名词解释

szypanther — Tue, 15 Jan 2013 07:34:28 +0000

第二代测序技术中Solexa以及它的升级版Hiseq，目前使用最多。为了帮助PLoB网友进一步了解Solexa相关的概念。与大家分享一篇网上看到的文章《Solexa测序技术中常见术语解释》，文章后面有参考来源链接。更多相关信息欢迎加入PLoB 2000人的生物信息QQ群（群号：235461986）来讨论，有相关测序以及生物信息学问题需要解答欢迎前来。下面直接附上相关的解释。大家同时可以结合上面的示意图，了解Solexa与Hiseq的基本结构。

SBS：边合成边测序反应，每次SBS会延伸一个碱基，大约耗时70分钟。

Run：单次上机测序反应，可以产生4G-75G测序通量不等。

Lane：单泳道，每条泳道可以直接物理区分测序样品，1次run最多可以同时上样8条Lane。

Channel：Lane的同义词。

Tile：小区，每条Lane中排有2列tile，合计120个小区。每个小区上分布数目繁多的簇结合位点。

Cluster：簇，在Solexa测序技术中会采用桥式PCR方式生产DNA簇，每个DNA簇才能产生亮度达到CCD可以分辨的荧光点。

Index：标签，在Solexa多重测序（Multiplexed Sequencing）过程中会使用Index来区分样品，并在常规测序完成后，针对Index部分额外进行7个循环的测序，通过Index的识别，可以在1条Lane中区分12种不同的样品。

Barcode: Index同义词

Fasta：一种序列存储格式。一个序列文件若以FASTA格式存储，则每一条序列的第一行以“>”开头，而跟随“>”的是序列的ID号（即唯一的标识符）及对该序列的描述信息；第二行开始是序列内容，序列短于61nt的，则一行排列完；序列长于 61nt的，则每行存储61nt，最后剩下小于61nt的，在最后一行排列完；第二条序列另起一行，仍然由“>”和序列的ID号开始，以此类推。

Fastq：Fastq是Solexa测序技术中一种反映测序序列的碱基质量的文件格式。第一行以“@”符号开头，后面紧跟一个序列的描述信息；第二行是该序列的内容；第三行以“+”符号开头，后面紧跟的内容与第一行一样，同样是该序列的描述信息；而第四行是第二行中的序列内容每个碱基所对应的测序质量值。

PF%：PF%是指符合测序质量标准的簇的百分比（Multiplexed Sequencing），与测序的通量相关联。

Read：Solexa是成簇反应的，每个簇对应一条DNA序列片段，成为一个read。

名词解释与图片的参考来源：http://www.igenomics.com.cn:7001/ajgene/jsp/ajweb/News.jsp?cid=C47825F27EC00001B8BF8B8D11C01D10

Illumina MiSeq 与GS FLX/Junior、Ion Torrent PGM性能比较

szypanther — Tue, 08 Jan 2013 02:35:24 +0000

Illumina MiSeq 与GS FLX/Junior性能比较表

	Illumina MiSeq	GS FLX/Junior
实验流程和周期	提供最快的二代测序的实验流程，可在8小时内完成从DNA样本其实到分析后的数据，比GS FLX/Junior快5倍。流程包括：l 文库制备：1.5小时，使用快速、transposon-based Nextera方法 l 在一个仪器系统内、以不到4.5小时（1 X 36 bp）的时间完成从自动话的簇生成到测序 l 在同一个仪器系统内，以不到2小时的时间完成初级和次级测序数据分析 l 2 X 150 bp运行约需27小时*	GS FLX/Junior 完整实验流程需要几天，包括：l 建库: 1 天 l emPCR: off-instrument and labor-intensive, 2-3 天手工操作 l 测序：10 hours l 初级和次级测序数据分析8小时(GS FLX), >2 小时(GS Junior)
通量	最高通量的个人化测序仪：l 每次运行可产出1-1.5 Gb数据，以及6.8 M PE Reads l 通量和读取数适合一系列广泛的重要应用，如：靶向/扩增子测序、小基因组测序、ChIP-Seq和small RNA	l 每次测序可产生0.5G/1M reads (GS FLX)或35M/100k reads (GS Junior)
读长	l MiSeq可进行2 X 150 bp的paired end读取，这项功能可以帮助开展一系列重要的应用，如：小基因组的de novo assembly、结构变异检测，以及将读取map到重复区域等l 可进行2 X 150 bp重叠读取，这样可产生达到275 bp的高精度、高灵敏的“连续”读取	l 400bp
化学原理	l Illumina SBS化学试剂可以提供最高产出的prefect reads和大于Q30 (99.9%准确率) 的碱基序列l 读取质量不受重复碱基区域影响 l 在二代测序领域，Illumina的SBS化学原理是最广泛验证的技术，且具有最多的文献支持	l 焦磷酸(Pyrosequencing)化学原理，在检测重复序列区域时错误率高l 无法准确测出连续超过8个相同碱基
系统性价比运行费用	l 整机价格包含所有测序运行和数据分析所需的硬件，无隐含费用l 根据读长的不同，运行成本小于$1000 与GS FLX/Junior相比具有最低的每Mb数据成本	除了GS FLX/Junior 整机的价格外，还需如下配套仪器l BioAnalyzer 2100 – $20-30K l TBS 380 Fluorometer – $10K l Coulter Counter – $18K l emPCR Preparation System – >$30K l 运行成本$1000-10000 l GS FLX 还需另购计算机服务器
应用	l Amplicon sequencing ; Targeted resequencingl Small Genome Sequencing (<20Mb de novo per run ) l ChIp-Seq; RNA – Seq; Metagenomics (16S) l cDNA/transcriptome sequencing	l Amplicon sequencingl Small Genome Sequencing ( <10 Mb de novo per run for GS FLX; <2 Mb de novo per run for GS Junior) l Metagenomics/pathogen detection l cDNA/transcriptome sequencing
靶向重测序	Illumina可以提供广泛、优化的靶向重测序解决方案：l PCR扩增：快速的基于Nextera的文库制备（90分钟）和36 bp测序适合几十个区段的测序，区域和样本可以进行多重分析 l 高度多重的扩增子和样本：使用TruSeq Custom Amplicon Kit短至8小时的流程 (每样本几百个靶向区域)，比毛细管电泳测序便宜几十倍 l 通过杂交捕获方式的TruSeq Custom Enrichment（1-10Mb）：目前性价比最高的方法，从样本池中预先进行富集，以实现高效的流程并缩短运行时间 l Illumina提供相应的试剂盒，以支持以上应用	l 需第三方试剂及仪器

Illumina MiSeq与Ion Torrent PGM性能比较表

	Illumina MiSeq	Ion Torrent PGM
实验周期	提供最快的二代测序的实验流程，可在8小时内完成从DNA样本其实到分析后的数据，比Ion Torrent快5倍。流程包括：l 文库制备：1.5小时，使用快速、transposon-based Nextera方法 l 在一个仪器系统内、以不到4.5小时（1 X 36 bp）的时间完成从自动话的簇生成到测序 l 在一个仪器系统内，以不到2小时的时间完成初级和次级分析 l 2 X 150 bp运行约需27小时*	整个实验流程需要数天时间，包括：l 文库制备：1天，且需要大量的手工操作 l ePCR: 在主机外进行，且需要大量手工操作，手动实验需3天，半自动操作需1-2天 l 测序：2小时 l 仪器外的初级和次级数据分析：1天，使用第三方软件，并且需要$16.5k的服务器
操作流程	l 通过触摸屏方便地控制仪器操作l 最简捷的样本准备流程：采用Nextera文库制备方法仅需15分钟手动操作时间 l 即插即用型试剂：预混与预装的14:20，进行解冻，放入仪器即可运行 l 1小时内在同一台仪器上完成簇生成，无需复杂的emulsion PCR l 在同一台机器上完成碱基读取、alignment和变异检出 l RFID追踪试剂耗材：自动将信息载入仪器 l 从簇生成到数据分析整个流程仅需20分钟的手动操作时间	l 复杂的操作流程：使用多个仪器进行的操作流程，更多出现失误的可能l 耗时的文库制备流程：有限的可选方案 l Emulsion PCR复杂且困难的扩增步骤：操作复杂且容易出错 l 无Paired-End测序，仪器的可用性和应用范围大大受限 l 手动数据分析：用户必须转变输出文件格式，以用于变异检出或其他生物学有意义的结果
通量	最高通量的个人化测序仪：l 每次运行可产出>1 Gb数据，以及>6M PE Reads l 通量和读取数适合一系列广泛的重要应用，如：靶向/扩增子测序、小基因组测序、ChIP-Seq和small RNA	l 计划在2011年第一季度达到100 Mb通量（目前为10 Mb），受限的通量无法进行许多重要的应用l 读取数量有限，不适合计数类型的应用，如：small RNA和ChIP-Seq
化学原理	l Illumina SBS化学试剂可以提供最高产出的prefect reads和大于Q30的碱基l 读取质量不受重复碱基区域影响 l 在二代测序领域，Illumina的SBS化学原理是最广泛验证的技术，且具有最多的文献支持	l 与Pyrosequencing和454相似的化学原理，在检测重复序列区域时错误率高l 未经验证的技术，没有文献支持
是否支持Paired End测序	l MiSeq可进行2 X 150 bp的paired end读取，这项功能可以帮助开展一系列重要的应用，如：小基因组的de novo assembly、结构变异检测，以及将读取map到重复区域等l 可进行2 X 150 bp重叠读取，这样可产生达到275 bp的高精度、高灵敏的“连续”读取	不支持PE测序，严重限制了仪器的可用性和应用范围
应用范围	应用灵活，支持广泛的传统毛细管电泳测序和二代测序的一系列应用：l 扩增子测序 l 靶向重测序 l Clone Checking l 小基因组测序（de nove/重测序，<20Mb） l ChIP-Seq l 小RNA测序 l 宏基因组（16S） l 以及更多应用……	有限的数据产出、低的检测精度、以及缺乏paired end测序支持，应用非常受限
系统性价比	l 整机价格包含所有测序运行和数据分析所需的硬件，无隐含费用l 根据读长的不同，运行成本小于$1000 l 与Ion Torrent相比具有最低的每Mb数据成本	除主机外需要一系列的辅助仪器：l Ion Torrent PGM：$50k l Torrent Server: $16.5k l Bioruptor – $20K用于DNA片段化 l Sample Preparation Library Builder? System (EZ Bead?具体价格？) l 相同数据量运行成本高于8000美金.
仪器体积	l 整个系统可以方便的适合标准的实验室台面：约2平方英尺占地面积l 整合的系统，用于簇生成、paired-end测序、初级和次级分析 l 无需emulsion PCR：无需单独的clean room或是通风橱	l 虽然仪器主机体积比较小，但emulsion PCR需要较大空间l 参考emulsion PCR流程，需要多个房间进行实验 l SOLiD 4 – (2 required; 3 recommended) (SPG, 4/10) Room 1 – library preparation (amplicon free room) Room 2 –emulsion PCR, bead, and slide preparation Room 3 – sequencing room l Roche FLX (4 rooms) (SPG, 4/09) Room 1 – library preparation (amplicon free room) Room 2 –DNA library capture, emulsification, emulsion dispense Room 3 – ePCR amplification, bead recovery and enrichment, primer annealing Room 4 – Sequencing
靶向重测序	Illumina可以提供广泛、优化的靶向重测序解决方案：l PCR扩增：快速的基于Nextera的文库制备（90分钟）和36 bp测序适合几十个区段的测序，区域和样本可以进行多重分析 l 高度多重的扩增子和样本：使用TruSeq Custom Amplicon Kit短至8小时的流程（每样本几百个靶向区域），比毛细管电泳测序便宜10多倍 l 通过杂交捕获方式的TruSeq Custom Enrichment（1-10Mb）：目前性价比最高的方法，从样本池中预先进行富集，以实现高效的流程并缩短运行时间 l Illumina提供相应的试剂盒，以支持以上应用	l 需要第三方的试剂支持，很难保证优化的实验结果

RazerS 3: Faster, fully sensitive read mapping

szypanther — Thu, 30 Aug 2012 05:37:42 +0000

Motivation: During the last years NGS sequencing has become a key technology for many applications in the biomedical sciences. Throughput continues to increase and new protocols provide longer reads than currently available. In almost all applications, read mapping is a first step. Hence, it is crucial to have algorithms and implementations that perform fast, with high sensitivity, and are able to deal with long reads and a large absolute number of indels.

Results: RazerS is a read mapping program with adjustable sensitivity based on counting q-grams. In this work we propose the successor RazerS 3 which now supports shared-memory parallelism, an additional seed-based filter with adjustable sensitivity, a much faster, banded version of the Myers’ bit-vector algorithm for verification, memory saving measures and support for the SAM output format. This leads to a much improved performance for mapping reads, in particular long reads with many errors. We extensively compare RazerS 3 with other popular read mappers and show that its results are often superior to them in terms of sensitivity while exhibiting practical and often competetive run times. In addition, RazerS 3 works without a precomputed index.

Availability and Implementation: Source code and binaries are freely available for download at http://www.seqan.de/projects/razers. RazerS 3 is implemented in C++ and OpenMP under a GPL license using the SeqAn library and supports Linux, Mac OS X, and Windows.

Contact: david.weese@fu-berlin.de

Qualimap: evaluating next generation sequencing alignment data

szypanther — Thu, 30 Aug 2012 05:27:12 +0000

Motivation: The sequence alignment/map (SAM) and the binary alignment/map (BAM) formats have become the standard method of representation of nucleotide sequence alignments for next-generation sequencing data. SAM/BAM files usually contain information from tens to hundreds of millions of reads. Often, the sequencing technology, protocol, and/or the selected mapping algorithm introduce some unwanted biases in these data. The systematic detection of such biases is a non-trivial task that is crucial to to drive appropriate downstream analyses.

Results: We have developed Qualimap, a Java application that supports user-friendly quality control of mapping data, by considering sequence features and their genomic properties. Qualimap takes sequence alignment data and provides graphical and statistical analyses for the evaluation of data. Such quality-control data are vital for highlighting problems in the sequencing and/or mapping processes, which must be addressed prior to further analyses.

Availability: Qualimap is freely available fromhttp://www.qualimap.org

第三代测序技术

szypanther — Thu, 23 Aug 2012 01:50:50 +0000

如果有人告诉你用显微镜实时观测单分子DNA聚合酶复制DNA，并用它来测序，你一定会
认为他异想天开，没有一点生物的sense。
我最初就是这样认为的，然而它不仅可以实现，而且已经实现了！这个就是被称为第三
代的测序技术，Pacific Biosciences公司推出的“Single Molecule Real Time (SMRT
™) DNA Sequencing”（单分子实时DNA测序）。
我有幸在NIH听到了这个技术发明人Stephen Turner博士的讲座，根据自己粗浅的理解
记录整理一下。

要实现单分子实时测序，有三个关键的技术。
第一个是荧光标记的脱氧核苷酸。显微镜现在再厉害，也不可能真的实时看到“单分子
”。但是它可以实时记录荧光的强度变化。当荧光标记的脱氧核苷酸被掺入DNA链的时
候，它的荧光就同时能在DNA链上探测到。当它与DNA链形成化学键的时候，它的荧光基
团就被DNA聚合酶切除，荧光消失。这种荧光标记的脱氧核苷酸不会影响DNA聚合酶的活
性，并且在荧光被切除之后，合成的DNA链和天然的DNA链完全一样。
第二个是纳米微孔。因为在显微镜实时记录DNA链上的荧光的时候，DNA链周围的众多的
荧光标记的脱氧核苷酸形成了非常强大的荧光背景。这种强大的荧光背景使单分子的荧
光探测成为不可能。Pacific Biosciences公司发明了一种直径只有几十纳米的纳米孔[
zero-mode waveguides (ZMWs)]，单分子的DNA聚合酶被固定在这个孔内。在这么小的
孔内，DNA链周围的荧光标记的脱氧核苷酸有限，而且由于A，T，C，G这四种荧光标记
的脱氧核苷酸非常快速地从外面进入到孔内又出去，它们形成了非常稳定的背景荧光信
号。而当某一种荧光标记的脱氧核苷酸被掺入到DNA链时，这种特定颜色的荧光会持续
一小段时间，直到新的化学键形成，荧光基团被DNA聚合酶切除为止（见图）。
第三个是共聚焦显微镜实时地快速地对集成在板上的无数的纳米小孔同时进行记录。由
于我对显微原理的物理知识匮乏，而Pacific Biosciences公司又没有非常强调在这方
面的发明，不做进一步探讨。

他们还对这一技术进行进一步的优化。
第一个是把双链DNA环化反复测序。人们可以在双链DNA的两头连上发夹结构的DNA
adaptor，从而使DNA环化。而DNA聚合酶就能够以环化的DNA作为模板滚环复制，反复测
一段DNA序列。这种反复测序，纠正了偶尔出现的复制错误，从而使测序精度非常高。
第二个是激发光中断测序法。DNA聚合酶虽然很稳定，但是在强大的激发光作用下酶也
是有一定寿命的。如果把激发光中断一段时间，在这段时间内DNA聚合酶继续复制DNA，
当激发光重新开启以后，人们就可以测到长DNA链后面的序列。

第三代测序技术非常可怕。1、它实现了DNA聚合酶内在自身的反应速度，一秒可以测10
个碱基，测序速度是化学法测序的2万倍。2、它实现了DNA聚合酶内在自身的
processivity（延续性，也就是DNA聚合酶一次可以合成很长的片段），一个反应就可
以测非常长的序列。二代测序现在可以测到上百个碱基，但是三代测序现在就可以测
几千个碱基。这为基因组的重复序列的拼接提供了非常好的条件。3、它的精度非常高
，达到99.9999%。
此外，它还有两个应用是二代测序所不具备的。
第一个是直接测RNA的序列。既然DNA聚合酶能够实时观测，那么以RNA为模板复制DNA的
逆转录酶也同样可以。RNA的直接测序，将大大降低体外逆转录产生的系统误差。
第二个是直接测甲基化的DNA序列。实际上DNA聚合酶复制A、T、C、G的速度是不一样的
。正常的C或者甲基化的C为模板，DNA聚合酶停顿的时间不同。根据这个不同的时间，
可以判断模板的C是否甲基化。

Pacific Biosciences公司预计2010年或者2011年就会推出商业化的测序仪器。在不远
的将来，如果他们能和二代测序一样集成100万个纳米微孔，那么一台仪器15分钟就能
够准确地测出一个人的基因组。以后每个人的基因组测序成本将变成100美元，人人都
可以消费得起。想想人类基因组计划耗资30亿美元，费时十几年，无数科学家参与其中
，技术的革新意义是多么重大啊！

公司链接：http://www.pacificbiosciences.com/