小生这厢有礼了(BioFaceBook Personal Blog) » metagenome

The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks

szypanther — Fri, 06 Dec 2013 07:45:24 +0000

SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive resource for up-to-date quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. SILVA provides a manually curated taxonomy for all three domains of life, based on representative phylogenetic trees for the small- and large-subunit rRNA genes. This article describes the improvements the SILVA taxonomy has undergone in the last 3 years. Specifically we are focusing on the curation process, the various resources used for curation and the comparison of the SILVA taxonomy with Greengenes and RDP-II taxonomies. Our comparisons not only revealed a reasonable overlap between the taxa names, but also points to significant differences in both names and numbers of taxa between the three resources.

微生物基因组中的GC-skew(zhuantie)

szypanther — Mon, 29 Apr 2013 03:11:09 +0000

如果给出两个关键词：生物信息、GC，可能很多人的第一反应是“GC含量”(GC-content)或者“CpG岛”(CpG island)吧。这两个星期开始做非编码RNA(Non-coding RNA)预测(对象是Sinorhizobium meliloti,草木樨中华根瘤菌)，接触到一个以前没听说过的新的“GC理论”：GC-skew.查国内文献，几乎找不到对它的详细介绍（也没有对应的中文翻译，skew有“ 歪，偏，斜”的意思，通过我对这个理论的理解，就把GC-skew翻译为“GC偏移”吧）。这里翻译一篇Nature上的Review，和大家分享一下。

微生物基因组中的GC-skew
在大多数细菌基因组中，我们注意到前导链(leading strand)和滞后链(lagging strand)在碱基组成上存在很明显的不同——前导链富含G和T，而滞后链中的A和C更多一些。打破A=T和C=G的碱基频率发生的偏移，被称之为“AT偏移(AT-skew)”和“GC偏移(GC-skew)”。由于通常GC偏移比AT偏移发生的更明显，所以我们更多地只考虑GC偏移。衡量GC偏移的一个方法是延基因序列做一个滑动窗口(sliding window)，计算(G-C)/(G+C)的值并绘图。这个公式给出了G超过C的百分比含量——值为正，则代表的是前导链；值为负，则为滞后链。

（图片来源：Nature.com）
是什么引起了GC偏移呢？我们对此还知之甚少。可能是因为前导链和滞后链在以单链DNA(single-stranded DNA)形态进行复制的时候两者花费的时间不同，所以易受不同的突变压力影响，从而导致暴露在不同的DNA受损环境之中。由于T-G和G-T的碱基互补配对错位(mispair)多于C-A和A-C，所以更容易出错的链(error-prone strand)可能相对地富含G和T.另一个理论依托于胞嘧啶脱氨水解(hydrolytic deamination of cytosine)，这一过程显著地发生在单链DNA之中。复制叉(Replication fork)的非对称结构使得滞后链模板产生暂时性单链，使之更容易发生胞嘧啶脱氨。胞嘧啶脱氨导致生成尿嘧啶，其在复制过程中和鸟嘌呤互补配对，实质是引起了C到T的突变。因此，C到T的脱氨基作用将增加那条链中G和T的百分比含量和其互补链中的C和A的百分比含量。
为什么分析GC偏移很重要呢？因为GC偏移在前导链中是正值而在滞后链中为负值，所以GC偏移值是前导链起点、终点以及转变成滞后链的信号，反之亦然。这使得GC偏移成为在环状染色体(circular chromosomes)中标记起点和终点的一个有用的工具。曲线图中显而易见的局部的变化，可以标记出例如近来反向序列的重组或者与外源DNA的同化。DNA的丢失不会造成GC偏移曲线基本形状的改变，尽管和外部DNA新近的合成可能将会对局部方差产生影响。
实际上，GC偏移的可视化会遭受局部波动的影响。所以最好利用GC偏移的累积量，其值是计算序列中任意某一起点到指定点中相邻滑动窗口GC偏移值的总和。图中所示为Wolinella succinogenes DSM1740基因组的GC偏移值和GC偏移累加值，并表明了GC偏移值如何改变了复制起点和终点的信号。GC偏移累加值分别在这些位置上标记出了最大值和最小值。

文章来源：http://www.nature.com/nrmicro/journal/v2/n11/box/nrmicro1024_BX1.html

RDP Tutorials (16s Analysis)

szypanther — Wed, 12 Sep 2012 07:43:23 +0000

Workflows:

Processing 16S rRNA data using a unsupervised method

Processing 16S rRNA data using a supervised method

Processing functional gene data using a supervised method

Individual tools:

Using the Pipeline Initial Process

Align 16S rRNA sequences using Infernal Aligner

Using the RDP Classifier

Using the RDP MultiClassifier

Performing Complete Linkage Clustering

–Using the .clust File Results (for abundance stats, diversity stats, OTU matrix or rarefaction)

Performing statistical analysis (coming soon)

Align protein using HMMER3 Aligner

Frameshift-correction and closest match assignment by RDP FrameBot

454 pyrosequencing analysis pipeline

szypanther — Thu, 16 Aug 2012 08:27:16 +0000

mothur > sffinfo(sff=454Reads_archaea.sff, flow=T)
Extracting info from 454Reads_archaea.sff …
10000
20000
30000
40000
50000
60000
70000
80000
90000
92115
It took 68 secs to extract 92115.
Output File Names:
454Reads_archaea.fasta
454Reads_archaea.qual
454Reads_archaea.flow

mothur > trim.flows(flow=454Reads_archaea.flow, oligos=oligos_LXY.txt, pdiffs=2, bdiffs=1, processors=2)
Appending files from process 15674

Output File Names:
454Reads_archaea.trim.flow
454Reads_archaea.scrap.flow
454Reads_archaea.GZ_ARC.flow
454Reads_archaea.GZ1122_ARC.flow
454Reads_archaea.GZ1122cellulose_ARC.flow
454Reads_archaea.GZ_xylan_ARC.flow
454Reads_archaea.GZ_cellulose55_ARC.flow
454Reads_archaea.SHX_xylan_ARC.flow
454Reads_archaea.GZ_xylose_ARC.flow
454Reads_archaea.Eric_ARC.flow
454Reads_archaea.Milk_D_ARC.flow
454Reads_archaea.Milk_E_ARC.flow
454Reads_archaea.ST1219_ARC.flow
454Reads_archaea.YL_ARC.flow
454Reads_archaea.SHX_xylose_ARC.flow
454Reads_archaea.SHX_cellulose55_ARC.flow
454Reads_archaea.TP_1201_ARC.flow
454Reads_archaea.ST_ARC.flow
454Reads_archaea.YL0203cellulose_ARC.flow
454Reads_archaea.TP_xylan_ARC.flow
454Reads_archaea.ST0303cellulose_ARC.flow
454Reads_archaea.SHX_ARC.flow
454Reads_archaea.ST_xylan_ARC.flow
454Reads_archaea.YL_xylan_ARC.flow
454Reads_archaea.SHX1219_ARC.flow
454Reads_archaea.SHX1125cellulose_ARC.flow
454Reads_archaea.flow.files

mothur > shhh.flows(file=454Reads_archaea.flow.files, processors=4)

mothur > trim.seqs(fasta=454Reads_archaea.shhh.fasta, name=454Reads_archaea.shhh.names, oligos=oligos_LXY.txt, pdiffs=2, bdiffs=1, maxhomop=8, minlength=150, flip=T, processors=2)

Total of all groups is 44091

Output File Names:
454Reads_archaea.shhh.trim.fasta
454Reads_archaea.shhh.scrap.fasta
454Reads_archaea.shhh.trim.names
454Reads_archaea.shhh.scrap.names
454Reads_archaea.shhh.groups

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.fasta, name=454Reads_archaea.shhh.trim.names)
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 218 218 0 3 1
2.5%-tile: 1 251 251 0 3 1103
25%-tile: 1 268 268 0 4 11023
Median: 1 274 274 0 4 22046
75%-tile: 1 281 281 0 4 33069
97.5%-tile: 1 297 297 0 5 42989
Maximum: 1 333 333 0 8 44091
Mean: 1 273.837 273.837 0 4.15944
# of unique seqs: 12780
total # of seqs: 44091

Output File Name:
454Reads_archaea.shhh.trim.summary

mothur > unique.seqs(fasta=454Reads_archaea.shhh.trim.fasta, name=454Reads_archaea.shhh.trim.names)

1000 959
2000 1691
3000 2431
4000 3358
5000 4352
6000 5335
7000 6328
8000 7261
9000 8187
10000 9082
11000 9963
12000 10859
12780 11449

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta
454Reads_archaea.shhh.trim.unique.names

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta, name=454Reads_archaea.shhh.trim.unique.names)
Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 218 218 0 3 1
2.5%-tile: 1 251 251 0 3 1103
25%-tile: 1 268 268 0 4 11023
Median: 1 274 274 0 4 22046
75%-tile: 1 281 281 0 4 33069
97.5%-tile: 1 297 297 0 5 42989
Maximum: 1 333 333 0 8 44091
Mean: 1 273.837 273.837 0 4.15944
# of unique seqs: 11449
total # of seqs: 44091

Output File Name:
454Reads_archaea.shhh.trim.unique.summary

Submit to RDP database, check and filter bacteria sequences!

http://rdp.cme.msu.edu/classifier/cl_status.jsp

domain Bacteria (1435 sequences)

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc allrank_454Reads_archaea.shhh.trim.unique.fasta_classified.txt
1435 1520 200784 allrank_454Reads_archaea.shhh.trim.unique.fasta_classified.txt

./filter_bacterseqs_for_align.py -i allrank_454Reads_archaea.shhh.trim.unique.fasta_classified.txt -f 454Reads_archaea.shhh.trim.unique.fasta -n 454Reads_archaea.shhh.trim.unique.names -g 454Reads_archaea.shhh.groups

shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc 454Reads_archaea.shhh.groups.filter
42187 84374 1224728 454Reads_archaea.shhh.groups.filter
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc 454Reads_archaea.shhh.groups
44091 88182 1270431 454Reads_archaea.shhh.groups

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.filter, name=454Reads_archaea.shhh.trim.unique.names.filter)
Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 218 218 0 3 1
2.5%-tile: 1 256 256 0 3 1055
25%-tile: 1 268 268 0 4 10547
Median: 1 274 274 0 4 21094
75%-tile: 1 282 282 0 4 31641
97.5%-tile: 1 297 297 0 5 41133
Maximum: 1 333 333 0 8 42187
Mean: 1 274.543 274.543 0 4.15355
# of unique seqs: 10014
total # of seqs: 42187
mothur > screen.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.align, name=454Reads_archaea.shhh.trim.unique.names.filter, group=454Reads_archaea.shhh.groups.filter, processors=2)
Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.summary
###mothur > align.seqs(candidate=454Reads_archaea.shhh.trim.unique.fasta.filter, template=core_set_aligned.imputed.fasta, flip=T, ksize=9, align=needleman, gapopen=-1, processors=3)
###

mothur > align.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.filter, reference=core_set_aligned.fasta.imputed, flip=T, processors=3)
Using 3 processors.

Reading in the core_set_aligned.fasta.imputed template sequences… DONE.
It took 1 to read 4938 sequences.
Aligning sequences from 454Reads_archaea.shhh.trim.unique.fasta.filter …
100
…
3338
Some of you sequences generated alignments that eliminated too many bases, a list is provided in 454Reads_archaea.shhh.trim.unique.fasta.flip.accnos. If the reverse compliment proved to be better it was reported.
It took 60 secs to align 10014 sequences.
Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.align
454Reads_archaea.shhh.trim.unique.fasta.align.report
454Reads_archaea.shhh.trim.unique.fasta.flip.accnos

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.align, name=454Reads_archaea.shhh.trim.unique.names.filter)
Using 3 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 86 98 2 0 1 1
2.5%-tile: 132 1746 51 0 3 1055
25%-tile: 136 1822 268 0 4 10547
Median: 136 1834 274 0 4 21094
75%-tile: 136 1850 282 0 4 31641
97.5%-tile: 194 1887 297 0 5 41133
Maximum: 6858 6885 313 0 8 42187
Mean: 284.168 1920.46 266.145 0 4.10781
# of unique seqs: 10014
total # of seqs: 42187

Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.summary
##mothur > screen.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.align, name=454Reads_archaea.shhh.trim.unique.names.filter, group=454Reads_archaea.shhh.groups.filter, ##start=136, optimize=end, criteria=90, processors=2)
#The optimize and criteria parameters allow you set the start, end, maxabig, maxhomop, minlength and maxlength parameters relative to your set of sequences .
#For example optimize=start-end, criteria=90, would set the start and end values to the position 90% of your sequences started and ended.

mothur > screen.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.align, name=454Reads_archaea.shhh.trim.unique.names.filter, group=454Reads_archaea.shhh.groups.filter, optimize=start-end, criteria=90, processors=4)
…
Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.align
454Reads_archaea.shhh.trim.unique.fasta.bad.accnos
454Reads_archaea.shhh.trim.unique.names.good.filter
454Reads_archaea.shhh.groups.good.filter
It took 4 secs to screen 10014 sequences.

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.align, name=454Reads_archaea.shhh.trim.unique.names.good.filter)

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 107 1819 243 0 3 1
2.5%-tile: 133 1821 263 0 3 925
25%-tile: 136 1831 269 0 4 9242
Median: 136 1836 274 0 4 18484
75%-tile: 136 1853 283 0 4 27725
97.5%-tile: 136 1871 298 0 5 36042
Maximum: 136 1920 313 0 8 36966
Mean: 135.731 1840.24 276.401 0 4.14224
# of unique seqs: 7703
total # of seqs: 36966

Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.good.summary

mothur > filter.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.align, vertical=T, trump=., processors=2)
3700
3800
3851

Length of filtered alignment: 486
Number of columns removed: 7196
Length of the original alignment: 7682
Number of sequences used to construct filter: 7703

Output File Names:
454Reads_archaea.filter
454Reads_archaea.shhh.trim.unique.fasta.good.filter.fasta
mothur > unique.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.fasta, name=454Reads_archaea.shhh.trim.unique.names.good.filter)

1000 974
2000 1887
3000 2768
4000 3604
5000 4424
6000 5238
7000 6017
7703 6573

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.fasta
454Reads_archaea.shhh.trim.unique.fasta.good.filter.names

mothur > shhh.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.names, group=454Reads_archaea.shhh.groups.good.filter, processors=3)
Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.unique.fasta
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.unique.names

/******************************************/

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh.Eric_ARC.map
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh.GZ1122_ARC.map
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh.GZ1122cellulose_ARC.map
…….

mothur > summary.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.names)
Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 484 219 0 3 1
2.5%-tile: 1 486 260 0 3 925
25%-tile: 1 486 261 0 4 9242
Median: 1 486 261 0 4 18484
75%-tile: 1 486 266 0 4 27725
97.5%-tile: 1 486 266 0 5 36042
Maximum: 3 486 282 0 7 36966
Mean: 1.00103 486 262.434 0 4.1387
# of unique seqs: 2911
total # of seqs: 36966

Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.summary

mothur > chimera.uchime(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.names, group=454Reads_archaea.shhh.groups.good.filter, processors=3)
It took 0 secs to check 46 sequences from group YL_xylan_ARC.

It took 43 secs to check 3276 sequences. 362 chimeras were found.
The number of sequences checked may be larger than the number of unique sequences because some sequences are found in several samples.

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.chimeras
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos
##############################3
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ mv 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.self
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ mv 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.chimeras 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.chimeras.self
################################

mothur >chimera.uchime(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.fasta, reference=core_set_aligned.fasta.imputed, processors=3)
05:04 26Mb 100.0% 30/969 chimeras found (3.1%)
05:11 26Mb 100.0% 88/970 chimeras found (9.1%)

It took 311 secs to check 2911 sequences. 213 chimeras were found.

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.chimeras
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos

###################################
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos
213 213 3195 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.self
362 362 5430 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.self
cat 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.self > 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum

sort 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum > 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort

merge 2 predict results of chimera and del repeat!
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ sed ‘$!N; /^$.*$\n\1$/!P; D’ 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort > 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq
382 382 5730 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq
####################################

get_fasta_from_seqname.py -i 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq -j 454Reads_archaea.fasta > 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq.fasta

chimera seqs RDP checking (http://rdp.cme.msu.edu/classifier/classifier.jsp)
Check the last genus id percent, if percent >=90%, (keep it and merge it to the non-chimera reads of each sample)
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ more allrank_454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq.fasta_classified.txt
HQ93PQ301A0ZHW;;Root;100%;Archaea;100%;”Euryarchaeota”;100%;”Methanomicrobia”;100%;Methanomicrobiales;100%;Methanospirillaceae;100%;Methanospirillum;100%
HQ93PQ301A1JYN;;Root;100%;Archaea;100%;”Euryarchaeota”;100%;”Methanomicrobia”;100%;Methanosarcinales;100%;Methanosarcinaceae;100%;Methanosarcina;100%
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ check_real_chimera_seq.py -i allrank_454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.uchime.accnos.sum.sort.uniq.fasta_classified.txt -d 90 | wc
221 221 3315

The 221 sequences should be merged to non-chimera results!!

-rwxrwxrwx 1 root root 2.4K 2012-08-14 11:19 chimera.seqs.name
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ wc chimera.seqs.name
161 161 2415 chimera.seqs.name

######################################################################
Removing chimeras (the total predict chimera seqs by two approaches!)
######################################################################
mothur > remove.seqs(accnos=chimera.seqs.name, fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.names, group=454Reads_archaea.shhh.groups.good.filter)

Removed 1197 sequences from your name file.
Removed 161 sequences from your fasta file.
Removed 1197 sequences from your group file.

Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.names
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.fasta
454Reads_archaea.shhh.groups.good.pick.filter
mothur > summary.seqs(name=current)

Using 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.names as input file for the name parameter.
Using 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.fasta as input file for the fasta parameter.

Using 3 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 484 219 0 3 1
2.5%-tile: 1 486 260 0 3 895
25%-tile: 1 486 261 0 4 8943
Median: 1 486 261 0 4 17885
75%-tile: 1 486 266 0 4 26827
97.5%-tile: 1 486 266 0 5 34875
Maximum: 3 486 278 0 7 35769
Mean: 1.00106 486 262.504 0 4.16914
# of unique seqs: 2750
total # of seqs: 35769

Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.summary
#############################
chimera number
##############################
./compute_chimera_for_singlesample.py -i 454Reads_Bacteria.Eric_BAC.shhh.groups -j chimera.seqs.name

GZ1122_ARC: 10
GZ1122cellulose_ARC: 3
GZ_ARC: 1
GZ_cellulose55_ARC: 1
GZ_xylan_ARC: 7
GZ_xylose_ARC: 4
SHX1125cellulose_ARC: 0
SHX1219_ARC: 0
SHX_ARC: 5
SHX_cellulose55_ARC: 8
SHX_xylan_ARC: 3
SHX_xylose_ARC: 35
ST0303cellulose_ARC: 15
ST1219_ARC: 5
ST_ARC: 21
ST_xylan_ARC: 11
TP_1201_ARC: 0
TP_xylan_ARC: 19
YL0203cellulose_ARC: 7
YL_ARC: 2
YL_xylan_ARC: 3

#######################
Removing “contaminants”
#######################
wget http://www.mothur.org/w/images/5/59/Trainset9_032012.pds.zip
shenzy@shenzy-ubuntu:/winxp_disk2/shenzy/xiaoying_work_archaea/16s_archaea_allsamples/shhh_pipe2$ unzip Trainset9_032012.pds.zip
Archive: Trainset9_032012.pds.zip
inflating: trainset9_032012.pds.tax
inflating: trainset9_032012.pds.fasta

mothur > classify.seqs(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.names, group=454Reads_archaea.shhh.groups.good.pick.filter, template=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80, processors=2)
….
Processing sequence: 1300
Processing sequence: 1300
[WARNING]: HQ93PQ301C4CTV could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
[WARNING]: HQ93PQ301CK6YP could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
[WARNING]: HQ93PQ301DJMTI could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
[WARNING]: HQ93PQ301ERRC6 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
Processing sequence: 1372
Processing sequence: 1371
It took 25 secs to classify 2750 sequences.

Reading 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.names… Done.

It took 3 secs to create the summary file for 2750 sequences.
Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.taxonomy
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.flip.accnos
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.tax.summary

mothur > remove.lineage(fasta=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.fasta, name=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.names, group=454Reads_archaea.shhh.groups.good.pick.filter, taxonomy=454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.taxonomy, taxon=Mitochondria-Cyanobacteria_Chloroplast-Eukarya-Bacteria-unknown)
Output File Names:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.pick.taxonomy
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.names
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.fasta
454Reads_archaea.shhh.groups.good.pick.pick.filter
mothur > summary.seqs(name=current)

Using 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.names as input file for the name parameter.
Using 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.fasta as input file for the fasta parameter.

Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 484 219 0 3 1
2.5%-tile: 1 486 260 0 3 852
25%-tile: 1 486 261 0 4 8519
Median: 1 486 261 0 4 17037
75%-tile: 1 486 266 0 4 25555
97.5%-tile: 1 486 266 0 5 33221
Maximum: 1 486 278 0 7 34072
Mean: 1 486 262.598 0 4.21437
# of unique seqs: 2644
total # of seqs: 34072

Output File Name:
454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.summary

############################################################################################
mothur > system(cp 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pds.pick.taxonomy archaea_16s_final.taxonomy)
mothur > system(cp 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.names archaea_16s_final.names)
mothur > system(cp 454Reads_archaea.shhh.trim.unique.fasta.good.filter.unique.shhh_seqs.pick.pick.fasta archaea_16s_final.fasta)
mothur > system(cp 454Reads_archaea.shhh.groups.good.pick.pick.filter archaea_16s_final.groups)
mothur > dist.seqs(fasta=archaea_16s_final.fasta, cutoff=0.1, processors=3)

Output File Name:
archaea_16s_final.dist

It took 25 to calculate the distances for 2644 sequences.

mothur > cluster(column=archaea_16s_final.dist, name=archaea_16s_final.names)
changed cutoff to 0.0392497

Output File Names:
archaea_16s_final.an.sabund
archaea_16s_final.an.rabund
archaea_16s_final.an.list

It took 9 seconds to cluster

mothur > make.shared(list=archaea_16s_final.an.list, group=archaea_16s_final.groups)

unique
0.01
0.02
0.03

Output File Names:
archaea_16s_final.an.shared
archaea_16s_final.an.Eric_ARC.rabund
archaea_16s_final.an.GZ1122_ARC.rabund
archaea_16s_final.an.GZ1122cellulose_ARC.rabund
archaea_16s_final.an.GZ_ARC.rabund
archaea_16s_final.an.GZ_cellulose55_ARC.rabund
archaea_16s_final.an.GZ_xylan_ARC.rabund
archaea_16s_final.an.GZ_xylose_ARC.rabund
archaea_16s_final.an.Milk_D_ARC.rabund
archaea_16s_final.an.Milk_E_ARC.rabund
archaea_16s_final.an.SHX1125cellulose_ARC.rabund
archaea_16s_final.an.SHX1219_ARC.rabund
archaea_16s_final.an.SHX_ARC.rabund
archaea_16s_final.an.SHX_cellulose55_ARC.rabund
archaea_16s_final.an.SHX_xylan_ARC.rabund
archaea_16s_final.an.SHX_xylose_ARC.rabund
archaea_16s_final.an.ST0303cellulose_ARC.rabund
archaea_16s_final.an.ST1219_ARC.rabund
archaea_16s_final.an.ST_ARC.rabund
archaea_16s_final.an.ST_xylan_ARC.rabund
archaea_16s_final.an.TP_1201_ARC.rabund
archaea_16s_final.an.TP_xylan_ARC.rabund
archaea_16s_final.an.YL0203cellulose_ARC.rabund
archaea_16s_final.an.YL_ARC.rabund
archaea_16s_final.an.YL_xylan_ARC.rabund
mothur > count.groups()

Using archaea_16s_final.an.shared as input file for the shared parameter.
Eric_ARC contains 14.
GZ1122_ARC contains 1780.
GZ1122cellulose_ARC contains 1063.
GZ_ARC contains 53.
GZ_cellulose55_ARC contains 1997.
GZ_xylan_ARC contains 1509.
GZ_xylose_ARC contains 1241.
Milk_D_ARC contains 19.
Milk_E_ARC contains 434.
SHX1125cellulose_ARC contains 2568.
SHX1219_ARC contains 2012.
SHX_ARC contains 1594.
SHX_cellulose55_ARC contains 2235.
SHX_xylan_ARC contains 944.
SHX_xylose_ARC contains 1932.
ST0303cellulose_ARC contains 1815.
ST1219_ARC contains 1597.
ST_ARC contains 774.
ST_xylan_ARC contains 1755.
TP_1201_ARC contains 1952.
TP_xylan_ARC contains 1849.
YL0203cellulose_ARC contains 1762.
YL_ARC contains 1154.
YL_xylan_ARC contains 2019.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^old^^^^^^^^^^^^^
Using archaea_16s_final.an.shared as input file for the shared parameter.
Eric_ARC contains 15.***********
GZ1122_ARC contains 1815.
GZ1122cellulose_ARC contains 1081.
GZ_ARC contains 54.
GZ_cellulose55_ARC contains 1997.
GZ_xylan_ARC contains 1547.
GZ_xylose_ARC contains 1245.
Milk_D_ARC contains 18. *********
Milk_E_ARC contains 434. ********
SHX1125cellulose_ARC contains 2570.
SHX1219_ARC contains 2012.
SHX_ARC contains 1593.
SHX_cellulose55_ARC contains 2236.
SHX_xylan_ARC contains 947.
SHX_xylose_ARC contains 1932.
ST0303cellulose_ARC contains 1810.
ST1219_ARC contains 1597.
ST_ARC contains 759.
ST_xylan_ARC contains 1755.
TP_1201_ARC contains 1952.
TP_xylan_ARC contains 1849.
YL0203cellulose_ARC contains 1762.
YL_ARC contains 1164.
YL_xylan_ARC contains 2019.

mothur > count.groups()

Using archaea_16s_final.an.shared as input file for the shared parameter.
Eric_ARC contains 569.
GZ1122_ARC contains 2103.
GZ1122cellulose_ARC contains 1594.
GZ_ARC contains 530.
GZ_cellulose55_ARC contains 2001.
GZ_xylan_ARC contains 1889.
GZ_xylose_ARC contains 2015.
Milk_D_ARC contains 598.
Milk_E_ARC contains 1753.
SHX1125cellulose_ARC contains 2831.
SHX1219_ARC contains 2247.
SHX_ARC contains 1660.
SHX_cellulose55_ARC contains 2249.
SHX_xylan_ARC contains 1213.
SHX_xylose_ARC contains 1991.
ST0303cellulose_ARC contains 1845.
ST1219_ARC contains 1621.
ST_ARC contains 1859.
ST_xylan_ARC contains 1769.
TP_1201_ARC contains 1969.
TP_xylan_ARC contains 1890.
YL0203cellulose_ARC contains 1785.
YL_ARC contains 1285.
YL_xylan_ARC contains 2025.

mothur > sub.sample(shared=archaea_16s_final.an.shared, size=759)

Eric_ARC contains 15. Eliminating.
GZ_ARC contains 54. Eliminating.
Milk_D_ARC contains 18. Eliminating.
Milk_E_ARC contains 434. Eliminating.
Sampling 759 from each group.
unique
0.01
0.02
0.03

Output File Names:
archaea_16s_final.an.uniquesubsample.shared
archaea_16s_final.an.0.01subsample.shared
archaea_16s_final.an.0.02subsample.shared
archaea_16s_final.an.0.03subsample.shared
mothur > classify.otu(list=archaea_16s_final.an.list, name=archaea_16s_final.names, taxonomy=archaea_16s_final.taxonomy)

reftaxonomy is not required, but if given will keep the rankIDs in the summary file static.
unique 2636
0.01 1940
0.02 1033
0.03 634

Output File Names:
archaea_16s_final.an.uniquecons.taxonomy
archaea_16s_final.an.uniquecons.tax.summary
archaea_16s_final.an.0.01cons.taxonomy
archaea_16s_final.an.0.01cons.tax.summary
archaea_16s_final.an.0.02cons.taxonomy
archaea_16s_final.an.0.02cons.tax.summary
archaea_16s_final.an.0.03cons.taxonomy
archaea_16s_final.an.0.03cons.tax.summary

MetaPhlAn: Metagenomic Phylogenetic Analysis

szypanther — Wed, 08 Aug 2012 03:58:38 +0000

MetaPhlAn is a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn relies on unique clade-specific marker genes identified from 3,000 reference genomes, allowing:

up to 25,000 reads-per-second (on one CPU) analysis speed (orders of magnitude faster compared to existing methods);
unambiguous taxonomic assignments as the MetaPhlAn markers are clade-specific;
accurate estimation of organismal relative abundance (in terms of number of cells rather than fraction of reads);
species-level resolution for bacterial and archaeal organisms;
extensive validation of the profiling accuracy on several synthetic datasets and on thousands of real metagenomes.

DySC: software for greedy clustering of 16S rRNA reads

szypanther — Wed, 08 Aug 2012 02:56:15 +0000

Summary: Pyrosequencing technologies are frequently used for sequencing the 16S ribosomal RNA marker gene for profiling microbial communities. Clustering of the produced reads is an important but time-consuming task. We present Dynamic Seed-based Clustering (DySC), a new tool based on the greedy clustering approach that uses a dynamic seeding strategy. Evaluations based on the normalized mutual information (NMI) criterion show that DySC produces higher quality clusters than UCLUST and CD-HIT at a comparable runtime.

Availability and implementation: DySC, implemented in C, is available at http://code.google.com/p/dysc/ under GNU GPL license.

Contact: bertil.schmidt@uni-mainz.de