Using python and this GFF parser that mimics Biopython’s SeqIO parsers:

from BCBio import GFF

# Read the gff
for seq in GFF.parse('my_file.gff'):
    # only focus on the CDSs
    for feat in filter(lambda x: x.type == 'CDS',
        # extract the locus tag
        locus_tag = feat.qualifiers.get('locus_tag',
        # extract the sequence
        dna_seq = seq[......]

Read more

python and PBS script

#PBS -l walltime=48:00:00,nodes=8:ppn=4
#PBS -N bbmap_batch
#PBS -l walltime=48:00:00
#megahit -1 /disk/rdisk09/zhiyshen/combined_HKG_1.fastq.gz -2 /disk/rdisk09/zhiyshen/combined_HKG_2.fastq.gz -m 0.9 -o /disk/rdisk09/zhiyshen/HKG_coassembly_out –min-contig-len 2000 -t 28
source activate bbmap
cd /home/zhiyshen/zhiyshen/viral_bioaerosol/bbmap/HKG/test/
python2.7 /home/zhiyshen/zhiyshen/viral_bioaerosol/bbmap/HKG/test/


#!/usr/bin/env python
import os,re,sys,string,commands,getopt,subprocess,glob
from os import path


Read more

R heatmap for ANI

data<-read.table("fastani_matrix.txt", header=TRUE )

(base) zyshen@wyq-P310:~/work/deltaBS/fastaANI$ more fastani_matrix.txt

B1147 E4385 E4742 E4930 EC5350 ROAR019

B1147 100 94.681488 94.696358 94.648102 99.785301 94.803284

E4385 94.681488 100 96.50248 97.408295 94.496964 95.691849

E4742 94.696358 96.50248 100 96.831787 94.59523 95.332825

E4930 94.648102 97.408295 96.831787 100 94.446297 95.715088

EC5350 99.785301 94.496964 94.59523 94.446297 100 94.630646

ROAR019 94.803284 95.691849 95.332825 95.715088 94.630646 100

> data
            B1147     E4385     E4742[......]

Read more

Metagenomics Tutorial (HUMAnN2)

This tutorial is set-up to walk you through the process of determining the taxonomic and functional composition of several metagenomic samples. It covers the use of Metaphlan2 (for taxonomic classification), HUMAnN2 (for functional classification), and STAMP (for visualization and statistical evaluation).

Throughout the tutorial, there are several questions to ensure that you are understanding the process (and not just copy and pasting). You can check the answer sheet after you have answered them yourself[……]

Read more

Metagenomics standard operating procedure v2

Note that this workflow is continually being updated. If you want to use the below commands be sure to keep track of them locally.

Last updated: 9 Oct 2018 (see revisions above for earlier minor versions and Metagenomics SOPs on the SOP for earlier major versions)

Important points to keep in mind:

  • The below options are not necessarily best for your data; it is important to explore different options, especially at the r[……]

Read more



这是那篇原始的文章:Arumugam, M., Raes, J., et al. (2011) Enterotypes of the human gut microbiome, Nature,doi://10.1038/nature09944 在谷歌上一搜,作者竟然做了个分析肠型的教程在这,学习一下: 这是2018年大佬们的共识文章:这是国人翻译的这篇文章, 当然,如果你只需要获得自己的结果或者自己课题的结果,不需要跑代码的,有最新的网页版分型,更好用,网址也放在这,同样也是上面翻译的那篇文章里提到的网址: 只需要把菌属的含量比例文件上就能很快得到结果。



wget http://enterotypi[......]

Read more

Making a pairwise distance matrix in pandas

This is a somewhat specialized problem that forms part of a lot of data science and clustering workflows. It starts with a relatively straightforward question: if we have a bunch of measurements for two different things, how do we come up with a single number that represents the difference between the two things?

An example will make the question clearer. Let’s load our olympic medal dataset:

import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 6
data = pd.read_csv("", sep="\t")


Read more

Pandas and Sklearn

pandas isnull函数检查数据是否有缺失

pandas isnull sum with column headers


for col in main_df:

I get a list of the null count for each column:


What I’m trying to do is create a new dataframe which has the column header alongside the null count, e.g.

col1 | 0
col2 | 1
col3 | 100



#print every column using:


nulls = df.isnull().sum().to_frame()
for index, row in nulls.iterrows():
    print(index, row[0])


for col in df:

pandas.get_dummies 的用法 (One-Hot Encoding)


Read more

conda 报错 Solving environment: failed

1.根据错误内容,安装失败的原因应该是这个网址 请求失败。


-2018-12-12 18:29:18–
Connecting to… failed: Connection refused.

3.用netstat -ntpl查看本地端口的使用情况

netstat -ntpl
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)[……]

Read more

DBS model training

phenotype gproNOG.annotations bitscores

This is the complete workflow used to generate a random forest model using output data from an hmmsearch of your protein coding genes against eggNOG gamma proteobacterial protein HMMs. Before running this notebook, run the script to get a tab-delimited file containing bitscores for all isolates.

“`{r, read in data}
# library(gplots)
# set the directory you are working from
directory <- “/home/shenzy/UNISED/CII_ECOR_update_final_164genes”
# Reading in the eggNOG model sco[……]

Read more