python and PBS script

#!/bin/bash
#PBS -l walltime=48:00:00,nodes=8:ppn=4
#PBS -N bbmap_batch
#PBS -l walltime=48:00:00
#megahit -1 /disk/rdisk09/zhiyshen/combined_HKG_1.fastq.gz -2 /disk/rdisk09/zhiyshen/combined_HKG_2.fastq.gz -m 0.9 -o /disk/rdisk09/zhiyshen/HKG_coassembly_out –min-contig-len 2000 -t 28
source activate bbmap
cd /home/zhiyshen/zhiyshen/viral_bioaerosol/bbmap/HKG/test/
python2.7 /home/zhiyshen/zhiyshen/viral_bioaerosol/bbmap/HKG/test/batch_run_bbmap.py

 

#!/usr/bin/env python
import os,re,sys,string,commands,getopt,subprocess,glob
from os import path

#SL310935_adapter_remov[……]

Read more

R heatmap for ANI

data<-read.table("fastani_matrix.txt", header=TRUE )

(base) zyshen@wyq-P310:~/work/deltaBS/fastaANI$ more fastani_matrix.txt

B1147 E4385 E4742 E4930 EC5350 ROAR019

B1147 100 94.681488 94.696358 94.648102 99.785301 94.803284

E4385 94.681488 100 96.50248 97.408295 94.496964 95.691849

E4742 94.696358 96.50248 100 96.831787 94.59523 95.332825

E4930 94.648102 97.408295 96.831787 100 94.446297 95.715088

EC5350 99.785301 94.496964 94.59523 94.446297 100 94.630646

ROAR019 94.803284 95.691849 95.332825 95.715088 94.630646 100

> data
            B1147     E4385     E4742[......]

Read more

Metagenomics Tutorial (HUMAnN2)

https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-Tutorial-(Humann2)


This tutorial is set-up to walk you through the process of determining the taxonomic and functional composition of several metagenomic samples. It covers the use of Metaphlan2 (for taxonomic classification), HUMAnN2 (for functional classification), and STAMP (for visualization and statistical evaluation).

Throughout the tutorial, there are several questions to ensure that you are understanding the process (and not just copy and pasting). You can check the answer sheet after you have answered them yourself[……]

Read more

Metagenomics standard operating procedure v2

http://www.360doc.com/content/20/0823/10/71250389_931761136.shtml

https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2

Note that this workflow is continually being updated. If you want to use the below commands be sure to keep track of them locally.

Last updated: 9 Oct 2018 (see revisions above for earlier minor versions and Metagenomics SOPs on the SOP for earlier major versions)

Important points to keep in mind:

  • The below options are not necessarily best for your data; it is important to explore different options, especially at the r[……]

Read more

肠型分析学习

肠型,Enterotype,是2011年在这篇文章中提出的,即将过去的2018年又有20多们肠道微生物的大佬对肠型的概念进行了回顾和确认。一直比较好奇怎样来用代码分析肠型,今天找到了这个教程,放在这:

这是那篇原始的文章:Arumugam, M., Raes, J., et al. (2011) Enterotypes of the human gut microbiome, Nature,doi://10.1038/nature09944 在谷歌上一搜,作者竟然做了个分析肠型的教程在这,学习一下:http://enterotyping.embl.de/enterotypes.html 这是2018年大佬们的共识文章:这是国人翻译的这篇文章,http://blog.sciencenet.cn/blog-3334560-1096828.html 当然,如果你只需要获得自己的结果或者自己课题的结果,不需要跑代码的,有最新的网页版分型,更好用,网址也放在这,同样也是上面翻译的那篇文章里提到的网址:http://enterotypes.org/ 只需要把菌属的含量比例文件上就能很快得到结果。

下面我就边学习边做来尝试着来个分析,并把代码放在这里备忘。其实作者已经整理好了代码,我学习一下,争取实现对手上的数据进行分析。

首先下载测试数据,

wget http://enterotypi[......]

Read more

Making a pairwise distance matrix in pandas

This is a somewhat specialized problem that forms part of a lot of data science and clustering workflows. It starts with a relatively straightforward question: if we have a bunch of measurements for two different things, how do we come up with a single number that represents the difference between the two things?

An example will make the question clearer. Let’s load our olympic medal dataset:

import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 6
data = pd.read_csv("https://raw.githubusercontent.com/mojones/binders/master/olympics.csv", sep="\t")
data

[……]

Read more

Pandas and Sklearn

pandas isnull函数检查数据是否有缺失

pandas isnull sum with column headers

 

for col in main_df:
    print(sum(pd.isnull(data[col])))

I get a list of the null count for each column:

0
1
100

What I’m trying to do is create a new dataframe which has the column header alongside the null count, e.g.

col1 | 0
col2 | 1
col3 | 100

 

 

#print every column using:

 

nulls = df.isnull().sum().to_frame()
for index, row in nulls.iterrows():
    print(index, row[0])

 

for col in df:
    print(df[col].unique())



pandas.get_dummies 的用法 (One-Hot Encoding)

[……]

Read more

conda 报错 Solving environment: failed

现在说说我的解决思路:
1.根据错误内容,安装失败的原因应该是这个网址 https://repo.anaconda.com/pkgs/main/noarch/repodata.json.bz2 请求失败。
2.所以我尝试用

wget https://repo.anaconda.com/pkgs/main/noarch/repodata.json.bz2
1
手动下载这个包,结果出现以下错误。

-2018-12-12 18:29:18– https://repo.anaconda.com/pkgs/main/noarch/repodata.json.bz2
Connecting to 127.0.0.1:33473… failed: Connection refused.

1
2
3
那么应该寻找失败的原因,127.0.0.1表示的是本机,应该不会有什么问题,那么会不会是因为端口33473被占用的原因。
3.用netstat -ntpl查看本地端口的使用情况

netstat -ntpl
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)[……]

Read more

DBS model training

phenotype gproNOG.annotations bitscores

This is the complete workflow used to generate a random forest model using output data from an hmmsearch of your protein coding genes against eggNOG gamma proteobacterial protein HMMs. Before running this notebook, run the parse_hmmsearch.pl script to get a tab-delimited file containing bitscores for all isolates.

“`{r, read in data}
# library(gplots)
library(caret)
library(randomForest)
set.seed(1)
# set the directory you are working from
directory <- “/home/shenzy/UNISED/CII_ECOR_update_final_164genes”
# Reading in the eggNOG model sco[……]

Read more

微生物多样研究—差异分析

1. 随机森林模型

随机森林是一种基于决策树(Decisiontree)的高效的机器学习算法,可以用于对样本进行分类(Classification),也可以用于回归分析(Regression)。

它属于非线性分类器,因此可以挖掘变量之间复杂的非线性的相互依赖关系。通过随机森林分析,可以找出能够区分两组样本间差异关键OTU。

Feature Importance Scores表格-来源于随机森林结果

记录了各OTU对组间差异的贡献值大小。

18585978-a855cbdb5a069bb1

注:一般地,选取Mean_decrease_in_accuracy值大于0.05的OTU,作进一步分析;对于组间差异较小的样本,该值可能会降至0.03。

2. 交叉验证分析

交叉验证(Crossvalidation),是一种统计学上将数据样本切割成较小子集的实用方法。先在一个子集上做分析,而其它子集则用来做后续对此分析的确认及验证。一开始的子集被称为训练集。而其它的子集则被称为验证集或测试集。

其中最常见的为k-foldercross-validation,它指的是将所有数据分成k个子集,每个子集均做一次测试集,其余的作为训练集。交叉验证重复k次,每次选择一个子集作为测试集,并将k次的平均交叉验证识别正确率作为结果。

所有的样本都被作为了训练集和测试集,每个样本都被验[……]

Read more