MetaVelvet: a short read assember for metagenomics

MetaVelvet: a short read assember for metagenomics

Introduction

Motivation: An important step of “metagenomics” analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines employ a single-genome assembler with carefully optimized parameters and post-process the resulting scaffolds to correct assembly errors. Limitations of the use of a single-genome assembler for de novo metagenome assembly are that highly conserved sequences shared between different species often causes chimera contigs, and sequences of highly abundant species are likely mis-identified as repeats in a single genome.

Methods: We modified and extended a single-genome and de Bruijn-graph based assembler, Velvet, for de novo metagenome assembly. Our fundamental ideas are first decomposing de Bruijn graph constructed from mixed short reads into individual sub-graphs and second building scaffolds based on every decomposed de Bruijn sub-graph as isolate species genome.

 

What is improved?

Longer N50 size: On artificially generated simulation datasets of short sequence reads, it is proven that MetaVelvet assembled metegenomic read data with significantly longer N50 sizes than single-genome assemblers (Velvet and SOAPdenovo) and another meta-genome assembler (MetaIDBA) as shown in the table below.

 

Improved gene prediction: We applied Velvet and MetaVelvet to real metagenomic dataset (SRR041654 and SRR041655; human gut microbiome data downloaded from NCBI sequence read archive). Then, MetaGeneAnnotator was used to predict a gene set of the microbiome. From the results, it is demonstrated that MetaVelvet can efficiently avoid chimeric scaffolds, resulting in longer scaffolds, increased number of predicted genes, and decreased fraction of partial genes.

 

Widely applicable to matagenomic analyses: We also compared the results of metagenomic analyses, such as taxonomic content analysis, functional composition analysis, and pathway mapping analysis, between Velvet and MetaVelvet. The results demonstrate that taxonomic content can be estimated more clearly when based on the MetaVelvet scaffolds than when based on the Velvet scaffolds as shown in the figure below left (the number of genes that cannot be assigned to phylum-level taxonomy is substantially decreased). Moreover, the result of functional composition analysis also shows that MetaVelvet is more appropriate for metagenomics purpose than Velvet as shown in the figure below right (the number of genes assigned to any functional category is improved).

Source Code (Release Notes)

The following source codes is freely available under GNU General Public License.

    • MetaVelvet-v1.1.01 (Oct.19,2011)
      – this version is based upon Velvet 1.1.06.
      – fix a frequent bug: no peak detection error.
      – fix a frequent bug: infinite loop error.
      – modify the graph split algorithm: an exception for “knock-turn” loops.
      – modify the graph split algorithm: split priority is now based on peak IDs rather than node IDs.

 

  • MetaVelvet-v0.3 (Jan.14,2011)
    – this version is based upon Velvet 0.7.62.

Source codes at these versions can be downloaded from here.

 

Requirements (>= 1.1.01)

    • Required libraries:
      • zlib library (>= 1.2.3) – zlib library is also included in the Velvet distribution.

 

    • Required programs:
      • velveth (>= 1.1.06) – included in the Velvet distribution.
      • velvetg (>= 1.1.06) – included in the Velvet distribution.
      • g++ (>= 4.3.2) – for compile.

 

  • Recommened environment:
    • OS : standard 64bit Linux (It can in theory function on a 32bit environment, but not recommended).
    • RAM: at least 12GB (more is no luxury).

 

Install MetaVelvet

    1. Download source code from here, and decompress the tar ball.

      ~$ tar zxvf MataVelvet-v1.1.01.tgz

      Alternatively, up-to-date source code is available from github MetaVelvet project.

      ~$ git clone git://github.com/hacchy/MetaVelvet.git

 

    1. Change directory to the MetaVelvet directory.

      ~$ cd MetaVelvet-v1.1.01

 

    1. Compile MetaVelvet execution files.

      ~/MetaVelvet-v1.1.01$ make [‘MAXKMERLENGTH = k‘] [‘CATEGORIES = cat‘]

      k – the maximum k-mer size you like to use.
      cat – the maximum number of read categories (i.e., maximum number of libraries in different insert lengths)
      Then, an executable files, meta-velvegh will be created (>= 1.1.01).

 

  1. Copy the two executable files to /usr/binor a directory you like to install.

    ~/MetaVelvet-v1.1.01$ cp meta-velvetg /usr/bin/

 

Getting started (>= 1.1.01)

MetaVelvet (>= 1.1.01) uses velveth and velvetg outputs for metagenomics assembly. Three files (“Sequences”, “Roadmaps”, and “Graph2″) are needed for running meta-velvetg.

    1. Download a small dataset from here, and decompress the tar ball.

      ~$ tar zxvf HMP.small.tar.gz

 

    1. Execute velveth to import read sequences and construct k-mer hash table.

      ~$ velveth out-dir 51 \
      -fasta -shortPaired \
      HMP.small/SRR041654_shuffled.fasta \
      HMP.small/SRR041655_shuffled.fasta

      Please check that “out-dir/Sequences” and “out-dir/Roadmaps” files are created. Note that k-mer size has important effect on assembly results. If your short read data is longer than 65bp, we recommend k-mer longer than 51.

 

    1. Execute velvetg to construct an initial de Bruijingraph.

      ~$ velvetg out-dir -exp_cov auto -ins_length 260

      Please check that “out-dir/Graph2″ file is created. Please use -exp_cov auto option, otherwise meta-velvetg cannot be executed.

 

  1. Execute meta-velvetgto output scaffold sequences.

    ~$ meta-velvetg out-dir [-ins_length 260] | tee logfile

    Please check that “out-dir/meta-velvetg.contigs.fa” file is created. The FASTA file is the main assembling results.

Advanced topics 1: Manual setting of k-mer coverage peaks (>= 1.1.01)

Contrary to single-genome assemblers (e.g., Velvet and SOAPdenovo), MetaVelvet assumes metagenomics settings. Accordingly, k-mer coverage histogram might be multi-modal rather than uni-modal. Please note that the coverage peak parameters can largely affect MetaVelvet assembling results. Although simple and automatic peak detection algorithm is implemented in meta-velvetg, we strongly recommend manual inspection of k-mer coverage peaks and manual setting of the coverage peak parameters. For manually setting the coverage peak parameters, please execute the following procedures.

    1. Execute velveth, velvetg, and meta-velvetg as in the “Getting started” section. Please check that “out-dir/meta-velvetg.Graph2-stats.txt” is created.

 

    1. Draw a k-mer coverage histogram and manually determine the k-mer coverage peaks.

      ~$ R
      (R) > install.packages(“plotrix”)
      (R) > library(plotrix)
      (R) > data = read.table(“out-dir/meta-velvetg.Graph2-stats.txt”, header=TRUE)
      (R) > weighted.hist(data$short1_cov, data$lgth, breaks=seq(0, 200, by=1))

      Please determine the coverage peak values from the histogram.

      For example, from the above histogram, seven coverage peaks are observed (around 210x, 120x, 70x, 45x, 23x, 12x, and 6x).
      If errors are occurred in this step, please see the “Trouble shootings” section.

 

    1. Alternatively, a Python script “scripts/scriptEstimateCovMulti.py” included in the MetaVelvet package can be used to determine coverage peak values.

      ~$ scriptEstimateCovMulti.py out-dir/stats_EstimateCovMulti.txt

      When applying the script to the above histogram, six peaks (214x, 122x, 70x, 43x, 25x, and 13.5x) are detected by the script. 7th-peak around 6x is removed because such low coverage peak can be caused by sequencing errors.

 

  1. Run meta-velvetgwith manual setting of coverage peaks.

    ~$ meta-velvetg out-dir -ins_length 260 \
    -exp_covs 214_122_70_43_25_13.5

    Please note that meta-velvetg assumes that the peak values are sorted in a descending order.

Advanced topics 2: Use Bambus2 scaffolding module (>= 1.1.01)

Bambus2 is a scaffolding module that can be applicable to metagenomics settings. MetaVelvet uses a novel graph splitting algorithm during contiging process, and uses the scaffolding module of Velvet (RockBand and Pebble) during scaffolding process. Alternatively, users can output MetaVelvet contigs and uses Bambus2 scaffolding module instead of using RockBand and Pebble.

To install Bambus2 programs, please see their web site http://www.cbcb.umd.edu/software/bambus/.

    1. Execute velveth, velvetg as in the “Getting started” section. Please check that “out-dir/Sequences”, “out-dir/Roadmaps”, and “out-dir/Graph2″ files are created.

 

  1. Execute meta-velvetg with -amos_file and -scaffoldingoptions.

    ~$ meta-velvetg out-dir -ins_length 260 \
    -exp_covs 214_122_70_43_25_13.5 \
    -amos_file yes -scaffolding no

    Please check that “out-dir/meta-velvetg.asm.afg” file is created. The file is meta-velvetg contiging result in an AMOS format.

  2. Create amos bank.

    ~$ bank-transact -m out-dir/meta-velvetg.asm.afg -b bank-dir -cf

  3. Use the paired-end information to construct a collection of contig links

    ~$ clk -b bank-dir 2> /dev/null

  4. Bundle the contig links into a collection of contig edges

    ~$ Bundler -b bank-dir -t M 2> /dev/null

  5. Identify genomic repeats and output them

    ~$ MarkRepeats -b bank-dir \
    -redundancy num-pairs \
    2> /dev/null \
    | tee repeat-file

  6. Order and orient contigs according to repeat and link information

    ~$ OrientContigs -b bank-dir \
    -redundancy num-pairs \
    -prefix output-prefix \
    -repeats repeat-file \
    -all \
    2> /dev/null

  7. Linearize the scaffolds

    ~$ untangle -e output-prefix.evidence.xml \
    -s output-prefix.out.xml \
    -o output-prefix.untangle.xml \
    2> /dev/null

  8. Output contig FastA file

    ~$ bank2fasta -d -b bank-dir \
    | tee contig-file

  9. Output scaffold FastA file

    ~$ printScaff -merge -e output-prefix.evidence.xml \
    -o output-prefix \
    -s output-prefix.untangle.xml \
    -l output-prefix.library \
    -f scaffold-file \
    2> /dev/null

    Then, scaffold-file is the final scaffolding results in the FastA format.

 

Frequently asked questions

    • Q: meta-velveth is not generated in the new version (>= 1.1.01). Is it problem?
      A: This is not problem. The usage of MetaVelvet is changed when the version 1.1.01 is released, and the new version does not include meta-velveth. Instead, please use velveth, velvetg, and meta-velvetg.

 

    • Q: When only one coverage peak is detected (or manually input), is there any difference between MetaVelvet and Velvet algorithms?
      A: There is no substantial difference. In such cases, meta-velvetg moves to “single-peak mode” and graph splitting functions in meta-velvetg is not called. Instead of graph splitting functions, standard velvet functions are called in such cases.

 

    • Q: What’s the difference in working procedures between velvetg, meta-velvetg (<= 0.3.1), and meta-velvetg (>= 1.1.01)?
      A: The following is the working procedure of velvetg, meta-velvetg (<= 0.3.1), and meta-velvetg (>= 1.1.01):

      velvetg & meta-velvetg(<=0.3.1) :
      Load Sequences & Roadmaps file
      -> Generate PreGraph file
      -> Generate Graph or Graph2 file
      -> Generate contigs.fa and LastGraph

      meta-velvetg (>= 1.1.01):
      Load Sequences & Roadmaps & Graph2 file
      -> Generate meta-velvetg.contigs.fa and meta-velvetg.LastGraph

 

  • Q: Is version compatibility between Velvet and MetaVelvet fully tested?
    A: Version compatibility between Velvet-1.0.06 and MetaVelvet-1.1.01 is fully tested.

Trouble shootings

  • Trouble: When drawing k-mer coverage histogram (as in the “Advanced topics 1″section), the following warning messages is appeared:

    > weighted.hist(data$shot1_cov,data$lgth,breaks=seq(0,200,1))
    Warning messages:
    1: In min(x, na.rm = na.rm) :
    no non-missing arguments to min; returning Inf
    2: In max(x, na.rm = na.rm) :
    no non-missing arguments to max; returning -Inf
    3: In weighted.hist(data$shot1_cov, data$lgth, breaks = seq(0, 200, :
    Areas will not relate to frequencies

    Solution:This warning (error) is caused by “Inf” values in the Graph2 node stats. Accordingly, by removing “Inf” values from the Graph2 stats, the error is resolved:

    $ head -n 1 meta-velvetg.Graph2-stats.txt \
    > meta-velvetg.Graph2-stats.rmInf.txt
    $ perl -ne ‘{print $_ unless /Inf/;}’ \
    meta-velvetg.Graph2-stats.txt \
    >> meta-velvetg.Graph2-stats.rmInf.txt
    $ R
    > library(plotrix)
    > data = read.table(“meta-velvetg.Graph2-stats.rmInf.txt”, header=TRUE)
    > weighted.hist(data$shot1_cov,data$lgth,breaks=seq(0,200,1))

 

Contact us

    • Simple questions and bug reports, please feel free to mail to the following adress:

 

 

  • Requests for future releases, we would like to continuously develop MetaVelvet with hearing from user voices:

 

Publications

  • Namiki T*, Hachiya T*, Tanaka H, Sakakibara Y. (2011) MetaVelvet : An extension of Velvet assembler to de novo metagenome assembly from short sequence reads, submitted. (Summplementary figures and tables)
    *Equally contributed.

 

3 comments to MetaVelvet: a short read assember for metagenomics

  • I dugg some of you post as I cerebrated they were very beneficial very useful

    • szypanther

      I am very happy to receive some guys’s feedback, Just like u! Hope my personal blog can give some help for others:)

      • Prav ste za tova, che niama pravda v Bulgaria, niama zankoi, a uj sme v EU. No moga da kaja edno, a imenno che mnogo hora v Bulgaria se zablijdavat, che v Amerika ima pravda. Korupciata ne e samo v BG, a navsiakade po sveta, vkliuchitelno Amerika, samo che tuk e po-prikrita!Sajaliavam ako sam vi razbila iluziite za USA, no istinata e takava.

Leave a Reply to Kompa Cancel reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>