GYRA : CCMAR Computational Cluster (NGS)

Welcome to the CCMAR Computational Cluster Facility: GYRA – gyra.ualg.pt

The GYRA cluster facility is administered and maintained by Cymon J. Cox of the Plant Systematics and Bioinformatics Research Group (PSB).

System:

The GYRA cluster facility consists of:

  • Frontend: 16-core 2.3GHz 32GB DELL PowerEdge R715
  • compute-0-0: 8-core 2.6GHz 8GB DELL PowerEdge SC1435
  • compute-0-1: 8-core 2.6GHz 8GB DELL PowerEdge SC1435
  • compute-0-2: 32-core 2.4Ghz 64GB DELL PowerEdge R815
  • compute-0-3: 16-core 2.0GHz 32GB DELL PowerEdge R715
  • compute-0-4: 32-core 2.4Ghz 64GB DELL PowerEdge R815
  • compute-0-5: 16-core 3.0Ghz 128GB DELL PowerEdge R815

The clustering software is Rocks 5.3 (Rolled Tacos) with a Open Grid Scheduler/Sun Grid Engine queuing system. A total of 128 slots/cores are available in the queue. The cluster supports mpi, mpich, and orte parrallel computing environments.

The current system status can be viewed here.

Access/Support:

The cluster facility is available to members of CCMar and collaborators. To request an account on gyra, or for general enquires, email Cymon.

(If you are looking for a free to access online bioinformatics platform you could try BioPortal , for phylogenetics try CIPRES.)

If you are trying to use Microsoft Windows to connect to GYRA, you could try these instructions for installing the necessary software.

Submitting jobs to the cluster:

Jobs can only be submitted from ‘gyra’ and not the nodes.

Jobs may be run on the cluster in Interactive (on compute-0-3 or compute-0-4 only) or Batch mode – batch job submission being the most often used.

Interactive job submission:

An interactive session can be request by issuing the command ‘qrsh’ at the prompt. If a slot is available in the queue, an interactive session will be started on an available node. The session will remain active and consuming a single slot until the exit command is issued at the prompt. Do not leave interactive sessions idle as other users will be deprived of resources.

Batch submission:

This is the usual job submission procedure – the advantage being that jobs can be queued and run when resources are available. Jobs are submitted to the queue using the command ‘qsub’ which executes a small shell-script describing the requested resources and job configuration. Typically these script are very simple; however for all available options see the SGE Users Manual.

The following describes a simple job submission script and typical configuration options:

#!/bin/bash

# Give the job a name that will appear in the queue
#$ -N theNameOfMyJob

# Request all output be placed in the current working directory
#$ -cwd

#Re-direct the 'standard out' and 'standard error' messages to the single file named myAnalysis.out
#$ -o myAnalysis.out
#$ -j y

#Request a bash shell
#$ -S /bin/bash

#Send email notification when job (b)egins, (e)nds, or is (a)borted
#$ -M myEmail@my.account.domain
#$ -m bea

# IMPORTANT: source your bash shell profile
source ~/.bash_profile

#Run this command:
paup myAnalysis.nex

All lines beginning with #$ are interpreted as commands by the SGE queue, those beginning with # are comments (except the first line which is special).

The above script, if in the file named ‘mySub.sh’, would be submitted to the queue using the following command:

  • [user@gyra ~]$ qsub mySub.sh

There is also a graphical user interface available for job submission called ‘qmon’. This requires that you are running the XWindows system on you local machine. The ‘qmon’ submission system is not supported and it is not recommended. Full details can be found in the SGE Users Manual.

Submitting parallel jobs:

Some software is parallelised and able to run a single job on muliple processors. To make this work correctly you need to indicate the number of CPU’s to the software AND tell the queue how many slots the job will use.

Submission script requesting 2 CPU’s (-np 2) with the software command “mb runfile.nex”:

#! /bin/bash
#$ -cwd
[...]
mpirun -np 2 mb runfile.nex

If the above submission script is called “mbsub.sh”, when submitting to the queue request the “orte” parallel environment (“-pe orte”) and 2 slots:

  • [user@gyra ~]$ qsub -pe orte 2 mbsub.sh

Mira assemblies: if you indicate SK:not=4 in your Mira command line (ie 4 threads for the SKIM algorithm), submit the job as follows:

  • [user@gyra ~]$ qsub -pe orte 4 my_mira.sh

Monitoring the queue:

The command ‘qstat’ describes the current status of the queue:

[cymon@gyra ~]$ qstat
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@compute-0-0.local        BP    0/3/8          3.02     lx26-amd64
    hl:mem_free=5.309G
   5214 0.55500 l34iCV26   cymon        r     04/11/2011 14:17:20     1
   5215 0.55500 l34iCV24   cymon        r     04/11/2011 14:28:35     1
   5216 0.55500 l34iCV22   cymon        r     04/11/2011 14:29:20     1
---------------------------------------------------------------------------------
all.q@compute-0-1.local        BP    0/1/8          1.01     lx26-amd64
    hl:mem_free=6.454G
   5218 0.55500 lpe34free  cymon        r     04/11/2011 14:40:20     1
---------------------------------------------------------------------------------
all.q@compute-0-2.local        BP    0/4/32         4.00     lx26-amd64
    hl:mem_free=60.593G
   3934 0.55500 q-run1     cymon        r     03/13/2011 16:39:22     1
   3935 0.55500 q-run2     cymon        r     03/13/2011 16:55:37     1
   3936 0.55500 q-run3     cymon        r     03/13/2011 16:55:52     1
   3937 0.55500 q-run4     cymon        r     03/13/2011 16:55:52     1
---------------------------------------------------------------------------------
all.q@compute-0-3.local        BIP   0/0/16         0.06     lx26-amd64
    hl:mem_free=31.122G
---------------------------------------------------------------------------------
all.q@compute-0-4.local        BIP   0/1/32         1.01     lx26-amd64
    hl:mem_free=61.376G
   5176 0.55500 l34iCV28   cymon        r     04/05/2011 16:57:37     1
---------------------------------------------------------------------------------
all.q@gyra.local               BP    0/0/16         0.05     lx26-amd64
    hl:mem_free=29.710G
  • [user@gyra ~]$ qstat -u <username> – display only those jobs for ‘username’
  • [user@gyra ~]$ qstat -j <jobnumber> – display details of a particular ‘jobnumber’ in the queue

Deleting jobs from the queue:

Issuing the commands:

  • [user@gyra ~]$ qdel -u <username> – will delete all jobs of the user ‘username’
  • [user@gyra ~]$ qdel <jobnumber> – will delete ‘jobnumber’ from the queue

Next-generation sequence assembly:

Due to the large volumes of data, NGS assembly can require large computation resources, especially RAM memory. Consequently, NGS assembly jobs run within a restricted environment in a special queue called assembly that requires that the total amount of memory needed be specified prior to running the job.

Note: In order for users to use the assembly queue, the user must request that they be added to the access group for the queue.

---------------------------------------------------------------------------------
assembly@compute-0-5.local     BP    0/1/4          0.01     lx26-amd64
        hl:mem_free=125.824G
    5980 0.50500 pMira1     peter        r     06/15/2011 09:41:36     1

Note that assembly jobs run in the *all.q* will be summarily killed.

Each job submitted to the assembly queue must request an amount of virtual_free memory – this is the largest amount of memory that the job will require to run. If the job exceeds the amount of specified memory during execution, the job will be suspended indefinitely, rather than causing the compute node to become unresponsive.

The amount of available and request-able virtual-free memory for the assembly queue can be obtained by issuing the following command:

[user@gyra admin]$ qstat -F vf -q assembly
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
assembly@compute-0-5.local     BP    0/0/4          0.00     lx26-amd64
        hc:virtual_free=126.797G

Note that mem-free is not the same as virtual-free memory. The amount of available virtual-free memory is dependant on the total amount available on the node (128GB) minus the amount needed by each other job on the node running in the all.q queue (by default 2GB / job).

More virtual-free memory than the currently available amount may be requested, but the job will not run until the requested amount becomes available.

To submit a job to the assembly queue you need to request the amount of virtual-free memory using the -l flag:

  • [user@gyra ~]$ qsub -q assembly -l vf=64G mira --project=run1 --job=denovo,genome,accurate,454

Here -l vf=64G requests 64GB of virtual-free memory for a mira assembly in the assembly queue.

Software:

The following software is available on the cluster (in no particular order):

Motif and pattern searching:

NCBI BLAST (2.2.21(legacy blastall) and 2.2.25+)
Other BLAST databases and custom databases on request. BLAST+ documentation.
  • nt
  • nr
  • refseq_genomic
  • refseq_protein
  • refseq_rna
  • swissprot
  • taxdb
  • est

    Last update: 5th March 2012

MpiBLAST See Submitting parallel jobs:
mpiBLAST is a freely available, open-source, parallel implementation of NCBI BLAST. By efficiently utilizing distributed computational resources through database fragmentation, query segmentation, intelligent scheduling, and parallel I/O, mpiBLAST improves NCBI BLAST performance by several orders of magnitude while scaling to hundreds of processors.
HMMER (3.0 and 2.3.2)
HMMER is used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments. It implements methods using probabilistic models called “profile hidden Markov models” (profile HMMs). Documentation

Multiple and pair-wise sequence alignment:

ClustalW (1.1.18 (clustalw) and 2.0.12 (clustalw2))
ClustalW2 is a general purpose multiple sequence alignment program for DNA or proteins.
T_Coffee (8.14)
A collection of tools for computing, evaluating and manipulating multiple alignments of DNA, RNA, protein sequences and structures.
Muscle (3.8.31)
“Faster and more accurate than CLUSTALW”…
Uclust (1.0.50 and 1.2.22q (Qiime))
“Search and clustering hundreds of times faster than BLAST”…
Usearch (5.2.32)
USEARCH is a unique high-throughput sequence analysis tool. It supports a variety of algorithms for sequence searching, clustering, and filtering
Mafft (6.833b)
MAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of greather than 200 sequences), FFT-NS-2 (fast; for alignment of greather than 10,000 sequences), etc.
GBlocks (0.91b)
Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis.
Exonerate (2.2.0)
Exonerate is a generic tool for pairwise sequence comparison.
TranslatorX
A perl script for nucleotide sequence alignment and alignment cleaning based on amino acid information. Uses ReadSeq and GBlocks. Alignments via Muscle, Clustalw, MAFFT, and T-Coffee.
Blat
BLAT (the BLAST-Like Alignment Tool) is a software program developed by Jim Kent at UCSC to identify similarities between DNA sequences and protein sequences.
SEED
SEED is a software for clustering large sets of Next Generation Sequences (NGS) with hundreds of millions of reads in a time and memory efficient manner. Its algorithm joins highly similar sequences into clusters that can differ by up to three mismatches and three overhanging residues. Article.

Population genetics / coalescent:

PopABC (1.0)
PopABC is an Approximate Bayesian Computation (ABC) method to estimate historical demographic parameters (e.g. population size, migration rate, mutation rate, recombination rate, splitting events) within a Isolation with migration (IM) population model.
Migrate-n
Estimation of population sizes and gene flow using the coalescent.
Lamarc (2.1.6)
LAMARC is a program which estimates population-genetic parameters such as population size, population growth rate, recombination rate, and migration rates. It approximates a summation over all possible genealogies that could explain the observed sample, which may be sequence, SNP, microsatellite, or electrophoretic data. LAMARC and its sister program Migrate are successor programs to the older programs Coalesce, Fluctuate, and Recombine, which are no longer being supported. Documentation
IMa2 (8/26/2011)
The program implements a method for generating posterior probabilities for complex demographic population genetic models. IMa2 works similarly to the older IMa program, with some important additions. IMa2 can handle data and implement a model for multiple populations (for numbers of sampled populations between one and ten) – not just two populations (as was the case with the original IM and IMa programs).

Phylogenetic analyses:

MrBayes MPI (3.1.2) and MrBayes v3.2-svn(r517- development version)
with Beagle-lib MrBayes is a program for the Bayesian estimation of phylogeny. (See Submitting parallel jobs:) Documentation
Phyml (2.4.4)
A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Guindon S., Gascuel O. Systematic Biology, 52(5):696-704, 2003.
RAxML-HPC, RAxML-MPI (7.0.4) and 7.2.8a
Maximum likelihood estimation of phylogeneies. (See Submitting parallel jobs:). Documentation
qmmraxmlHPC (1.0)
Uses a class-frequency (cF) mixture model to model site-specific distributions for phylogenetic inference.
Phylip (3.6.8)
All things Felsenstein.
Phylobayes (3.3)
PhyloBayes is a Bayesian Monte Carlo Markov Chain (MCMC) sampler for phylogenetic reconstruction using protein alignments. Compared to other phylogenetic MCMC samplers, the main distinguishing feature of PhyloBayes is the underlying probabilistic model, CAT (Lartillot and Philippe, 2004). CAT is a mixture model especially devised to account for site-specific features of protein evolution. It is particularly well suited for large multigene alignments, such as those used in phylogenomics. Documentation
NH Phylobayes (0.2.3)
A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. Some notes on running nhpb
p4 (python2.7 : numpy-1.5.0 : gsl-1.14)
P4 is a Python package that does maximum likelihood and Bayesian phylogenetic analyses on molecular sequences. It’s specialty is that you can use heterogeneous models, where the model parameters can differ in different parts of the tree, or over different parts of the data.Includes Qdist module – Installation Documentation

DendroPy (3.11.0)
DendroPy is a Python library for phylogenetic computing. It provides classes and functions for the simulation, processing, and manipulation of phylogenetic trees and character matrices, and supports the reading and writing of phylogenetic data in a range of formats, such as NEXUS, NEWICK, NeXML, Phylip, FASTA, etc. Application scripts for performing some useful phylogenetic operations, such as data conversion and tree posterior distribution summarization, are also distributed and installed as part of the libary. DendroPy can thus function as a stand-alone library for phylogenetics, a component of more complex multi-library phyloinformatic pipelines, or as a scripting “glue” that assembles and drives such pipelines. Tutorial
BayesTraits
BayesTraits is a computer package for performing analyses of trait evolution among groups of species for which a phylogeny or sample of phylogenies is available. This new package incoporates our earlier and separate programes Multistate, Discrete and Continuous. BayesTraits can be applied to the analysis of traits that adopt a finite number of discrete states, or to the analysis of continuously varying traits. Hypotheses can be tested about models of evolution, about ancestral states and about correlations among pairs of traits.
BayesPhylogenies (1.0)
BayesPhylogenies is a general package for inferring phylogenetic trees using Bayesian Markov Chain Monte Carlo (MCMC) or Metropolis-coupled Markov chain Monte Carlo (MCMCMC) methods. The program allows a range of models of gene sequence evolution, models for morphological traits, models for rooted trees, gamma and beta distributed rate-heterogeneity, and implements a ‘mixture model’ (Pagel and Meade, 2004) that allows the user to fit more than one model of sequence evolution, without partitioning the data.
Beast (1.6.1) with Beagle-lib
BEAST is a cross-platform program for Bayesian MCMC analysis of molecular sequences. It is entirely orientated towards rooted, time-measured phylogenies inferred using strict or relaxed molecular clock models. It can be used as a method of reconstructing phylogenies but is also a framework for testing evolutionary hypotheses without conditioning on a single tree topology. BEAST uses MCMC to average over tree space, so that each tree is weighted proportional to its posterior probability. Installation and use of Beagle-lib with Beast
Modelgenerator (85)
ModelGenerator is a model selection program that selects optimal amino acid and nucleotide substitution models from Fasta or Phylip alignments. ModelGenerator supports 56 nucleotide and 96 amino acid substitution models.
Jmodeltest (0.1.1)
jModelTest is a tool to carry out statistical selection of best-fit models of nucleotide substitution. It implements five different model selection strategies: hierarchical and dynamical likelihood ratio tests (hLRT and dLRT), Akaike and Bayesian information criteria (AIC and BIC), and a decision theory method (DT). It also provides estimates of model selection uncertainty, parameter importances and model-averaged parameter estimates, including model-averaged phylogenies.
MrModeltest2 (2.3)
C program for selecting DNA substitution models using PAUP*.
Garli (vers 1.0 and vers 2.0 MPI)
GARLI (Genetic Algorithm for Rapid Likelihood Inference) performs phylogenetic searches on aligned nucleotide, codon and amino acid data sets using the maximum likelihood criterion. On a practical level, the program is able to perform maximum-likelihood tree searches on large data sets in a number of hours.
Prottest (2.4)
PROTTEST (ModelTest’s relative) is a program for selecting the model of protein evolution that best fits a given set of sequences (alignment). This java program is based on the Phyml program (for maximum likelihood calculations and optimization of parameters) and uses the PAL library as well. Models included are empirical substitution matrices (such as WAG, LG, mtREV, Dayhoff, DCMut, JTT, VT, Blosum62, CpREV, RtREV, MtMam, MtArt, HIVb, and HIVw) that indicate relative rates of amino acid replacement, and specific improvements (+I:invariable sites, +G: rate heterogeneity among sites, +F: observed amino acid frequencies) to account for the evolutionary constraints impossed by conservation of protein structure and function. ProtTest uses the Akaike Information Criterion (AIC) and other statistics (AICc and BIC) to find which of the candidate models best fits the data at hand.
PAUP (4.0b10)
Needs no introduction. – 4.0 final release expected any day now…
PAML (4.4)
PAML is a package of programs for phylogenetic analyses of DNA or protein sequences using maximum likelihood. It is maintained and distributed for academic use free of charge by Ziheng Yang. ANSI C source codes are distributed for UNIX/Linux/Mac OSX, and executables are provided for MS Windows. PAML is not good for tree making. It may be used to estimate parameters and test hypotheses to study the evolutionary process, when you have reconstructed trees using other programs such as PAUP*, PHYLIP, MOLPHY, PhyML, RaxML, etc.
Tree-Puzzle (5.2)
TREE-PUZZLE is a computer program to reconstruct phylogenetic trees from molecular sequence data by maximum likelihood. It implements a fast tree search algorithm, quartet puzzling, that allows analysis of large data sets and automatically assigns estimations of support to each internal branch. TREEPUZZLE also computes pairwise maximum likelihood distances as well as branch lengths for user specified trees. Branch lengths can be calculated with and without the molecular-clock assumption. In addition, TREE-PUZZLE o ers likelihood mapping, a method to investigate the support of a hypothesized internal branch without computing an overall tree and to visualize the phylogenetic content of a sequence alignment. TREE-PUZZLE also conducts a number of statistical tests on the data set (chi-square test for homogeneity of base composition, likelihood ratio to test the clock hypothesis, one and two-sided Kishino-Hasegawa test, Shimodaira-Hasegawa test, Expected Likelihood Weights).
Consel (0.1.k)
CONSEL is a program package consists of small programs written in C language. It calculates the probability value (i.e., p-value) to assess the confidence in the selection problem. Although CONSEL is applicable to any selection problem, it is mainly designed for the phylogenetic tree selection. CONSEL calculates the p-value using several testing procedures; the bootstrap probability, the Kishino-Hasegawa test, the Shimodaira-Hasegawa test, and the weighted Shimodaira-Hasegawa test. In addition to these conventional tests, CONSEL calculates the p-value based on the approximately unbiased test using the multi-scale bootstrap technique. Documentation
Tree Congruence Tester (tct.py) – (Requires p4)
Tree Congruence Test(er): reads two rooted trees (NEXUS and/or PHYLIP format), reciprocally prunes each tree of missing taxa (automatic), deletes any taxa passed to the programme (via -d), and tests topological congruence among the remaining clades supported by a value greater than the set threshold.
combineNexus.py – (Requires p4 and >= Python 2.7)
Reads Nexus formatted matrices and combines them into a single matrix with blank sequences for genes that are missing from individual matrices.
calculatePhylogeneticDiversity.py – (Requires p4 and >= Python 2.7)
Calculate the Phylogenetic Diversity (PD: Faith 1992) of a group of taxa on a tree. PD is the minimum total length of all the phylogenetic branches required to span a given set of taxa on the phylogenetic tree (and does not include the stem branch of a clade).
makeConsensusTree.py – (Requires p4 and >= Python 2.7)
Write a majority rule consensus tree from 1 or more Nexus or Newick formatted tree files. Each tree file can have a specified burnin and/or step count. The data file (in Phylip or Nexus format) from which the trees are derived must be supplied.
Concaterpillar (1.4) – (Requires SCIPY SciPy and pyMPI)
A hierarchical likelihood ratio test for phylogenetic congruence. Documentation (See Submitting parallel jobs:)
minmax-chisq
Reduced amino acid alphabets for phylogenetic inference. Documentation
Crux (1.2.0)
Crux is a software toolkit for molecular phylogenetic inference. Incl: Bayesian Markov chain Monte Carlo (MCMC) methods (with Metropolis coupling and MPI support for parallel computation) can sample among non-nested models using reversible model jumps. Polytomous trees can be sampled, also via reversible jumps. In fact, every non-essential model parameter that Crux’s MCMC implementation estimates can be expunged via reversible jumps.Notes on running Crux.

Figtree (1.3.1)
FigTree is designed as a graphical viewer of phylogenetic trees and as a program for producing publication-ready figures.
Misfits (1.0)
MISFITS is a program to evaluate the goodness of fit of a model to an alignment in phylogeny reconstruction. Documentation
Compass (1.0)
COMPASS is a UNIX program for identifying and removing fast evolving sites from morphological or molecular data matrices using a number of compatibility-based methods. Documentation
AIS: Almost Invariant Sets
The goal is to identify sets of amino acids with a high probability of change between elements of the set but small probability of change between different sets by using amino acid replacement matrices and their eigenvectors. After identification of the subsets the quality of the partition is assessed with a conductance measure. Documentation
TreeFinder
TREEFINDER computes phylogenetic trees from molecular sequences. The program infers even large trees by maximum likelihood under a variety of models of sequence evolution. Documentation

Genome assembly, mapping, and annotation:

Galaxy
Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.Installing Galaxy on ROCKS.

For an account to use the local installation of Galaxy, email Cymon.

Glimmer (3.02)
Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. Glimmer (Gene Locator and Interpolated Markov ModelER) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA. (See NGS assembly)
TIGR Assembler (v2)
TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Science and Technology. 1995 Jan 01; 1(1): 9-19. (See NGS assembly)
Mira (3.4.0)(+3rd party)
The mira genome fragment assembler is a specialised assembler for sequencing projects classified as ‘hard’ due to high number of similar repeats. For EST transcripts, miraEST is specialised on reconstructing pristine mRNA transcripts while detecting and classifying single nucleotide polymorphisms (SNP) occuring in different variations thereof. Online wiki Documentation (See NGS assembly) The Definitive Guide to Mira3Note that by default Mira uses 2 threads in the SKIM algorithm, so if no more threads are requested, then typically you would submit a job requesting 2 slots in the queue -pe orte 2 – see Submitting parallel jobs:

MUMmer (3.23)
MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form. Dependency for AMOS.
Roche 454 Data Analysis suite (2.5.3)
Obtain biologically meaningful results from your sequence data quickly and affordably with the powerful suite of analysis tools provided with the Genome Sequencer FLX System, updated for the GS FLX Titanium series. AKA Newbler et al. (See NGS assembly)Documentation:

sff_extract (0.2.8)
454 sequence reads are usually stored in sff files. In these files the information about the reads is stored: sequece, quality and quality and adapter clips. sff_extract extracts the reads from the sff files and stores them into fasta and xml or caf text files.
Blast2GO (2.3.5)
Command line version only: b2g4pipe, no visualisation. Blast2GO is an ALL in ONE tool for functional annotation of (novel) sequences and the analysis of annotation data. Documentation Configuration
Octupus (0.1.1)
OCTUPUS uses a novel method of sequence clustering and pairwise comparisons which reduces the influence of chimeras and intraspecific diversity on cluster generation. The clustering approach used to generate OCTUs is designed with the intent to reflect the expected pattern of diversity of rDNA genes. Additionally, OCTUPUS provides a method to screen clusters for evidence of chimera formation without the use of a reference database. OCTUPUS is optimized for speed, and does not require a computing cluster to analyze typical large scale datasets.
Smalt (0.4.1)
SMALT is a pairwise sequence alignment program designed for the efficient mapping of DNA sequencing reads onto genomic reference sequences. Reads from a range of sequencing platforms, for example Illumina-Solexa, Roche-454 or ABI-Sanger, can be processed including paired-end reads. Documentation
Lucy (1.20)
Lucy is a program for DNA sequence quality trimming and vector removal. Its purpose is to process DNA sequence data acquired from DNA sequencers to prepare the data for downstream processing applications such as genome assembly.
Amplicon Noise (1.2.1)
AmpliconNoise is a collection of programs for the removal of noise from 454 sequenced PCR amplicons. It involves two steps the removal of noise from the sequencing itself and the removal of PCR point errors. This project also includes the Perseus algorithm for chimera removal.
ABySS (1.3.2)
(MPI with sparsehash – max kmer 96) ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes. (See NGS assembly)
Trans-ABySS (1.1.0)
Trans-ABySS is a software pipeline for analyzing ABySS-assembled contigs from shotgun transcriptome data. The pipeline accepts assemblies that were generated across a wide range of k values in order to address variable transcript expression levels. It first filters and merges the multi-k assemblies, generating a much smaller set of nonredundant contigs. It contains scripts that map assembled contigs to known transcripts, currently supporting Blat and Exonerate contig-to-genome aligners. It identifies novel splicing events like exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. Its scripts can also estimate gene expression levels, identify candidate polyadenylation sites, and identify candidate gene-fusion events. Documentation (See NGS assembly)
Ray (1.6.0)
Ray is a parallel genome assembler utilizing MPI. Ray is a single-executable program (the executable is Ray). Its aim is to assemble sequences on MPI-enabled computers or clusters. Ray assembles reads obtained with new sequencing technologies (Illumina, 454, SOLiD) using MPI 2.2 – a message passing inferface standard. (See NGS assembly) MAXKMERLENGTH=32
Velvet (1.1.03)
Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, in the United Kingdom. Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs. Documentation (See NGS assembly)
Oases (0.1.21)
Oases is a de novo transcriptome assembler designed to produce transcripts from short read sequencing technologies, such as Illumina, SOLiD, or 454 in the absence of any genomic assembly. Documentation (See NGS assembly)
TopHat (1.2.0)
TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. Documentation Installation
Bowtie (0.12.7)
Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end). Documentation (See NGS assembly)
Cufflinks (0.9.3beta)
Transcript assembly, differential expression, and differential regulation for RNA-Seq. Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one. Documentation
clean_reads (0.2)
clean_reads cleans NGS (Sanger, 454, Illumina and solid) reads. It can trim:

  • bad quality regions,
  • adaptors,
  • vectors, and
  • regular expresssions.

It also filters out the reads that do not meet a minimum quality criteria based on the sequence length and the mean quality. It uses several algorithms and third party tools to carry out the cleaning. The third party tools used are: lucy, blast, mdust and trimpoly. The functionality offered by clean_reads is similar to the cleaning capabilities of the ngs_backbone pipeline. In fact, both tools use the same code base and are just different interfaces on top of a Python library called franklin. Can be parallelised with psubprocess.

prinseq-lite (0.14.4)
PRINSEQ is a publicly available tool that is able to filter, reformat and trim your sequences and to provide you summary statistics for your sequence data. Documentation
DeconSeq
Sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, possibly causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants presents a necessary step for all metagenomic projects.
Fastqc (0.9.1)
A quality control tool for high throughput sequence data.
SOAP
SOAPaligner/soap2 is a member of the SOAP (Short Oligonucleotide Analysis Package). It is an updated version of SOAP software for short oligonucleotide alignment. The new program features in super fast and accurate alignment for huge amounts of short reads generated by Illumina/Solexa Genome Analyzer.

  • soap2sam.pl
  • 2bwt-builder
  • SOAPdenovo31mer (1.05): A de novo short reads assembler.
  • soap (SOAPaligner/soap2) (2.20): Short Oligonucleotide Analysis Package.
SOLiD SAGE Analysis Software (v1.10)
SOLiD™ SAGE™Analysis Software v1.10 is a Linux-based program that takes the raw data files from SOLiD™ SAGE™ sequencing reads and matches them to known sequences in your reference database of choice. It is designed for use with the SOLiD™ SAGE™ Kit or the SOLiD™ SAGE™Kit with Barcoding Adaptor Module, which generates libraries of 27-bp tags for all transcripts in a cell. Documentation
martyr.py (1.0)
A Python version of MARTA. It basically annotates BLAST (x or n) XML output with the NCBI taxonomy. Requires: Biopython, blastdbcmd, NCBI taxonomy dump and the target BLAST DB in $BLASTDB, and a BioSQL database in PostGreSQL with the NCBI taxonomy loaded.
RDP Classifier
The RDP Classifier is a naive Bayesian classifier that can rapidly and accurately provides taxonomic assignments from domain to genus, with confidence estimates for each assignment. More information can be found at Ribosomal Database Project
ACEAssemblySplitter.py (1.0)
Split an ACE format assembly file into multiple single ACE files each with a single contig. The resulting ACE contig files are formatted so that they can be read by CodonCode.
PyCogent (1.5.1)
PyCogent: A toolkit for making sense from sequence. PyCogent includes connectors to remote databases, built-in generalized probabilistic techniques for working with biological sequences, and controllers for 3rd party applications.
PYNAST (1.1)
PyNAST is a python implementation of the NAST sequence alignment tool.
ChimeraSlayer (MicrobiomeUtilities-r20110519)
A set of software utilities for processing and analyzing 16S rRNA genes including generating NAST alignments, chimera checking, and assembling paired 16S rRNA reads according to reference sequence homology.
Qiime (svn repository: r2753)
QIIME (pronounced chime) stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities, primarily based on high-throughput amplicon sequencing data (such as SSU rRNA) generated on a variety of platforms, but also supporting analysis of other types of data (such as shotgun metagenomic data). QIIME takes users from their raw sequencing output through initial analyses such as OTU picking, taxonomic assignment, and construction of phylogenetic trees from representative sequences of OTUs, and through downstream statistical analysis, visualization, and production of publication-quality graphics. QIIME has been applied to single studies based on billions of sequences from thousands of samples. Notes on installation.

AMOS (3.1.0)

Includes:

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal – to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system. Notes on installation.

BWA (0.5.8c)
BWA is a fast light-weighted tool that aligns relatively short sequences (queries) to a sequence database (target), such as the human reference genome.
BFAST (0.6.4e)
BFAST facilitates the fast and accurate mapping of short reads to reference sequences. Some advantages of BFAST include: * Speed: enables billions of short reads to be mapped quickly. * Accuracy: A priori probabilities for mapping reads with defined set of variants. * An easy way to measurably tune accuracy at the expense of speed. Specifically, BFAST was designed to facilitate whole-genome resequencing, where mapping billions of short reads with variants is of utmost importance. BFAST supports both Illumina and ABI SOLiD data, as well as any other Next-Generation Sequencing Technology (454, Helicos), with particular emphasis on sensitivity towards errors, SNPs and especially indels. Other algorithms take short-cuts by ignoring errors, certain types of variants (indels), and even require further alignment, all to be the “fastest” (but still not complete). BFAST is able to be tuned to find variants regardless of the error-rate, polymorphism rate, or other factors.
Maq (0.7.1)
Maq is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. It is particularly designed for Illumina-Solexa 1G Genetic Analyzer, and has preliminary functions to handle ABI SOLiD data. Documentation
Integrated Genome Browser (6.7.0)
The Integrated Genome Browser (IGB, pronounced Ig-Bee) is an interactive, zoomable, scrollable software program you can use to visualize and explore genome-scale data sets, such as tiling array data, next-generation sequencing results, genome annotations, microarray designs, and the sequence itself. Documenation
Tablet – Next Generation Sequence Assembly Visualization
Tablet is a lightweight, high-performance graphical viewer for next generation sequence assemblies and alignments.
Trinity (r2012-03-17) – RNA-Seq De novo Assembly
A novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes.
Mauve (2.3.1) – Multiple Genome Alignment
Mauve is a system for efficiently constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Multiple genome alignment provides a basis for research into comparative genomics and the study of evolutionary dynamics. Aligning whole genomes is a fundamentally different problem than aligning short sequences. Documentation

Bioinformatics suites and libraries:

Emboss (6.4.0)
EMBOSS is “The European Molecular Biology Open Software Suite”.
FASTA
The FASTA programs find regions of local or global (new) similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence. Other programs provide information on the statistical significance of an alignment. Like BLAST, FASTA can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
Biopython (1.59) – (Python2.7)
Biopython is a set of freely available tools for biological computation written in Python. Documentation
Wise2 (2.2.0)
Wise2 is a package focused on comparisons of bio polymers, commonly DNA sequence and protein sequence. There are many other packages which do this, probably the best known being BLAST package (from NCBI) and the FASTA package (from Bill Pearson).
NCL (2.1.14)
The NEXUS Class Library (NCL) is an integrated collection of C++ classes designed to allow the user to quickly write a program that reads NEXUS-formatted data files. It also allows easy extension of the NEXUS format to include new blocks of your own design.

R statistics (2.12.0)

R is a language and environment for statistical computing and graphics. Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. EdgeR – differential expression analysis of RNA-seq and digital gene expression profiles with biological replication. Uses empirical Bayes estimation and exact tests based on the negative binomial distribution. Also useful for differential signal analysis with other types of genome-scale count data.

BioPerl
Bioinformatics from the Dark Side.
Pysam (0.3.1)
Pysam is a python module for reading and manipulating Samfiles. It’s a lightweight wrapper of the samtools C-API.
Samtools (0.1.12a)
SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
SciPy (0.9.0)
SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering.The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Installation with ATLAS and complete LAPACK library
NCBI Sequence Read Archive (SRA) Toolkit (2.0rc5)
Stuff from NCBI to manipulate SRAs. Documentation
FASTX-toolkit
The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.

  • FASTQ-to-FASTA converter Convert FASTQ files to FASTA files.
  • FASTQ Information Chart Quality Statistics and Nucleotide Distribution
  • FASTQ/A Collapser Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)
  • FASTQ/A Trimmer Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise).
  • FASTQ/A Renamer Renames the sequence identifiers in FASTQ/A file.
  • FASTQ/A Clipper Removing sequencing adapters / linkers
  • FASTQ/A Reverse-Complement Producing the Reverse-complement of each sequence in a FASTQ/FASTA file.
  • FASTQ/A Barcode splitter Splitting a FASTQ/FASTA files containning multiple samples
  • FASTA Formatter changes the width of sequences line in a FASTA file
  • FASTA Nucleotide Changer Convets FASTA sequences from/to RNA/DNA
  • FASTQ Quality Filter Filters sequences based on quality
  • FASTQ Quality Trimmer Trims (cuts) sequences based on quality
  • FASTQ Masker Masks nucleotides with ‘N’ (or other character) based on quality
pyfasta (0.4.3)
Fast, memory-efficient, pythonic (and command-line) access to fasta sequence files.
Seqmagick (0.3.1)
Seqmagick is a kickass little utility built in the spirit of imagemagick to expose the file format conversion in Biopython in a convenient way.
Bio++ (core-2.0.1)
Bio++ is a set of C++ libraries for Bioinformatics, including sequence analysis, phylogenetics, molecular evolution and population genetics. Bio++ is fully Object Oriented and is designed to be both easy to use and computer efficient. Installation Documentation
cdbfasta
Use cdbfasta to create the index file for a multi-FASTA file and cdbyank to pull records based on that index file.

Molecular modeling:

MODELLER (9v8)
MODELLER is used for homology or comparative modeling of protein three-dimensional structures (1,2). The user provides an alignment of a sequence to be modeled with known related structures and MODELLER automatically calculates a model containing all non-hydrogen atoms. MODELLER implements comparative protein structure modeling by satisfaction of spatial restraints (3,4), and can perform many additional tasks, including de novo modeling of loops in protein structures, optimization of various models of protein structure with respect to a flexibly defined objective function, multiple alignment of protein sequences and/or structures, clustering, searching of sequence databases, comparison of protein structures, etc.
NAMD (2.7 MPI)
A parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Documentation (See Submitting parallel jobs:)
Amber11 and AmberTools
Assisted Model Building with Energy Refinement. Documentation Amber11 and Documentation AmberTools
Gromacs (4.5.2)
GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers. Documentation
PSIPRED V32
The PSIPRED Protein Structure Prediction Server aggregates several of our structure prediction methods into one location.


2 comments to GYRA : CCMAR Computational Cluster (NGS)

  • I think that each and every line of a computational pepiline for a paper has to be made public together with the the text of the paper. We will see more and more problems like this and sooner or later journals will be obliged to to put data files and analysis source code together with the manuscript on their website. There will be few people who go through the source, I admit, but in case of an error one undergrad student can be enough.And more and more biologists know how to program in Perl. In the long run, many readers will be able to grasp at least some ideas of a piece of source code.Another advantage of having code available is that all of us can easily learn from other people. Another advantage of open pepilines is that it forces you to keep them neat and clean up the mess in your temp/justATest/FinalResul-directories.In my opinion (I’m a computer scientist), bioinformatics papers are a lot more obscure than real biology papers and reviewers are more critical there when it comes to materials and methods than in bioinformatics.

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>