小生这厢有礼了(BioFaceBook Personal Blog) » software

usually bioinformatics tools

szypanther — Thu, 14 Jun 2012 08:02:17 +0000

http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

This directory contains applications for stand-alone use, 
built specifically for a Linux 64-bit machine.

For help on the bigBed and bigWig applications see:

http://genome.ucsc.edu/goldenPath/help/bigBed.html


http://genome.ucsc.edu/goldenPath/help/bigWig.html

View the file 'FOOTER' to see the usage statement for 
each of the applications.

      Name                    Last modified      Size  Description

      Parent Directory                             -   
      FOOTER                  12-Jun-2012 18:01   65K  
      bedClip                 12-Jun-2012 18:01  243K  
      bedExtendRanges         12-Jun-2012 18:01  2.7M  
      bedGraphToBigWig        12-Jun-2012 18:01  251K  
      bedItemOverlapCount     12-Jun-2012 18:01  2.7M  
      bedSort                 12-Jun-2012 18:01  224K  
      bedToBigBed             12-Jun-2012 18:01  334K  
      bigBedInfo              12-Jun-2012 18:01  327K  
      bigBedSummary           12-Jun-2012 18:01  330K  
      bigBedToBed             12-Jun-2012 18:01  326K  
      bigWigAverageOverBed    12-Jun-2012 18:01  334K  
      bigWigInfo              12-Jun-2012 18:01  251K  
      bigWigSummary           12-Jun-2012 18:01  251K  
      bigWigToBedGraph        12-Jun-2012 18:01  251K  
      bigWigToWig             12-Jun-2012 18:01  251K  
      blat/                   12-Jun-2012 18:01    -   
      faCount                 12-Jun-2012 18:01  163K  
      faFrag                  12-Jun-2012 18:01  160K  
      faOneRecord             12-Jun-2012 18:01  137K  
      faPolyASizes            12-Jun-2012 18:01  159K  
      faRandomize             12-Jun-2012 18:01  160K  
      faSize                  12-Jun-2012 18:01  163K  
      faSomeRecords           12-Jun-2012 18:01  142K  
      faToNib                 12-Jun-2012 18:01  166K  
      faToTwoBit              12-Jun-2012 18:01  256K  
      fetchChromSizes         12-Jun-2012 18:01  2.6K  
      genePredToGtf           12-Jun-2012 18:01  2.7M  
      gff3ToGenePred          12-Jun-2012 18:01  2.7M  
      gtfToGenePred           12-Jun-2012 18:01  2.7M  
      hgWiggle                12-Jun-2012 18:01  2.7M  
      htmlCheck               12-Jun-2012 18:01  235K  
      hubCheck                12-Jun-2012 18:01  2.7M  
      liftOver                12-Jun-2012 18:01  2.7M  
      liftOverMerge           12-Jun-2012 18:01  228K  
      liftUp                  12-Jun-2012 18:01  2.7M  
      mafSpeciesSubset        12-Jun-2012 18:01  159K  
      mafsInRegion            12-Jun-2012 18:01  236K  
      makeTableList           12-Jun-2012 18:01  2.7M  
      nibFrag                 12-Jun-2012 18:01  167K  
      overlapSelect           12-Jun-2012 18:01  2.7M  
      paraFetch               12-Jun-2012 18:01  210K  
      paraSync                12-Jun-2012 18:01  210K  
      pslCDnaFilter           12-Jun-2012 18:01  232K  
      pslPretty               12-Jun-2012 18:01  1.2M  
      pslReps                 12-Jun-2012 18:01  803K  
      pslSort                 12-Jun-2012 18:01  804K  
      sizeof                  12-Jun-2012 18:01  5.3K  
      stringify               12-Jun-2012 18:01  142K  
      textHistogram           12-Jun-2012 18:01  149K  
      twoBitInfo              12-Jun-2012 18:01  248K  
      twoBitToFa              12-Jun-2012 18:01  330K  
      validateFiles           12-Jun-2012 18:01  2.7M  
      wigCorrelate            12-Jun-2012 18:01  267K  
      wigToBigWig             12-Jun-2012 18:01  1.0M

================================================================
========   bedClip   ====================================
================================================================
bedClip - Remove lines from bed file that refer to off-chromosome places.
usage:
   bedClip input.bed chrom.sizes output.bed
options:
   -verbose=2 - set to get list of lines clipped and why

================================================================
========   bedExtendRanges   ====================================
================================================================
bedExtendRanges - extend length of entries in bed 6+ data to be at least the given length,
taking strand directionality into account.

usage:
   bedExtendRanges database length files(s)

options:
   -host	mysql host
   -user	mysql user
   -password	mysql password
   -tab		Separate by tabs rather than space
   -verbose=N - verbose level for extra information to STDERR

example:

   bedExtendRanges hg18 250 stdin

   bedExtendRanges -user=genome -host=genome-mysql.cse.ucsc.edu hg18 250 stdin

will transform:
    chr1 500 525 . 100 +
    chr1 1000 1025 . 100 -
to:
    chr1 500 750 . 100 +
    chr1 775 1025 . 100 -

================================================================
========   bedGraphToBigWig   ====================================
================================================================
bedGraphToBigWig v 4 - Convert a bedGraph program to bigWig.
usage:
   bedGraphToBigWig in.bedGraph chrom.sizes out.bw
where in.bedGraph is a four column file in the format:
         
and chrom.sizes is two column:  
and out.bw is the output indexed big wig file.
Use the script: fetchChromSizes to obtain the actual chrom.sizes information
from UCSC, please do not make up a chrom sizes from your own information.
The input bedGraph file must be sorted, use the unix sort command:
  sort -k1,1 -k2,2n unsorted.bedGraph > sorted.bedGraph
options:
   -blockSize=N - Number of items to bundle in r-tree.  Default 256
   -itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024
   -unc - If set, do not use compression.
================================================================
========   bedItemOverlapCount   ====================================
================================================================
bedItemOverlapCount - count number of times a base is overlapped by the
	items in a bed file.  Output is bedGraph 4 to stdout.
usage:
 sort bedFile.bed | bedItemOverlapCount [options]  stdin
To create a bigWig file from this data to use in a custom track:
 sort -k1,1 bedFile.bed | bedItemOverlapCount [options]  stdin \
         > bedFile.bedGraph
 bedGraphToBigWig bedFile.bedGraph chrom.sizes bedFile.bw
   where the chrom.sizes is obtained with the script: fetchChromSizes
   See also:

http://genome-test.cse.ucsc.edu/~kent/src/unzipped/utils/userApps/fetchChromSizes

options:
   -zero      add blocks with zero count, normally these are ommitted
   -bed12     expect bed12 and count based on blocks
              Without this option, only the first three fields are used.
   -max       if counts per base overflows set to max (4294967295) instead of exiting
   -outBounds output min/max to stderr
   -chromSize=sizefile	Read chrom sizes from file instead of database
             sizefile contains two white space separated fields per line:
		chrom name and size
   -host=hostname	mysql host used to get chrom sizes
   -user=username	mysql user
   -password=password	mysql password

Notes:
 * You may want to separate your + and - strand
   items before sending into this program as it only looks at
   the chrom, start and end columns of the bed file.
 * Program requires a  connection to lookup chrom sizes for a sanity
   check of the incoming data.  Even when the -chromSize argument is used
   the  must be present, but it will not be used.

 * The bed file *must* be sorted by chrom
 * Maximum count per base is 4294967295. Recompile with new unitSize to increase this
================================================================
========   bedSort   ====================================
================================================================
bedSort - Sort a .bed file by chrom,chromStart
usage:
   bedSort in.bed out.bed
in.bed and out.bed may be the same.
================================================================
========   bedToBigBed   ====================================
================================================================
bedToBigBed v. 2.0 - Convert bed file to bigBed. (BigBed version: 4)
usage:
   bedToBigBed in.bed chrom.sizes out.bb
Where in.bed is in one of the ascii bed formats, but not including track lines
and chrom.sizes is two column:  
and out.bb is the output indexed big bed file.
Use the script: fetchChromSizes to obtain the actual chrom.sizes information
from UCSC, please do not make up a chrom sizes from your own information.
The in.bed file must be sorted by chromosome,start,
  to sort a bed file, use the unix sort command:
     sort -k1,1 -k2,2n unsorted.bed > sorted.bed

options:
   -type=bedN[+[P]] : 
                      N is between 3 and 15, 
                      optional (+) if extra "bedPlus" fields, 
                      optional P specifies the number of extra fields. Not required, but preferred.
                      Examples: -type=bed6 or -type=bed6+ or -type=bed6+3 
                      (see http://genome.ucsc.edu/FAQ/FAQformat.html#format1)
   -as=fields.as - If you have non-standard "bedPlus" fields, it's great to put a definition
                   of each field in a row in AutoSql format here.
   -blockSize=N - Number of items to bundle in r-tree.  Default 256
   -itemsPerSlot=N - Number of data points bundled at lowest level. Default 512
   -unc - If set, do not use compression.
   -tab - If set, expect fields to be tab separated, normally
           expects white space separator.

================================================================
========   bigBedInfo   ====================================
================================================================
bigBedInfo - Show information about a bigBed file.
usage:
   bigBedInfo file.bb
options:
   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
   -chroms - list all chromosomes and their sizes
   -zooms - list all zoom levels and theier sizes
   -as - get autoSql spec

================================================================
========   bigBedSummary   ====================================
================================================================
bigBedSummary - Extract summary information from a bigBed file.
usage:
   bigBedSummary file.bb chrom start end dataPoints
Get summary data from bigBed for indicated region, broken into
dataPoints equal parts.  (Use dataPoints=1 for simple summary.)
options:
   -type=X where X is one of:
         coverage - % of region that is covered (default)
         mean - average depth of covered regions
         min - minimum depth of covered regions
         max - maximum depth of covered regions
   -fields - print out information on fields in file.
      If fields option is used, the chrom, start, end, dataPoints
      parameters may be omitted
   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs

================================================================
========   bigBedToBed   ====================================
================================================================
bigBedToBed - Convert from bigBed to ascii bed format.
usage:
   bigBedToBed input.bb output.bed
options:
   -chrom=chr1 - if set restrict output to given chromosome
   -start=N - if set, restrict output to only that over start
   -end=N - if set, restict output to only that under end
   -maxItems=N - if set, restrict output to first N items
   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs

================================================================
========   bigWigAverageOverBed   ====================================
================================================================
bigWigAverageOverBed - Compute average score of big wig over each bed, which may have introns.
usage:
   bigWigAverageOverBed in.bw in.bed out.tab
The output columns are:
   name - name field from bed, which should be unique
   size - size of bed (sum of exon sizes
   covered - # bases within exons covered by bigWig
   sum - sum of values over all bases covered
   mean0 - average over bases with non-covered bases counting as zeroes
   mean - average over just covered bases
Options:
   -bedOut=out.bed - Make output bed that is echo of input bed but with mean column appended
   -sampleAroundCenter=N - Take sample at region N bases wide centered around bed item, rather
                     than the usual sample in the bed item.

================================================================
========   bigWigInfo   ====================================
================================================================
bigWigInfo - Print out information about bigWig file.
usage:
   bigWigInfo file.bw
options:
   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
   -chroms - list all chromosomes and their sizes
   -zooms - list all zoom levels and their sizes
   -minMax - list the min and max on a single line

================================================================
========   bigWigSummary   ====================================
================================================================
bigWigSummary - Extract summary information from a bigWig file.
usage:
   bigWigSummary file.bigWig chrom start end dataPoints
Get summary data from bigWig for indicated region, broken into
dataPoints equal parts.  (Use dataPoints=1 for simple summary.)

NOTE:  start and end coordinates are in BED format (0-based)

options:
   -type=X where X is one of:
         mean - average value in region (default)
         min - minimum value in region
         max - maximum value in region
         std - standard deviation in region
         coverage - % of region that is covered
   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs

================================================================
========   bigWigToBedGraph   ====================================
================================================================
bigWigToBedGraph - Convert from bigWig to bedGraph format.
usage:
   bigWigToBedGraph in.bigWig out.bedGraph
options:
   -chrom=chr1 - if set restrict output to given chromosome
   -start=N - if set, restrict output to only that over start
   -end=N - if set, restict output to only that under end
   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs

================================================================
========   bigWigToWig   ====================================
================================================================
bigWigToWig - Convert bigWig to wig.  This will keep more of the same structure of the
original wig than bigWigToBedGraph does, but still will break up large stepped sections
into smaller ones.
usage:
   bigWigToWig in.bigWig out.wig
options:
   -chrom=chr1 - if set restrict output to given chromosome
   -start=N - if set, restrict output to only that over start
   -end=N - if set, restict output to only that under end
   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs

================================================================
========   blat   ====================================
================================================================
blat - Standalone BLAT v. 34x12 fast sequence search command line tool
usage:
   blat database query [-ooc=11.ooc] output.psl
where:
   database and query are each either a .fa , .nib or .2bit file,
   or a list these files one file name per line.
   -ooc=11.ooc tells the program to load over-occurring 11-mers from
               and external file.  This will increase the speed
               by a factor of 40 in many cases, but is not required
   output.psl is where to put the output.
   Subranges of nib and .2bit files may specified using the syntax:
      /path/file.nib:seqid:start-end
   or
      /path/file.2bit:seqid:start-end
   or
      /path/file.nib:start-end
   With the second form, a sequence id of file:start-end will be used.
options:
   -t=type     Database type.  Type is one of:
                 dna - DNA sequence
                 prot - protein sequence
                 dnax - DNA sequence translated in six frames to protein
               The default is dna
   -q=type     Query type.  Type is one of:
                 dna - DNA sequence
                 rna - RNA sequence
                 prot - protein sequence
                 dnax - DNA sequence translated in six frames to protein
                 rnax - DNA sequence translated in three frames to protein
               The default is dna
   -prot       Synonymous with -t=prot -q=prot
   -ooc=N.ooc  Use overused tile file N.ooc.  N should correspond to 
               the tileSize
   -tileSize=N sets the size of match that triggers an alignment.  
               Usually between 8 and 12
               Default is 11 for DNA and 5 for protein.
   -stepSize=N spacing between tiles. Default is tileSize.
   -oneOff=N   If set to 1 this allows one mismatch in tile and still
               triggers an alignments.  Default is 0.
   -minMatch=N sets the number of tile matches.  Usually set from 2 to 4
               Default is 2 for nucleotide, 1 for protein.
   -minScore=N sets minimum score.  This is the matches minus the 
               mismatches minus some sort of gap penalty.  Default is 30
   -minIdentity=N Sets minimum sequence identity (in percent).  Default is
               90 for nucleotide searches, 25 for protein or translated
               protein searches.
   -maxGap=N   sets the size of maximum gap between tiles in a clump.  Usually
               set from 0 to 3.  Default is 2. Only relevent for minMatch > 1.
   -noHead     suppress .psl header (so it's just a tab-separated file)
   -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
   -repMatch=N sets the number of repetitions of a tile allowed before
               it is marked as overused.  Typically this is 256 for tileSize
               12, 1024 for tile size 11, 4096 for tile size 10.
               Default is 1024.  Typically only comes into play with makeOoc.
               Also affected by stepSize. When stepSize is halved repMatch is
               doubled to compensate.
   -mask=type  Mask out repeats.  Alignments won't be started in masked region
               but may extend through it in nucleotide searches.  Masked areas
               are ignored entirely in protein or translated searches. Types are
                 lower - mask out lower cased sequence
                 upper - mask out upper cased sequence
                 out   - mask according to database.out RepeatMasker .out file
                 file.out - mask database according to RepeatMasker file.out
   -qMask=type Mask out repeats in query sequence.  Similar to -mask above but
               for query rather than target sequence.
   -repeats=type Type is same as mask types above.  Repeat bases will not be
               masked in any way, but matches in repeat areas will be reported
               separately from matches in other areas in the psl output.
   -minRepDivergence=NN - minimum percent divergence of repeats to allow 
               them to be unmasked.  Default is 15.  Only relevant for 
               masking using RepeatMasker .out files.
   -dots=N     Output dot every N sequences to show program's progress
   -trimT      Trim leading poly-T
   -noTrimA    Don't trim trailing poly-A
   -trimHardA  Remove poly-A tail from qSize as well as alignments in 
               psl output
   -fastMap    Run for fast DNA/DNA remapping - not allowing introns, 
               requiring high %ID. Query sizes must not exceed 5000.
   -out=type   Controls output file format.  Type is one of:
                   psl - Default.  Tab separated format, no sequence
                   pslx - Tab separated format with sequence
                   axt - blastz-associated axt format
                   maf - multiz-associated maf format
                   sim4 - similar to sim4 format
                   wublast - similar to wublast format
                   blast - similar to NCBI blast format
                   blast8- NCBI blast tabular format
                   blast9 - NCBI blast tabular format with comments
   -fine       For high quality mRNAs look harder for small initial and
               terminal exons.  Not recommended for ESTs
   -maxIntron=N  Sets maximum intron size. Default is 750000
   -extendThroughN - Allows extension of alignment through large blocks of N's
================================================================
========   faCount   ====================================
================================================================
faCount - count base statistics and CpGs in FA files.
usage:
   faCount file(s).fa
     -summary  show only summary statistics
     -dinuc    include statistics on dinucletoide frequencies
     -strands  count bases on both strands

================================================================
========   faFrag   ====================================
================================================================
faFrag - Extract a piece of DNA from a .fa file.
usage:
   faFrag in.fa start end out.fa
options:
   -mixed - preserve mixed-case in FASTA file

================================================================
========   faOneRecord   ====================================
================================================================
faOneRecord - Extract a single record from a .FA file
usage:
   faOneRecord in.fa recordName

================================================================
========   faPolyASizes   ====================================
================================================================
faPolyASizes - get poly A sizes
usage:
   faPolyASizes in.fa out.tab

output file has four columns:
   id seqSize tailPolyASize headPolyTSize

options:

================================================================
========   faRandomize   ====================================
================================================================
faRandomize - Program to create random fasta records using
same base frequency as seen in original fasta records.
Use optional -seed flag to specify seed for random number
generator.
usage:
   faRandomize in.fa randomized.fa

================================================================
========   faSize   ====================================
================================================================
faSize - print total base count in fa files.
usage:
   faSize file(s).fa
Command flags
   -detailed        outputs name and size of each record
                    has the side effect of printing nothing else
   -tab             output statistics in a tab separated format

================================================================
========   faSomeRecords   ====================================
================================================================
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

================================================================
========   faToNib   ====================================
================================================================
faToNib - Convert from .fa to .nib format
usage:
   faToNib [options] in.fa out.nib
options:
   -softMask - create nib that soft-masks lower case sequence
   -hardMask - create nib that hard-masks lower case sequence

================================================================
========   faToTwoBit   ====================================
================================================================
faToTwoBit - Convert DNA from fasta to 2bit format
usage:
   faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit
options:
   -noMask       - Ignore lower-case masking in fa file.
   -stripVersion - Strip off version number after . for genbank accessions.
   -ignoreDups   - only convert first sequence if there are duplicates

================================================================
========   fetchChromSizes   ====================================
================================================================
usage: fetchChromSizes  > .chrom.sizes
   used to fetch chrom.sizes information from UCSC for the given 
 - name of UCSC database, e.g.: hg18, mm9, etc ...

This script expects to find one of the following commands:
   wget, mysql, or ftp in order to fetch information from UCSC.
Route the output to the file .chrom.sizes as indicated above.

Example:   fetchChromSizes hg18 > hg18.chrom.sizes
================================================================
========   genePredToGtf   ====================================
================================================================
genePredToGtf - Convert genePred table or file to gtf.
usage:
   genePredToGtf database genePredTable output.gtf
If database is 'file' then track is interpreted as a file
rather than a table in database.
options:
   -utr - Add 5UTR and 3UTR features
   -honorCdsStat - use cdsStartStat/cdsEndStat when defining start/end
    codon records
   -source=src set source name to uses
   -addComments - Add comments before each set of transcript records.
    allows for easier visual inspection
Note: use a refFlat table or extended genePred table or file to include
the gene_name attribute in the output.  This will not work with a refFlat
table dump file. If you are using a genePred file that starts with a numeric
bin column, drop it using the UNIX cut command:
    cut -f 2- in.gp | genePredToGtf file stdin out.gp

================================================================
========   gfClient   ====================================
================================================================
gfClient v. 34x12 - A client for the genomic finding program that produces a .psl file
usage:
   gfClient host port seqDir in.fa out.psl
where
   host is the name of the machine running the gfServer
   port is the same as you started the gfServer with
   seqDir is the path of the .nib or .2bit files relative to the current dir
       (note these are needed by the client as well as the server)
   in.fa is a fasta format file.  May contain multiple records
   out.psl where to put the output
options:
   -t=type     Database type.  Type is one of:
                 dna - DNA sequence
                 prot - protein sequence
                 dnax - DNA sequence translated in six frames to protein
               The default is dna
   -q=type     Query type.  Type is one of:
                 dna - DNA sequence
                 rna - RNA sequence
                 prot - protein sequence
                 dnax - DNA sequence translated in six frames to protein
                 rnax - DNA sequence translated in three frames to protein
   -prot       Synonymous with -d=prot -q=prot
   -dots=N   Output a dot every N query sequences
   -nohead   Suppresses psl five line header
   -minScore=N sets minimum score.  This is twice the matches minus the 
               mismatches minus some sort of gap penalty.  Default is 30
   -minIdentity=N Sets minimum sequence identity (in percent).  Default is
               90 for nucleotide searches, 25 for protein or translated
               protein searches.
   -out=type   Controls output file format.  Type is one of:
                   psl - Default.  Tab separated format without actual sequence
                   pslx - Tab separated format with sequence
                   axt - blastz-associated axt format
                   maf - multiz-associated maf format
                   sim4 - similar to sim4 format
                   wublast - similar to wublast format
                   blast - similar to NCBI blast format
                   blast8- NCBI blast tabular format
                   blast9 - NCBI blast tabular format with comments
   -maxIntron=N  Sets maximum intron size. Default is 750000
================================================================
========   gfServer   ====================================
================================================================
gfServer v 34x12 - Make a server to quickly find where DNA occurs in genome.
To set up a server:
   gfServer start host port file(s)
   Where the files are in .nib or .2bit format
To remove a server:
   gfServer stop host port
To query a server with DNA sequence:
   gfServer query host port probe.fa
To query a server with protein sequence:
   gfServer protQuery host port probe.fa
To query a server with translated dna sequence:
   gfServer transQuery host port probe.fa
To query server with PCR primers
   gfServer pcr host port fPrimer rPrimer maxDistance
To process one probe fa file against a .nib format genome (not starting server):
   gfServer direct probe.fa file(s).nib
To test pcr without starting server:
   gfServer pcrDirect fPrimer rPrimer file(s).nib
To figure out usage level
   gfServer status host port
To get input file list
   gfServer files host port
Options:
   -tileSize=N size of n-mers to index.  Default is 11 for nucleotides, 4 for
               proteins (or translated nucleotides).
   -stepSize=N spacing between tiles. Default is tileSize.
   -minMatch=N Number of n-mer matches that trigger detailed alignment
               Default is 2 for nucleotides, 3 for protiens.
   -maxGap=N   Number of insertions or deletions allowed between n-mers.
               Default is 2 for nucleotides, 0 for protiens.
   -trans  Translate database to protein in 6 frames.  Note: it is best
           to run this on RepeatMasked data in this case.
   -log=logFile keep a log file that records server requests.
   -seqLog    Include sequences in log file (not logged with -syslog)
   -ipLog     Include user's IP in log file (not logged with -syslog)
   -syslog    Log to syslog
   -logFacility=facility log to the specified syslog facility - default local0.
   -mask      Use masking from nib file.
   -repMatch=N Number of occurrences of a tile (nmer) that trigger repeat masking the tile.
               Default is 1024.
   -maxDnaHits=N Maximum number of hits for a dna query that are sent from the server.
               Default is 100.
   -maxTransHits=N Maximum number of hits for a translated query that are sent from the server.
               Default is 200.
   -maxNtSize=N Maximum size of untranslated DNA query sequence
               Default is 40000
   -maxAaSize=N Maximum size of protein or translated DNA queries
               Default is 8000
   -canStop If set then a quit message will actually take down the
            server

================================================================
========   gff3ToGenePred   ====================================
================================================================
gff3ToGenePred - convert a GFF3 file to a genePred file
usage:
   gff3ToGenePred inGff3 outGp
options:
  -maxParseErrors=50 - Maximum number of parsing errors before aborting. A negative
   value will allow an unlimited number of errors.  Default is 50.
  -maxConverErrors=50 - Maximum number of conversion errors before aborting. A negative
   value will allow an unlimited number of errors.  Default is 50.
  -honorStartStopCodons - only set CDS start/stop status to complete if there are
   corresponding start_stop codon records
This converts:
   - top-level gene records with mRNA records
   - top-level mRNA records
   - mRNA records that contain:
       - exon and CDS
       - CDS, five_prime_UTR, three_prime_UTR
       - only exon for non-coding
The first step is to parse GFF3 file, up to 50 errors are reported before
aborting.  If the GFF3 files is successfully parse, it is converted to gene,
annotation.  Up to 50 conversion errors are reported before aborting.

Input file must conform to the GFF3 specification:

http://www.sequenceontology.org/gff3.shtml

================================================================
========   gtfToGenePred   ====================================
================================================================
gtfToGenePred - convert a GTF file to a genePred
usage:
   gtfToGenePred gtf genePred

options:
     -genePredExt - create a extended genePred, including frame
      information and gene name
     -allErrors - skip groups with errors rather than aborting.
      Useful for getting infomation about as many errors as possible.
     -infoOut=file - write a file with information on each transcript
     -sourcePrefix=pre - only process entries where the source name has the
      specified prefix.  May be repeated.
     -impliedStopAfterCds - implied stop codon in after CDS
     -simple    - just check column validity, not hierarchy, resulting genePred may be damaged
     -geneNameAsName2 - if specified, use gene_name for the name2 field
      instead of gene_id.

================================================================
========   hgWiggle   ====================================
================================================================
hgWiggle - fetch wiggle data from data base or file
usage:
   hgWiggle [options] 
options:
   -db= - use specified database
   -chr=chrN - examine data only on chrN
   -chrom=chrN - same as -chr option above
   -position=[chrN:]start-end - examine data in window start-end (1-relative)
             (the chrN: is optional)
   -chromLst= - file with list of chroms to examine
   -doAscii - perform the default ascii output, in addition to other outputs
            - Any of the other -do outputs turn off the default ascii output
            - ***WARNING*** this ascii output is 0-relative offset which
            - *** is *not* the normal wiggle input format.  Use the -lift
            - *** argument -lift=1 to get 1-relative offset:
   -lift= - lift ascii output positions by D (0 default)
   -rawDataOut - output just the data values, nothing else
   -htmlOut - output stats or histogram in HTML instead of plain text
   -doStats - perform stats measurement, default output text, see -htmlOut
   -doBed - output bed format
   -bedFile= - constrain output to ranges specified in bed 
   -dataConstraint='DC' - where DC is one of < = >= <= == != 'in range'
   -ll= - lowerLimit compare data values to F (float) (all but 'in range')
   -ul= - upperLimit compare data values to F (float)
		(need both ll and ul when 'in range')

   -help - display more examples and extra options (to stderr)

   When no database is specified, track names will refer to .wig files

   example using the file chrM.wig:
	hgWiggle chrM
   example using the database table hg17.gc5Base:
	hgWiggle -chr=chrM -db=hg17 gc5Base
================================================================
========   htmlCheck   ====================================
================================================================
htmlCheck - Do a little reading and verification of html file
usage:
   htmlCheck how url
where how is:
   ok - just check for 200 return.  Print error message and exit -1 if no 200
   getAll - read the url (header and html) and print to stdout
   getHeader - read the header and print to stdout
   getCookies - print list of cookies
   getHtml - print the html, but not the header to stdout
   getForms - print the form structure to stdout
   getVars - print the form variables to stdout
   getLinks - print links
   getTags - print out just the tags
   checkLinks - check links in page
   checkLinks2 - check links in page and all subpages in same host
             (Just one level of recursion)
   checkLocalLinks - check local links in page
   checkLocalLinks2 - check local links in page and connected local pages
             (Just one level of recursion)
   submit - submit first form in page if any using 'GET' method
   validate - do some basic validations including TABLE/TR/TD nesting
options:
   cookies=cookie.txt - Cookies is a two column file
           containing 
note: url will need to be in quotes if it contains an ampersand.
================================================================
========   hubCheck   ====================================
================================================================
hubCheck - Check a track data hub for integrity.
usage:
   hubCheck http://yourHost/yourDir/hub.txt
options:
   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs.
                           Will create this directory if not existing
   -verbose=2            - output verbosely
   -clear=browserMachine - clear hub status, no checking
   -noTracks             - don't check each track, just trackDb

================================================================
========   liftOver   ====================================
================================================================
liftOver - Move annotations from one assembly to another
usage:
   liftOver oldFile map.chain newFile unMapped
oldFile and newFile are in bed format by default, but can be in GFF and
maybe eventually others with the appropriate flags below.
The map.chain file has the old genome as the target and the new genome
as the query.

***********************************************************************
WARNING: liftOver was only designed to work between different
         assemblies of the same organism. It may not do what you want
         if you are lifting between different organisms. If there has
         been a rearrangement in one of the species, the size of the
         region being mapped may change dramatically after mapping.
***********************************************************************

options:
   -minMatch=0.N Minimum ratio of bases that must remap. Default 0.95
   -gff  File is in gff/gtf format.  Note that the gff lines are converted
         separately.  It would be good to have a separate check after this
         that the lines that make up a gene model still make a plausible gene
         after liftOver
   -genePred - File is in genePred format
   -sample - File is in sample format
   -bedPlus=N - File is bed N+ format
   -positions - File is in browser "position" format
   -hasBin - File has bin value (used only with -bedPlus)
   -tab - Separate by tabs rather than space (used only with -bedPlus)
   -pslT - File is in psl format, map target side only
   -minBlocks=0.N Minimum ratio of alignment blocks or exons that must map
                  (default 1.00)
   -fudgeThick    (bed 12 or 12+ only) If thickStart/thickEnd is not mapped,
                  use the closest mapped base.  Recommended if using 
                  -minBlocks.
   -multiple               Allow multiple output regions
   -minChainT, -minChainQ  Minimum chain size in target/query, when mapping
                           to multiple output regions (default 0, 0)
   -minSizeT               deprecated synonym for -minChainT (ENCODE compat.)
   -minSizeQ               Min matching region size in query with -multiple.
   -chainTable             Used with -multiple, format is db.tablename,
                               to extend chains from net (preserves dups)
   -errorHelp              Explain error messages

================================================================
========   liftOverMerge   ====================================
================================================================
liftOverMerge - Merge multiple regions in BED 5 files
                   generated by liftOver -multiple
usage:
   liftOverMerge oldFile newFile
options:
   -mergeGap=N    Max size of gap to merge regions (default 0)

================================================================
========   liftUp   ====================================
================================================================
liftUp - change coordinates of .psl, .agp, .gap, .gl, .out, .gff, .gtf .bscore 
.tab .gdup .axt .chain .net, genePred, .wab, .bed, or .bed8 files to parent
coordinate system.

usage:
   liftUp [-type=.xxx] destFile liftSpec how sourceFile(s)
The optional -type parameter tells what type of files to lift
If omitted the type is inferred from the suffix of destFile
Type is one of the suffixes described above.
DestFile will contain the merged and lifted source files,
with the coordinates translated as per liftSpec.  LiftSpec
is tab-delimited with each line of the form:
   offset oldName oldSize newName newSize
LiftSpec may optionally have a sixth column specifying + or - strand,
but strand is not supported for all input types.
The 'how' parameter controls what the program will do with
items which are not in the liftSpec.  It must be one of:
   carry - Items not in liftSpec are carried to dest without translation
   drop  - Items not in liftSpec are silently dropped from dest
   warn  - Items not in liftSpec are dropped.  A warning is issued
   error - Items not in liftSpec generate an error
If the destination is a .agp file then a 'large inserts' file
also needs to be included in the command line:
   liftUp dest.agp liftSpec how inserts sourceFile(s)
This file describes where large inserts due to heterochromitin
should be added. Use /dev/null and set -gapsize if there's not inserts file.

options:
   -nohead  No header written for .psl files
   -dots=N Output a dot every N lines processed
   -pslQ  Lift query (rather than target) side of psl
   -axtQ  Lift query (rather than target) side of axt
   -chainQ  Lift query (rather than target) side of chain
   -netQ  Lift query (rather than target) side of net
   -wabaQ  Lift query (rather than target) side of waba alignment
   	(waba lifts only work with query side at this time)
   -nosort Don't sort bed, gff, or gdup files, to save memory
   -gapsize change contig gapsize from default
   -ignoreVersions - Ignore NCBI-style version number in sequence ids of input files
   -extGenePred lift extended genePred

================================================================
========   mafSpeciesSubset   ====================================
================================================================
mafSpeciesSubset - Extract a maf that just has a subset of species.
usage:
   mafSpeciesSubset in.maf species.lst out.maf
Where:
    in.maf is a file where the sequence source are either simple species
           names, or species.something.  Usually actually it's a genome
           database name rather than a species before the dot to tell the
           truth.
    species.lst is a file with a list of species to keep
    out.maf is the output.  It will have columns that are all - or . in
           the reduced species set removed, as well as the lines representing
           species not in species.lst removed.
options:
   -keepFirst - If set, keep the first 'a' line in a maf no matter what
                Useful for mafFrag results where we use this for the gene name

================================================================
========   mafsInRegion   ====================================
================================================================
mafsInRegion - Extract MAFS in a genomic region
usage:
    mafsInRegion regions.bed out.maf|outDir in.maf(s)
options:
    -outDir - output separate files named by bed name field to outDir
    -keepInitialGaps - keep alignment columns at the beginning and of a block that are gapped in all species

================================================================
========   makeTableList   ====================================
================================================================
makeTableList - create/recreate tableList tables (cache of SHOW TABLES)
usage:
   makeTableList [assemblies]
options:
   -all               recreate tableList for all assemblies
================================================================
========   nibFrag   ====================================
================================================================
nibFrag - Extract part of a nib file as .fa (all bases/gaps lower case by default)
usage:
   nibFrag [options] file.nib start end strand out.fa
where strand is + (plus) or m (minus)
options:
   -masked - use lower case characters for bases meant to be masked out
   -hardMasked - use upper case for not masked-out and 'N' characters for masked-out bases
   -upper - use upper case characters for all bases
   -name=name Use given name after '>' in output sequence
   -dbHeader=db Add full database info to the header, with or without -name option
   -tbaHeader=db Format header for compatibility with tba, takes database name as argument

================================================================
========   overlapSelect   ====================================
================================================================
wrong # args:  overlapSelect [options] selectFile inFile outFile

Select records based on overlapping chromosome ranges.  The ranges are
specified in the selectFile, with each block specifying a range.
Records are copied from the inFile to outFile based on the selection
criteria.  Selection is based on blocks or exons rather than entire
range.

Options starting with -select* apply to selectFile and those starting
with -in* apply to inFile.

Options:
  -selectFmt=fmt - specify selectFile format:
          psl - PSL format (default for *.psl files).
          pslq - PSL format, using query instead of target
          genePred - genePred format (default for *.gp or
                     *.genePred files).
          bed - BED format (default for *.bed files).
                If BED doesn't have blocks, the bed range is used. 
          chain - chain file format (default from .chain files)
          chainq - chain file format, using query instead of target
  -selectCoordCols=spec - selectFile is tab-separate with coordinates
       as described by spec, which is one of:
            o chromCol - chrom in this column followed by start and end.
            o chromCol,startCol,endCol,strandCol,name - chrom, start, end, and
              strand in specified columns. Columns can be omitted from the end
              or left empty to not specify.
          NOTE: column numbers are zero-based
  -selectCds - Use only CDS in the selectFile
  -selectRange - Use entire range instead of blocks from records in
          the selectFile.
  -inFmt=fmt - specify inFile format, same values as -selectFmt.
  -inCoordCols=spec - inFile is tab-separate with coordinates specified by
      spec, in format described above.
  -inCds - Use only CDS in the inFile
  -inRange - Use entire range instead of blocks of records in the inFile.
  -nonOverlapping - select non-overlapping instead of overlapping records
  -strand - must be on the same strand to be considered overlapping
  -oppositeStrand - must be on the opposite strand to be considered overlapping
  -excludeSelf - don't compare records with the same coordinates and name.
      Warning: using only one of -inCds or -selectCds will result in different
      coordinates for the same record.
  -idMatch - only select overlapping records if they have the same id
  -aggregate - instead of computing overlap bases on individual select entries, 
      compute it based on the total number of inFile bases overlap by selectFile
      records. -overlapSimilarity and -mergeOutput will not work with
      this option.
  -overlapThreshold=0.0 - minimum fraction of an inFile record that
      must be overlapped by a single select record to be considered
      overlapping.  Note that this is only coverage by a single select
      record, not total coverage.
  -overlapThresholdCeil=1.1 - select only inFile records with less than
      this amount of overlap with a single record, provided they are selected
      by other criteria.
  -overlapSimilarity=0.0 - minimum fraction of inFile and select records that
      Note that this is only coverage by a single select record and this
      is; bidirectional inFile and selectFile must overlap by this
      amount.  A value of 1.0 will select identical records (or CDS if
      both CDS options are specified.  Not currently supported with
      -aggregate.
  -overlapSimilarityCeil=1.1 - select only inFile records with less than this
      amount of similarity with a single record. provided they are selected by
      other criteria.
  -overlapBases=-1 - minimum number of bases of overlap, < 0 disables.
  -statsOutput - output overlap statistics instead of selected records. 
      If no overlap criteria is specified, all overlapping entries are
      reported, Otherwise only the pairs passing the criteria are
      reported. This results in a tab-separated file with the columns:
         inId selectId inOverlap selectOverlap overBases
      Where inOverlap is the fraction of the inFile record overlapped by
      the selectFile record and selectOverlap is the fraction of the
      select record overlap by inFile records.  With -aggregate, output
      is:
         inId inOverlap inOverBases inBases
  -statsOutputAll - like -statsOutput, however output all inFile records,
      including those that are not overlapped.
  -statsOutputBoth - like -statsOutput, however output all selectFile and
      inFile records, including those that are not overlapped.
  -mergeOutput - output file with be a merge of the input file with the
      selectFile records that selected it.  The format is
         inRecselectRec.
      if multiple select records hit, inRec is repeated. This will increase
      the memory required. Not supported with -nonOverlapping or -aggregate.
  -idOutput - output a tab-separated file of pairs of
         inId selectId
      with -aggregate, only a single column of inId is written
  -dropped=file  - output rows that were dropped to this file.
  -verbose=n - verbose > 1 prints some details,

================================================================
========   paraFetch   ====================================
================================================================
paraFetch - try to fetch url with multiple connections
usage:
   paraFetch N R URL {outPath}
   where N is the number of connections to use
         R is the number of retries
   outPath is optional. If not specified, it will attempt to parse URL to discover output filename.
options:
   -newer  only download a file if it is newer than the version we already have.
   -progress  Show progress of download.

================================================================
========   paraSync   ====================================
================================================================
paraSync 1.0
paraSync - uses paraFetch to recursively mirror url to given path
usage:
   paraSync {options} N R URL outPath
   where N is the number of connections to use
         R is the number of retries
options:
   -A='ext1,ext2'  means accept only files with ext1 or ext2
   -newer  only download a file if it is newer than the version we already have.
   -progress  Show progress of download.

================================================================
========   pslCDnaFilter   ====================================
================================================================
wrong # of args:  pslCDnaFilter [options] inPsl outPsl

Filter cDNA alignments in psl format.  Filtering criteria are
comparative, selecting near best in genome alignments for each
given cDNA and non-comparative, based only on the quality of an
individual alignment.

WARNING: comparative filters requires that the input is sorted by
query name.  The command: 'sort -k 10,10' will do the trick.

Each alignment is assigned a score that is based on identity and
weighted towards longer alignments and those with introns.  This
can do either global or local best-in-genome selection.  Local
near best in genome keeps fragments of an mRNA that align in
discontinuous locations from other fragments.  It is useful for
unfinished genomes.  Global near best in genome keeps alignments
based on overall score.

Options:
   -algoHelp - print message describing the filtering algorithm.

   -localNearBest=-1.0 - local near best in genome filtering,
    keeping aligments within this fraction of the top score for
    each aligned portion of the mRNA. A value of zero keeps only
    the best for each fragment. A value of -1.0 disables
    (default).

   -globalNearBest=-1.0 - global near best in genome filtering,
    keeping aligments withing this fraction of the top score.  A
    value of zero keeps only the best alignment.  A value of -1.0
    disables (default).

   -ignoreNs - don't include Ns (repeat masked) while calculating the
    score and coverage. That is treat them as unaligned rather than
    mismatches.  Ns are still counts as mismatches when calculating
    the identity.

   -ignoreIntrons - don't favor apparent introns when scoring.

   -minId=0.0 - only keep alignments with at least this fraction
    identity.

   -minCover=0.0 - minimum fraction of query that must be
    aligned.  If -polyASizes is specified and the query is in
    the file, the ploy-A is not included in coverage
    calculation.

   -decayMinCover  -  the minimum coverage is calculated
    per alignment from the query size using the formula:
       minCoverage = 1.0 - qSize / 250.0
    and minCoverage is bounded between 0.25 and 0.9.

   -minSpan=0.0 - keep only alignments whose target length are
    at least this fraction of the longest alignment passing the
    other filters.  This can be useful for removing possible
    retroposed genes.

   -minQSize=0 - drop queries shorter than this size

   -minAlnSize=0 - minimum number of aligned bases.  This includes
    repeats, but excludes poly-A/poly-T bases if available.

   -minNonRepSize=0 - Minimum number of matching bases that are not repeats.
    This does not include mismatches.
    Must use -repeats on BLAT if doing unmasked alignments.

   -maxRepMatch=1.0 - Maximum fraction of matching bases
    that are repeats.  Must use -repeats on BLAT if doing
    unmasked alignments.

   -maxAligns=-1 - maximum number of alignments for a given query. If
    exceeded, then alignments are sorted by score and only this number
    will be saved.  A value of -1 disables (default)

   -polyASizes=file - tab separate file with information about
    poly-A tails and poly-T heads.  Format is outputted by
    faPolyASizes:

        id seqSize tailPolyASize headPolyTSize

   -usePolyTHead - if a poly-T head was detected and is longer
    than the poly-A tail, it is used when calculating coverage
    instead of the poly-A head.

   -bestOverlap - filter overlapping alignments, keeping the best of
    alignments that are similar.  This is designed to be used with
    overlapping, windowed alignments, where one alignment might be truncated.
    Does not discarding ones with weird overlap unless -filterWeirdOverlapped
    is specified.

   -hapRegions=psl - PSL format alignments of each haplotype pseudo-chromosome
    to the corresponding reference chromosome region.  This is used to map
    alignments between regions.

   -dropped=psl - save psls that were dropped to this file.

   -weirdOverlapped=psl - output weirdly overlapping PSLs to
    this file.

   -filterWeirdOverlapped - Filter weirdly overlapped alignments, keeping
    the single highest scoring one or an arbitrary one if multiple with
    the same high score.

   -alignStats=file - output the per-alignment statistics to this file

   -uniqueMapped - keep only cDNAs that are uniquely aligned after all
    other filters have been applied.

   -noValidate - don't run pslCheck validation.

   -verbose=1 - 0: quite
                1: output stats
                2: list problem alignment (weird or invalid)
                3: list dropped alignments and reason for dropping
                4: list kept psl and info
                5: info about all PSLs

   -hapRefMapped=psl - output PSLs of haplotype to reference chromosome
    cDNA alignments mappings (for debugging purposes).

   -hapRefCDnaAlns=psl - output PSLs of haplotype cDNA to reference cDNA
    alignments (for debugging purposes).

   -alnIdQNameMode - add internal assigned alignment numbers to cDNA names
    on output.  Useful for debugging, as they are include in the verbose
    tracing as [#1], etc.  Will make a mess of normal production usage.

   -blackList=file.txt - adds a list of accession ranges to a black list.
    Any accession on this list is dropped. Black list file is two columns
    where the first column is the beginning of the range, and the second
    column is the end of the range, inclusive.

The default options don't do any filtering. If no filtering
criteria are specified, all PSLs will be passed though, except
those that are internally inconsistent.

THE INPUT MUST BE BE SORTED BY QUERY for the comparative filters.

================================================================
========   pslPretty   ====================================
================================================================
pslPretty - Convert PSL to human readable output
usage:
   pslPretty in.psl target.lst query.lst pretty.out
options:
   -axt - save in something like Scott Schwartz's axt format
          Note gaps in both sequences are still allowed in the
          output which not all axt readers will expect
   -dot=N Put out a dot every N records
   -long - Don't abbreviate long inserts
   -check=fileName - Output alignment checks to filename
It's a really good idea if the psl file is sorted by target
if it contains multiple targets.  Otherwise this will be
very very slow.   The target and query lists can either be
fasta, 2bit or nib files, or a list of fasta, 2bit and/or nib files
one per line

================================================================
========   pslReps   ====================================
================================================================
pslReps - analyse repeats and generate genome wide best
alignments from a sorted set of local alignments
usage:
    pslReps in.psl out.psl out.psr
where in.psl is an alignment file generated by psLayout and
sorted by pslSort, out.psl is the best alignment output
and out.psr contains repeat info
options:
    -nohead don't add PSL header
    -ignoreSize Will not weigh in favor of larger alignments so much
    -noIntrons Will not penalize for not having introns when calculating
              size factor
    -singleHit  Takes single best hit, not splitting into parts
    -minCover=0.N minimum coverage to output.  Default is 0.
    -ignoreNs Ignore 'N's when calculating minCover.
    -minAli=0.N minimum alignment ratio
               default is 0.93
    -nearTop=0.N how much can deviate from top and be taken
               default is 0.01
    -minNearTopSize=N  Minimum size of alignment that is near top
               for alignment to be kept.  Default 30.
    -coverQSizes=file Tab-separate file with effective query sizes.
                     When used with -minCover, this allows polyAs
                     to be excluded from the coverage calculation

================================================================
========   pslSort   ====================================
================================================================
pslSort - merge and sort psCluster .psl output files
usage:
  pslSort dirs[1|2] outFile tempDir inDir(s)
This will sort all of the .psl files in the directories
inDirs in two stages - first into temporary files in tempDir
and second into outFile.  The device on tempDir needs to have
enough space (typically 15-20 gigabytes if processing whole genome)
  pslSort g2g[1|2] outFile tempDir inDir(s)
This will sort a genome to genome alignment, reflecting the
alignments across the diagonal.

Adding 1 or 2 after the dirs or g2g will limit the program to
only the first or second pass repectively of the sort

Options:
   -nohead - do not write psl header:
   -verbose=N Set verbosity level, higher for more output. Default 1

================================================================
========   sizeof   ====================================
================================================================
     type   bytes    bits
     char	1	8
unsigned char	1	8
short int	2	16
u short int	2	16
      int	4	32
 unsigned	4	32
     long	8	64
unsigned long	8	64
long long	8	64
u long long	8	64
   size_t	8	64
   void *	8	64
    float	4	32
   double	8	64
long double	16	128
LITTLE ENDIAN machine detected
byte order: normal order: 0x12345678 in memory: 0x78563412
================================================================
========   stringify   ====================================
================================================================
stringify - Convert file to C strings
usage:
   stringify [options] in.txt
A stringified version of in.txt  will be printed to standard output.

Options:
  -var=varname - create a variable with the specified name containing
                 the string.
  -static - create the variable as a string array.

================================================================
========   textHistogram   ====================================
================================================================
textHistogram - Make a histogram in ascii
usage:
   textHistogram [options] inFile
Where inFile contains one number per line.
  options:
   -binSize=N - Size of bins, default 1
   -maxBinCount=N - Maximum # of bins, default 25
   -minVal=N - Minimum value to put in histogram, default 0
   -log - Do log transformation before plotting
   -noStar - Don't draw asterisks
   -col=N - Which column to use. Default 1
   -aveCol=N - A second column to average over. The averages
             will be output in place of counts of primary column.
   -real - Data input are real values (default is integer)
   -autoScale=N - autoscale to N # of bins
   -probValues - show prob-Values (density and cum.distr.) (sets -noStar too)
   -freq - show frequences instead of counts
   -skip=N - skip N lines before starting, default 0

================================================================
========   twoBitInfo   ====================================
================================================================
twoBitInfo - get information about sequences in a .2bit file
usage:
   twoBitInfo input.2bit output.tab
options:
   -nBed   instead of seq sizes, output BED records that define 
           areas with N's in sequence
   -noNs   outputs the length of each sequence, but does not count Ns 
Output file has the columns::
   seqName size

The 2bit file may be specified in the form path:seq or path:seq1,seq2,seqN...
so that information is returned only on the requested sequence(s).
If the form path:seq:start-end is used, start-end is ignored.

================================================================
========   twoBitToFa   ====================================
================================================================
twoBitToFa - Convert all or part of .2bit file to fasta
usage:
   twoBitToFa input.2bit output.fa
options:
   -seq=name - restrict this to just one sequence
   -start=X  - start at given position in sequence (zero-based)
   -end=X - end at given position in sequence (non-inclusive)
   -seqList=file - file containing list of the desired sequence names 
                    in the format seqSpec[:start-end], e.g. chr1 or chr1:0-189
                    where coordinates are half-open zero-based, i.e. [start,end)
   -noMask - convert sequence to all upper case
   -bpt=index.bpt - use bpt index instead of built in one
   -bed=input.bed - grab sequences specified by input.bed. Will exclude introns

Sequence and range may also be specified as part of the input
file name using the syntax:
      /path/input.2bit:name
   or
      /path/input.2bit:name
   or
      /path/input.2bit:name:start-end

================================================================
========   validateFiles   ====================================
================================================================
validateFiles - Validate format of different track input files
                Program exits with non-zero status if any errors detected
                  otherwise exits with zero status
                Use filename 'stdin' to read from stdin
                Files can be in .gz, .bz2, .zip, .Z format and are 
                  automatically decompressed
                Multiple input files of the same type can be listed
                Error messages are written to stderr
                OK or failing file lines can be optionally written to stdout
usage:
   validateFiles -type=FILE_TYPE file1 [file2 [...]]
options:
   -type=(a value from the list below)
         tagAlign|pairedTagAlign|broadPeak|narrowPeak|gappedPeak|bedGraph
                   : see http://genomewiki.cse.ucsc.edu/EncodeDCC/index.php/File_Formats
         fasta     : Fasta files (only one line of sequence, and no quality scores)
         fastq     : Fasta with quality scores (see http://maq.sourceforge.net/fastq.shtml)
         csfasta   : Colorspace fasta (implies -colorSpace) (see link below)
         csqual    : Colorspace quality (see link below)
                     (see http://marketing.appliedbiosystems.com/mk/submit/SOLID_KNOWLEDGE_RD?_JS=T&rd=dm)
         BAM       : Binary Alignment/Map
                     (see http://samtools.sourceforge.net/SAM1.pdf)
         bigWig    : Big Wig
                     (see http://genome.ucsc.edu/goldenPath/help/bigWig.html)
         bigBedN[+[P]]: 
                     (see http://genome.ucsc.edu/goldenPath/help/bigBed.html)
         bedN[+[P]] : 
                     (see http://genome.ucsc.edu/FAQ/FAQformat.html#format1)
                         N is between 3 and 15, 
                         optional (+) if extra "bedPlus" fields, 
                         optional P specifies the number of extra fields. Not required, but preferred.
                      Examples: -type=bed6 or -type=bed6+ or -type=bed6+3 

   -as=fields.as                If you have extra "bedPlus" fields, it's great to put a definition
                                  of each field in a row in AutoSql format here. Applies to bed-related types.
   -tab - If set, expect fields to be tab separated, normally
           expects white space separator. Applies to bed-related types.
   -chromDb=db                  Specify DB containing chromInfo table to validate chrom names
                                  and sizes
   -chromInfo=file.txt          Specify chromInfo file to validate chrom names and sizes
   -colorSpace                  Sequences include colorspace values [0-3] (can be used 
                                  with formats such as tagAlign and pairedTagAlign)
   -genome=path/to/hg18.2bit    Validate tagAlign or pairedTagAlign sequences match genome
                                  in .2bit file
   -mismatches=n                Maximum number of mismatches in sequence (or read pair) if 
                                  validating tagAlign or pairedTagAlign files
   -mismatchTotalQuality=n      Maximum total quality score at mismatching positions
   -matchFirst=n                only check the first N bases of the sequence
   -mmPerPair                   Check either pair dont exceed mismatch count if validating
                                  pairedTagAlign files (default is the total for the pair)
   -mmCheckOneInN=n             Check mismatches in only one in 'n' lines (default=1, all)
   -nMatch                      N's do not count as a mismatch
   -privateData                 Private data so empty sequence is tolerated
   -isSorted                    Input is sorted by chrom, only affects types tagAlign and pairedTagAlign
   -allowOther                  allow chromosomes that aren't native in BAM's
   -allowBadLength              allow chromosomes that have the wrong length in BAM
   -complementMinus             complement the query sequence on the minus strand (for testing BAM)
   -showBadAlign                show non-compliant alignments
   -bamPercent=N.N              percentage of BAM alignments that must be compliant

   -doReport                    output report in filename.report
   -version                     Print version

================================================================
========   wigCorrelate   ====================================
================================================================
wigCorrelate - Produce a table that correlates all pairs of wigs.
usage:
   wigCorrelate one.wig two.wig ... n.wig
This works on bigWig as well as wig files.
The output is to stdout
options:
   -clampMax=N - values larger than this are clipped to this value

================================================================
========   wigToBigWig   ====================================
================================================================
wigToBigWig v 4 - Convert ascii format wig file (in fixedStep, variableStep
or bedGraph format) to binary big wig format.
usage:
   wigToBigWig in.wig chrom.sizes out.bw
Where in.wig is in one of the ascii wiggle formats, but not including track lines
and chrom.sizes is two column:  
and out.bw is the output indexed big wig file.
Use the script: fetchChromSizes to obtain the actual chrom.sizes information
from UCSC, please do not make up a chrom sizes from your own information.
options:
   -blockSize=N - Number of items to bundle in r-tree.  Default 256
   -itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024
   -clip - If set just issue warning messages rather than dying if wig
                  file contains items off end of chromosome.
   -unc - If set, do not use compression.
================================================================

ubuntu 9.10 下安装glimmer3失败及其解决办法

szypanther — Tue, 12 Jun 2012 01:51:35 +0000

(1)安装前，先修改src/Common下第26行#include 为 #include ；再make，可是仍然会产生以下错误
* Make Target is all
#####    Making Directory /usr/local/glimmer3.02/src/Common   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Common’
@@@@@@@@@@@@@@@@@@@ delcher.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ fasta.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ gene.cc @@@@@@@@@@@@@@@@@@@@@
gene.cc: In member function ‘void PWM_t::Print(FILE*)’:
gene.cc:263: warning: deprecated conversion from string constant to ‘char*’
gene.cc: In function ‘int Char_Sub(char)’:
gene.cc:448: error: invalid conversion from ‘const char*’ to ‘char*’
make[1]: *** [gene.o] 错误 1
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Common’
#####    Making Directory /usr/local/glimmer3.02/src/ICM   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/ICM’
@@@@@@@@@@@@@@@@@@@ icm.cc @@@@@@@@@@@@@@@@@@@@@
icm.cc: In function ‘int Subscript(char)’:
icm.cc:1986: error: invalid conversion from ‘const char*’ to ‘char*’
make[1]: *** [icm.o] 错误 1
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/ICM’
#####    Making Directory /usr/local/glimmer3.02/src/Glimmer   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Glimmer’
@@@@@@@@@@@@@@@@@@@ anomaly.cc @@@@@@@@@@@@@@@@@@@@@
anomaly.cc: In function ‘int main(int, char**)’:
anomaly.cc:82: warning: suggest parentheses around ‘&&’ within ‘||’
@@@@@@@@@@@@@@@@@@@ glimmer3.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ long-orfs.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@  test.cc @@@@@@@@@@@@@@@@@@@@@
make[1]: *** 没有规则可以创建“anomaly”需要的目标“libGLMcommon.a”。停止。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Glimmer’
#####    Making Directory /usr/local/glimmer3.02/src/Util   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Util’
@@@@@@@@@@@@@@@@@@@ entropy-profile.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@  entropy-score.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@  extract.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ multi-extract.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ start-codon-distrib.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ uncovered.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ window-acgt.cc @@@@@@@@@@@@@@@@@@@@@
make[1]: *** 没有规则可以创建“entropy-profile”需要的目标“libGLMcommon.a”。停止。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Util’
* Make Target is all
#####    Making Directory /usr/local/glimmer3.02/src/Common   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Common’
@@@@@@@@@@@@@@@@@@@  gene.cc @@@@@@@@@@@@@@@@@@@@@
gene.cc: In member function ‘void PWM_t::Print(FILE*)’:
gene.cc:263: warning: deprecated conversion from string constant to ‘char*’
gene.cc: In function ‘int Char_Sub(char)’:
gene.cc:448: error: invalid conversion from ‘const char*’ to ‘char*’
make[1]: *** [gene.o] 错误 1
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Common’
#####    Making Directory /usr/local/glimmer3.02/src/ICM   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/ICM’
@@@@@@@@@@@@@@@@@@@ icm.cc @@@@@@@@@@@@@@@@@@@@@
icm.cc: In function ‘int Subscript(char)’:
icm.cc:1986: error: invalid conversion from ‘const char*’ to ‘char*’
make[1]: *** [icm.o] 错误 1
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/ICM’
#####    Making Directory /usr/local/glimmer3.02/src/Glimmer   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Glimmer’
make[1]: *** 没有规则可以创建“anomaly”需要的目标“libGLMcommon.a”。停止。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Glimmer’
#####    Making Directory /usr/local/glimmer3.02/src/Util   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Util’
make[1]: *** 没有规则可以创建“entropy-profile”需要的目标“libGLMcommon.a”。停止。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Util’
* Make Target is all
#####    Making Directory /usr/local/glimmer3.02/src/Common   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Common’
@@@@@@@@@@@@@@@@@@@ gene.cc @@@@@@@@@@@@@@@@@@@@@
gene.cc: In member function ‘void PWM_t::Print(FILE*)’:
gene.cc:263: warning: deprecated conversion from string constant to ‘char*’
gene.cc: In function ‘int Char_Sub(char)’:
gene.cc:448: error: invalid conversion from ‘const char*’ to ‘char*’
make[1]: *** [gene.o] 错误 1
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Common’
#####    Making Directory /usr/local/glimmer3.02/src/ICM   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/ICM’
@@@@@@@@@@@@@@@@@@@ icm.cc @@@@@@@@@@@@@@@@@@@@@
icm.cc: In function ‘int Subscript(char)’:
icm.cc:1986: error: invalid conversion from ‘const char*’ to ‘char*’
make[1]: *** [icm.o] 错误 1
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/ICM’
#####    Making Directory /usr/local/glimmer3.02/src/Glimmer   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Glimmer’
make[1]: *** 没有规则可以创建“anomaly”需要的目标“libGLMcommon.a”。停止。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Glimmer’
#####    Making Directory /usr/local/glimmer3.02/src/Util   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Util’
make[1]: *** 没有规则可以创建“entropy-profile”需要的目标“libGLMcommon.a”。停止。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Util’

(2)别着急，对着错误提示，分别打开并编辑出错的两个文件gene.cc;icm.cc；在错误地方的char前面加上const，然后在make，一切OK。
* Make Target is all
#####    Making Directory /usr/local/glimmer3.02/src/Common   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Common’
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Common’
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Common’
@@@@@@@@@@@@@@@@@@@ delcher.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ fasta.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ gene.cc @@@@@@@@@@@@@@@@@@@@@
################### libGLMcommon.a #####################
ar: creating /usr/local/glimmer3.02/lib/libGLMcommon.a
a – delcher.o
a – fasta.o
a – gene.o
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Common’
#####    Making Directory /usr/local/glimmer3.02/src/ICM   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/ICM’
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/ICM’
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/ICM’
@@@@@@@@@@@@@@@@@@@ icm.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ build-icm.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ build-fixed.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ score-fixed.cc @@@@@@@@@@@@@@@@@@@@@
################### libGLMicm.a #####################
ar: creating /usr/local/glimmer3.02/lib/libGLMicm.a
a – icm.o
a – build-icm.o
a – build-fixed.o
a – score-fixed.o
++++++++++++++++++++ build-icm ++++++++++++++++++++++
++++++++++++++++++++ build-fixed ++++++++++++++++++++++
++++++++++++++++++++ score-fixed ++++++++++++++++++++++
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/ICM’
#####    Making Directory /usr/local/glimmer3.02/src/Glimmer   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Glimmer’
@@@@@@@@@@@@@@@@@@@ anomaly.cc @@@@@@@@@@@@@@@@@@@@@
anomaly.cc: In function ‘int main(int, char**)’:
anomaly.cc:82: warning: suggest parentheses around ‘&&’ within ‘||’
@@@@@@@@@@@@@@@@@@@ glimmer3.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ long-orfs.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ test.cc @@@@@@@@@@@@@@@@@@@@@
++++++++++++++++++++ anomaly ++++++++++++++++++++++
++++++++++++++++++++ glimmer3 ++++++++++++++++++++++
++++++++++++++++++++ long-orfs ++++++++++++++++++++++
++++++++++++++++++++ test ++++++++++++++++++++++
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Glimmer’
#####    Making Directory /usr/local/glimmer3.02/src/Util   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Util’
@@@@@@@@@@@@@@@@@@@ entropy-profile.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ entropy-score.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ extract.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ multi-extract.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ start-codon-distrib.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ uncovered.cc @@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@ window-acgt.cc @@@@@@@@@@@@@@@@@@@@@
++++++++++++++++++++ entropy-profile ++++++++++++++++++++++
++++++++++++++++++++ entropy-score ++++++++++++++++++++++
++++++++++++++++++++ extract ++++++++++++++++++++++
++++++++++++++++++++ multi-extract ++++++++++++++++++++++
++++++++++++++++++++ start-codon-distrib ++++++++++++++++++++++
++++++++++++++++++++ uncovered ++++++++++++++++++++++
++++++++++++++++++++ window-acgt ++++++++++++++++++++++
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Util’
* Make Target is all
#####    Making Directory /usr/local/glimmer3.02/src/Common   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Common’
make[1]: 没有什么可以做的为 `all’。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Common’
#####    Making Directory /usr/local/glimmer3.02/src/ICM   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/ICM’
make[1]: 没有什么可以做的为 `all’。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/ICM’
#####    Making Directory /usr/local/glimmer3.02/src/Glimmer   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Glimmer’
make[1]: 没有什么可以做的为 `all’。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Glimmer’
#####    Making Directory /usr/local/glimmer3.02/src/Util   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Util’
make[1]: 没有什么可以做的为 `all’。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Util’
* Make Target is all
#####    Making Directory /usr/local/glimmer3.02/src/Common   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Common’
make[1]: 没有什么可以做的为 `all’。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Common’
#####    Making Directory /usr/local/glimmer3.02/src/ICM   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/ICM’
make[1]: 没有什么可以做的为 `all’。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/ICM’
#####    Making Directory /usr/local/glimmer3.02/src/Glimmer   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Glimmer’
make[1]: 没有什么可以做的为 `all’。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Glimmer’
#####    Making Directory /usr/local/glimmer3.02/src/Util   all #####
make[1]: 正在进入目录 `/usr/local/glimmer3.02/src/Util’
make[1]: 没有什么可以做的为 `all’。
make[1]:正在离开目录 `/usr/local/glimmer3.02/src/Util’
(3)不用理会“make[1]: 没有什么可以做的为 `all’。“！！

Bpipe A tool for running and managing bioinformatics pipelines

szypanther — Thu, 17 May 2012 02:39:15 +0000

Starting from a Shell Script

In this tutorial we will develop a Bpipe pipeline script for a realistic (but simplified) analysis pipeline used for variant calling on NGS data. In this pipeline the following stages are executed:

sequence alignment using bwa
sorting and indexing output files using samtools
PCR duplicate removal using Picard
calling variants using samtools

For the pipeline to work you will need to have the above tools installed, or you can just follow the tutorial without running the examples. We assume that the Picard files are in /usr/local/picard-tools/, while for the rest of the tools we just assume they are accessible in the PATH variable. We assume that the reference sequence to align to is in the local directory, named reference.fa, has been indexed using the bwa indexcommand and the reads to align are also in the local directory and named s_1.txt.

To show how we convert the pipeline to a Bpipe script we will start with a bash script that represents the pipeline above:

#!/bin/bash
bwa aln -I -t 8 reference.fa s_1.txt > out.sai
bwa samse reference.fa out.sai s_1.txt > out.sam 

samtools view -bSu out.sam  | samtools sort -  out.sorted

java -Xmx1g -jar /usr/local/picard-tools/MarkDuplicates.jar \
                            MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000\
                            METRICS_FILE=out.metrics \
                            REMOVE_DUPLICATES=true \
                            ASSUME_SORTED=true  \
                            VALIDATION_STRINGENCY=LENIENT \
                            INPUT=out.sorted.bam \
                            OUTPUT=out.dedupe.bam 

samtools index out.dedupe.bam 

samtools mpileup -uf reference.fa out.dedupe.bam | bcftools view -bvcg - > out.bcf

Step 1 – Convert Commands to Bpipe Stages

To start out, all we need to do is decide which commands belong together logically and declare Bpipe stages for them. Inside each stage we will turn the original shell commands into Bpipe “exec” statements. Here is how it looks:

align = {
    exec "bwa aln -I -t 8 reference.fa s_1.txt > out.sai"
    exec "bwa samse reference.fa out.sai s_1.txt > out.sam"
}

sort = {
    exec "samtools view -bSu out.sam  | samtools sort -  out.sorted"
}

dedupe = {
    exec """
      java -Xmx1g -jar /usr/local/picard-tools/MarkDuplicates.jar
                            MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
                            METRICS_FILE=out.metrics
                            REMOVE_DUPLICATES=true
                            ASSUME_SORTED=true  
                            VALIDATION_STRINGENCY=LENIENT
                            INPUT=out.sorted.bam
                            OUTPUT=out.dedupe.bam
    """
}

index = {
    exec "samtools index out.dedupe.bam"
}

call_variants = {
    exec "samtools mpileup -uf reference.fa out.dedupe.bam | bcftools view -bvcg - > out.bcf"
}

Note: in the original bash script the multiline command for running MarkDuplicates needed to have backslashes at the end of each line, however because Bpipe understands multiline commands natively we do not need to include those in the Bpipe version

Step 2 – Add Pipeline Definition and Run Pipeline

In Step 1 we defined our pipeline stages but we did not define how they link together. The following lines added at the bottom of our file define our pipeline:

Bpipe.run {
    align + sort + dedupe + index + call_variants
}

So far we have not done very much, but it is worth noting that even if we do nothing more we already have a functional Bpipe script that has many advantages over the old shell script. If you save it as pipeline.txt then you can run it using Bpipe:

bpipe run pipeline.txt

If the pipeline works you will get some output like so:

====================================================================================================
|                                 Starting Pipeline at 2011-10-06                                  |
====================================================================================================

=========================================== Stage align ============================================
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
...

Although it looks like the alignment is running in the foreground, actually Bpipe started it in the background: you can hit Ctrl-C at this point and Bpipe will give you the option to terminate the job or leave it running:

Pipeline job running as process 33202.  Terminate? (y/n):

If you answer no the job will keep running, even if you log out. You can type bpipe jobs to see that it is still running, or bpipe log to resume seeing the output.

Step 3 – Define Inputs and Outputs and add Variables

So far our pipeline runs but we aren’t getting all the benefits of Bpipe because we have not defined our inputs and outputs in terms of Bpipe’sinput and output variables which allow Bpipe to add many useful features. To add them we literally replace the inputs and outputs with $input and $output. We also take the opportunity to make our script a little bit more general by extracting out the location of the reference sequence and Picard Tools as variables.

Our pipeline now looks like this:

REFERENCE="reference.fa"
PICARD_HOME="/usr/local/picard-tools/"

align = {
    exec "bwa aln -I -t 8 $REFERENCE $input > ${input}.sai"
    exec "bwa samse $REFERENCE ${input}.sai $input > $output"
}

sort = {
    exec "samtools view -bSu $input  | samtools sort - $output"
}

dedupe = {
    exec """
      java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar          
                            MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
                            METRICS_FILE=out.metrics
                            REMOVE_DUPLICATES=true
                            ASSUME_SORTED=true  
                            VALIDATION_STRINGENCY=LENIENT
                            INPUT=$input
                            OUTPUT=$output
    """
}

index = {
    exec "samtools index $input"
}

call_variants = {
    exec "samtools mpileup -uf $REFERENCE $input | bcftools view -bvcg - > $output"
}

Bpipe.run {
    align + sort + dedupe + index + call_variants
}

Now our pipeline runs but Bpipe is managing the names of the input and output files. This important property means that we have much more flexibility in how our pipeline works. For example, if you decided to remove the dedupe stage you could just delete it from your pipeline definition – and your pipeline would still work!

Since the first pipeline stage now expects an input, you must provide one when you run bpipe like so:

bpipe run pipeline.txt s_1.txt

Note: in the first command executed the $input variable is used in the form ${input}.sai rather than $input.sai. This is because in Groovy the “.” is a special character (an operator) that accesses a property of a variable. Just like in Bash scripting when you have this kind of problem, the solution is to surround the actual variable with curly braces to distinguish it from neighboring characters.

Step 4 – Name Outputs

Our script has one deficiency: the names of the output files are not very satisfactory because they do not have the conventional file extensions. For example, the sam file will be called s_1.txt.align. This happens because we did not give Bpipe any information about what kind of file comes out of the align stage. To make things work more naturally we can give Bpipe some hints about how to name things, using the filter andtransform keywords to annotate the script.

REFERENCE="reference.fa"
PICARD_HOME="/usr/local/picard-tools/"

@Transform("sam")
align = {
      exec "bwa aln -I -t 8 $REFERENCE $input > ${input}.sai"
      exec "bwa samse $REFERENCE ${input}.sai $input > $output"
}

@Transform("bam")
sort = {
        exec "samtools view -bSu $input  | samtools sort - $output"
}

@Filter("dedupe")
dedupe = {
        exec """
            java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar
                            MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
                            METRICS_FILE=out.metrics
                            REMOVE_DUPLICATES=true
                            ASSUME_SORTED=true  
                            VALIDATION_STRINGENCY=LENIENT
                            INPUT=$input
                            OUTPUT=$output
        """
}

@Transform("bai")
index = {
        exec "samtools index $input"
        return input
}

@Transform("bcf")
call_variants = {
        exec "samtools mpileup -uf $REFERENCE $input | bcftools view -bvcg - > $output"
}

Bpipe.run {
    align + sort + dedupe + index + call_variants
}

Here we have added annotations that tell Bpipe what kind of operation is happening in each stage. Transform operations are producing a new type of file from the input – they modify the file extension to a new one. Filter operations modify a file without changing its type – they keep the file extension the same but add a component to the body of the name. Notice that we are not dictating the full name of the file to Bpipe, we’re just helping it to understand how the form of the file name should change. This means our pipeline is still flexible, but we have sensible recognizable names for our files.