Tutorial: Piping with samtools, bwa and bedtools

In this tutorial I hope to introduce some of the concepts for using unix piping. Piping is a very useful feature to avoid creation of intermediate use once files.

Lets begin with a typical command to do paired end mapping with bwa:

#-t 4 is for using 4 threads/cores bwa aln -t 4 ./hg19.fasta ./s1_1.fastq > ./s1_1.sai bwa aln -t 4 ./hg19.fasta ./s1_2.fastq > ./s1_2.sai bwa sampe ./hg19.fasta ./s1_1.sai ./s1_2.sai ./s1_1.fastq ./s1_2.fastq > s1.sam

Supposed we wish to now convert to bam, sort, remove duplicates, and lastly create a bed file.

samtools view -Shu s1.sam > s1.bam samtools sort s1.bam s1_sorted samtools rmdup -s s1_sorted.bam s1_sorted_nodup.bam bamToBed -i s1_sorted_nodup.bam > s1_sorted_nodup.bed

This workflow above creates many files that are only used once (such as s1.bam) and we can use the linux pipe utility to reduce the number intermediate files created. The pipe function is the character | and what it does is take the output from one command and sets it as input for next command (assuming next command knows how to deal with this input).

For example, when you type in head myfile.txt the command ‘head’ will read first 10 lines of myfile.txt and output it (by default) into the display (stdout) so you will see the first 10 lines printed on your screen. However, if you were to do something like this:

head myfile.txt | wc -l

you will instead get 10 printed out on your screen. What the pipe command did was take the output of head command and used it as input for the wordcount (wc) command. wc -l command will count the number of lines passed in (which is 10 since head by default prints the first 10 lines). An equivalent command would be

head myfile.txt > myfile_top10_lines.txt wc -l myfile_top10_lines.txt

By using the pipe command we are able to eliminate the intermediate file that was generated.

Below would be an example of how pipe command can be used in samtools example

bwa aln -t 4 ./hg19.fasta ./s1_1.fastq > ./s1_1.sai bwa aln -t 4 ./hg19.fasta ./s1_2.fastq > ./s1_2.sai bwa sampe ./hg19.fasta ./s1_1.sai ./s1_2.sai ./s1_1.fastq ./s1_2.fastq | \ samtools view -Shu - | \ samtools sort - - | \ samtools rmdup -s - - | \ tee s1_sorted_nodup.bam | \ bamToBed > s1_sorted_nodup.bed

A couple of things I wanted to point out here:

  • The \ symbol after the pipe symbol tells the terminal that I want to keep writing command on the next line, that way you can have multiline commands instead of super-long command on one line.
  • The tee command takes the information being piped and creates a file while passing the data along the pipe, this command is useful if you want to save an intermediate file but still need to do additional processing.
  • The – symbol in samtools it to tell the samtools program to take the input from pipe
  • It is not recommended you run this command all at once like I showed above if you have a giant bam file because the sort will take forever. Running things in a pipe can be more space efficient since you are not storing intermediate files but can create issues/bottlenecks elsewhere. If there are problems, run the pieces without piping.

Other common uses of pipes that I have not shown above it to pipe output into snp calling, you can find some examples here: http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ

samtools merge - *.bam | tee merged.bam | samtools rmdup - - | tee rmdup.bam | samtools pileup - vcf ref.fa - | gzip > raw-var.txt.gz

Compressing files with gzip can save a lot of space, you can also view these files with the zless and zcat command.

In case you are wondering if there is a way to run bwa sampe without storing intermediate sai files the command below will do this:

bwa sampe ./hg19.fasta <(bwa aln -t 4 ./hg19.fasta ./s1_1.fastq) <(bwa aln -t 4 ./hg19.fasta ./s1_2.fastq > ./s1_2.sai) ./s1_1.fastq ./s1_2.fastq > s1.sam

56 comments to Tutorial: Piping with samtools, bwa and bedtools

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>