<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>小生这厢有礼了(BioFaceBook Personal Blog) &#187; 二代测序</title>
	<atom:link href="http://www.biofacebook.com/?cat=5&#038;feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://www.biofacebook.com</link>
	<description>记录生物信息学点滴足迹（NGS,Genome,Meta,Linux)</description>
	<lastBuildDate>Sun, 23 Aug 2020 03:28:53 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.41</generator>
	<item>
		<title>Metagenomics standard operating procedure v2</title>
		<link>http://www.biofacebook.com/?p=1527</link>
		<comments>http://www.biofacebook.com/?p=1527#comments</comments>
		<pubDate>Sun, 23 Aug 2020 03:28:53 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[三代测序]]></category>
		<category><![CDATA[二代测序]]></category>
		<category><![CDATA[生物信息]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1527</guid>
		<description><![CDATA[<p>http://www.360doc.com/content/20/0823/10/71250389_931761136.shtml</p> <p>https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2</p> <p>Note that this workflow is continually being updated. If you want to use the below commands be sure to keep track of them locally.</p> <p>Last updated: 9 Oct 2018 (see revisions above for earlier minor versions and Metagenomics SOPs on the SOP for earlier major versions)</p> <p>Important points to keep in mind:</p> The [...]]]></description>
				<content:encoded><![CDATA[<p><strong><em>http://www.360doc.com/content/20/0823/10/71250389_931761136.shtml</em></strong></p>
<p>https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2</p>
<p><em>Note that this workflow is continually being updated. If you want to use the below commands be sure to keep track of them locally.</em></p>
<p><em>Last updated: 9 Oct 2018 (see <code>revisions</code> above for earlier minor versions and <code>Metagenomics SOPs</code> on the SOP for earlier major versions)</em></p>
<p>Important points to keep in mind:</p>
<ul>
<li>The below options are not necessarily best for your data; it is important to explore different options, especially at the read quality filtering stage.</li>
<li>If you are confused about a particular step be sure to check out the pages under <code>Metagenomic Resources</code> on the right side-bar.</li>
<li>You should check that only the FASTQs of interest are being specified at each step (e.g. with the <code>ls</code> command). If you are re-running commands multiple times or using optional steps the input filenames and/or folders could differ.</li>
<li>The tool <code>parallel</code> comes up several times in this workflow. Be sure to check out our tutorial on this tool <a href="https://github.com/LangilleLab/microbiome_helper/wiki/Quick-Introduction-to-GNU-Parallel">here</a>.</li>
<li>You can always run <code>parallel</code> with the <code>--dry-run</code> option to see what commands are being run to double-check they are doing what you want!</li>
</ul>
<p>This workflow starts with raw paired-end MiSeq or NextSeq data in demultiplexed FASTQ format assumed to be located within a folder called <code>raw_data</code>.</p>
<h2><a id="user-content-1-first-steps" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#1-first-steps"></a>1. First Steps</h2>
<h3><a id="user-content-11-join-lanes-together-all-illumina-except-miseq" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#11-join-lanes-together-all-illumina-except-miseq"></a>1.1 Join lanes together (all Illumina, except MiSeq)</h3>
<p>Concatenate multiple lanes of sequencing together (e.g. for NextSeq data). If you do <strong>NOT</strong> do this step, remember to change <code>cat_lanes</code> to <code>raw_data</code> in the further commands below that have assumed you are working with the most common type of metagenome data.</p>
<pre><code>concat_lanes.pl raw_data/* -o cat_lanes -p 4
</code></pre>
<h3><a id="user-content-12-inspect-read-quality" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#12-inspect-read-quality"></a>1.2 Inspect read quality</h3>
<p>Run <code>fastqc</code> to allow manual inspection of the quality of sequences.</p>
<pre><code>mkdir fastqc_out
fastqc -t 4 cat_lanes/* -o fastqc_out/
</code></pre>
<h3><a id="user-content-13-optional-stitch-fr-reads" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#13-optional-stitch-fr-reads"></a>1.3 (Optional) Stitch F+R reads</h3>
<p>Stitch paired-end reads together (summary of stitching results are written to &#8220;pear_summary_log.txt&#8221;). Note: it is important to check the % of reads assembled. It may be better to concatenate the forward and reverse reads together if the assembly % is too low (see step 2.2).</p>
<pre><code>run_pear.pl -p 4 -o stitched_reads cat_lanes/*
</code></pre>
<p><strong><em>If you don&#8217;t stitch your reads together at this step you will need to unzip your FASTQ files before continuing with the below commands.</em></strong></p>
<h2><a id="user-content-2-read-quality-control-and-contaminant-screens" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#2-read-quality-control-and-contaminant-screens"></a>2. Read Quality-Control and Contaminant Screens</h2>
<h3><a id="user-content-21-running-kneaddata" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#21-running-kneaddata"></a>2.1 Running KneadData</h3>
<p>Use <code>kneaddata</code> to run pre-processing tools. First Trimmomatic is run to remove low quality sequences. Then Bowtie2 is run to screen out contaminant sequences. Below we are screening out reads that map to the human or PhiX genomes. Note KneadData is being run below on all unstitched FASTQ pairs with <code>parallel</code>, you can see our quick tutorial on this tool <a href="https://github.com/LangilleLab/microbiome_helper/wiki/Quick-Introduction-to-GNU-Parallel">here</a>. For a detailed breakdown of the options in the below command see <a href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-Sequencing-Pre-processing">this page</a>. The forward and reverse reads will be specified by &#8220;_1&#8243; and &#8220;_2&#8243; in the output files, ignore the &#8220;R1&#8243; in each filename. Note that the <code>\</code> characters at the end of each line are just to split the command over multiple lines to make it easier to read.</p>
<pre><code>parallel -j 1 --link 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
 ::: cat_lanes/*_R1.fastq ::: cat_lanes/*_R2.fastq
</code></pre>
<p>Clean up the output directory (helps downstream commands) by moving the discarded sequences to a subfolder:</p>
<pre><code>mkdir kneaddata_out/contam_seq

mv kneaddata_out/*_contam*.fastq kneaddata_out/contam_seq
</code></pre>
<p>You can produce a logfile summarizing the kneadData output with this command:</p>
<pre><code>kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt
</code></pre>
<h3><a id="user-content-22-concatenate-unstitched-output-omit-if-data-stitched-above" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#22-concatenate-unstitched-output-omit-if-data-stitched-above"></a>2.2 Concatenate unstitched output (Omit if data stitched above)</h3>
<p>If you did not stitch your paired-end reads together with <code>pear</code>, then you can concatenate the forward and reverse FASTQs together now. Note this is done after quality filtering so that both reads in a pair are either discarded or retained. It&#8217;s important to only specify the &#8220;paired&#8221; FASTQs outputted by <code>kneaddata</code> in the below command.</p>
<pre><code>concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_*.fastq 
</code></pre>
<p>You should check over the commands that are printed to screen to make sure the correct FASTQs are being concatenated together.</p>
<h2><a id="user-content-3-determine-functions-with-humann2" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#3-determine-functions-with-humann2"></a>3. Determine Functions with HUMAnN2</h2>
<h3><a id="user-content-31-run-humann2" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#31-run-humann2"></a>3.1 Run HUMAnN2</h3>
<p>Run <code>humann2</code> with <code>parallel</code> to calculate abundance of UniRef90 gene families and MetaCyc pathways. If you are processing environment data (e.g. soil samples) the vast majority of the reads may not map using this approach. Instead, you can try mapping against the UniRef50 database (which you can point to with the <code>--protein-database</code> option).</p>
<pre><code>mkdir humann2_out

parallel -j 4 'humann2 --threads 1 --input {} --output humann2_out/{/.}' ::: cat_reads/*fastq
</code></pre>
<h3><a id="user-content-32-merge-individual-sample-data-together" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#32-merge-individual-sample-data-together"></a>3.2 Merge individual sample data together</h3>
<p>Join HUMAnN2 output per sample into one table.</p>
<pre><code>mkdir humann2_final_out

humann2_join_tables -s --input humann2_out/ --file_name pathabundance --output humann2_final_out/humann2_pathabundance.tsv
humann2_join_tables -s --input humann2_out/ --file_name pathcoverage --output humann2_final_out/humann2_pathcoverage.tsv
humann2_join_tables -s --input humann2_out/ --file_name genefamilies --output humann2_final_out/humann2_genefamilies.tsv
</code></pre>
<h3><a id="user-content-33-table-output-normalization" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#33-table-output-normalization"></a>3.3 Table output normalization</h3>
<p>Re-normalize gene family and pathway <strong>abundances</strong> (so that all samples are in units of copies per million).</p>
<pre><code>humann2_renorm_table --input humann2_final_out/humann2_pathabundance.tsv --units cpm --output humann2_final_out/humann2_pathabundance_cpm.tsv
humann2_renorm_table --input humann2_final_out/humann2_genefamilies.tsv --units cpm --output humann2_final_out/humann2_genefamilies_cpm.tsv
</code></pre>
<h3><a id="user-content-34-separate-out-taxonomic-contributions" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#34-separate-out-taxonomic-contributions"></a>3.4 Separate out taxonomic contributions</h3>
<p>Split HUMAnN2 output abundance tables in stratified and unstratified tables (stratified tables include the taxa associated with a functional profile).</p>
<pre><code>humann2_split_stratified_table --input humann2_final_out/humann2_pathabundance_cpm.tsv --output humann2_final_out
humann2_split_stratified_table --input humann2_final_out/humann2_genefamilies_cpm.tsv --output humann2_final_out
humann2_split_stratified_table --input humann2_final_out/humann2_pathcoverage.tsv --output humann2_final_out
</code></pre>
<h3><a id="user-content-35-format-stamp-function-file" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#35-format-stamp-function-file"></a>3.5 Format STAMP function file</h3>
<p>Convert unstratified HUMAnN2 abundance tables to STAMP format by changing header-line. These commands remove the comment character and the spaces in the name of the first column. Trailing descriptions of the abundance datatype are also removed from each sample&#8217;s column name.</p>
<pre><code>sed 's/_Abundance-RPKs//g' humann2_final_out/humann2_genefamilies_cpm_unstratified.tsv | sed 's/# Gene Family/GeneFamily/' &gt; humann2_final_out/humann2_genefamilies_cpm_unstratified.spf
sed 's/_Abundance//g' humann2_final_out/humann2_pathabundance_cpm_unstratified.tsv | sed 's/# Pathway/Pathway/' &gt; humann2_final_out/humann2_pathabundance_cpm_unstratified.spf
sed 's/_Coverage//g' humann2_final_out/humann2_pathcoverage_unstratified.tsv | sed 's/# Pathway/Pathway/' &gt; humann2_final_out/humann2_pathcoverage_unstratified.spf
</code></pre>
<h3><a id="user-content-36-extract-metaphlan2-taxonomic-compositions" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#36-extract-metaphlan2-taxonomic-compositions"></a>3.6 Extract MetaPhlAn2 taxonomic compositions</h3>
<p>Since HUMAnN2 also runs MetaPhlAn2 as an initial step, we can use the output tables already created to get the taxonomic composition of our samples. First we need to gather all the output MetaPhlAn2 results per sample into a single directory and then merge them into a single table using MetaPhlAn2&#8217;s <code>merge_metaphlan_tables.py</code> command. After this file is created we can fix the header so that each column corresponds to a sample name without the trailing &#8220;_metaphlan_bugs_list&#8221; description. Note that MetaPhlAn2 works best for well-characterized environments, like the human gut, and has low sensitivity in other environments.</p>
<pre><code>mkdir metaphlan2_out
cp humann2_out/*/*/*metaphlan_bugs_list.tsv metaphlan2_out/
/usr/local/metaphlan2/utils/merge_metaphlan_tables.py metaphlan2_out/*metaphlan_bugs_list.tsv &gt; metaphlan2_merged.txt
sed -i 's/_metaphlan_bugs_list//g' metaphlan2_merged.txt
</code></pre>
<h3><a id="user-content-37-format-stamp-taxonomy-file" class="anchor" href="https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-standard-operating-procedure-v2#37-format-stamp-taxonomy-file"></a>3.7 Format STAMP taxonomy file</h3>
<p>Lastly we can convert this MetaPhlAn2 abundance table to STAMP format</p>
<pre><code>metaphlan_to_stamp.pl metaphlan2_merged.txt &gt; metaphlan2_merged.spf</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://www.biofacebook.com/?feed=rss2&#038;p=1527</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Vipie: web pipeline for parallel characterization of viral populations from multiple NGS samples</title>
		<link>http://www.biofacebook.com/?p=1401</link>
		<comments>http://www.biofacebook.com/?p=1401#comments</comments>
		<pubDate>Mon, 16 Sep 2019 06:12:35 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[二代测序]]></category>
		<category><![CDATA[生物信息]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1401</guid>
		<description><![CDATA[Background <p>Next generation sequencing (NGS) technology allows laboratories to investigate virome composition in clinical and environmental samples in a culture-independent way. There is a need for bioinformatic tools capable of parallel processing of virome sequencing data by exactly identical methods: this is especially important in studies of multifactorial diseases, or in parallel comparison of laboratory [...]]]></description>
				<content:encoded><![CDATA[<h3 class="c-article__sub-heading u-h3" data-test="abstract-sub-heading">Background</h3>
<p>Next generation sequencing (NGS) technology allows laboratories to investigate virome composition in clinical and environmental samples in a culture-independent way. There is a need for bioinformatic tools capable of parallel processing of virome sequencing data by exactly identical methods: this is especially important in studies of multifactorial diseases, or in parallel comparison of laboratory protocols.</p>
<p>&nbsp;</p>
<p>https://sourceforge.net/projects/vipie/</p>
]]></content:encoded>
			<wfw:commentRss>http://www.biofacebook.com/?feed=rss2&#038;p=1401</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Utilities / Create consensus sequence from a BAM file</title>
		<link>http://www.biofacebook.com/?p=1396</link>
		<comments>http://www.biofacebook.com/?p=1396#comments</comments>
		<pubDate>Mon, 09 Sep 2019 01:34:13 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[二代测序]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1396</guid>
		<description><![CDATA[Utilities / Create consensus sequence from a BAM file Description <p>Given an indexed BAM file and corresponding reference genome (in fasta format), this tool constructs a consensus sequence based on the alignment.</p> Parameters <p>None</p> Details <p>This tool uses SAMtools, bcftools and vcfutils.pl script to create a consensus sequence for the given alignment file. The actual [...]]]></description>
				<content:encoded><![CDATA[<h2>Utilities / Create consensus sequence from a BAM file</h2>
<h3>Description</h3>
<p>Given an indexed BAM file and corresponding reference genome (in fasta format), this tool constructs a consensus sequence based on the alignment.</p>
<h3>Parameters</h3>
<p>None</p>
<h3>Details</h3>
<p>This tool uses SAMtools, bcftools and vcfutils.pl script to create a consensus sequence for the given alignment file. The actual command line executed is:</p>
<pre>  samtools mpileup -uf reference.fa aligment.bam | bcftools view -cg - | vcfutils vcf2fq  
</pre>
<p>Note that the input BAM file must be sorted before it can be used by this tool.</p>
<h3>Output</h3>
<p>Output is a fasta formatted sequence file.</p>
<h3>References</h3>
<p>This tool is based on the <a href="http://samtools.sourceforge.net/">SAMtools package</a>. Please cite the article <a href="http://www.ncbi.nlm.nih.gov/pubmed/19505943">The Sequence alignment/map (SAM) format and SAMtools</a> by Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) Bioinformatics, 25, 2078-9. [PMID: 19505943].</p>
]]></content:encoded>
			<wfw:commentRss>http://www.biofacebook.com/?feed=rss2&#038;p=1396</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Good software</title>
		<link>http://www.biofacebook.com/?p=1385</link>
		<comments>http://www.biofacebook.com/?p=1385#comments</comments>
		<pubDate>Wed, 24 Jul 2019 06:46:13 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[Linux相关]]></category>
		<category><![CDATA[二代测序]]></category>
		<category><![CDATA[兴趣杂项]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1385</guid>
		<description><![CDATA[<p>multiqc ranger ployly</p> <p>https://github.com/MultiQC</p> <p>https://github.com/ranger/ranger</p> <p>https://zhuanlan.zhihu.com/p/34369349</p> ]]></description>
				<content:encoded><![CDATA[<p>multiqc   ranger    ployly</p>
<p>https://github.com/MultiQC</p>
<p>https://github.com/ranger/ranger</p>
<p>https://zhuanlan.zhihu.com/p/34369349</p>
]]></content:encoded>
			<wfw:commentRss>http://www.biofacebook.com/?feed=rss2&#038;p=1385</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Binning and Plotting</title>
		<link>http://www.biofacebook.com/?p=1361</link>
		<comments>http://www.biofacebook.com/?p=1361#comments</comments>
		<pubDate>Tue, 26 Feb 2019 02:27:48 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[二代测序]]></category>
		<category><![CDATA[生物信息]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1361</guid>
		<description><![CDATA[<p>In statistics, data binning is a way to categorize a number of continuous values into a smaller number of buckets (bins). Each bucket defines an numerical interval. For example, if there is a variable about house-based education levels which are measured by continuous values ranged between 0 and 19, data binning will place each value [...]]]></description>
				<content:encoded><![CDATA[<p>In statistics, data binning is a way to categorize a number of continuous values into a smaller number of buckets (bins). Each bucket defines an numerical interval. For example, if there is a variable about house-based education levels which are measured by continuous values ranged between 0 and 19, data binning will place each value into one bucket if the value falls into the interval that the bucket covers. This post shows data binning in R as well as visualizing the bins.</p>
<p>The dataset contains 32038 observations for mean education level per house. Load the data into R.</p>
<div class="language-R highlighter-rouge">
<pre class="highlight"><code><span class="n">data</span> <span class="o">&lt;-</span> <span class="n">read.csv</span><span class="p">(</span><span class="s2">"data/meaneducation.csv"</span><span class="p">)</span>
<span class="n">x</span> <span class="o">&lt;-</span> <span class="n">data</span><span class="p">[,</span><span class="m">1</span><span class="p">]</span> <span class="c1">#take 1st column of data frame to a vector
</span></code></pre>
</div>
<p>After loading the data, inspect the data by running</p>
<div class="language-R highlighter-rouge">
<pre class="highlight"><code><span class="n">summary</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</code></pre>
</div>
<div class="highlighter-rouge">
<pre class="highlight"><code>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   11.88   12.43   12.52   13.10   19.00 
</code></pre>
</div>
<p>The summary data shows the range of the variable, <span id="MathJax-Element-1-Frame" class="MathJax" style="display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 19.2px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;mo stretchy=&quot;false&quot;&gt;[&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;19&lt;/mn&gt;&lt;mo stretchy=&quot;false&quot;&gt;]&lt;/mo&gt;&lt;/math&gt;"><span id="MathJax-Span-1" class="math"><span id="MathJax-Span-2" class="mrow"><span id="MathJax-Span-3" class="mo">[</span><span id="MathJax-Span-4" class="mn">0</span><span id="MathJax-Span-5" class="mo">,</span><span id="MathJax-Span-6" class="mn">19</span><span id="MathJax-Span-7" class="mo">]</span></span></span><span class="MJX_Assistive_MathML">[0,19]</span></span>. The median education level per house is <span id="MathJax-Element-2-Frame" class="MathJax" style="display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 19.2px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;mn&gt;12.43&lt;/mn&gt;&lt;/math&gt;"><span id="MathJax-Span-8" class="math"><span id="MathJax-Span-9" class="mrow"><span id="MathJax-Span-10" class="mn">12.43</span></span></span><span class="MJX_Assistive_MathML">12.43</span></span></p>
<p>As the attribute is continuous, the data distribution can be visualized in a probability distribution graph. The following script creates a density plot showing the probability distribution of x.</p>
<div class="language-R highlighter-rouge">
<pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span>
<span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">data</span><span class="p">[,</span><span class="m">1</span><span class="p">]))</span> <span class="o">+</span>
  <span class="n">geom_density</span><span class="p">(</span><span class="n">fill</span><span class="o">=</span><span class="s1">'lightblue'</span><span class="p">)</span> <span class="o">+</span>
  <span class="n">geom_rug</span><span class="p">()</span> <span class="o">+</span>
  <span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s1">'mean education per house'</span><span class="p">)</span>

</code></pre>
</div>
<p><img src="http://www.jdatalab.com/assets/binning-plot/output_7_1.png" alt="png" /></p>
<p>While the density plot is informative, it might be too technical for less technical people to read. In this case, the better data visualization is binning the data into discrete categories and plotting the count of each bin. A sample script is provided below for creating bins and plot bin counts in a bar chart.</p>
<h2 id="creating-bins">Creating bins</h2>
<div class="language-R highlighter-rouge">
<pre class="highlight"><code><span class="c1"># set up boundaries for intervals/bins
</span><span class="n">breaks</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">8</span><span class="p">,</span><span class="m">10</span><span class="p">,</span><span class="m">13</span><span class="p">,</span><span class="m">15</span><span class="p">,</span><span class="m">19</span><span class="p">,</span><span class="m">21</span><span class="p">)</span>
<span class="c1"># specify interval/bin labels
</span><span class="n">labels</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="s2">"&lt;2"</span><span class="p">,</span> <span class="s2">"2-5)"</span><span class="p">,</span> <span class="s2">"5-8)"</span><span class="p">,</span> <span class="s2">"8-10)"</span><span class="p">,</span> <span class="s2">"10-13)"</span><span class="p">,</span> <span class="s2">"13-15)"</span><span class="p">,</span> <span class="s2">"15-19)"</span><span class="p">,</span> <span class="s2">"&gt;=19"</span><span class="p">)</span>
<span class="c1"># bucketing data points into bins
</span><span class="n">bins</span> <span class="o">&lt;-</span> <span class="n">cut</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">breaks</span><span class="p">,</span> <span class="n">include.lowest</span> <span class="o">=</span> <span class="nb">T</span><span class="p">,</span> <span class="n">right</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">labels</span><span class="p">)</span>
<span class="c1"># inspect bins
</span><span class="n">summary</span><span class="p">(</span><span class="n">bins</span><span class="p">)</span>
</code></pre>
</div>
<div class="highlighter-rouge">
<pre class="highlight"><code>    &lt;2   2-5)   5-8)  8-10) 10-13) 13-15) 15-19)    19- 
   120      0     24    330  22595   7760   1200      9 
</code></pre>
</div>
<p>The script above results in a factor <code class="highlighter-rouge">bins</code> with 8 levels. Each level is marked by a string in the vector <code class="highlighter-rouge">labels</code>. Each value in <code class="highlighter-rouge">bins</code> indicates the level that the observation falls into. The summary of <code class="highlighter-rouge">bins</code> lists the count for each bin.</p>
<h2 id="plotting-the-bins">Plotting the bins</h2>
<p>After generating the bins, simply plot the bins by the following script:</p>
<div class="language-R highlighter-rouge">
<pre class="highlight"><code><span class="n">plot</span><span class="p">(</span><span class="n">bins</span><span class="p">,</span> <span class="n">main</span><span class="o">=</span><span class="s2">"Frequency"</span><span class="p">,</span> <span class="n">ylab</span><span class="o">=</span><span class="s2">"Number of Houses"</span><span class="p">,</span><span class="n">col</span><span class="o">=</span><span class="s2">"bisque"</span><span class="p">)</span>
</code></pre>
</div>
<p><img src="http://www.jdatalab.com/assets/binning-plot/output_14_0.png" alt="png" /></p>
<h3 id="using-ggplot">Using ggplot</h3>
<div class="language-R highlighter-rouge">
<pre class="highlight"><code><span class="n">y</span> <span class="o">&lt;-</span> <span class="n">cbind</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">bins</span><span class="p">)</span>
<span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">y</span><span class="p">,</span> <span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">y</span><span class="o">$</span><span class="n">bins</span><span class="p">,</span><span class="n">fill</span><span class="o">=</span><span class="n">..count..</span><span class="p">))</span> <span class="o">+</span> 
  <span class="n">geom_bar</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s1">'black'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="m">0.9</span><span class="p">)</span> <span class="o">+</span> 
  <span class="n">stat_count</span><span class="p">(</span><span class="n">geom</span><span class="o">=</span><span class="s2">"text"</span><span class="p">,</span> <span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="n">..count..</span><span class="p">),</span> <span class="n">hjust</span><span class="o">=</span><span class="m">-0.1</span><span class="p">)</span> <span class="o">+</span>
  <span class="n">theme_bw</span><span class="p">()</span> <span class="o">+</span> 
  <span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="s2">"Number of Houses"</span><span class="p">,</span><span class="n">x</span><span class="o">=</span><span class="s2">"Mean Education Values"</span><span class="p">)</span> <span class="o">+</span>
  <span class="n">coord_flip</span><span class="p">()</span> <span class="o">+</span>
  <span class="n">ylim</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">25000</span><span class="p">)</span> <span class="o">+</span>
  <span class="n">scale_x_discrete</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span> <span class="c1"># include the bins of length zero
</span></code></pre>
</div>
<p><img src="http://www.jdatalab.com/assets/binning-plot/output_16_1.png" alt="png" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.biofacebook.com/?feed=rss2&#038;p=1361</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ChIPseeker for ChIP peak annotation (转贴）</title>
		<link>http://www.biofacebook.com/?p=1299</link>
		<comments>http://www.biofacebook.com/?p=1299#comments</comments>
		<pubDate>Tue, 28 Nov 2017 06:35:13 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[二代测序]]></category>
		<category><![CDATA[生物信息]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1299</guid>
		<description><![CDATA[<p>https://guangchuangyu.github.io/2014/04/chipseeker-for-chip-peak-annotation/</p> <p>ChIPpeakAnno WAS the only R package for ChIP peak annotation. I used it for annotating peak in my recent study.</p> <p>I found it does not consider the strand information of genes. I reported the bug to the authors, but they are reluctant to change.</p> <p>So I decided to develop my own package, ChIPseeker, and [...]]]></description>
				<content:encoded><![CDATA[<p>https://guangchuangyu.github.io/2014/04/chipseeker-for-chip-peak-annotation/</p>
<p><a href="http://www.bioconductor.org/packages/2.14/bioc/html/ChIPpeakAnno.html" target="_blank">ChIPpeakAnno</a> WAS the only R package for ChIP peak annotation. I used it for annotating peak in my recent study.</p>
<p>I found it does not consider the strand information of genes. I <a href="http://guangchuangyu.github.io/2014/01/bug-of-r-package-chippeakanno/" target="_blank">reported the bug</a> to the authors, but they are reluctant to change.</p>
<p>So I decided to develop my own package, <a href="http://www.bioconductor.org/packages/2.14/bioc/html/ChIPseeker.html" target="_blank">ChIPseeker</a>, and it’s now available in Bioconductor.</p>
<pre><code class="hljs ruby">&gt; <span class="hljs-keyword">require</span>(TxDb.Hsapiens.UCSC.hg19.knownGene)
&gt; txdb &lt;- TxDb.Hsapiens.UCSC.hg19.knownGene
&gt; <span class="hljs-keyword">require</span>(ChIPseeker)
&gt; peakfile = getSampleFiles()[[<span class="hljs-number">4</span>]]
&gt; peakfile
[<span class="hljs-number">1</span>] <span class="hljs-string">"/usr/local/Cellar/r/3.1.0/R.framework/Versions/3.1/Resources/library/ChIPseeker/extdata//GEO_sample_data/GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz"</span>
&gt; peakAnno &lt;- annotatePeak(peakfile, tssRegion=c(-<span class="hljs-number">3000</span>, <span class="hljs-number">1000</span>), TxDb=txdb, annoDb=<span class="hljs-string">"org.Hs.eg.db"</span>)
<span class="hljs-meta">&gt;&gt;</span> loading peak file...              <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">13</span> <span class="hljs-number">12</span><span class="hljs-symbol">:</span><span class="hljs-number">03</span><span class="hljs-symbol">:</span><span class="hljs-number">45</span> PM 
<span class="hljs-meta">&gt;&gt;</span> preparing features information...         <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">13</span> <span class="hljs-number">12</span><span class="hljs-symbol">:</span><span class="hljs-number">03</span><span class="hljs-symbol">:</span><span class="hljs-number">45</span> PM 
<span class="hljs-meta">&gt;&gt;</span> identifying nearest features...       <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">13</span> <span class="hljs-number">12</span><span class="hljs-symbol">:</span><span class="hljs-number">03</span><span class="hljs-symbol">:</span><span class="hljs-number">45</span> PM 
<span class="hljs-meta">&gt;&gt;</span> calculating distance from peak to TSS...  <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">13</span> <span class="hljs-number">12</span><span class="hljs-symbol">:</span><span class="hljs-number">03</span><span class="hljs-symbol">:</span><span class="hljs-number">46</span> PM 
<span class="hljs-meta">&gt;&gt;</span> assigning genomic annotation...       <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">13</span> <span class="hljs-number">12</span><span class="hljs-symbol">:</span><span class="hljs-number">03</span><span class="hljs-symbol">:</span><span class="hljs-number">46</span> PM 
<span class="hljs-meta">&gt;&gt;</span> adding gene annotation...             <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">13</span> <span class="hljs-number">12</span><span class="hljs-symbol">:</span><span class="hljs-number">03</span><span class="hljs-symbol">:</span><span class="hljs-number">50</span> PM 
<span class="hljs-meta">&gt;&gt;</span> assigning chromosome lengths          <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">13</span> <span class="hljs-number">12</span><span class="hljs-symbol">:</span><span class="hljs-number">03</span><span class="hljs-symbol">:</span><span class="hljs-number">50</span> PM 
<span class="hljs-meta">&gt;&gt;</span> done...                   <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">13</span> <span class="hljs-number">12</span><span class="hljs-symbol">:</span><span class="hljs-number">03</span><span class="hljs-symbol">:</span><span class="hljs-number">50</span> PM 
&gt; head(as.GRanges(peakAnno))
GRanges object with <span class="hljs-number">6</span> ranges <span class="hljs-keyword">and</span> <span class="hljs-number">14</span> metadata <span class="hljs-symbol">columns:</span>
      seqnames             ranges strand <span class="hljs-params">|          V4        V5
                      |</span>     
  [<span class="hljs-number">1</span>]     chr1 [ <span class="hljs-number">815092</span>,  <span class="hljs-number">817883</span>]      * <span class="hljs-params">| MACS_peak_1    295.76
  [2]     chr1 [1243287, 1244338]      * |</span> MACS_peak_2     <span class="hljs-number">63.19</span>
  [<span class="hljs-number">3</span>]     chr1 [<span class="hljs-number">2979976</span>, <span class="hljs-number">2981228</span>]      * <span class="hljs-params">| MACS_peak_3    100.16
  [4]     chr1 [3566181, 3567876]      * |</span> MACS_peak_4    <span class="hljs-number">558.89</span>
  [<span class="hljs-number">5</span>]     chr1 [<span class="hljs-number">3816545</span>, <span class="hljs-number">3818111</span>]      * <span class="hljs-params">| MACS_peak_5     57.57
  [6]     chr1 [6304864, 6305704]      * |</span> MACS_peak_6     <span class="hljs-number">54.57</span>
             annotation  geneChr geneStart   geneEnd geneLength geneStrand
                    
  [<span class="hljs-number">1</span>] Distal Intergenic     chr1    <span class="hljs-number">803451</span>    <span class="hljs-number">812182</span>       <span class="hljs-number">8732</span>          -
  [<span class="hljs-number">2</span>]  Promoter (&lt;=<span class="hljs-number">1</span>kb)     chr1   <span class="hljs-number">1243994</span>   <span class="hljs-number">1247057</span>       <span class="hljs-number">3064</span>          +
  [<span class="hljs-number">3</span>]  Promoter (&lt;=<span class="hljs-number">1</span>kb)     chr1   <span class="hljs-number">2976181</span>   <span class="hljs-number">2980350</span>       <span class="hljs-number">4170</span>          -
  [<span class="hljs-number">4</span>]  Promoter (&lt;=<span class="hljs-number">1</span>kb)     chr1   <span class="hljs-number">3547331</span>   <span class="hljs-number">3566671</span>      <span class="hljs-number">19341</span>          -
  [<span class="hljs-number">5</span>]  Promoter (&lt;=<span class="hljs-number">1</span>kb)     chr1   <span class="hljs-number">3816968</span>   <span class="hljs-number">3832011</span>      <span class="hljs-number">15044</span>          +
  [<span class="hljs-number">6</span>]            <span class="hljs-number">3</span>  UTR     chr1   <span class="hljs-number">6304252</span>   <span class="hljs-number">6305638</span>       <span class="hljs-number">1387</span>          +
           geneId transcriptId distanceToTSS         ENSEMBL      SYMBOL
                   
  [<span class="hljs-number">1</span>]      <span class="hljs-number">284593</span>   uc001abt.<span class="hljs-number">4</span>         -<span class="hljs-number">5701</span> ENSG00000230368      FAM41C
  [<span class="hljs-number">2</span>]      <span class="hljs-number">126789</span>   uc001aed.<span class="hljs-number">3</span>             <span class="hljs-number">0</span> ENSG00000169972       PUSL1
  [<span class="hljs-number">3</span>]      <span class="hljs-number">440556</span>   uc001aka.<span class="hljs-number">3</span>             <span class="hljs-number">0</span> ENSG00000177133   LINC00982
  [<span class="hljs-number">4</span>]       <span class="hljs-number">49856</span>   uc001ako.<span class="hljs-number">3</span>             <span class="hljs-number">0</span> ENSG00000116213      WRAP73
  [<span class="hljs-number">5</span>]   <span class="hljs-number">100133612</span>   uc001alg.<span class="hljs-number">3</span>             <span class="hljs-number">0</span> ENSG00000236423   LINC01134
  [<span class="hljs-number">6</span>]      <span class="hljs-number">390992</span>   uc009vly.<span class="hljs-number">2</span>          <span class="hljs-number">1452</span> ENSG00000173673        HES3
                                          GENENAME
                                       
  [<span class="hljs-number">1</span>] family with sequence similarity <span class="hljs-number">41</span>, member C
  [<span class="hljs-number">2</span>]              pseudouridylate synthase-like <span class="hljs-number">1</span>
  [<span class="hljs-number">3</span>]   long intergenic non-protein coding RNA <span class="hljs-number">982</span>
  [<span class="hljs-number">4</span>]      WD repeat containing, antisense to TP73
  [<span class="hljs-number">5</span>]  long intergenic non-protein coding RNA <span class="hljs-number">1134</span>
  [<span class="hljs-number">6</span>]       hes family bHLH transcription factor <span class="hljs-number">3</span>
  ---
  <span class="hljs-symbol">seqlengths:</span>
        chr1     chr1<span class="hljs-number">0</span>     chr11     chr12 ...      chr8      chr9      chrX
   <span class="hljs-number">249250621</span> <span class="hljs-number">135534747</span> <span class="hljs-number">135006516</span> <span class="hljs-number">133851895</span> ... <span class="hljs-number">146364022</span> <span class="hljs-number">141213431</span> <span class="hljs-number">155270560</span>
</code></pre>
<p><em>annotatePeak</em> function can accept peak file or GRanges object that contained peak information. The sample peakfile shown above is a sample output from MACS. All the information contained in peakfile will be kept in the output of annotatePeak.</p>
<p>The <em>annotation</em> column annotates genomic features of the peak, that is whether peak is located in Promoter, Exon, UTR, Intron or Intergenic. If the peak is annotated by Exon or Intron, more detail information will be provided. For instance, Exon (38885 exon 3 of 11) indicates that the peak is located at the 3th exon of gene 38885 (entrezgene ID) which contain 11 exons in total.</p>
<p>All the names of column that contains information of nearest gene annotated will start with gene in camelCaps style.</p>
<p>If parameter annoDb=“org.Hs.eg.db” is provided, extra columns of ENSEMBL, SYMBOL and GENENAME will be provided in the final output.</p>
<p>For more detail, please refer to the package <a href="http://www.bioconductor.org/packages/release/bioc/vignettes/ChIPseeker/inst/doc/ChIPseeker.pdf" target="_blank">vignette</a>. Comments or suggestions are welcome.</p>
<p>——- below was added at April 30, 2014 ———–</p>
<p>After two weeks of ChIPseeker release in Bioconductor, I am very glad to receive an email from <a href="http://www.flychip.org.uk/" target="_blank">Cambridge Systems Biology Centre</a> that they are using ChIPseeker in annotating Drosophila ChIP data.</p>
<p>Although I only illustrate examples in human, ChIPseeker do supports all species only if genomic information was available. <img src="http://www.biofacebook.com/wp-includes/images/smilies/icon_mrgreen.gif" alt=":mrgreen:" class="wp-smiley" /> <img src="http://www.biofacebook.com/wp-includes/images/smilies/icon_mrgreen.gif" alt=":mrgreen:" class="wp-smiley" /></p>
<pre><code class="hljs ruby">&gt; <span class="hljs-keyword">require</span>(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
&gt; <span class="hljs-keyword">require</span>(ChIPseeker)
&gt; txdb &lt;- TxDb.Dmelanogaster.UCSC.dm3.ensGene
&gt;
&gt; chr &lt;- c(<span class="hljs-string">"chrX"</span>, <span class="hljs-string">"chrX"</span>, <span class="hljs-string">"chrX"</span>)
&gt; start &lt;- c(<span class="hljs-number">71349</span>, <span class="hljs-number">77124</span>, <span class="hljs-number">86364</span>)
&gt; <span class="hljs-keyword">end</span> &lt;- c(<span class="hljs-number">72835</span>, <span class="hljs-number">77836</span>, <span class="hljs-number">88378</span>)
&gt; peak &lt;- GRanges(seqnames = chr, ranges = IRanges(start, <span class="hljs-keyword">end</span>))
&gt; peakAnno &lt;- annotatePeak(peak,
+                          tssRegion=c(-<span class="hljs-number">3000</span>,<span class="hljs-number">100</span>),
+                          TxDb = txdb,
+                          annoDb=<span class="hljs-string">"org.Dm.eg.db"</span>)
<span class="hljs-meta">&gt;&gt;</span> preparing features information...         <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">30</span> <span class="hljs-number">21</span><span class="hljs-symbol">:</span><span class="hljs-number">28</span><span class="hljs-symbol">:</span><span class="hljs-number">27</span>
<span class="hljs-meta">&gt;&gt;</span> identifying nearest features...       <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">30</span> <span class="hljs-number">21</span><span class="hljs-symbol">:</span><span class="hljs-number">28</span><span class="hljs-symbol">:</span><span class="hljs-number">30</span>
<span class="hljs-meta">&gt;&gt;</span> calculating distance from peak to TSS...  <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">30</span> <span class="hljs-number">21</span><span class="hljs-symbol">:</span><span class="hljs-number">28</span><span class="hljs-symbol">:</span><span class="hljs-number">31</span>
<span class="hljs-meta">&gt;&gt;</span> assigning genomic annotation...       <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">30</span> <span class="hljs-number">21</span><span class="hljs-symbol">:</span><span class="hljs-number">28</span><span class="hljs-symbol">:</span><span class="hljs-number">31</span>
<span class="hljs-meta">&gt;&gt;</span> adding gene annotation...             <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">30</span> <span class="hljs-number">21</span><span class="hljs-symbol">:</span><span class="hljs-number">28</span><span class="hljs-symbol">:</span><span class="hljs-number">42</span>
Loading required <span class="hljs-symbol">package:</span> org.Dm.eg.db
Loading required <span class="hljs-symbol">package:</span> DBI

<span class="hljs-meta">&gt;&gt;</span> assigning chromosome lengths           <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">30</span> <span class="hljs-number">21</span><span class="hljs-symbol">:</span><span class="hljs-number">28</span><span class="hljs-symbol">:</span><span class="hljs-number">43</span>
<span class="hljs-meta">&gt;&gt;</span> done...                   <span class="hljs-number">2014</span>-<span class="hljs-number">04</span>-<span class="hljs-number">30</span> <span class="hljs-number">21</span><span class="hljs-symbol">:</span><span class="hljs-number">28</span><span class="hljs-symbol">:</span><span class="hljs-number">43</span>
&gt; as.GRanges(peakAnno)
GRanges object with <span class="hljs-number">3</span> ranges <span class="hljs-keyword">and</span> <span class="hljs-number">12</span> metadata <span class="hljs-symbol">columns:</span>
      seqnames         ranges strand <span class="hljs-params">|
                  |</span>
  [<span class="hljs-number">1</span>]     chrX [<span class="hljs-number">71349</span>, <span class="hljs-number">72835</span>]      * <span class="hljs-params">|
  [2]     chrX [77124, 77836]      * |</span>
  [<span class="hljs-number">3</span>]     chrX [<span class="hljs-number">86364</span>, <span class="hljs-number">88378</span>]      * <span class="hljs-params">|
                                           annotation  geneChr geneStart
                                            
  [1]                                Promoter (&lt;=1kb)     chrX     72458
  [2] Intron (FBtr0100135/FBgn0029128, intron 1 of 7)     chrX     72458
  [3] Intron (FBtr0100135/FBgn0029128, intron 2 of 7)     chrX     72458
        geneEnd geneLength geneStrand      geneId transcriptId distanceToTSS
                   
  [1]     96056      23599          + FBgn0029128  FBtr0100135             0
  [2]     96056      23599          + FBgn0029128  FBtr0100135          5378
  [3]     96056      23599          + FBgn0029128  FBtr0100135         15920
         ENTREZID      SYMBOL    GENENAME
        
  [1]     3354987         tyn     trynity
  [2]     3354987         tyn     trynity
  [3]     3354987         tyn     trynity
  ---
  seqlengths:
       chrX
   22422827
&gt;

</span></code></pre>
<h2 id="citation">Citation</h2>
<p><strong><em>Yu G</em></strong>, Wang LG and He QY<sup>*</sup>. <a href="http://bioinformatics.oxfordjournals.org/content/31/14/2382" target="_blank">ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization</a>. <strong><em>Bioinformatics</em></strong>2015, 31(14):2382-2383.</p>
<pre><code class="hljs ruby"><span class="hljs-params"> </span></code></pre>
]]></content:encoded>
			<wfw:commentRss>http://www.biofacebook.com/?feed=rss2&#038;p=1299</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mapping reads with bwa and bowtie</title>
		<link>http://www.biofacebook.com/?p=1289</link>
		<comments>http://www.biofacebook.com/?p=1289#comments</comments>
		<pubDate>Tue, 07 Nov 2017 09:00:20 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[Linux相关]]></category>
		<category><![CDATA[二代测序]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1289</guid>
		<description><![CDATA[<p>In this tutorial, we’re going to take a set of Illumina reads from an inbred Drosophila melanogaster line, and map them back to the reference genome. (After these steps, we could do things like generate a list of SNPs at which this line differs from the reference strain, or generate a genome sequence for this [...]]]></description>
				<content:encoded><![CDATA[<p>In this tutorial, we’re going to take a set of Illumina reads from an inbred Drosophila melanogaster line, and map them back to the reference genome. (After these steps, we could do things like generate a list of SNPs at which this line differs from the reference strain, or generate a genome sequence for this fly strain, but we’ll get to that later on in the course.) We are also going to use two different (but popular) mapping tools, bwa and bowtie. Among their differences is that bowtie (while smokin’ fast) does not deal with “gapped” alignments, i.e. it does not handle insertion/deletions well.</p>
<div id="getting-the-data" class="section">
<h2>Getting the data</h2>
<p>If you just finished the FastQC tutorial, you can keep working on the same machine. Otherwise, launch an EC2 instance, make a volume out of the same snapshot that we did before, and mount it. The data for this tutorial are in /data/drosophila:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% cd /data/drosophila/
</pre>
</div>
</div>
<p>For this example, you’ll also need the Drosophila reference genome:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% curl -O ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.37_FB2011_05/fasta/dmel-all-chromosome-r5.37.fasta.gz
%% gunzip dmel-all-chromosome-r5.37.fasta.gz
</pre>
</div>
</div>
</div>
<div id="installing-and-running-bwa" class="section">
<h2>Installing and running bwa</h2>
<p>To actually do the mapping, we need to download and install bwa:</p>
<p>First we are going to grab the source files for bwa from sourceforge, using curl. It is important to know that we need to specify a few flags to let the program know that we want to leep the name (and filetype) for the file (-O) and that curl should follow relative hyperlinks (-L), to deal with redirection of the file site (or else curl won’t work with sourceforge).:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% curl -O -L http://sourceforge.net/projects/bio-bwa/files/bwa-0.5.9.tar.bz2
</pre>
</div>
</div>
<p>Now we want to uncompress the tarball file using “tar”. x extracts, v is verbose (telling you what it is doing), f skips prompting for each individual file, and j tells it to unzip .bz2 files.:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% tar xvfj bwa-0.5.9.tar.bz2
%% cd bwa-0.5.9
</pre>
</div>
</div>
<p>The make command calls a program that helps to automate the compiling process for the program.:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% make
</pre>
</div>
</div>
<p>(Note: If your system doesn’t have make installed already, you’ll need to run “apt-get install make” before you can build bwa – but if you are using the right AMI, it should be pre-installed.)</p>
<p>Copy the executable for bwa to a directory for binaries which is in your shell search path:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% cp bwa /usr/local/bin
%% cd ..
</pre>
</div>
</div>
<p>If you want to see what is in your shell search path (which can be modified later), run:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% echo $PATH
</pre>
</div>
</div>
<p>Now there are several steps involved in mapping our sequence reads and getting the output into a usable form. First we need to tell bwa to make an index of the reference genome; this will take a few minutes, so we’ve already got the index already generated in the data directory, but if you want to try it yourself, you can run (but see the note below first!!):</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% bwa index dmel-all-chromosome-r5.37.fasta
</pre>
</div>
</div>
<p>Note: This step takes several minutes. If you run it, it will overwrite the index files we have already made, so don’t run the above line exactly; instead, create a copy of the reference genome and then index the copy instead, so that we can preserve our pre-computed reference index:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% cp dmel-all-chromosome-r5.37.fasta dmel-all-chromosome-r5.37.copy.fasta
%% bwa index dmel-all-chromosome-r5.37.copy.fasta
</pre>
</div>
</div>
<p>Next, we do the actual mapping. These were paired-end reads, which means that for each DNA fragment, we have sequence data from both ends. The sequences are therefore stored in two separate files (one for the data from each end), so we have two mapping steps to perform. For now, we’ll use bwa’s default settings. The files you’ll be running this on are datasets that have been trimmed down to just the first 1 million sequence reads to speed things up, but at the end you’ll be able to work with the final product from an analysis of the full dataset that we ran earlier (some of these steps take upwards of an hour on the full dataset, but just a couple minutes on the trimmed dataset). Run:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% bwa aln dmel-all-chromosome-r5.37.fasta RAL357_1.fastq &gt; RAL357_1.sai
%% bwa aln dmel-all-chromosome-r5.37.fasta RAL357_2.fastq &gt; RAL357_2.sai
</pre>
</div>
</div>
<p>These .sai files aren’t very useful to us, so we need to convert them into SAM files. In this step, bwa takes the information from the two separate ends of each sequence and combines everything together. Here’s how you do it (this may take around 10 minutes):</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% bwa sampe dmel-all-chromosome-r5.37.fasta RAL357_1.sai RAL357_2.sai RAL357_1.fastq RAL357_2.fastq &gt; RAL357_bwa.sam
</pre>
</div>
</div>
<p>The SAM file is technically human-readable; take a look at it with:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% more RAL357_bwa.sam
</pre>
</div>
</div>
<p>&#8230;It’s not very easy to understand (if you are really curious about the SAM format, there is a 12-page manual at <a class="reference external" href="http://samtools.sourceforge.net/SAM1.pdf">http://samtools.sourceforge.net/SAM1.pdf</a>). For now we’ll use bowtie to map the same reads, and we’ll use another tool to visualize these mappings in a more intuitive way.</p>
</div>
<div id="bwa-options" class="section">
<h2>bwa options</h2>
<p>There are several options you can configure in bwa. Probably one of the most important is how many mismatches you will allow between a read and a potential mapping location for that location to be considered a match. The default is 4% of the read length, but you can set this to be either another proportion of the read length, or a fixed integer. For example, if you ran:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% bwa aln -n 4 dmel-all-chromosome-r5.37.fasta RAL357_1.fastq &gt; RAL357_1.sai
</pre>
</div>
</div>
<p>This would do almost the same thing as above, except this time, all locations in the reference genome that contain four or fewer mismatches to a given sequence read would be considered a match to that read.</p>
<p>Alternatively, you could do:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% bwa aln -n 0.01 dmel-all-chromosome-r5.37.fasta RAL357_1.fastq &gt; RAL357_1.sai
</pre>
</div>
</div>
<p>This would only allow reads to be mapped to locations at which the reference genome differs by 1% or less from a given read fragment.</p>
<p>If you want to speed things up, you can tell it to run the alignment on multiple threads at once (this will only work if your computer has a multi-core processor, which our Amazon image does). To do so, use the -t option to specify the number of threads. For example, the following line would run in two simultaneous threads:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% bwa aln -t 2 dmel-all-chromosome-r5.37.fasta RAL357_1.fastq &gt; RAL357_1.sai
</pre>
</div>
</div>
<p>bwa can also handle single-end reads. The only difference is that you would use samse instead of sampe to generate your SAM file:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% bwa samse dmel-all-chromosome-r5.37.fasta RAL357_1.sai RAL357_1.fastq &gt; RAL357_1.sam
</pre>
</div>
</div>
</div>
<div id="now-let-us-align-our-reads-using-bowtie" class="section">
<h2>Now let us align our reads using bowtie</h2>
<p>(Note: For simplicity we are going to put all of the bowtie related files into the same directory. For your own work, you may want to organize your file structure better than we have).</p>
<p>Let’s get bowtie from Sourceforge:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% curl -O -L http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/bowtie-0.12.7-linux-x86_64.zip
</pre>
</div>
</div>
<p>unzip the file, and create a directory for bowtie. In this case, the program is precompiled so it comes as a binary executable:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% unzip bowtie-0.12.7-linux-x86_64.zip
</pre>
</div>
</div>
<p>Change directory:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% cd bowtie-0.12.7
</pre>
</div>
</div>
<p>Copy the bowtie files to a directory in you shell search path, and then move back to the parent directory (/data/drosophila):</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% cp bowtie bowtie-build bowtie-inspect /usr/local/bin
</pre>
</div>
</div>
<p>Let’s create a new directory, “drosophila_bowtie” where we are going to place all the bowtie results:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% cd ..
%% mkdir drosophila_bowtie
%% cd drosophila_bowtie
</pre>
</div>
</div>
<p>Now we are going to build an index of the Drosophila genome using bowtie just like we did with bwa. The original Drosophila reference genome is in the same location as we used before. Again, we have already performed the indexing step (it takes about 7 minutes), so if you want to try it yourself, index a copy so you don’t over-write the one we’ve pre-run for you:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%%  bowtie-build /data/drosophila/dmel-all-chromosome-r5.37.fasta  drosophila_bowtie
</pre>
</div>
</div>
<p>Now we get to map! We are going to use the default options for bowtie for the moment. Let’s go through this. there are a couple of flags that we have set, since We have paired end reads for these samples, and multiple processors. The general format for bowtie is (don’t run this):</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% bowtie indexFile fastqFile outputFile
</pre>
</div>
</div>
<p>However we have some more details we want to include, so there are a couple of flags that we have to set. -S means that we want the output in SAM format. -p 2 is for multithreading (using more than one processor). In this case we have two to use. -1 -2 tells bowtie that these are paired end reads (the .fastq), and specifies which one is which.</p>
<p>This should take 35-40 minutes to run on the full dataset so we’ll run it on a trimmed version (should take about 3 minutes; later we’ll give you pre-computed results for the full set.):</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% bowtie -S -p 2 drosophila_bowtie -1 /data/drosophila/RAL357_1.fastq -2 /data/drosophila/RAL357_2.fastq RAL357_bowtie.sam
</pre>
</div>
</div>
<p>You may see warning messages like:</p>
<div class="highlight-python">
<div class="highlight">
<pre>Warning: Exhausted best-first chunk memory for read SRR018286.1830915 USI-EAS034_2_PE_FC304DDAAXX:8:21:450:1640 length=45/1 (patid 1830914); skipping read
</pre>
</div>
</div>
<p>We will talk about some options you can set to deal with this.</p>
<p>The bowtie manual can be found here: <a class="reference external" href="http://bowtie-bio.sourceforge.net/manual.shtml">http://bowtie-bio.sourceforge.net/manual.shtml</a></p>
<p>Some additional useful arguments/options (at least for me) -m # Suppresses all alignments for a particular read if more than m reportable alignments exist. -v # no more than v mismatches in the entire length of the read -n -l # max number of mismatches in the high quality “seed”, which is the the first l base pairs of a read. -chunkmbs # number of mb of memory a thread is given to store path. Useful when you get warnings like above –best # make Bowtie “guarantee” that reported singleton alignments are “best” given the options –tryhard # try hard to find valid alignments, when they exit. VERY SLOW.</p>
</div>
<div id="processing-the-output-for-use-with-samtools" class="section">
<h2>Processing the output for use with Samtools</h2>
<p>Even the SAM file isn’t very useful unless we can get it into a program that generates more readable output or lets us visualize things in a more intuitive way. For now, we’ll get the output into a sorted BAM file so we can look at it using Samtools later.</p>
<p>Download and install Samtools:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% cd /data/drosophila
%% curl -O -L http://sourceforge.net/projects/samtools/files/samtools/0.1.16/samtools-0.1.16.tar.bz2
%% tar xvfj samtools-0.1.16.tar.bz2
%% cd samtools-0.1.16
%% make
%% cp samtools /usr/local/bin
%% cd misc/
%% cp *.pl maq2sam-long maq2sam-short md5fa md5sum-lite wgsim /usr/local/bin/
%% cd /data/drosophila
</pre>
</div>
</div>
<p>Like bwa, Samtools also requires us to go through several steps before we have our data in usable form. First, we need to have Samtools generate its own index of the reference genome:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% samtools faidx dmel-all-chromosome-r5.37.fasta
</pre>
</div>
</div>
<p>Next, we need to convert the SAM file into a BAM file. (A BAM file is just a binary version of a SAM file.):</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% samtools import dmel-all-chromosome-r5.37.fasta.fai RAL357_bwa.sam RAL357_bwa.bam
</pre>
</div>
</div>
<p>Now, we need to sort the BAM file:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% samtools sort RAL357_bwa.bam RAL357_bwa.sorted
</pre>
</div>
</div>
<p>And last, we need Samtools to index the BAM file:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% samtools index RAL357_bwa.sorted.bam
</pre>
</div>
</div>
<p>Let us do this again for the bowtie output. We move back to the drosophila_bowtie directory (you could do this all from the other directory, but it gets harder to read the command with long pathnames):</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% cd drosophila_bowtie
%% samtools import ../dmel-all-chromosome-r5.37.fasta.fai RAL357_bowtie.sam RAL357_bowtie.bam
</pre>
</div>
</div>
<p>Now, we need to sort the BAM file (also slow):</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% samtools sort RAL357_bowtie.bam RAL357_bowtie.sorted
</pre>
</div>
</div>
<p>And last, we need Samtools to index the BAM file:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% samtools index RAL357_bowtie.sorted.bam
</pre>
</div>
</div>
<p>All done! Now we can use the sorted BAM file in Samtools to visualize our mappings, generate lists of SNPs, and call consensus sequences. We’ll get to all of that later on today and in the rest of the course.</p>
</div>
<div id="viewing-the-output-with-tview" class="section">
<h2>Viewing the output with TView</h2>
<p>Before we can use TView to compare Bowtie and BWA mappings, we need to sort the Bowtie BAM file, and generate an index for it.:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% cd drosophila_bowtie
%% samtools sort RAL357_full_bowtie.bam RAL357_full_bowtie.sorted
%% samtools index RAL357_full_bowtie.sorted.bam
</pre>
</div>
</div>
<p>Now that we’ve generated the files, we can view the output with TView. We’ll compare two different sorted:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% cd ..
%% samtools tview RAL357_full_bwa.sorted.bam
</pre>
</div>
</div>
<p>Now open an additional terminal window, and load the Bowtie mapping file there as well.:</p>
<div class="highlight-python">
<div class="highlight">
<pre>%% cd /data/drosophila/drosophila_bowtie
%% samtools tview RAL357_full_bowtie.sorted.bam ../dmel-all-chromosome-r5.37.fasta</pre>
</div>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.biofacebook.com/?feed=rss2&#038;p=1289</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BACTERIAL GENOMICS TUTORIAL （repost）</title>
		<link>http://www.biofacebook.com/?p=1276</link>
		<comments>http://www.biofacebook.com/?p=1276#comments</comments>
		<pubDate>Thu, 08 Jun 2017 04:00:22 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[二代测序]]></category>
		<category><![CDATA[兴趣杂项]]></category>
		<category><![CDATA[生物信息]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1276</guid>
		<description><![CDATA[<p>[Originally posted by Kat on her BacPathGenomics blog, April 2013]</p> <p>This is a shameless plug for an article and accompanying tutorial I’ve just published together with David Edwards, my excellent MSc Bioinformatics student from the University of Melbourne. It’s currently available as a PDF pre-pub from BMC Microbial Informatics and Experimentation, but the web version [...]]]></description>
				<content:encoded><![CDATA[<p><em>[Originally posted by Kat on her <a href="https://bacpathgenomics.wordpress.com/2011/04/12/clarifier-bacterial-populations-and-communities/" target="_blank">BacPathGenomics blog</a>, April 2013]</em></p>
<p>This is a shameless plug for an article and accompanying tutorial I’ve just published together with David Edwards, my excellent MSc Bioinformatics student from the University of Melbourne. It’s currently available as a PDF pre-pub from <em>BMC Microbial Informatics and Experimentation</em>, but the web version will be available soon. The accompanying tutorial is available <a href="http://www.microbialinformaticsj.com/imedia/1627516282963968/supp1.pdf">here</a>.</p>
<p>The idea for this came from discussions at last year’s ASM <em>(Australian Society of Microbiology) </em>meeting, where it was highlighted that there was a lack of courses and tutorials available for biologists to learn the basics of genomic analysis so that they can make use of next gen sequencing. Michael Wise, a founding editor of <em>BMC Microbial Informatics and Experimentation</em> based at UWA in Perth, suggested the new journal would be an ideal home for such a tutorial… so here we are:</p>
<h1><a href="http://www.microbialinformaticsj.com/content/3/1/2/" target="_blank">Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data</a></h1>
<p><a href="http://www.microbialinformaticsj.com/content/3/1/2/" target="_blank">http://www.microbialinformaticsj.com/content/3/1/2/</a></p>
<p>High throughput sequencing is now fast and cheap enough to be considered part of the toolbox for investigating bacteria, and there are thousands of bacterial genome sequences available for comparison in the public domain. Bacterial genome analysis is increasingly being performed by diverse groups in research, clinical and public health labs alike, who are interested in a wide array of topics related to bacterial genetics and evolution. Examples include outbreak analysis and the study of pathogenicity and antimicrobial resistance. In this beginner’s guide, we aim to provide an entry point for individuals with a biology background who want to perform their own bioinformatics analysis of bacterial genome data, to enable them to answer their own research questions. We assume readers will be familiar with genetics and the basic nature of sequence data, but do not assume any computer programming skills. The main topics covered are assembly, ordering of contigs, annotation, genome comparison and extracting common typing information. Each section includes worked examples using publicly available <em>E. coli</em> data and free software tools, all which can be performed on a desktop computer.</p>
<h2>Four great tools</h2>
<p>In the paper and tutorial, we introduce the four tools which we rely on most for basic analysis of bacterial genome assemblies: Velvet, ACT, Mauve and BRIG. All except ACT were developed as part of a PhD project, and have endured well beyond the original PhD to become well-known bioinformatics tools. New students take note!</p>
<p>In the paper, each tool is highlighted in its own figure, which includes some basic instructions. This is reproduced below, but is covered in much more detail in the tutorial that comes with the paper (link at the bottom).</p>
<p><strong>1. <em>Velvet</em> for genome assembly</strong></p>
<p>Possibly the most popular and widely used short read assembler, developed by the amazing Dan Zerbino during his PhD at EBI in Cambridge. Quite a PhD project!</p>
<p>[ <a href="http://www.ebi.ac.uk/~zerbino/velvet/">Download</a> | <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2336801/">Paper</a> | <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2952100/">Protocol</a> ]</p>
<p><a href="http://bacpathgenomics.files.wordpress.com/2013/04/figure1_velvet.png"><img class="alignnone size-full wp-image-635" src="https://bacpathgenomics.files.wordpress.com/2013/04/figure1_velvet.png?w=620" alt="Figure1_Velvet" /></a> <a href="http://bacpathgenomics.files.wordpress.com/2013/04/bacterial_genomics_tutorial.pdf"><br />
</a></p>
<p>Reads are assembled into contigs using <i>Velvet</i> and <i>VelvetOptimiser</i> in two steps, (1) <i>velveth</i> converts reads to <i>k</i>-mers using a hash table, and (2) <i>velvetg</i> assembles overlapping <i>k</i>-mers into contigs via a de Bruijn graph. <i>VelvetOptimiser</i> can be used to automate the optimisation of parameters for <i>velveth</i> and <i>velvetg</i> and generate an optimal assembly. To generate an assembly of <i>E. coli</i> O104:H4 using the command-line tool<i>Velvet</i>:</p>
<p>• Download <i>Velvet</i> [23] (we used version 1.2.08 on Mac OS X, compiled with a maximum <i>k</i>-mer length of 101 bp)</p>
<p>• Download the paired-end Illumina reads for <i>E. coli</i> O104:H4 strain TY-2482 (ENA accession SRR292770)</p>
<p>• Convert the reads to <i>k</i>-mers using this command:</p>
<p>velveth out_data_35 35 -<span class="skimlinks-unlinked">fastq.gz</span> -shortPaired -separate SRR292770_1.<span class="skimlinks-unlinked">fastq.gz</span> SRR292770_2.<span class="skimlinks-unlinked">fastq.gz</span></p>
<p>• Then, assemble overlapping <i>k</i>-mers into contigs using this command:</p>
<p>velvetg out_data_35 -clean yes -exp_cov 21 -cov_cutoff 2.81 -min_contig_lgth 200</p>
<p>This will produce a set of contigs in multifasta format for further analysis. See Additional file 1: Tutorial for further details, including help with downloading reads and using <i>VelvetOptimiser</i>.</p>
<p><strong>2. <em>ACT</em> for pairwise genome comparison</strong></p>
<p>Part of the Sanger Institute’s Artemis suite of tools. Also look at Artemis (single genome viewer), DNA Plotter (which can draw circular diagrams of your genomes) and BAMView (which can display mapped reads overlaid on a reference genome), they are all available <a href="http://www.sanger.ac.uk/resources/software/act/">here</a>.</p>
<p>[ <a href="http://www.sanger.ac.uk/resources/software/act/">Download</a> | <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2606163/">Paper</a> | <a href="ftp://ftp.sanger.ac.uk/pub4/resources/software/act/act.pdf">Manual</a> ]</p>
<p><a href="http://bacpathgenomics.files.wordpress.com/2013/04/figure2_act.png"><img class="alignnone size-full wp-image-633" src="https://bacpathgenomics.files.wordpress.com/2013/04/figure2_act.png?w=620" alt="Figure2_ACT" /></a></p>
<p><i>Artemis</i> and <i>ACT</i> are free, interactive genome browsers (we used <i>ACT</i> 11.0.0 on Mac OS X).</p>
<p>• Open the assembled <i>E. coli</i> O104:H4 contigs in <i>Artemis</i> and write out a single, concatenated sequence using File -&gt; Write -&gt; All Bases -&gt; FASTA Format.</p>
<p>• Generate a comparison file between the concatenated contigs and 2 alternative reference genomes using the website <a href="http://www.webact.org/"><i>WebACT</i></a>.</p>
<p>• Launch <i>ACT</i> and load in the reference sequences, contigs and comparison files, to get a 3-way comparison like the one shown here.</p>
<p>Here, the <i>E. coli</i> O104:H4 contigs are in the middle row, the enteroaggregative <i>E. coli </i>strain Ec55989 is on top and the enterohaemorrhagic <i>E. coli</i> strain EDL933 is below. Details of the comparison can be viewed by zooming in, to the level of genes or DNA bases.</p>
<p><strong>3. <em>Mauve</em> for </strong><strong>contig ordering and </strong>multiple genome comparison</p>
<p>Developed by the wonderful Aaron Darling during his PhD, he is now Associate Professor at University of Technology Sydney. Also see <a href="http://code.google.com/p/ngopt/wiki/How_To_Score_Genome_Assemblies_with_Mauve">Mauve Assembly Metrics</a>, an optional plugin for assessing assembly quality which was developed for the Assemblathon.</p>
<p>[ <a href="http://asap.ahabs.wisc.edu/mauve/">Download</a> | <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2892488/">Paper</a> | <a href="http://asap.ahabs.wisc.edu/mauve/mauve-user-guide/">User Guide</a> ]</p>
<p><a href="http://bacpathgenomics.files.wordpress.com/2013/04/fig3_mauve.jpeg"><img class="alignnone size-full wp-image-634" src="https://bacpathgenomics.files.wordpress.com/2013/04/fig3_mauve.jpeg?w=620" alt="Fig3_Mauve" /></a></p>
<p>Mauve is a free alignment tool with an interactive browser for visualising results (we used Mauve 2.3.1 on Mac OS X).</p>
<p>• Launch Mauve and select File -&gt; Align with progressiveMauve</p>
<p>• Click ‘Add Sequence…’ to add your genome assembly (e.g. annotated <i>E. coli </i>O104:H4 contigs) and other reference genomes for comparison.</p>
<p>• Specify a file for output, then click ‘Align…’</p>
<p>• When the alignment is finished, a visualization of the genome blocks and their homology will be displayed, as shown here. <i>E. coli </i>O104:H4 is on the top, red lines indicate contig boundaries within the assembly. Sequences outside coloured blocks do not have homologs in the other genomes.</p>
<p><strong>4. <em>BRIG</em> (BLAST Ring Image Generator) for multiple genome comparison</strong></p>
<p>From Nabil-Fareed Alikhan at the University of Queensland, also as part of a graduate project, which I believe is still in progress…</p>
<p>[ <a href="http://brig.sourceforge.net/">Download</a> | <a href="ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/">Download BLAST</a> | <a href="http://www.biomedcentral.com/1471-2164/12/402">Paper</a> | <a href="http://brig.sourceforge.net/">Tutorial</a> ]</p>
<p><img class="alignnone size-full wp-image-636" src="https://bacpathgenomics.files.wordpress.com/2013/04/fig4_brig.png?w=620" alt="Fig4_BRIG" /></p>
<p>BRIG is a free tool that requires a local installation of BLAST (we used BRIG 0.95 on Mac OS X). The output is a static image.</p>
<p>• Launch BRIG and set the reference sequence (EHEC EDL933 chromosome) and the location of other <i>E. coli</i> sequences for comparison. If you include reference sequences for the Stx2 phage and LEE pathogenicity island, it will be easy to see where these sequences are located.</p>
<p>• Click ‘Next’ and specify the sequence data and colour for each ring to be displayed in comparison to the reference.</p>
<p>• Click ‘Next’ and specify a title for the centre of the image and an output file, then click ‘Submit’ to run BRIG.</p>
<p>• BRIG will create an output file containing a circular image like the one shown here. It is easy to see that the Stx2 phage is present in the EHEC chromosomes (purple) and the outbreak genome (black), but not the EAEC or EPEC chromosomes.</p>
<h3>Tutorial</h3>
<p><strong>The tutorial accompanying the article</strong> is available <a href="http://www.microbialinformaticsj.com/imedia/1627516282963968/supp1.pdf">here</a>. To give you an idea of what’s covered, here is the table of contents:</p>
<p><strong>1. Genome assembly and annotation…………………………………………………………… 2</strong></p>
<p>1.1 Downloading <i>E. coli</i> sequences for assembly…………………………………………….. 2</p>
<p>1.2 Examining quality of reads (FastQC)………………………………………………………… 2</p>
<p>1.3 Velvet – assembling reads into contigs………………………………………………………. 4</p>
<p>1.3.1 Using VelvetOptimiser to optimise <i>de novo </i>assembly with Velvet………….. 6</p>
<p>1.4 Ordering contigs against a reference using Mauve………………………………………. 7</p>
<p>1.4.1 Viewing the ordered contigs (Mauve)………………………………………………… 10</p>
<p>1.4.2 Viewing the ordered contigs (ACT)……………………………………………………. 13</p>
<p>1.5 Mauve Assembly Metrics – Statistical View of the Contigs………………………… 15</p>
<p>1.6 Annotation with RAST……………………………………………………………………………. 15</p>
<p>1.6.1 Alternatives to RAST………………………………………………………………………. 19</p>
<p><strong>2. Comparative genome analysis……………………………………………………………….. 20</strong></p>
<p>2.1 Downloading <i>E. coli</i> genome sequences for comparative analysis………………. 20</p>
<p>2.2 Mauve – for multiple genome alignment……………………………………………………. 21</p>
<p>2.3 ACT – for detailed pairwise genome comparisons……………………………………… 24</p>
<p>2.3.1 Generating comparison files for ACT…………………………………………………. 24</p>
<p>2.3.2 Viewing genome comparisons in ACT……………………………………………….. 27</p>
<p>2.4 BRIG – Visualizing reference-based comparisons of multiple sequences……… 29</p>
<p><strong>3. Typing and specialist tools……………………………………………………………………. 34</strong></p>
<p>3.1 PHAST – for identification of phage sequences…………………………………………. 34</p>
<p>3.2 ResFinder – for identification of resistance gene sequences………………………… 34</p>
<p>3.3 Multilocus sequence typing…………………………………………………………………….. 34</p>
<p>3.4 PATRIC – online genome comparison tool………………………………………………… 34</p>
]]></content:encoded>
			<wfw:commentRss>http://www.biofacebook.com/?feed=rss2&#038;p=1276</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POPULATION GENOMICS OF KLEBSIELLA (Repost)</title>
		<link>http://www.biofacebook.com/?p=1274</link>
		<comments>http://www.biofacebook.com/?p=1274#comments</comments>
		<pubDate>Thu, 08 Jun 2017 03:58:56 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[二代测序]]></category>
		<category><![CDATA[兴趣杂项]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1274</guid>
		<description><![CDATA[<p>https://holtlab.net/2015/06/23/population-genomics-of-klebsiella/</p> <p>Well, after almost 6 years, our Klebsiella pneumoniae genomics paper is finally out!</p> <p>It’s a beast of a thing and there are still a million and one questions to address just from this one data set. For those interested in looking at the data for themselves, the raw reads are available under accessionERP000165, the [...]]]></description>
				<content:encoded><![CDATA[<p>https://holtlab.net/2015/06/23/population-genomics-of-klebsiella/</p>
<p>Well, after almost 6 years, our <em>Klebsiella pneumoniae</em> genomics paper is finally out!</p>
<p>It’s a beast of a thing and there are still a million and one questions to address just from this one data set. For those interested in looking at the data for themselves, the raw reads are available under accession<a href="http://www.ebi.ac.uk/ena/data/view/ERP000165">ERP000165</a>, the assemblies are in Sylvain Brisse’s <a href="http://bigsdb.web.pasteur.fr/"><em>Klebsiella pneumoniae</em> BIGSdb</a> at the Pasteur Institute, and the tree + metadata are available for your interactive viewing pleasure in <a href="http://goo.gl/X6msv9">MicroReact</a>.</p>
<p>The paper itself is open access in PNAS, you can read it here.</p>
<p><strong><a href="http://www.pnas.org/content/early/2015/06/17/1501049112.abstract">Genomic analysis of diversity, population structure, virulence, and antimicrobial resistance in Klebsiella pneumoniae, an urgent threat to public health</a></strong></p>
<p><a href="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-10-42-pm.png"><img class="wp-image-176 size-medium" src="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-10-42-pm.png?w=300&amp;h=263" srcset="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-10-42-pm.png?w=300&amp;h=263 300w, https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-10-42-pm.png?w=600&amp;h=526 600w, https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-10-42-pm.png?w=150&amp;h=131 150w" alt="Screen Shot 2015-06-18 at 4.10.42 pm" width="300" height="263" data-attachment-id="176" data-permalink="https://holtlab.net/2015/06/23/population-genomics-of-klebsiella/screen-shot-2015-06-18-at-4-10-42-pm/" data-orig-file="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-10-42-pm.png" data-orig-size="688,602" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screen Shot 2015-06-18 at 4.10.42 pm" data-image-description="" data-medium-file="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-10-42-pm.png?w=300&amp;h=263" data-large-file="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-10-42-pm.png?w=620" /></a></p>
<p>There have been lots of really nice Klebs genomics papers out in the last 18 months or so, examining the evolution of the ST258 clone that carries the KPC gene (<em>K. pneumoniae</em> carbapaenemase) and is wreaking havoc in hospitals all over the place (including recently in Melbourne), and also several hospital-based studies tracking transmission and evolution of local drug-resistant outbreaks.</p>
<p>But that is just the tip of the <em>K. pneumoniae</em> iceberg.</p>
<p>Our paper asks a completely different set of questions, which you could basically sum up as “what the hell is <em>Klebsiella pneumoniae</em> anyway?”</p>
<p>To do this, we sequenced ~300 genomes of really diverse <em>K. pneumoniae</em> strains. We didn’t have much information about genetic diversity to go on, so we chose strains with different phenotypes (antimicrobial resistance patterns, capsular serotypes or sequence types where known), from different sources (human and animal, asymptomatic carriage and infections of various kinds), and from different geographical locations.</p>
<p>This was done by an international group of collaborators who pooled their resources, not only sharing their precious strain collections but also digging through hospital and other records to find as much information about the strains as possible.</p>
<p>You can view the tree and associated metadata, including geographical origin and source information, over on <a href="http://goo.gl/X6msv9">Microreact</a>.<img class=" size-full wp-image-177 aligncenter" src="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-09-33-pm.png?w=620" srcset="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-09-33-pm.png 503w, https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-09-33-pm.png?w=150 150w, https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-09-33-pm.png?w=300 300w" alt="Screen Shot 2015-06-18 at 4.09.33 pm" data-attachment-id="177" data-permalink="https://holtlab.net/2015/06/23/population-genomics-of-klebsiella/screen-shot-2015-06-18-at-4-09-33-pm/" data-orig-file="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-09-33-pm.png?w=620" data-orig-size="503,306" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screen Shot 2015-06-18 at 4.09.33 pm" data-image-description="" data-medium-file="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-09-33-pm.png?w=620?w=300" data-large-file="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-4-09-33-pm.png?w=620?w=503" /></p>
<p>We found out some pretty interesting things about <em>Klebsiella pneumoniae</em>, including the fact that <strong>what’s identified as <em>K. pneumoniae</em> using standard tests is actually a mixed bag of three related species</strong>, that now have their own names: <em>K. pneumoniae</em> (KpI group, which includes the majority of clinical isolates and all the stuff you might have heard of like the clone that causes rhinoscleromatis, and the KPC clone ST258, and the hypervirulent clone ST23); <em>K. quasipneumoniae</em>; and <em>K. variicola </em>(plant associated and usually nitrogen-fixing).</p>
<p>By now, this species stuff has been nutted out (mainly by co-author Sylvain Brisse from Institut Pasteur) by analysing marker gene sequences, but it’s really important to be able to show that those patterns hold at the whole-genome level, and we found some interesting things about the distribution of the rarer species (see the paper for details).</p>
<p><a href="https://katholtlab.files.wordpress.com/2015/06/kp_pop_network.png"><img class="aligncenter size-large wp-image-201" src="https://katholtlab.files.wordpress.com/2015/06/kp_pop_network.png?w=620&amp;h=247" srcset="https://katholtlab.files.wordpress.com/2015/06/kp_pop_network.png?w=620&amp;h=247 620w, https://katholtlab.files.wordpress.com/2015/06/kp_pop_network.png?w=150&amp;h=60 150w, https://katholtlab.files.wordpress.com/2015/06/kp_pop_network.png?w=300&amp;h=120 300w, https://katholtlab.files.wordpress.com/2015/06/kp_pop_network.png?w=768&amp;h=306 768w, https://katholtlab.files.wordpress.com/2015/06/kp_pop_network.png 953w" alt="Kp_pop_network" width="620" height="247" data-attachment-id="201" data-permalink="https://holtlab.net/2015/06/23/population-genomics-of-klebsiella/kp_pop_network/" data-orig-file="https://katholtlab.files.wordpress.com/2015/06/kp_pop_network.png" data-orig-size="953,380" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Kp_pop_network" data-image-description="" data-medium-file="https://katholtlab.files.wordpress.com/2015/06/kp_pop_network.png?w=300" data-large-file="https://katholtlab.files.wordpress.com/2015/06/kp_pop_network.png?w=620&amp;h=247" /></a></p>
<p>Importantly, we did the whole pan-genome analysis thing and found that <strong>as a population, <em>K. pneumoniae</em>has more genes than humans.</strong> Almost 30,000 in fact. Each individual strain has ~5,500 genes, but &lt;2,000 of those are core genes that are common to all <em>K. pneumoniae.</em> The rest are accessory genes that can come and go, helping the bug to adapt to new environments.</p>
<p><a href="https://katholtlab.files.wordpress.com/2015/06/kp_pan_genome.png"><img class="aligncenter size-full wp-image-202" src="https://katholtlab.files.wordpress.com/2015/06/kp_pan_genome.png?w=620" srcset="https://katholtlab.files.wordpress.com/2015/06/kp_pan_genome.png 491w, https://katholtlab.files.wordpress.com/2015/06/kp_pan_genome.png?w=150 150w, https://katholtlab.files.wordpress.com/2015/06/kp_pan_genome.png?w=300 300w" alt="Kp_pan_genome" data-attachment-id="202" data-permalink="https://holtlab.net/2015/06/23/population-genomics-of-klebsiella/kp_pan_genome/" data-orig-file="https://katholtlab.files.wordpress.com/2015/06/kp_pan_genome.png?w=620" data-orig-size="491,387" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Kp_pan_genome" data-image-description="" data-medium-file="https://katholtlab.files.wordpress.com/2015/06/kp_pan_genome.png?w=620?w=300" data-large-file="https://katholtlab.files.wordpress.com/2015/06/kp_pan_genome.png?w=620?w=491" /></a></p>
<p>One of the cool things we were able to do with our data set, which you just can’t do with genomic studies focused on specific clones or outbreaks, was to look at statistical associations between accessory genes and phenotypes. Admittedly our available phenotypes were pretty limited, but we found a few important things.</p>
<p><strong>1. VIRULENCE</strong></p>
<p>We screened for genes associated with <strong>virulence in humans</strong> by focusing in on invasive infections, and comparing gene frequencies in human isolates from invasive community-acquired infections (i.e. the kind of infections that <em>land you in hospital</em>) vs. those in human carriage isolates or hospital acquired infections (i.e. the kind of infections that get you when you are already in hospital for something else and are particularly vulnerable to infection).</p>
<p>The only genes that were significantly associated with invasive infection in humans were <em>rmpA</em> and <em>rmpA2</em>, which upregulate capsule production, and genes related to <strong>iron acquisition</strong> (specifically acquired siderophore systems that can help to steal iron from animal hosts – see paper for details). These genes have been known about for some time, based on mouse models and knowledge of other pathogens, however we were able to show that these genes are significantly associated with invasive <em>K. pneumoniae</em>disease in humans, which is not something that can be proven directly using experimental systems. (The siderophore story actually goes a bit deeper than the iron issue… it’s a bit too complex to go into here but I recommend reading Michael Bachman’s work e.g. <a href="http://dx.doi.org/10.1128/mBio.00224-11" target="_blank">“Interaction of lipocalin 2, transferrin, and siderophores determines the replicative niche of <em>Klebsiella pneumoniae</em> during pneumonia” in <em>MBio</em>, 2012</a>).</p>
<p><a href="https://katholtlab.files.wordpress.com/2015/06/siderophores_by_class.png"><img class=" size-full wp-image-189 aligncenter" src="https://katholtlab.files.wordpress.com/2015/06/siderophores_by_class.png?w=620" srcset="https://katholtlab.files.wordpress.com/2015/06/siderophores_by_class.png?w=620 620w, https://katholtlab.files.wordpress.com/2015/06/siderophores_by_class.png?w=150 150w, https://katholtlab.files.wordpress.com/2015/06/siderophores_by_class.png?w=300 300w, https://katholtlab.files.wordpress.com/2015/06/siderophores_by_class.png?w=768 768w, https://katholtlab.files.wordpress.com/2015/06/siderophores_by_class.png 903w" alt="siderophores_by_class" data-attachment-id="189" data-permalink="https://holtlab.net/2015/06/23/population-genomics-of-klebsiella/siderophores_by_class/" data-orig-file="https://katholtlab.files.wordpress.com/2015/06/siderophores_by_class.png?w=620" data-orig-size="903,430" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="siderophores_by_class" data-image-description="" data-medium-file="https://katholtlab.files.wordpress.com/2015/06/siderophores_by_class.png?w=620?w=300" data-large-file="https://katholtlab.files.wordpress.com/2015/06/siderophores_by_class.png?w=620?w=620" /></a></p>
<p>Interestingly, doing the same test in bovine isolates showed that the story is very different: we had a lot of isolates from dairy herds, including clinical and subclinical mastitis; asymptomatic carriage isolates and strains from the farm environment… and found that an acquired lactose operon was almost perfectly associated with mastitis in cows! Something similar has been observed before in <em><a href="http://www.ncbi.nlm.nih.gov/pubmed/21536150" target="_blank">Streptococcus agalactiae</a></em>.</p>
<p><strong>2. ANTIBIOTIC RESISTANCE</strong></p>
<p><strong>Resistance genes</strong> were associated with human hospital isolates and human carriage isolates. This is far from an ideal study design to test this, as we had different types of collections from different geographical regions; however, even when you look within different local collections you see the same patterns: (a) comparing bovine and human isolates from NY state, the resistance genes were all in human isolates not cow isolates; (b) comparing human carriage and infection isolates (both nosocomial and community acquired) in Vietnam, the resistance genes were mainly in human carriage and hospital isolates, not in community infections; (c) in the remaining countries, isolates from infections acquired in hospital had more resistance genes than those that were considered nosocomial (diagnosed within 48 hours of admission).</p>
<p><a href="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-5-12-24-pm.png"><img class="aligncenter wp-image-190 size-large" src="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-5-12-24-pm.png?w=620&amp;h=298" srcset="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-5-12-24-pm.png?w=620&amp;h=298 620w, https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-5-12-24-pm.png?w=150&amp;h=72 150w, https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-5-12-24-pm.png?w=300&amp;h=144 300w, https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-5-12-24-pm.png?w=768&amp;h=369 768w, https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-5-12-24-pm.png 933w" alt="Screen Shot 2015-06-18 at 5.12.24 pm" data-attachment-id="190" data-permalink="https://holtlab.net/2015/06/23/population-genomics-of-klebsiella/screen-shot-2015-06-18-at-5-12-24-pm/" data-orig-file="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-5-12-24-pm.png" data-orig-size="933,448" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screen Shot 2015-06-18 at 5.12.24 pm" data-image-description="" data-medium-file="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-5-12-24-pm.png?w=300" data-large-file="https://katholtlab.files.wordpress.com/2015/06/screen-shot-2015-06-18-at-5-12-24-pm.png?w=620" /></a></p>
<p>What’s really interesting is that <strong>while resistance genes and virulence genes are both highly mobile components of the accessory genome, they were essentially orthogonal </strong>in their distribution. The resistance genes were mainly in hospital acquired infections and carriage isolates, whereas the virulence strains were mainly found in isolates from community acquired infections.</p>
<p><a href="https://katholtlab.files.wordpress.com/2015/06/resistance-virulence-axis-2.png"><img class="aligncenter size-medium wp-image-207" src="https://katholtlab.files.wordpress.com/2015/06/resistance-virulence-axis-2.png?w=300&amp;h=253" srcset="https://katholtlab.files.wordpress.com/2015/06/resistance-virulence-axis-2.png?w=300&amp;h=253 300w, https://katholtlab.files.wordpress.com/2015/06/resistance-virulence-axis-2.png?w=600&amp;h=506 600w, https://katholtlab.files.wordpress.com/2015/06/resistance-virulence-axis-2.png?w=150&amp;h=126 150w" alt="resistance-virulence-axis-2" width="300" height="253" data-attachment-id="207" data-permalink="https://holtlab.net/2015/06/23/population-genomics-of-klebsiella/resistance-virulence-axis-2/" data-orig-file="https://katholtlab.files.wordpress.com/2015/06/resistance-virulence-axis-2.png" data-orig-size="639,538" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="resistance-virulence-axis-2" data-image-description="" data-medium-file="https://katholtlab.files.wordpress.com/2015/06/resistance-virulence-axis-2.png?w=300&amp;h=253" data-large-file="https://katholtlab.files.wordpress.com/2015/06/resistance-virulence-axis-2.png?w=620" /></a>So far, this has resulted in the emergence of <strong>two very different kinds of <em>K. pneumoniae</em> clones of importance to human health: hypervirulent clones, and multidrug resistant clones</strong>. This is pretty lucky, as it means the hypervirulent clones are generally sensitive to antibiotics (although antimicrobial treatment is difficult for some conditions, like liver abscess), and the problem of untreatable highly drug resistant <em>Klebs</em> infections has not spread outside of hospitals.</p>
<p>Unfortunately, our luck appears to be runnning out and <strong>we are already starting to see the convergence of virulence and resistance.</strong> Hypervirulent ST23 strains, which have all four of the acquired siderophore systems, are accumulating antibiotic resistance genes. And about half of the KPC <em>Klebs</em> ST258 strains causing problems in hospitals globally have one of the siderophore gene clusters, yersiniabactin, which has been shown in clinical ST258 isolates to <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/21576334/" target="_blank">confer enhanced ability to cause pneumonia</a>. How long till the other virulence genes creep in? We need to be watching!</p>
<p>Also, our data indicates that <strong>there are plenty of other hypervirulent or multidrug resistant <em>Klebs</em>clones emerging out there…</strong> convergence of virulence and resistance could happen in any one of them, so we need to be thinking and monitoring beyond the well-known ST23 and ST258 strains.</p>
<p>In any case, genomic surveillance is going to become really important for <em>Klebsiella</em>…</p>
]]></content:encoded>
			<wfw:commentRss>http://www.biofacebook.com/?feed=rss2&#038;p=1274</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>TOOLS FOR BACTERIAL COMPARATIVE GENOMICS</title>
		<link>http://www.biofacebook.com/?p=1272</link>
		<comments>http://www.biofacebook.com/?p=1272#comments</comments>
		<pubDate>Thu, 08 Jun 2017 03:51:17 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[二代测序]]></category>
		<category><![CDATA[生物信息]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1272</guid>
		<description><![CDATA[<p class="p1">Yesterday I spoke at a workshop for JAMS TOAST (Sydney’s Joint Academic Microbiology Seminars – bioinformatics workshop)… I was asked to cover tools for comparative genomics, so I put together a list of the tried and tested programs that I find most useful for this kind of analysis. So here is the list.</p> <p [...]]]></description>
				<content:encoded><![CDATA[<p class="p1">Yesterday I spoke at a workshop for JAMS TOAST (Sydney’s Joint Academic Microbiology Seminars – bioinformatics workshop)… I was asked to cover tools for comparative genomics, so I put together a list of the tried and tested programs that I find most useful for this kind of analysis. So here is the list.</p>
<p class="p1"><em><strong>First, a few caveats…</strong></em></p>
<p class="p1">These are mostly tools with a graphical user interface (mostly Java based)… this means they should be pretty accessible to most users, however if you want to do analyses that are a bit more custom or niche, you will have to get your hands dirty and use the commandline (which you should learn to do anyway!!)</p>
<p class="p1">These tools are useful for small-ish scale genomic comparisons, in the order of 2-20 genomes.</p>
<p class="p1">Most of these tools are for assembled data, hence we start with how to assemble your data… this will become less of an issue as we move to long read sequencing with PacBio and MinION etc, but for the moment most of the data I work with is from large scale sequencing projects with Illumina (100s-1000s) so we use mapping-based approaches for a lot of tasks… so I have included a few comments about this at the end.</p>
<p class="p1"><em><strong>Beginner’s guide with walk-through tutorial</strong></em></p>
<p class="p1">Some of these tools, particularly the visualisation of whole genome comparisons (using Artemis &amp; ACT, Mauve, and BRIG) are covered the in the tutorial from our 2013 “<a href="http://www.microbialinformaticsj.com/content/3/1/2/">Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data</a>“. So if you want a walk-through, that’s a good place to start. Please note that the <strong>links to the read sets used in the tutorial have changed</strong>, but you can still find the reads at NCBI or ENA under the accession provided (SRR292770).</p>
<hr />
<p class="p1"><b>First things first – Are my reads good quality?</b></p>
<p class="p3"><b><a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/">FastQC</a> – </b>Generate graphical reports of read quality from the fastq files.</p>
<hr />
<p class="p1"><b>Assembly</b></p>
<p class="p3"><b><a href="http://bioinf.spbau.ru/spades">SPAdes</a> – </b>de Bruijn graph assembly, incorporating multiple kmers and read pairing information in the building of the graph. Think of this as a more sophisticated version of Velvet… in my experience, it nearly always provides better assemblies than Velvet, except on the rare occasion (1-5% of read sets) where it fails to get a good assembly at all. In which case, try Velvet!</p>
<p class="p3"><b><a href="http://www.ebi.ac.uk/~zerbino/velvet/">Velvet</a> – </b>The first and most widely used de Bruijn graph assembler built to tackle the problem of short reads. Graphs are built using a single kmer value, and read pairing information used for scaffolding only (unlike SPAdes, where multiple kmers are incorporated into a single graph and read pairing is also used directly in building the graph). How do you know what kmer to use? Use <a href="http://www.vicbioinformatics.com/software.velvetoptimiser.shtml">Velvet Optimiser</a>. Hate the command line? Try <a href="http://www.vicbioinformatics.com/software.vague.shtml">Vague</a>, a GUI wrapper for Velvet.</p>
<p class="p3"><i><strong>How do I judge if I have a good assembly?</strong> </i>Try <a href="http://quast.bioinf.spbau.ru/">QUAST</a></p>
<p class="p3"><i><strong>What other assemblers are there?</strong> </i>What’s best for what task? Take a look at <a href="http://nucleotid.es/results/">Nucelotid.es</a> and <a href="http://genome.cshlp.org/content/21/12/2224.long">Assemblathon</a>.</p>
<p class="p3"><em><b>How can I view my assembly graphs? </b></em>Try <a href="http://rrwick.github.io/Bandage/">Bandage</a> – freshly released from Ryan Wick, a MSc (Bioinformatics) student in my lab. Bandage allows you to view and manipulate de Druijn graphs output by Velvet or SPAdes… lots of super cool features and useful applications, see the <a href="http://rrwick.github.io/Bandage/">github site</a> for examples.</p>
<hr />
<p class="p1"><b>Working with assembled data</b></p>
<p class="p1">Now you have a nice set of assembled contigs – where are all the genes?</p>
<p class="p1"><b><i>Whole genome annotation</i></b></p>
<p class="p3"><b><a href="http://rast.nmpdr.org/">RAST</a> – </b>Web tool (upload contigs), uses the subsystems in the SEED database and provides detailed annotation and pathway analysis. Takes several hours per genome but I think this is the best way to get a high quality annotation (if you have only a few genomes to annotate).</p>
<p class="p3"><b><a href="http://www.vicbioinformatics.com/software.prokka.shtml">Prokka</a> – </b>Standalone command line tool, takes just a few minutes per genome. This is the best way to get good quality annotation in a flash, which is particularly useful if you have loads of genomes or need to annotate a pangenome or metagenome. Note however that the quality of functional information is not as good as RAST, and you will need several extra steps if you want to do functional profiling and pathway analysis of your genome(s)… which is in-built in RAST.</p>
<p class="p3"><b><i>Annotating specific types of features</i></b></p>
<p class="p3"><b>Resistance genes</b></p>
<ul>
<li class="p3"><a href="http://arpcard.mcmaster.ca/">CARD</a> – best combination of easy interface + pretty good database</li>
<li class="p3"><a href="http://www.mediterranee-infection.com/article.php?laref=282&amp;titer=arg-annot">ARG-Annot</a>  – best quality database (in my experience, focusing on Enterobacteriaceae)</li>
<li class="p3"><a href="http://cge.cbs.dtu.dk/services/ResFinder/">ResFinder</a> – easy interface, database needs ongoing development</li>
</ul>
<p class="p3"><b>Virulence genes</b></p>
<ul>
<li class="p3"><a href="http://patricbrc.org/">PATRIC</a> – for certain bugs only, but has good online tools for genome comparisons.</li>
<li class="p3"><a href="http://www.mgc.ac.cn/VFs/">VFDB</a> – broader range of species, but varying levels of comprehensiveness and you need to do more of the work yourself.</li>
</ul>
<p class="p3"><b>Insertion sequences</b></p>
<ul>
<li class="p3"><a href="http://www-genome.biotoul.fr/">IS saga</a> – Upload your genome and have IS saga find all the transposes in your genome using their IS finder database</li>
</ul>
<p class="p3"><b>Phage</b></p>
<ul>
<li class="p3"><a href="http://phast.wishartlab.com/">PHAST</a> – Upload your genome and this will identify likely prophage regions, summarising these at the level of whole phage and also individual genes.</li>
</ul>
<hr />
<p class="p1"><b>Viewing your genome – <a href="https://www.sanger.ac.uk/resources/software/artemis/">The Artemis Genome Browser</a></b></p>
<p class="p3">There are zillions of genome browsers out there, but I still love Artemis… and not just because I’m from the Sanger Institute. Unlike most genome browsers, Artemis was <strong>custom-built for bacterial genomes</strong>, which let’s face it are really quite different from humans and other eukaryotes.</p>
<p class="p3">The default view shows you your sequence and annotation, with 6 frame translation and allows you to easily edit or create features in the annotation, graph sequence-based functions like GC content and GC skew, and do all manner of other useful things. It’s been around for a zillion years (well, at least 10 or so) and is <strong>very well developed and supported</strong>.</p>
<p class="p3">Artemis has lots of cool features built in, including the ‘BamView’ feature that allows you to view BAM files that <strong>show the alignment of reads mapped to your genome</strong>, zoomed in to the base level or zoomed out to look at coverage and SNP distributions… this is also super handy for viewing RNAseq data, as you can easily see the stacks of reads derived from coding regions.</p>
<p class="p3">Artemis also has DNA Plotter built in, which you can use to <strong>generate those pretty circular figures</strong> of your genome sequences and their features.</p>
<p class="p3">Plus, when you’ve got used to using Artemis to get to know your shiny new genome, you can move on to<strong>viewing comparisons against other genomes</strong> using ACT – the Artemis Comparison Tool.</p>
<hr />
<p class="p1"><b>Comparing whole genome assemblies</b></p>
<p class="p1">NOTE: Walk-throughs of these tools, using examples from the 2011 <em>E. coli</em> outbreak in Germany, are covered in the “<a href="http://www.microbialinformaticsj.com/content/3/1/2/">Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data</a>“.</p>
<p class="p3"><b><a href="https://www.sanger.ac.uk/resources/software/artemis/">ACT (Artemis Comparison Tool)</a> – </b>Visualises BLAST (or similar) comparisons of genomes. This is most useful for comparisons of two or a few genomes, and makes it easy to spot and zoom in to regions of difference.</p>
<p class="p3"><b><a href="http://gel.ahabs.wisc.edu/mauve/">Mauve</a> – </b>Whole genome alignment and viewer that can output SNPs, regions of difference, homologous blocks, etc. It can also be used to assess assembly quality against a reference, using Mauve Contig Metrics.</p>
<p class="p3"><b><a href="http://brig.sourceforge.net/">BRIG (BLAST Ring Image Generator)</a> – </b>Gives a global view of whole genome comparisons by visualising BLAST comparisons via<strong> pretty circular figures</strong>. This is suitable for comparing lots of genomes, although because you have to enter each one through the GUI, it’s tricky to do more than a dozen or so.</p>
<hr />
<p class="p1"><b>Whole genome SNP-based phylogenies (from assembled data)</b></p>
<p class="p3">You can’t go past Adam Phiippy’s<b> <a href="http://harvest.readthedocs.org/en/v1.1/">Harvest Suite</a></b></p>
<p class="p3">Parsnp – Compare genomes to a reference (using MUMmer) to identify core genome SNPs and build a phylogeny</p>
<p class="p3">Gingr – View the phylogeny and associated SNP calls (VCF format)… also useful for visualising tree + VCF that you have created in other ways, e.g. from mapping.</p>
<hr />
<p class="p3"><b>Detecting recombination in whole genome comparisons</b></p>
<p class="p3"><a href="https://github.com/sanger-pathogens/gubbins">Gubbins</a> – A new implementation of the approach first used in <a href="http://www.ncbi.nlm.nih.gov/pubmed/21273480">Nick Croucher’s 2011 Science paper on Streptococcus pneumoniae</a>. Command-line driven and runs pretty fast (&lt;2 hours usually on our data).</p>
<p class="p3"><a href="http://www.helsinki.fi/bsg/software/BRAT-NextGen/">BRAT NextGen</a> – Uses a similar idea to Gubbins but using Bayesian clustering is GUI-driven… sounds nice, but actually I find it less convenient than Gubbins as there are manual steps required and then you need to run lots of iterations to get significance values.</p>
<hr />
<hr />
<p class="p1"><b><i>Mapping based analyses</i></b></p>
<p class="p3"><b>Why?</b></p>
<p class="p3">If you have specific questions to answer, where precise variant detection is important (e.g. allele calling, MLST, SNP detection, typing, mutation detection), mapping provides greater sensitivity and specificity than assembled data. Basically, if you want to be really sure about a variant call, you should be using the full information available in the reads rather than relying on the assembler and consensus base caller to get things right every time. See our SRST2 paper if you don’t believe me.</p>
<p class="p3">Also, if you need quick answers to specific questions, this is almost always going to be achieved faster and more accurately if you work direct from reads without attempting to generate high quality assemblies first.</p>
<p class="p3"><b><i>The basics</i></b></p>
<p class="p3">For mapping our go-to is BWA or Bowtie2 (getting from fastq -&gt; BAM). For processing of BAMs we use: SAMtools and BAMtools for variant calling, and BAMstats and BEDtools for summarising coverage and other information from the alignments.</p>
<p class="p3"><em><b>Pipelines for specific tasks</b></em></p>
<p class="p3">There are loads of pipelines around the place that use the basic tools above to do specific tasks. A few of ours are:</p>
<ul>
<li class="p3"><strong><a href="http://katholt.github.io/srst2/">SRST2</a> </strong>– MLST, resistance genes, virulence genes</li>
<li class="p3"><strong><a href="https://github.com/jhawkey/IS_mapper">ISMapper</a> </strong>– IS (insertion sequence / tranposase) insertions</li>
<li class="p3"><strong><a href="http://https//github.com/katholt/RedDog">RedDog</a> </strong>– Whole genome SNP-based phylogenies</li>
</ul>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>https://holtlab.net/2015/02/25/tools-for-bacterial-comparative-genomics/</p>
]]></content:encoded>
			<wfw:commentRss>http://www.biofacebook.com/?feed=rss2&#038;p=1272</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
