<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>小生这厢有礼了(BioFaceBook Personal Blog) &#187; 统计学习</title>
	<atom:link href="https://www.biofacebook.com/?cat=11&#038;feed=rss2" rel="self" type="application/rss+xml" />
	<link>https://www.biofacebook.com</link>
	<description>记录生物信息学点滴足迹（NGS,Genome,Meta,Linux)</description>
	<lastBuildDate>Sun, 23 Aug 2020 03:28:53 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.41</generator>
	<item>
		<title>Making a pairwise distance matrix in pandas</title>
		<link>https://www.biofacebook.com/?p=1517</link>
		<comments>https://www.biofacebook.com/?p=1517#comments</comments>
		<pubDate>Tue, 09 Jun 2020 03:31:54 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[统计学习]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1517</guid>
		<description><![CDATA[<p>This is a somewhat specialized problem that forms part of a lot of data science and clustering workflows. It starts with a relatively straightforward question: if we have a bunch of measurements for two different things, how do we come up with a single number that represents the difference between the two things?</p> <p>An example [...]]]></description>
				<content:encoded><![CDATA[<p>This is a somewhat specialized problem that forms part of a lot of data science and clustering workflows. It starts with a relatively straightforward question: if we have a bunch of measurements for two different things, how do we come up with a single number that represents the difference between the two things?</p>
<p>An example will make the question clearer. Let&#8217;s load our olympic medal dataset:</p>
<pre><code class="lang-python hljs"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
pd.options.display.max_rows = <span class="hljs-number">10</span>
pd.options.display.max_columns = <span class="hljs-number">6</span>
data = pd.read_csv(<span class="hljs-string">"https://raw.githubusercontent.com/mojones/binders/master/olympics.csv"</span>, sep=<span class="hljs-string">"\t"</span>)
data</code></pre>
<div data-preserve-html-node="true">
<table class="dataframe" border="1" data-preserve-html-node="true">
<thead data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true">City</th>
<th data-preserve-html-node="true">Year</th>
<th data-preserve-html-node="true">Sport</th>
<th data-preserve-html-node="true">&#8230;</th>
<th data-preserve-html-node="true">Medal</th>
<th data-preserve-html-node="true">Country</th>
<th data-preserve-html-node="true">Int Olympic Committee code</th>
</tr>
</thead>
<tbody data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">0</th>
<td data-preserve-html-node="true">Athens</td>
<td data-preserve-html-node="true">1896</td>
<td data-preserve-html-node="true">Aquatics</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">Gold</td>
<td data-preserve-html-node="true">Hungary</td>
<td data-preserve-html-node="true">HUN</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">1</th>
<td data-preserve-html-node="true">Athens</td>
<td data-preserve-html-node="true">1896</td>
<td data-preserve-html-node="true">Aquatics</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">Silver</td>
<td data-preserve-html-node="true">Austria</td>
<td data-preserve-html-node="true">AUT</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">2</th>
<td data-preserve-html-node="true">Athens</td>
<td data-preserve-html-node="true">1896</td>
<td data-preserve-html-node="true">Aquatics</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">Bronze</td>
<td data-preserve-html-node="true">Greece</td>
<td data-preserve-html-node="true">GRE</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">3</th>
<td data-preserve-html-node="true">Athens</td>
<td data-preserve-html-node="true">1896</td>
<td data-preserve-html-node="true">Aquatics</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">Gold</td>
<td data-preserve-html-node="true">Greece</td>
<td data-preserve-html-node="true">GRE</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">4</th>
<td data-preserve-html-node="true">Athens</td>
<td data-preserve-html-node="true">1896</td>
<td data-preserve-html-node="true">Aquatics</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">Silver</td>
<td data-preserve-html-node="true">Greece</td>
<td data-preserve-html-node="true">GRE</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">&#8230;</th>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">29211</th>
<td data-preserve-html-node="true">Beijing</td>
<td data-preserve-html-node="true">2008</td>
<td data-preserve-html-node="true">Wrestling</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">Silver</td>
<td data-preserve-html-node="true">Germany</td>
<td data-preserve-html-node="true">GER</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">29212</th>
<td data-preserve-html-node="true">Beijing</td>
<td data-preserve-html-node="true">2008</td>
<td data-preserve-html-node="true">Wrestling</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">Bronze</td>
<td data-preserve-html-node="true">Lithuania</td>
<td data-preserve-html-node="true">LTU</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">29213</th>
<td data-preserve-html-node="true">Beijing</td>
<td data-preserve-html-node="true">2008</td>
<td data-preserve-html-node="true">Wrestling</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">Bronze</td>
<td data-preserve-html-node="true">Armenia</td>
<td data-preserve-html-node="true">ARM</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">29214</th>
<td data-preserve-html-node="true">Beijing</td>
<td data-preserve-html-node="true">2008</td>
<td data-preserve-html-node="true">Wrestling</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">Gold</td>
<td data-preserve-html-node="true">Cuba</td>
<td data-preserve-html-node="true">CUB</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">29215</th>
<td data-preserve-html-node="true">Beijing</td>
<td data-preserve-html-node="true">2008</td>
<td data-preserve-html-node="true">Wrestling</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">Silver</td>
<td data-preserve-html-node="true">Russia</td>
<td data-preserve-html-node="true">RUS</td>
</tr>
</tbody>
</table>
<p data-preserve-html-node="true">29216 rows × 12 columns</p>
</div>
<p>and measure, for each different country, the number of medals they&#8217;ve won in each different sport:</p>
<pre><code class="lang-python hljs">summary = data.groupby([<span class="hljs-string">'Country'</span>, <span class="hljs-string">'Sport'</span>]).size().unstack().fillna(<span class="hljs-number">0</span>)
summary</code></pre>
<div data-preserve-html-node="true">
<table class="dataframe" border="1" data-preserve-html-node="true">
<thead data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Sport</th>
<th data-preserve-html-node="true">Aquatics</th>
<th data-preserve-html-node="true">Archery</th>
<th data-preserve-html-node="true">Athletics</th>
<th data-preserve-html-node="true">&#8230;</th>
<th data-preserve-html-node="true">Water Motorsports</th>
<th data-preserve-html-node="true">Weightlifting</th>
<th data-preserve-html-node="true">Wrestling</th>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Country</th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
</tr>
</thead>
<tbody data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Afghanistan</th>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Algeria</th>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">6.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Argentina</th>
<td data-preserve-html-node="true">3.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">5.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">2.0</td>
<td data-preserve-html-node="true">0.0</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Armenia</th>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">4.0</td>
<td data-preserve-html-node="true">4.0</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Australasia</th>
<td data-preserve-html-node="true">11.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">1.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">&#8230;</th>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Virgin Islands*</th>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">West Germany</th>
<td data-preserve-html-node="true">62.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">67.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">7.0</td>
<td data-preserve-html-node="true">9.0</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Yugoslavia</th>
<td data-preserve-html-node="true">91.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">2.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">16.0</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Zambia</th>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">1.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Zimbabwe</th>
<td data-preserve-html-node="true">7.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">0.0</td>
</tr>
</tbody>
</table>
<p data-preserve-html-node="true">137 rows × 42 columns</p>
</div>
<p>Now we&#8217;ll pick two countries:</p>
<pre><code class="lang-python hljs">summary.loc[[<span class="hljs-string">'Germany'</span>, <span class="hljs-string">'Italy'</span>]]</code></pre>
<div data-preserve-html-node="true">
<table class="dataframe" border="1" data-preserve-html-node="true">
<thead data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Sport</th>
<th data-preserve-html-node="true">Aquatics</th>
<th data-preserve-html-node="true">Archery</th>
<th data-preserve-html-node="true">Athletics</th>
<th data-preserve-html-node="true">&#8230;</th>
<th data-preserve-html-node="true">Water Motorsports</th>
<th data-preserve-html-node="true">Weightlifting</th>
<th data-preserve-html-node="true">Wrestling</th>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Country</th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
</tr>
</thead>
<tbody data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Germany</th>
<td data-preserve-html-node="true">175.0</td>
<td data-preserve-html-node="true">6.0</td>
<td data-preserve-html-node="true">99.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">20.0</td>
<td data-preserve-html-node="true">24.0</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Italy</th>
<td data-preserve-html-node="true">113.0</td>
<td data-preserve-html-node="true">12.0</td>
<td data-preserve-html-node="true">71.0</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.0</td>
<td data-preserve-html-node="true">14.0</td>
<td data-preserve-html-node="true">20.0</td>
</tr>
</tbody>
</table>
<p data-preserve-html-node="true">2 rows × 42 columns</p>
</div>
<p>Each country has 42 columns giving the total number of medals won in each sport. Our job is to come up with a single number that summarizes how different those two lists of numbers are. Mathematicians have figured out lots of different ways of doing that, many of which are implemented in the <code>scipy.spatial.distance</code> module.</p>
<p>If we just import <code>pdist</code> from the module, and pass in our dataframe of two countries, we&#8217;ll get a measuremnt:</p>
<pre><code class="lang-python hljs"><span class="hljs-keyword">from</span> scipy.spatial.distance <span class="hljs-keyword">import</span> pdist
pdist(summary.loc[[<span class="hljs-string">'Germany'</span>, <span class="hljs-string">'Italy'</span>]])</code></pre>
<pre><code>array([342.3024978])</code></pre>
<p>That&#8217;s the distance score using the default metric, which is called the <em>euclidian distance</em>. Think of it as the straight line distance between the two points in space defined by the two lists of 42 numbers.</p>
<p>Now, what happens if we pass in a dataframe with three countries?</p>
<pre><code class="lang-python hljs">pdist(summary.loc[[<span class="hljs-string">'Germany'</span>, <span class="hljs-string">'Italy'</span>, <span class="hljs-string">'France'</span>]])</code></pre>
<pre><code>array([342.3024978 , 317.98584874, 144.82403116])</code></pre>
<p>As we might expect, we have three measurements:</p>
<ul>
<li>Germany and Italy</li>
<li>Germnay and France</li>
<li>Italy and France</li>
</ul>
<p>But it&#8217;s not easy to figure out which belongs to which. Happily, <code>scipy</code> also has a helper function that will take this list of numbers and turn it back into a square matrix:</p>
<pre><code class="lang-python hljs"><span class="hljs-keyword">from</span> scipy.spatial.distance <span class="hljs-keyword">import</span> squareform

squareform(pdist(summary.loc[[<span class="hljs-string">'Germany'</span>, <span class="hljs-string">'Italy'</span>, <span class="hljs-string">'France'</span>]]))</code></pre>
<pre><code>array([[  0.        , 342.3024978 , 317.98584874],
       [342.3024978 ,   0.        , 144.82403116],
       [317.98584874, 144.82403116,   0.        ]])</code></pre>
<p>In order to make sense of this, we need to re-attach the country names, which we can just do by turning it into a DataFrame:</p>
<pre><code class="lang-python hljs">pd.DataFrame(
    squareform(pdist(summary.loc[[<span class="hljs-string">'Germany'</span>, <span class="hljs-string">'Italy'</span>, <span class="hljs-string">'France'</span>]])),
    columns = [<span class="hljs-string">'Germany'</span>, <span class="hljs-string">'Italy'</span>, <span class="hljs-string">'France'</span>],
    index = [<span class="hljs-string">'Germany'</span>, <span class="hljs-string">'Italy'</span>, <span class="hljs-string">'France'</span>]
)</code></pre>
<div data-preserve-html-node="true">
<table class="dataframe" border="1" data-preserve-html-node="true">
<thead data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true">Germany</th>
<th data-preserve-html-node="true">Italy</th>
<th data-preserve-html-node="true">France</th>
</tr>
</thead>
<tbody data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Germany</th>
<td data-preserve-html-node="true">0.000000</td>
<td data-preserve-html-node="true">342.302498</td>
<td data-preserve-html-node="true">317.985849</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Italy</th>
<td data-preserve-html-node="true">342.302498</td>
<td data-preserve-html-node="true">0.000000</td>
<td data-preserve-html-node="true">144.824031</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">France</th>
<td data-preserve-html-node="true">317.985849</td>
<td data-preserve-html-node="true">144.824031</td>
<td data-preserve-html-node="true">0.000000</td>
</tr>
</tbody>
</table>
</div>
<p>Hopefully this agrees with our intuition; the numbers on the diagonal are all zero, because each country is identical to itself, and the numbers above and below are mirror images, because the distance between Germany and France is the same as the distance between France and Germany (remember that we are talking about distance in terms of their medal totals, not geographical distance!)</p>
<p>Finally, to get pairwise measurements for the whole input dataframe, we just pass in the complete object and get the country names from the index:</p>
<pre><code class="lang-python hljs">pairwise = pd.DataFrame(
    squareform(pdist(summary)),
    columns = summary.index,
    index = summary.index
)

pairwise</code></pre>
<div data-preserve-html-node="true">
<table class="dataframe" border="1" data-preserve-html-node="true">
<thead data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Country</th>
<th data-preserve-html-node="true">Afghanistan</th>
<th data-preserve-html-node="true">Algeria</th>
<th data-preserve-html-node="true">Argentina</th>
<th data-preserve-html-node="true">&#8230;</th>
<th data-preserve-html-node="true">Yugoslavia</th>
<th data-preserve-html-node="true">Zambia</th>
<th data-preserve-html-node="true">Zimbabwe</th>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Country</th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true"></th>
</tr>
</thead>
<tbody data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Afghanistan</th>
<td data-preserve-html-node="true">0.000000</td>
<td data-preserve-html-node="true">8.774964</td>
<td data-preserve-html-node="true">96.643675</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">171.947666</td>
<td data-preserve-html-node="true">1.732051</td>
<td data-preserve-html-node="true">17.492856</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Algeria</th>
<td data-preserve-html-node="true">8.774964</td>
<td data-preserve-html-node="true">0.000000</td>
<td data-preserve-html-node="true">95.199790</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">171.688672</td>
<td data-preserve-html-node="true">7.348469</td>
<td data-preserve-html-node="true">19.519221</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Argentina</th>
<td data-preserve-html-node="true">96.643675</td>
<td data-preserve-html-node="true">95.199790</td>
<td data-preserve-html-node="true">0.000000</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">148.128323</td>
<td data-preserve-html-node="true">96.348326</td>
<td data-preserve-html-node="true">89.810912</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Armenia</th>
<td data-preserve-html-node="true">5.830952</td>
<td data-preserve-html-node="true">9.848858</td>
<td data-preserve-html-node="true">96.477977</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">171.604196</td>
<td data-preserve-html-node="true">5.744563</td>
<td data-preserve-html-node="true">18.384776</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Australasia</th>
<td data-preserve-html-node="true">18.708287</td>
<td data-preserve-html-node="true">20.024984</td>
<td data-preserve-html-node="true">97.744565</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">166.991018</td>
<td data-preserve-html-node="true">18.627936</td>
<td data-preserve-html-node="true">22.360680</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">&#8230;</th>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Virgin Islands*</th>
<td data-preserve-html-node="true">1.414214</td>
<td data-preserve-html-node="true">8.774964</td>
<td data-preserve-html-node="true">96.457244</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">171.947666</td>
<td data-preserve-html-node="true">1.732051</td>
<td data-preserve-html-node="true">17.492856</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">West Germany</th>
<td data-preserve-html-node="true">153.052279</td>
<td data-preserve-html-node="true">150.306354</td>
<td data-preserve-html-node="true">142.537714</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">184.945938</td>
<td data-preserve-html-node="true">152.577849</td>
<td data-preserve-html-node="true">144.045132</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Yugoslavia</th>
<td data-preserve-html-node="true">171.947666</td>
<td data-preserve-html-node="true">171.688672</td>
<td data-preserve-html-node="true">148.128323</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">0.000000</td>
<td data-preserve-html-node="true">171.874955</td>
<td data-preserve-html-node="true">169.103519</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Zambia</th>
<td data-preserve-html-node="true">1.732051</td>
<td data-preserve-html-node="true">7.348469</td>
<td data-preserve-html-node="true">96.348326</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">171.874955</td>
<td data-preserve-html-node="true">0.000000</td>
<td data-preserve-html-node="true">17.521415</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">Zimbabwe</th>
<td data-preserve-html-node="true">17.492856</td>
<td data-preserve-html-node="true">19.519221</td>
<td data-preserve-html-node="true">89.810912</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">169.103519</td>
<td data-preserve-html-node="true">17.521415</td>
<td data-preserve-html-node="true">0.000000</td>
</tr>
</tbody>
</table>
<p data-preserve-html-node="true">137 rows × 137 columns</p>
</div>
<p>A nice way to visualize these is with a heatmap. 137 countries is a bit too much to show on a webpage, so let&#8217;s restrict it to just the countries that have scored at least 500 medals total:</p>
<pre><code class="lang-python hljs"><span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-comment"># make summary table for just top countries</span>
top_countries = (
    data
    .groupby(<span class="hljs-string">'Country'</span>)
    .filter(<span class="hljs-keyword">lambda</span> x : len(x) &gt; <span class="hljs-number">500</span>)
    .groupby([<span class="hljs-string">'Country'</span>, <span class="hljs-string">'Sport'</span>])
    .size()
    .unstack()
    .fillna(<span class="hljs-number">0</span>)
    )

<span class="hljs-comment"># make pairwise distance matrix</span>
pairwise_top = pd.DataFrame(
    squareform(pdist(top_countries)),
    columns = top_countries.index,
    index = top_countries.index
)

<span class="hljs-comment"># plot it with seaborn</span>
plt.figure(figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">10</span>))
sns.heatmap(
    pairwise_top,
    cmap=<span class="hljs-string">'OrRd'</span>,
    linewidth=<span class="hljs-number">1</span>
)</code></pre>
<p><img src="https://github.com/mojones/datavis_resources/raw/master/pairwise%20distance/output_18_1.png" alt="png" /></p>
<p>Now that we have a plot to look at, we can see a problem with the distance metric we&#8217;re using. The US has won so many more medals than other countries that it distorts the measurement. And if we think about it, what we&#8217;re really interested in is not the exact number of medals in each category, but the relative number. In other words, we want two contries to be considered similar if they both have about twice as many medals in boxing as athletics, for example, regardless of the exact numbers.</p>
<p>Luckily for us, there is a distance measure already implemented in scipy that has that property &#8211; it&#8217;s called <em>cosine distance</em>. Think of it as a measurement that only looks at the relationships between the 44 numbers for each country, not their magnitude. We can switch to cosine distance by specifying the <code>metric</code> keyword argument in <code>pdist</code>:</p>
<pre><code class="lang-python hljs"><span class="hljs-comment"># make pairwise distance matrix</span>
pairwise_top = pd.DataFrame(
    squareform(pdist(top_countries, metric=<span class="hljs-string">'cosine'</span>)),
    columns = top_countries.index,
    index = top_countries.index
)

<span class="hljs-comment"># plot it with seaborn</span>
plt.figure(figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">10</span>))
sns.heatmap(
    pairwise_top,
    cmap=<span class="hljs-string">'OrRd'</span>,
    linewidth=<span class="hljs-number">1</span>
)</code></pre>
<p><img src="https://github.com/mojones/datavis_resources/raw/master/pairwise%20distance/output_20_1.png" alt="png" /></p>
<p>And as you can see we spot some much more interstesting patterns. Notice, for example, that Russia and Soviet Union have a very low distance (i.e. their medal distributions are very similar).</p>
<p>When looking at data like this, remember that the shade of each cell is not telling us anything about how many medals a country has won &#8211; simply how different or similar each country is to each other. Compare the above heatmap with this one which displays the proportion of medals in each sport per country:</p>
<pre><code class="lang-python hljs">plt.figure(figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">10</span>))
sns.heatmap(
    top_countries.apply(<span class="hljs-keyword">lambda</span> x : x / x.sum(), axis=<span class="hljs-number">1</span>),
    cmap=<span class="hljs-string">'BuPu'</span>,
    square=<span class="hljs-keyword">True</span>,
    cbar_kws = {<span class="hljs-string">'fraction'</span> : <span class="hljs-number">0.02</span>}
)</code></pre>
<p><img src="https://github.com/mojones/datavis_resources/raw/master/pairwise%20distance/output_22_1.png" alt="png" /></p>
<p>Finally, how might we find pairs of countries that have very similar medal distributions (i.e. very low numbers in the pairwise table)? By far the easiest way is to start of by reshaping the table into long form, so that each comparison is on a separate row:</p>
<pre><code class="lang-python hljs"><span class="hljs-comment"># create our pairwise distance matrix</span>
pairwise = pd.DataFrame(
    squareform(pdist(summary, metric=<span class="hljs-string">'cosine'</span>)),
    columns = summary.index,
    index = summary.index
)

<span class="hljs-comment"># move to long form</span>
long_form = pairwise.unstack()

<span class="hljs-comment"># rename columns and turn into a dataframe</span>
long_form.index.rename([<span class="hljs-string">'Country A'</span>, <span class="hljs-string">'Country B'</span>], inplace=<span class="hljs-keyword">True</span>)
long_form = long_form.to_frame(<span class="hljs-string">'cosine distance'</span>).reset_index()</code></pre>
<p>Now we can write our filter as normal, remembering to filter out the unintersting rows that tell us a country&#8217;s distance from itself!</p>
<pre><code class="lang-python hljs">long_form[
    (long_form[<span class="hljs-string">'cosine distance'</span>] &lt; <span class="hljs-number">0.05</span>) 
    &amp; (long_form[<span class="hljs-string">'Country A'</span>] != long_form[<span class="hljs-string">'Country B'</span>])
]</code></pre>
<div data-preserve-html-node="true">
<table class="dataframe" border="1" data-preserve-html-node="true">
<thead data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true"></th>
<th data-preserve-html-node="true">Country A</th>
<th data-preserve-html-node="true">Country B</th>
<th data-preserve-html-node="true">cosine distance</th>
</tr>
</thead>
<tbody data-preserve-html-node="true">
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">272</th>
<td data-preserve-html-node="true">Algeria</td>
<td data-preserve-html-node="true">Zambia</td>
<td data-preserve-html-node="true">0.026671</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">1034</th>
<td data-preserve-html-node="true">Azerbaijan</td>
<td data-preserve-html-node="true">Mongolia</td>
<td data-preserve-html-node="true">0.045618</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">1105</th>
<td data-preserve-html-node="true">Bahamas</td>
<td data-preserve-html-node="true">Barbados</td>
<td data-preserve-html-node="true">0.021450</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">1111</th>
<td data-preserve-html-node="true">Bahamas</td>
<td data-preserve-html-node="true">British West Indies</td>
<td data-preserve-html-node="true">0.021450</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">1113</th>
<td data-preserve-html-node="true">Bahamas</td>
<td data-preserve-html-node="true">Burundi</td>
<td data-preserve-html-node="true">0.021450</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">&#8230;</th>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
<td data-preserve-html-node="true">&#8230;</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">17033</th>
<td data-preserve-html-node="true">United Arab Emirates</td>
<td data-preserve-html-node="true">Haiti</td>
<td data-preserve-html-node="true">0.010051</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">17037</th>
<td data-preserve-html-node="true">United Arab Emirates</td>
<td data-preserve-html-node="true">Independent Olympic Participants</td>
<td data-preserve-html-node="true">0.000000</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">17051</th>
<td data-preserve-html-node="true">United Arab Emirates</td>
<td data-preserve-html-node="true">Kuwait</td>
<td data-preserve-html-node="true">0.000000</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">18164</th>
<td data-preserve-html-node="true">Virgin Islands</td>
<td data-preserve-html-node="true">Netherlands Antilles</td>
<td data-preserve-html-node="true">0.000000</td>
</tr>
<tr data-preserve-html-node="true">
<th data-preserve-html-node="true">18496</th>
<td data-preserve-html-node="true">Zambia</td>
<td data-preserve-html-node="true">Algeria</td>
<td data-preserve-html-node="true">0.026671</td>
</tr>
</tbody>
</table>
<p data-preserve-html-node="true">462 rows × 3 columns</p>
</div>
<p>https://www.drawingfromdata.com/making-a-pairwise-distance-matrix-with-pandas  dm = squareform(pdist(input_data,metric=&#8217;braycurtis&#8217;))</p>
]]></content:encoded>
			<wfw:commentRss>https://www.biofacebook.com/?feed=rss2&#038;p=1517</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Pandas and Sklearn</title>
		<link>https://www.biofacebook.com/?p=1512</link>
		<comments>https://www.biofacebook.com/?p=1512#comments</comments>
		<pubDate>Fri, 05 Jun 2020 03:02:14 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[生物信息]]></category>
		<category><![CDATA[统计学习]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1512</guid>
		<description><![CDATA[pandas isnull函数检查数据是否有缺失 pandas isnull sum with column headers <p>&#160;</p> for col in main_df: print(sum(pd.isnull(data[col]))) <p>I get a list of the null count for each column:</p> 0 1 100 <p>What I&#8217;m trying to do is create a new dataframe which has the column header alongside the null count, e.g.</p> col1 &#124; 0 col2 &#124; 1 col3 [...]]]></description>
				<content:encoded><![CDATA[<h1 class="title-article">pandas isnull函数检查数据是否有缺失</h1>
<h1 class="grid--cell fs-headline1 fl1 ow-break-word mb8"><a class="question-hyperlink" href="https://stackoverflow.com/questions/41681693/pandas-isnull-sum-with-column-headers">pandas isnull sum with column headers</a></h1>
<p>&nbsp;</p>
<pre><code>for col in main_df:
    print(sum(pd.isnull(data[col])))
</code></pre>
<p>I get a list of the null count for each column:</p>
<pre><code>0
1
100
</code></pre>
<p>What I&#8217;m trying to do is create a new dataframe which has the column header alongside the null count, e.g.</p>
<pre><code>col1 | 0
col2 | 1
col3 | 100</code></pre>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>#print every column using:</p>
<p>&nbsp;</p>
<pre><code>nulls = df.isnull().sum().to_frame()
for index, row in nulls.iterrows():
    print(index, row[0])
</code></pre>
<p>&nbsp;</p>
<pre><code>for col in df:
    print(df[col].unique())



</code></pre>
<h1 class="article-title J-articleTitle">pandas.get_dummies 的用法 (One-Hot Encoding)</h1>
<p>&nbsp;</p>
<p>get_dummies 是利用pandas实现one hot encode的方式。详细参数请查看<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html" target="_blank" rel="nofollow noopener noreferrer" data-from="10680">官方文档</a></p>
<blockquote><p>pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False)[source]</p></blockquote>
<p>参数说明：</p>
<ul class="ul-level-0">
<li>data : array-like, Series, or DataFrame 输入的数据</li>
<li>prefix : string, list of strings, or dict of strings, default None get_dummies转换后，列名的前缀</li>
<li>columns : list-like, default None 指定需要实现类别转换的列名</li>
<li>dummy_na : bool, default False 增加一列表示空缺值，如果False就忽略空缺值</li>
<li>drop_first : bool, default False 获得k中的k-1个类别值，去除第一个</li>
</ul>
<p><strong>离散特征的编码分为两种情况：</strong></p>
<p>1、离散特征的取值之间没有大小的意义，比如color：[red,blue],那么就使用one-hot编码</p>
<p>2、离散特征的取值有大小的意义，比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}</p>
<p>例子：</p>
<pre class="prism-token token  language-javascript"><span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd

df <span class="token operator">=</span> pd<span class="token punctuation">.</span><span class="token function">DataFrame</span><span class="token punctuation">(</span><span class="token punctuation">[</span>  
            <span class="token punctuation">[</span><span class="token string">'green'</span> <span class="token punctuation">,</span> <span class="token string">'A'</span><span class="token punctuation">]</span><span class="token punctuation">,</span>   
            <span class="token punctuation">[</span><span class="token string">'red'</span>   <span class="token punctuation">,</span> <span class="token string">'B'</span><span class="token punctuation">]</span><span class="token punctuation">,</span>   
            <span class="token punctuation">[</span><span class="token string">'blue'</span>  <span class="token punctuation">,</span> <span class="token string">'A'</span><span class="token punctuation">]</span><span class="token punctuation">]</span><span class="token punctuation">)</span>  

df<span class="token punctuation">.</span>columns <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'color'</span><span class="token punctuation">,</span>  <span class="token string">'class'</span><span class="token punctuation">]</span> 
pd<span class="token punctuation">.</span><span class="token function">get_dummies</span><span class="token punctuation">(</span>df<span class="token punctuation">)</span></pre>
<p>get_dummies 前：</p>
<figure>
<div class="image-block"><img class="" src="https://ask.qcloudimg.com/http-save/yehe-4908043/zzw0nx7acu.png?imageView2/2/w/1620" alt="" /></div>
</figure>
<p>get_dummies 后：</p>
<figure>
<div class="image-block"><img class="" src="https://ask.qcloudimg.com/http-save/yehe-4908043/91815t6ls4.png?imageView2/2/w/1620" alt="" /></div>
</figure>
<p>上述执行完以后再打印df 出来的还是get_dummies 前的图，因为你没有写</p>
<pre class="prism-token token  language-javascript">df <span class="token operator">=</span> pd<span class="token punctuation">.</span><span class="token function">get_dummies</span><span class="token punctuation">(</span>df<span class="token punctuation">)</span></pre>
<p>可以对指定列进行get_dummies</p>
<pre class="prism-token token  language-javascript">pd<span class="token punctuation">.</span><span class="token function">get_dummies</span><span class="token punctuation">(</span>df<span class="token punctuation">.</span>color<span class="token punctuation">)</span></pre>
<figure>
<div class="image-block"><img class="" src="https://ask.qcloudimg.com/http-save/yehe-4908043/qoplyil41o.png?imageView2/2/w/1620" alt="" /></div>
</figure>
<p>将指定列进行get_dummies 后合并到元数据中</p>
<pre class="prism-token token  language-javascript">df <span class="token operator">=</span> df<span class="token punctuation">.</span><span class="token function">join</span><span class="token punctuation">(</span>pd<span class="token punctuation">.</span><span class="token function">get_dummies</span><span class="token punctuation">(</span>df<span class="token punctuation">.</span>color<span class="token punctuation">)</span><span class="token punctuation">)</span></pre>
<figure>
<div class="image-block"><img class="" src="https://ask.qcloudimg.com/http-save/yehe-4908043/m0sb5qrqaz.png?imageView2/2/w/1620" alt="" /></div>
</figure>
<p>参考：<a href="https://blog.csdn.net/maymay_/article/details/80198468" target="_blank" rel="nofollow noopener noreferrer" data-from="10680">https://blog.csdn.net/maymay_/article/details/80198468</a></p>
<pre> 
&gt;&gt;&gt; train_filter.info()
&lt;class 'pandas.core.frame.DataFrame'&gt;
RangeIndex: 1482 entries, 0 to 1481
Columns: 182 entries, SampleID to BS120
dtypes: float64(177), int64(2), object(3)
memory usage: 2.1+ MB
&gt;&gt;&gt; train_filter.dtypes
SampleID object
Streptococcus Infection float64
Duration_of_gestation object
Gestation_age float64
Gestation_age_G1 float64
Gestation_age_G2 float64
GDM_HDP float64
Age int64
Age_group int64
Blood_type float64
Medication_use float64
Progesterone_use float64
Pregnancy_mode float64
Native_place float64
Combined_disease float64
Infection float64
Scar_uterus float64
Risk_rating float64
Anamnesis float64
Thalassemia float64
Ovary_disease float64
Hepatopathy float64
Allergic_history float64
Thyroid_disease float64
Hysteromyoma float64
Breast_disease float64
Weight_at_delivery object
Weight_before_pregnancy float64
Height float64
BMI_before_pregnancy float64
 ... 
B_A/G float64
B_r_GT_G float64
B_r_GT float64
B_TBA_G float64
B_TBA float64
B_ALT_G float64
B_ALT float64
B_AST_G float64
B_AST float64
B_TBIL_G float64
B_TBIL float64
B_DBIL_G float64
B_DBIL float64
B_IBIL_G float64
B_IBIL float64
B_Crea_G float64
B_Crea float64
B_CysC_G float64
B_CysC float64
B_UA_G float64
B_UA float64
B_Urea_G float64
B_Urea float64
B_GLU_G float64
B_GLU float64
HbA1c_G float64
HbA1c float64
BS float64
BS60 float64
BS120 float64
Length: 182, dtype: object



</pre>
<p>scikit-learn 是基于 Python 语言的机器学习工具</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>http://www.scikitlearn.com.cn/</p>
]]></content:encoded>
			<wfw:commentRss>https://www.biofacebook.com/?feed=rss2&#038;p=1512</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>微生物多样研究—差异分析</title>
		<link>https://www.biofacebook.com/?p=1483</link>
		<comments>https://www.biofacebook.com/?p=1483#comments</comments>
		<pubDate>Fri, 27 Mar 2020 07:29:11 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[生物信息]]></category>
		<category><![CDATA[统计学习]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1483</guid>
		<description><![CDATA[ <p>1. 随机森林模型</p> <p>随机森林是一种基于决策树（Decisiontree）的高效的机器学习算法，可以用于对样本进行分类（Classification），也可以用于回归分析（Regression）。</p> <p>它属于非线性分类器，因此可以挖掘变量之间复杂的非线性的相互依赖关系。通过随机森林分析，可以找出能够区分两组样本间差异关键OTU。</p> <p>Feature Importance Scores表格-来源于随机森林结果</p> <p>记录了各OTU对组间差异的贡献值大小。</p> </p> <p>注：一般地，选取Mean_decrease_in_accuracy值大于0.05的OTU，作进一步分析；对于组间差异较小的样本，该值可能会降至0.03。</p> <p>2. 交叉验证分析</p> <p>交叉验证（Crossvalidation)，是一种统计学上将数据样本切割成较小子集的实用方法。先在一个子集上做分析，而其它子集则用来做后续对此分析的确认及验证。一开始的子集被称为训练集。而其它的子集则被称为验证集或测试集。</p> <p>其中最常见的为k-foldercross-validation，它指的是将所有数据分成k个子集，每个子集均做一次测试集，其余的作为训练集。交叉验证重复k次，每次选择一个子集作为测试集，并将k次的平均交叉验证识别正确率作为结果。</p> <p>所有的样本都被作为了训练集和测试集，每个样本都被验证一次。</p> <p>对随机森林方法筛选出的关键OTU的组合进行遍历，以期用最少的OTU数目组合构建一个错误率最低高效分类器。</p> <p>一般地，对随机森林分析筛选出的关键OTU，按照不同组合进行10倍交叉验证分析，找出能够最准确区分组间差异的最少的OTU组合，再做进一步的分析，如ROC分析等。</p> <p>注：图中横坐标表示不同数量的OTU组合，纵坐标表示该数量OTU组合下分类的错误率。OTU组合数越少，且错误率越低，则该OTU组合被认为是能够区分组间差异的最少的OTU组合。</p> <p>3. ROC曲线</p> <p>接收者操作特征曲线（Receiveroperating characteristic curve，ROC 曲线）也是一种有效的有监督学习方法。ROC分析属于二元分类算法，用来处理只有两种分类的问题，可以用于选择最佳的判别模型,选择最佳的诊断界限值。</p> <p>可依据专业知识，对疾病组和参照组测定结果进行分析，确定测定值的上下限、组距以及截断点(cut-offpoint)，按选择的组距间隔列出累积频数分布表，分别计算出所有截断点的敏感性(Sensetivity)、特异性和假阳性率(1-特异性:Specificity)。以敏感性为纵坐标代表真阳性率，(1-特异性)为横坐标代表假阳性率，作图绘成ROC曲线。ROC曲线越靠近左上角，诊断的准确性就越高。亦可通过分别计算各个试验的ROC曲线下的面积(AUC)进行比较，哪一种试验的AUC最大，则哪一种试验的诊断价值最佳。</p> <p></p> <p>注：图中横坐标为假阳性率false positive rate（FPR）：Specificity，纵坐标为真阳性率true positive rate（TPR）：Sensetivity。最靠近左上角的ROC曲线的点是错误最少的最好阈值，其假阳性和假阴性的总数最少。ROC曲线下的面积值在1.0和0.5之间。在AUC&#62;0.5的情况下，AUC越接近于1，说明诊断效果越好。AUC在 0.5~0.7时有较低准确性，AUC在0.7~0.9时有一定准确性，AUC在0.9以上时有较高准确性。AUC=0.5时，说明诊断方法完全不起作用，无诊断价值。AUC&#60;0.5不符合真实情况，在实际中极少出现。</p> <p>4. Wilcoxon秩和检验分析</p> <p>Wilcoxonrank-sum test，也叫曼-惠特尼U检验（Mann–WhitneyU test），是两组独立样本非参数检验的一种方法。其原假设为两组独立样本来自的两总体分布无显著差异，通过对两组样本平均秩的研究来实现判断两总体的分布是否存在差异，该分析可以对两组样品的物种进行显著性差异分析，并对p值计算假发现率（FDR）q值。</p> <p></p> <p>注：mean分别为两组样品物种的平均相对丰度，sd分别是两组样本物种相对丰度的标准差。P值为对两组检验原假设为真的概率值，p&#60;0.05表示存在差异，p&#60;0.01表示差异显著，q值为假发现率。</p> <p>5. 差异菌群Heatmap分析</p> <p>以10倍交叉验证（10-foldcross-validation）估计泛化误差（Generalizationerror）的大小，其余参数使用默认设置。建模结果同时包含“基线”误差（Baselineerror）的期望值，即数据集中属于最优势分类的样本全部被错误分类的概率。每个OTU根据其被移除后模型预报错误率增加的大小确定其重要度数值，重要度越高，该OTU对模型预报准确率的贡献越大。</p> <p>根据挑选出来的差异OTU，根据其在每个样品中的丰度信息，对物种进行聚类，绘制成热图，便于观察哪些物种在哪些样品中聚集较多或含量较低。</p> <p></p> <p>注：图中越接近蓝色表示物种丰度越低，越接近橙红色表示丰度越高。左边的聚类树是根据各物种间的spearman相关性距离进行聚类；上边的聚类树是采用样本间距离算法中最常用的Bray-Curtis算法进行聚类。</p> <p>6. 两组样本Welch’s t-test分析</p> <p>两组不同方差的样本可使用Welch’st-test进行差异比较分析，通过此分析可获得在两组中有显著性差异的物种[或差异基因丰度—适用于元（宏）基因组]。</p> <p> <p>注：上图所示为不同基因丰度（或物种）在两组样品中的丰度比例，中间所示为95%置信度区间内，物种丰度的差异比例，最右边的值为p值，p值＜0.05，表示差异显著。</p> <p>7. Shannon多样性指数比较盒状图</p> <p>将不同分类或环境的多组样本的Shannon多样性指数进行四分位计算，比较不同样本组的组间Shannon指数差异。同时进行非参数Mann-Whitney判断样本组间的显著性差异</p> [...]]]></description>
				<content:encoded><![CDATA[<div>
<div>
<p><b>1. 随机森林模型</b></p>
<p>随机森林是一种基于决策树（Decisiontree）的高效的机器学习算法，可以用于对样本进行分类（Classification），也可以用于回归分析（Regression）。</p>
<p>它属于非线性分类器，因此可以挖掘变量之间复杂的非线性的相互依赖关系。通过随机森林分析，可以找出能够区分两组样本间差异关键OTU。</p>
<p><i>Feature Importance Scores表格-来源于随机森林结果</i></p>
</div>
<p>记录了各OTU对组间差异的贡献值大小。</p></div>
<div><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-a855cbdb5a069bb1.png"><img class="aligncenter size-large wp-image-1484" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-a855cbdb5a069bb1-1024x337.png" alt="18585978-a855cbdb5a069bb1" width="640" height="211" /></a></p>
<div>
<div>
<p>注：一般地，选取Mean_decrease_in_accuracy值大于0.05的OTU，作进一步分析；对于组间差异较小的样本，该值可能会降至0.03。</p>
<p><b>2. 交叉验证分析</b></p>
<p>交叉验证（Crossvalidation)，是一种统计学上将数据样本切割成较小子集的实用方法。先在一个子集上做分析，而其它子集则用来做后续对此分析的确认及验证。一开始的子集被称为训练集。而其它的子集则被称为验证集或测试集。</p>
<p>其中最常见的为k-foldercross-validation，它指的是将所有数据分成k个子集，每个子集均做一次测试集，其余的作为训练集。交叉验证重复k次，每次选择一个子集作为测试集，并将k次的平均交叉验证识别正确率作为结果。</p>
<p>所有的样本都被作为了训练集和测试集，每个样本都被验证一次。</p>
<p>对随机森林方法筛选出的关键OTU的组合进行遍历，以期用最少的OTU数目组合构建一个错误率最低高效分类器。</p>
<p>一般地，对随机森林分析筛选出的关键OTU，按照不同组合进行10倍交叉验证分析，找出能够最准确区分组间差异的最少的OTU组合，再做进一步的分析，如ROC分析等。</p>
</div>
</div>
<div><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-ffceaf913a867414.png"><br />
<img class="aligncenter size-full wp-image-1485" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-ffceaf913a867414.png" alt="18585978-ffceaf913a867414" width="773" height="723" /></a></div>
<div>
<div>
<p>注：图中横坐标表示不同数量的OTU组合，纵坐标表示该数量OTU组合下分类的错误率。OTU组合数越少，且错误率越低，则该OTU组合被认为是能够区分组间差异的最少的OTU组合。</p>
<p><b>3. ROC曲线</b></p>
<p>接收者操作特征曲线（Receiveroperating characteristic curve，ROC 曲线）也是一种有效的有监督学习方法。ROC分析属于二元分类算法，用来处理只有两种分类的问题，可以用于选择最佳的判别模型,选择最佳的诊断界限值。</p>
<p>可依据专业知识，对疾病组和参照组测定结果进行分析，确定测定值的上下限、组距以及截断点(cut-offpoint)，按选择的组距间隔列出累积频数分布表，分别计算出所有截断点的敏感性(Sensetivity)、特异性和假阳性率(1-特异性:Specificity)。以敏感性为纵坐标代表真阳性率，(1-特异性)为横坐标代表假阳性率，作图绘成ROC曲线。ROC曲线越靠近左上角，诊断的准确性就越高。亦可通过分别计算各个试验的ROC曲线下的面积(AUC)进行比较，哪一种试验的AUC最大，则哪一种试验的诊断价值最佳。</p>
</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-07a26b63fea21bf8.png"><img class="aligncenter size-full wp-image-1486" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-07a26b63fea21bf8.png" alt="18585978-07a26b63fea21bf8" width="685" height="688" /></a></p>
<div>
<div>
<p>注：图中横坐标为假阳性率false positive rate（FPR）：Specificity，纵坐标为真阳性率true positive rate（TPR）：Sensetivity。最靠近左上角的ROC曲线的点是错误最少的最好阈值，其假阳性和假阴性的总数最少。ROC曲线下的面积值在1.0和0.5之间。在AUC&gt;0.5的情况下，AUC越接近于1，说明诊断效果越好。AUC在 0.5~0.7时有较低准确性，AUC在0.7~0.9时有一定准确性，AUC在0.9以上时有较高准确性。AUC=0.5时，说明诊断方法完全不起作用，无诊断价值。AUC&lt;0.5不符合真实情况，在实际中极少出现。</p>
<p><b>4. Wilcoxon秩和检验分析</b></p>
<p>Wilcoxonrank-sum test，也叫曼-惠特尼U检验（Mann–WhitneyU test），是两组独立样本非参数检验的一种方法。其原假设为两组独立样本来自的两总体分布无显著差异，通过对两组样本平均秩的研究来实现判断两总体的分布是否存在差异，该分析可以对两组样品的物种进行显著性差异分析，并对p值计算假发现率（FDR）q值。</p>
</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-6714c4d277105abd.png"><img class="aligncenter size-large wp-image-1487" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-6714c4d277105abd-1024x218.png" alt="18585978-6714c4d277105abd" width="640" height="136" /></a></p>
<div>
<div>
<p>注：mean分别为两组样品物种的平均相对丰度，sd分别是两组样本物种相对丰度的标准差。P值为对两组检验原假设为真的概率值，p&lt;0.05表示存在差异，p&lt;0.01表示差异显著，q值为假发现率。</p>
<p><b>5. 差异菌群Heatmap分析</b></p>
<p>以10倍交叉验证（10-foldcross-validation）估计泛化误差（Generalizationerror）的大小，其余参数使用默认设置。建模结果同时包含“基线”误差（Baselineerror）的期望值，即数据集中属于最优势分类的样本全部被错误分类的概率。每个OTU根据其被移除后模型预报错误率增加的大小确定其重要度数值，重要度越高，该OTU对模型预报准确率的贡献越大。</p>
<p>根据挑选出来的差异OTU，根据其在每个样品中的丰度信息，对物种进行聚类，绘制成热图，便于观察哪些物种在哪些样品中聚集较多或含量较低。</p>
</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-391e366b1420fa92.png"><img class="aligncenter size-full wp-image-1488" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-391e366b1420fa92.png" alt="18585978-391e366b1420fa92" width="660" height="650" /></a></p>
<div>
<div>
<p>注：图中越接近蓝色表示物种丰度越低，越接近橙红色表示丰度越高。左边的聚类树是根据各物种间的spearman相关性距离进行聚类；上边的聚类树是采用样本间距离算法中最常用的Bray-Curtis算法进行聚类。</p>
<p><b>6. 两组样本Welch’s t-test分析</b></p>
<p>两组不同方差的样本可使用Welch’st-test进行差异比较分析，通过此分析可获得在两组中有显著性差异的物种[或差异基因丰度—适用于元（宏）基因组]。</p>
</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-8453377e0336274c.png"><img class="aligncenter size-large wp-image-1489" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-8453377e0336274c-1024x531.png" alt="18585978-8453377e0336274c" width="640" height="332" /></a></div>
<div>
<div>
<div>
<p>注：上图所示为不同基因丰度（或物种）在两组样品中的丰度比例，中间所示为95%置信度区间内，物种丰度的差异比例，最右边的值为p值，p值＜0.05，表示差异显著。</p>
<p><b>7. Shannon多样性指数比较盒状图</b></p>
<p>将不同分类或环境的多组样本的Shannon多样性指数进行四分位计算，比较不同样本组的组间Shannon指数差异。同时进行非参数Mann-Whitney判断样本组间的显著性差异</p>
</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-5c097b33070b52e3.png"><img class="aligncenter size-full wp-image-1490" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-5c097b33070b52e3.png" alt="18585978-5c097b33070b52e3" width="338" height="350" /></a></div>
<div>
<div>
<p>注：横坐标表示样本分组，纵坐标表示相对应的Alpha多样性指数值；图形可以显示5个统计量（最小值，第一个四分位，中位数，第三个中位数和最大值，及由下到上5条线）。p＜0.05，表示差异显著；P&lt;0.01，表示差异极显著。</p>
<p><b>8. 基于距离的箱式图</b></p>
<p>将不同分类或环境的多组样本的距离进行四分位计算，比较不同样本组的组内和组间的距离分布差异。同时进行multipleStudent&#8217;s two-sample t-tests判断样本组间差异的显著性。</p>
<p>箱式图的作用：识别数据异常值；粗略估计和判断数据特征；比较几批数据的形状，同一数轴上，几批数据的箱形图并行排列，几批数据的中位数、尾长、异常值、分布区间等形状信息一目了然。</p>
<p>箱线图（Boxplot）也称箱须图（Box-whiskerPlot），是利用数据中的五个统计量：最小值、第一四分位数、中位数、第三四分位数与最大值来描述数据的一种方法，它也可以粗略地看出数据是否具有对称性，分布的分散程度等信息，特别可以用于对几组样本的比较。简单箱线图由五部分组成，分别是最小值、中位数、最大值和两个四分位数。</p>
</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-1605130e5d641edc.png"><img class="aligncenter size-full wp-image-1491" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-1605130e5d641edc.png" alt="18585978-1605130e5d641edc" width="740" height="682" /></a></div>
<div>
<div>
<div>
<p>注：第一四分位数 (Q1)，又称“下四分位数”，等于该样本中所有数值由小到大排列后第25%的数字。第二四分位数 (Q2)，又称“中位数”，等于该样本中所有数值由小到大排列后第50%的数字。 第三四分位数 (Q3)，又称“上四分位数”，等于该样本中所有数值由小到大排列后第75%的数字。</p>
<p><b>9. LEfSe分析</b></p>
<p>LEfSe是一种用于发现高维生物标识和揭示基因组特征的软件。包括基因，代谢和分类，用于区别两个或两个以上生物条件（或者是类群）。该算法强调的是统计意义和生物相关性。让研究人员能够识别不同丰度的特征以及相关联的类别。</p>
<p>LEfSe通过生物学统计差异使其具有强大的识别功能。然后，它执行额外的测试，以评估这些差异是否符合预期的生物学行为。</p>
<p>具体来说，首先使用non-parametric factorial Kruskal-Wallis (KW) sum-rank test（非参数因子克鲁斯卡尔—沃利斯和秩验检）检测具有显著丰度差异特征，并找到与丰度有显著性差异的类群。最后，LEfSe采用线性判别分析（LDA）来估算每个组分（物种）丰度对差异效果影响的大小。</p>
</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-9453e2fea01fb73b.png"><img class="aligncenter size-large wp-image-1492" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-9453e2fea01fb73b-1024x377.png" alt="18585978-9453e2fea01fb73b" width="640" height="236" /></a></div>
<div>
<div>
<div>
<p>说明：左边的图为统计两个组别当中有显著作用的微生物类群通过LDA分析（线性回归分析）后获得的LDA分值。右边的图为聚类树，节点大小表示丰度，默认从门到属依次向外排列。红色区域和绿色区域表示不同分组，树枝中红色节点表示在红色组别中起到重要作用的微生物类群，绿色节点表示在绿色组别中起到重要作用的微生物类群，黄色节点表示的是在两组中均没有起到重要作用的微生物类群。图中英文字母表示的物种名称在右侧图例中进行展示。</p>
<p><b>10. ANOSIM相似性分析</b></p>
<p>相似性分析(ANOSIM)是一种非参数检验，用来检验组间（两组或多组）的差异是否显著大于组内差异，从而判断分组是否有意义。首先利用Bray-Curtis算法计算两两样品间的距离，然后将所有距离从小到大进行排序，按以下公式计算R值，之后将样品进行置换，重新计算R*值，R*大于R的概率即为P值。</p>
</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-e0960fa6644a18ff.png"><img class="aligncenter size-full wp-image-1493" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-e0960fa6644a18ff.png" alt="18585978-e0960fa6644a18ff" width="454" height="135" /></a></p>
<div>
<div>
<p>其中，</p>
<p>r ̅ _b：表示组间（Between groups）距离排名的平均值；</p>
<p>r ̅ _w：表示组内（Within groups）距离排名的平均值；</p>
<p>n：表示样品总数。</p>
<p><b>Table. Anosimanalysis</b></p>
</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-d456795bc0cddde8.png"><img class="aligncenter size-large wp-image-1494" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-d456795bc0cddde8-1024x344.png" alt="18585978-d456795bc0cddde8" width="640" height="215" /></a></div>
<div>
<div>
<div>
<p>注：理论上，R值范围为-1到+1，实际中R值一般从0到1，R值接近1表示组间差异越大于组内差异，R值接近0则表示组间和组内没有明显差异。P值则反映了分析结果的统计学显著性，P值越小，表明各样本分组之间的差异显著性越高，P&lt; 0.05表示统计具有显著性；Number of permutation表示置换次数。</p>
<p><b>11. Adonis多因素方差分析</b></p>
<p>Adonis又称置换多因素方差分析（permutationalMANOSVA）或非参数多因素方差分析（nonparametricMANOVA）。它利用半度量(如Bray-Curtis)或度量距离矩阵(如Euclidean)对总方差进行分解，分析不同分组因素对样品差异的解释度，并使用置换检验对划分的统计学意义进行显著性分析。</p>
<p><b>Table permutational MANOVA analysis</b></p>
</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-7fa68baa52da9d13.png"><img class="aligncenter size-large wp-image-1495" src="http://www.biofacebook.com/wp-content/uploads/2020/03/18585978-7fa68baa52da9d13-1024x132.png" alt="18585978-7fa68baa52da9d13" width="640" height="83" /></a></p>
<div>
<div>
<p>注：</p>
<p>Group：表示分组；</p>
<p>Df：表示自由度；</p>
<p>SumsOfSqs：总方差，又称离差平方和；</p>
<p>MeanSqs：平均方差，即SumsOfSqs/Df；</p>
<p>F.Model：F检验值；</p>
<p>R2：表示不同分组对样品差异的解释度，即分组方差与总方差的比值，即分组所能解释的原始数据中差异的比例，R2越大表示分组对差异的解释度越高；</p>
<p>Pr(&gt;F)：通过置换检验获得的P值，P值越小，表明组间差异显著性越强。</p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p>作者：JarySun<br />
链接：https://www.jianshu.com/p/87f24cceaa43<br />
来源：简书<br />
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。</p></div>
]]></content:encoded>
			<wfw:commentRss>https://www.biofacebook.com/?feed=rss2&#038;p=1483</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ANOSIM，PERMANOVA/Adonis，MRPP　（转贴）</title>
		<link>https://www.biofacebook.com/?p=1474</link>
		<comments>https://www.biofacebook.com/?p=1474#comments</comments>
		<pubDate>Fri, 27 Mar 2020 07:13:55 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[统计学习]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1474</guid>
		<description><![CDATA[<p></p> 1. ANOSIM 组间相似性分析 相似性分析(ANOSIM)是一种非参数检验，用来检验组间(两组或多组)的差异是否显著大于组内差异，从而判断分组是否有意义。首先利用 Bray-Curtis 算法计算两两样品间的距离，然后将所有距离从小到大进行排序， 按以下公式计算 R 值，之后将样品进行置换，重新计算 R值，R大于 R 的概率即为 P 值。 <p></p> 注:图上总共有 N+1 个盒子，N 为分组数量。“Between”的盒子指代的是分组之间的差异，其他分别代表各自组 内差异。R 值范围为-1 到+1，实际中 R 值一般从 0 到 1。R 值接近 1 表示组间差异越大于组内差异，R 值接近 0 则表示组间和组内没有明显差异;此次统计分析的可信度用 P-value 表示，P&#60; 0.05 表示统计具有显著性。 2. PERMANOVA/Adonis 置换多元方差分析 PERMANOVA (Permutational multivariate analysis of variance，置换多元方差分析)，又称 Adonis 分析，可利用半度量(如 Bray-Curtis)或度量距离矩阵(如 Euclidean)对总方差进行分解，通过线性模型分析不同分组因素 或环境因子(如临床表型数据、土壤理化指标等)对样品差异的解释度，并使用置换检验进行显著性分析。 <p> 3. MRPP [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-6fde5e9bfb6489c7.png"><img class="aligncenter size-large wp-image-1475" src="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-6fde5e9bfb6489c7-1024x419.png" alt="6634703-6fde5e9bfb6489c7" width="640" height="262" /></a></p>
<div>
<div>
<h4>1. ANOSIM 组间相似性分析</h4>
<ul>
<li>相似性分析(ANOSIM)是一种非参数检验，用来检验组间(两组或多组)的差异是否显著大于组内差异，从而判断分组是否有意义。首先利用 Bray-Curtis 算法计算两两样品间的距离，然后将所有距离从小到大进行排序， 按以下公式计算 R 值，之后将样品进行置换，重新计算 R<em>值，R</em>大于 R 的概率即为 P 值。
<div class="image-package"></div>
</li>
</ul>
</div>
<p><img class="" src="https://upload-images.jianshu.io/upload_images/6634703-ec94fa34c56b542a.png?imageMogr2/auto-orient/strip|imageView2/2/w/1200" alt="" data-original-src="//upload-images.jianshu.io/upload_images/6634703-ec94fa34c56b542a.png" data-original-width="1358" data-original-height="610" data-original-format="image/png" data-original-filesize="204109" data-image-index="1" /><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-ec94fa34c56b542a.png"><img class="aligncenter size-large wp-image-1476" src="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-ec94fa34c56b542a-1024x460.png" alt="6634703-ec94fa34c56b542a" width="640" height="288" /></a><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-01bd752421e6028e.png"><img class="aligncenter size-large wp-image-1477" src="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-01bd752421e6028e-1024x819.png" alt="6634703-01bd752421e6028e" width="640" height="512" /></a></p>
<div>
<div>注:图上总共有 N+1 个盒子，N 为分组数量。“Between”的盒子指代的是分组之间的差异，其他分别代表各自组 内差异。R 值范围为-1 到+1，实际中 R 值一般从 0 到 1。R 值接近 1 表示组间差异越大于组内差异，R 值接近 0 则表示组间和组内没有明显差异;此次统计分析的可信度用 P-value 表示，P&lt; 0.05 表示统计具有显著性。</div>
<h4>2. PERMANOVA/Adonis 置换多元方差分析</h4>
<div>
<div>PERMANOVA (Permutational multivariate analysis of variance，置换多元方差分析)，又称 Adonis 分析，可利用半度量(如 Bray-Curtis)或度量距离矩阵(如 Euclidean)对总方差进行分解，通过线性模型分析不同分组因素 或环境因子(如临床表型数据、土壤理化指标等)对样品差异的解释度，并使用置换检验进行显著性分析。</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-73376c3dc4dfb540.png"><img class="aligncenter size-large wp-image-1478" src="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-73376c3dc4dfb540-1024x288.png" alt="6634703-73376c3dc4dfb540" width="640" height="180" /></a><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-cf2b6e7dd53c0843.png"><img class="aligncenter size-full wp-image-1479" src="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-cf2b6e7dd53c0843.png" alt="6634703-cf2b6e7dd53c0843" width="850" height="715" /></a></div>
<div>
<h4>3. MRPP 多响应置换过程分析</h4>
<div>
<div>MRPP 分析(Multiple Response Permutation Procedure，MRPP)用来检验组间(两组或多组)的差异是否显著大于组内差异。与 ANOSIM 分析类似，可利用半度量(如 Bray-Curtis)或度量距离矩阵(如 Euclidean)计算 A 值表示组间差异，使用置换检验对分组进行显著性分析。</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-296b421d3f08ae04.png"><img class="aligncenter size-large wp-image-1480" src="http://www.biofacebook.com/wp-content/uploads/2020/03/6634703-296b421d3f08ae04-1024x262.png" alt="6634703-296b421d3f08ae04" width="640" height="164" /></a></div>
</div>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>https://www.biofacebook.com/?feed=rss2&#038;p=1474</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Adonis与ANOSIM检验究竟是什么？(转贴）</title>
		<link>https://www.biofacebook.com/?p=1461</link>
		<comments>https://www.biofacebook.com/?p=1461#comments</comments>
		<pubDate>Fri, 27 Mar 2020 06:51:55 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[统计学习]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1461</guid>
		<description><![CDATA[<p>做微生物16S测序的时候，公司的报告里经常会给到两种检验Adonis和ANOSIM，听过t.test、wilicox、anova各种检验，那么Adonis和ANOSIM检验是什么呢</p> Adonis 多元方差分析 <p>Adonis，多元方差分析，亦可称为非参数多元方差分析。其原理是利用距离矩阵（比如基于Bray-Curtis距离、Euclidean距离）对总方差进行分解，分析不同分组因素对样品差异的解释度，并使用置换检验对其统计学意义进行显著性分析。</p> <p>Adonis分析结果通常如下：</p> Index Df SumsOfSqs MeanSqs F.Model R2 Pr(&#62;F) GroupFactor 4 1.0899 0.27248 1.4862 0.14883 0.011 Residuals 34 6.2335 0.18334 0.85117 Total 38 7.3234 1.00000 <p>其中，GroupFactor表示实验中的分组方法 Df表示自由度 SumsOfSqs表示总方差即离差平方和 MeanSqs表示均方差（SumsOfSeqs/Df） F.Model表示检验值F R2表示该分组方式对样品间差异的解释度，R2越大说明该分组方案对差异的解释度越高 Pr表示P值，小于0.05时显著说明本次检验的可信度高。</p> <p>那么Adonis具体要如何使用呢？ 在微生物的分析中我们通常把Adonis和PCA分析结合在一起。进行完PCA分析后，我们想要检验不同的分组之间究竟是否有差异，差异是否显著，这时候我们就可以用Adonis检验。如下图，虽然我们可以看到三组被分开了，但是这种分开真的显著吗？这种分组又能对样本的差异解释多少呢？那么右侧的Adonis检验就告诉了我们明确的答案，这种分组时显著的，R2=0.11。</p> <p></p> <p>Sylvain I A, Adams R I, Taylor J W. A different suite: The assemblage of distinct fungal [...]]]></description>
				<content:encoded><![CDATA[<p>做微生物16S测序的时候，公司的报告里经常会给到两种检验Adonis和ANOSIM，听过t.test、wilicox、anova各种检验，那么Adonis和ANOSIM检验是什么呢</p>
<div>
<div>
<h4>Adonis 多元方差分析</h4>
<p>Adonis，多元方差分析，亦可称为非参数多元方差分析。其原理是利用距离矩阵（比如基于Bray-Curtis距离、Euclidean距离）对总方差进行分解，<strong>分析不同分组因素对样品差异的解释度</strong>，并使用置换检验对其统计学意义进行<strong>显著性分析</strong>。</p>
<p>Adonis分析结果通常如下：</p>
<table>
<thead>
<tr>
<th>Index</th>
<th>Df</th>
<th>SumsOfSqs</th>
<th>MeanSqs</th>
<th>F.Model</th>
<th>R2</th>
<th>Pr(&gt;F)</th>
</tr>
</thead>
<tbody>
<tr>
<td>GroupFactor</td>
<td>4</td>
<td>1.0899</td>
<td>0.27248</td>
<td>1.4862</td>
<td>0.14883</td>
<td>0.011</td>
</tr>
<tr>
<td>Residuals</td>
<td>34</td>
<td>6.2335</td>
<td>0.18334</td>
<td>0.85117</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total</td>
<td>38</td>
<td>7.3234</td>
<td>1.00000</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
<p>其中，GroupFactor表示实验中的分组方法<br />
Df表示自由度<br />
SumsOfSqs表示总方差即离差平方和<br />
MeanSqs表示均方差（SumsOfSeqs/Df）<br />
F.Model表示检验值F<br />
R2表示该分组方式对样品间差异的解释度，R2越大说明该分组方案对差异的解释度越高<br />
Pr表示P值，小于0.05时显著说明本次检验的可信度高。</p>
<p><strong>那么Adonis具体要如何使用呢？</strong><br />
在微生物的分析中我们通常把Adonis和PCA分析结合在一起。进行完PCA分析后，我们想要检验不同的分组之间究竟是否有差异，差异是否显著，这时候我们就可以用Adonis检验。如下图，虽然我们可以看到三组被分开了，但是这种分开真的显著吗？这种分组又能对样本的差异解释多少呢？那么右侧的Adonis检验就告诉了我们明确的答案，这种分组时显著的，R2=0.11。</p>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-0eb31addfe19bebe.png"><img class="aligncenter size-full wp-image-1465" src="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-0eb31addfe19bebe.png" alt="8637066-0eb31addfe19bebe" width="850" height="715" /></a></p>
<p>Sylvain I A, Adams R I, Taylor J W. A different suite: The assemblage of distinct fungal communities in water-damaged units of a poorly-maintained public housing building[J]. PloS one, 2019, 14(3): e0213355.</p>
<p>在R中我们可以使用Vegan包中的函数<code>adonis()</code>或<code>adonis2()</code>进行adonis检验。</p>
<div class="_2Uzcx_">
<pre class="line-numbers  language-kotlin"><code class="  language-kotlin"><span class="token function">adonis2</span><span class="token punctuation">(</span>formula<span class="token punctuation">,</span> <span class="token keyword">data</span><span class="token punctuation">,</span> permutations <span class="token operator">=</span> <span class="token number">999</span><span class="token punctuation">,</span> method <span class="token operator">=</span> <span class="token string">"bray"</span><span class="token punctuation">,</span>
    sqrt<span class="token punctuation">.</span>dist <span class="token operator">=</span> FALSE<span class="token punctuation">,</span> add <span class="token operator">=</span> FALSE<span class="token punctuation">,</span> <span class="token keyword">by</span> <span class="token operator">=</span> <span class="token string">"terms"</span><span class="token punctuation">,</span>
    parallel <span class="token operator">=</span> <span class="token function">getOption</span><span class="token punctuation">(</span><span class="token string">"mc.cores"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token operator">..</span><span class="token punctuation">.</span><span class="token punctuation">)</span>

<span class="token function">adonis</span><span class="token punctuation">(</span>formula<span class="token punctuation">,</span> <span class="token keyword">data</span><span class="token punctuation">,</span> permutations <span class="token operator">=</span> <span class="token number">999</span><span class="token punctuation">,</span> method <span class="token operator">=</span> <span class="token string">"bray"</span><span class="token punctuation">,</span>
    strata <span class="token operator">=</span> NULL<span class="token punctuation">,</span> contr<span class="token punctuation">.</span>unordered <span class="token operator">=</span> <span class="token string">"contr.sum"</span><span class="token punctuation">,</span>
    contr<span class="token punctuation">.</span>ordered <span class="token operator">=</span> <span class="token string">"contr.poly"</span><span class="token punctuation">,</span> parallel <span class="token operator">=</span> <span class="token function">getOption</span><span class="token punctuation">(</span><span class="token string">"mc.cores"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token operator">..</span><span class="token punctuation">.</span><span class="token punctuation">)</span>

</code></pre>
<div>
<div>
<h4>ANOSIM 相似性分析</h4>
<p>ANOSIM，相似性分析是一种非参数检验，<strong>用于检验高纬度数据间的相似性，比较组间和组内差异的大小</strong>，从而判断分组是否有意义，其可以用于检验两组的组间和组内差异，也可以用于多组。</p>
<p>ANOSIM的原理如下，以最基本的两个组为例：<br />
现在一共有6个样本，根据我们的实验方案将其分为两组Group1和Group2，每组含有3个样本。</p>
<p>1、首先我们基于组内样本间的距离计算组内的相似性。</p>
</div>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-4c0d4a9ee6608c0d.png"><img class="aligncenter size-full wp-image-1466" src="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-4c0d4a9ee6608c0d.png" alt="8637066-4c0d4a9ee6608c0d" width="307" height="257" /></a></div>
<div>2、然后我们基于组间样本的距离计算组间的相似性。</div>
<div><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-3b4b1374ba6a6743.png"><img class="aligncenter size-full wp-image-1467" src="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-3b4b1374ba6a6743.png" alt="8637066-3b4b1374ba6a6743" width="306" height="256" /></a></div>
<div>结合组内和组间，得到：</div>
<div><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-8d5c1d0cd703fe77.png"><img class="aligncenter size-full wp-image-1468" src="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-8d5c1d0cd703fe77.png" alt="8637066-8d5c1d0cd703fe77" width="337" height="330" /></a></div>
<div>
<div class="image-package"></div>
<p>然后我们根据公式计算R值：</p>
<p><a href="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-b3c9bd5383539d8c.png"><img class="aligncenter size-full wp-image-1469" src="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-b3c9bd5383539d8c.png" alt="8637066-b3c9bd5383539d8c" width="855" height="363" /></a></p>
<div>
<div>
<p>其中，<br />
r0= mean rank of between group dissimilarities 即组间差异性秩的平均值<br />
rw= mean rank of within group dissimilarities 即组内差异性秩的平均值<br />
n=the number of samples 即样本总数量</p>
<p>因此根据公式可以知道，R的取值范围为[-1,1]：<br />
当R趋向于1时，说明组间差异大于组内差异<br />
当R=0时，说明组间没有差异，即分组无效，不同分组之间没有差异。<br />
当R趋向于-1时，说明组间差异小于组内差异。</p>
<p>当R大于0时，我们还要进一步检验这种差异是否显著具有可信度，ANOSIM中对其的检验方法也是使用Permutation Test即置换检验。</p>
<blockquote><p>置换检验：1、对原始样本进行随机分组，分为实验组和对照组<br />
2、计算随机分组的Ri值，并和R比较<br />
3、重复1000次<br />
4、计算p=Ri大于R的百分比，从而计算P值</p></blockquote>
<p>在我们做完PCoA、NMDS等降维分析的时候，我们也会遇到一同样的问题，数据看起来是分开的，但是不同的组之间差异真的显著吗？这个时候也可以选择ANOSIM进行检验。</p>
<p>R中Vegan包也提供了ANOSIM检验。下面用R中自带的鸢尾花数据集（iris）做一个示范：</p>
<pre class="line-numbers  language-csharp"><code class="  language-csharp"><span class="token function">library</span><span class="token punctuation">(</span>vegan<span class="token punctuation">)</span>
<span class="token function">library</span><span class="token punctuation">(</span>ggplot2<span class="token punctuation">)</span>
<span class="token preprocessor property">#Delete Species Infor</span>
dat<span class="token operator">&lt;</span><span class="token operator">-</span><span class="token function">subset</span><span class="token punctuation">(</span>iris<span class="token punctuation">,</span><span class="token keyword">select</span> <span class="token operator">=</span> <span class="token operator">-</span>Species<span class="token punctuation">)</span>
<span class="token preprocessor property">#Calculate Distance</span>
iris<span class="token punctuation">.</span>dist<span class="token operator">&lt;</span><span class="token operator">-</span><span class="token function">vegdist</span><span class="token punctuation">(</span>dat<span class="token punctuation">)</span>
<span class="token preprocessor property">#MDS analysis</span>
m<span class="token operator">&lt;</span><span class="token operator">-</span><span class="token function">monoMDS</span><span class="token punctuation">(</span>iris<span class="token punctuation">.</span>dist<span class="token punctuation">)</span>
MDS<span class="token operator">&lt;</span><span class="token operator">-</span><span class="token keyword">as</span><span class="token punctuation">.</span>data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span>m$points<span class="token punctuation">)</span>
<span class="token preprocessor property">#Gain group information</span>
MDS$<span class="token keyword">group</span><span class="token operator">&lt;</span><span class="token operator">-</span>iris$Species
<span class="token preprocessor property">#Plot</span>
p<span class="token operator">&lt;</span><span class="token operator">-</span><span class="token function">ggplot</span><span class="token punctuation">(</span>MDS<span class="token punctuation">,</span><span class="token function">aes</span><span class="token punctuation">(</span>MDS1<span class="token punctuation">,</span>MDS2<span class="token punctuation">,</span>col<span class="token operator">=</span><span class="token keyword">group</span><span class="token punctuation">,</span>shape<span class="token operator">=</span><span class="token keyword">group</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token operator">+</span>
  <span class="token function">geom_point</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">+</span>
  <span class="token function">theme_bw</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">+</span>
  <span class="token function">theme</span><span class="token punctuation">(</span>legend<span class="token punctuation">.</span>title<span class="token operator">=</span><span class="token function">element_blank</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<a href="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-6be283f0b46d2616.png"><img class="aligncenter size-large wp-image-1470" src="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-6be283f0b46d2616-1024x821.png" alt="8637066-6be283f0b46d2616" width="640" height="513" /></a>
从上图我们可以直观地看出，组间差异大于组内差异，三组样本明显可以分开。
 那么进一步我们用ANOSIM检验来验证我们从图中得到的结论。
</code></pre>
<pre class="line-numbers  language-bash"><code class="  language-bash">#ANOSIM
anosim_result&lt;-anosim(dat,iris$Species,permutations = 999)
summary(anosim_result)
plot(anosim_result, col = c('#FFD700','#FF7F00','#EE2C2C','#D02090'))
<a href="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-402a68df1c328084.png"><img class="aligncenter size-full wp-image-1471" src="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-402a68df1c328084.png" alt="8637066-402a68df1c328084" width="968" height="756" /></a>
<a href="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-8e1374dca1a6ff46.png"><img class="aligncenter size-large wp-image-1472" src="http://www.biofacebook.com/wp-content/uploads/2020/03/8637066-8e1374dca1a6ff46-1024x817.png" alt="8637066-8e1374dca1a6ff46" width="640" height="511" /></a>
从上图可以直观看到组间差异大于组内差异，R=0.858，接近于1，P值为0.001，小于0.05，说明该不同的分组之间差异明显，该分组是有意义的。</code></pre>
<pre class="line-numbers  language-csharp"></pre>
</div>
</div>
</div>
</div>
</div>
<p>作者：jlyq617<br />
链接：https://www.jianshu.com/p/dfa689f7cafd<br />
来源：简书</p>
</div>
]]></content:encoded>
			<wfw:commentRss>https://www.biofacebook.com/?feed=rss2&#038;p=1461</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>alpha多样性</title>
		<link>https://www.biofacebook.com/?p=1436</link>
		<comments>https://www.biofacebook.com/?p=1436#comments</comments>
		<pubDate>Tue, 18 Feb 2020 03:24:22 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[生物信息]]></category>
		<category><![CDATA[统计学习]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1436</guid>
		<description><![CDATA[ 扩增子数据分析之多样性指数： alpha多样性 <p>多样性指数（Diversity index）和计算公式可以见： wikipedia</p> <p>Alpha多样性（Alpha Diversity）是对某个样品中物种多样性的分析，包含样品中的物种类别的多样性——丰富度（Richness）和物种组成多少的整体分布——均匀度（Evenness）两个因素，通常用Richness,Chao1，Shannon，Simpson，Dominance和Equitability等指数来评估样本的物种多样性。</p> <p>丰富度指数</p> <p>Richness, Chao1，Shannon三个指数是常用的评估丰富度的指标，数值越高表明样品包含的物种丰富度就越高。</p> Richness指数: 指样本中被检测到的OTU量； Chao1指数 : 通过低丰度OTUs来进一步预测样品中的OTUs数量； Shannon指数 : 计算考虑到样品中的OTUs及其相对丰度信息， 通过对数（如以2为底的shannon_2，以自然对数为底的shannon_e 以10为底的shannon_10）转换来预测样品中的分类多样性。 <p>均匀度指数</p> <p>Simpson，Dominance和Equitability三个指数是常用的评估均匀度的指标。</p> Simpson指数 : 表示随机选取两条序列属于同一个分类（如OTUs）的概率（故数值在0~1之间）， 数值越接近1表示表明OTUs的丰度分部越不均匀； Dominancez指数 : 取值为1-Simpson，表示随机选取两条序列属于不同分类（如OTUs）的概率； Equitability指数: 根据Shannon指数值计算，当其值为1时表明样品中的物种丰度分布绝对均匀， 而其值越小这表明物种丰度分布呈现出越高的偏向。 <p>汇总表：</p> 指数 单位 计算方式 richness OTUs 样本中至少包含一条序列的OTU数目 chao1 OTUs N + S^2 / (2D^2)，其中N为OTU个数, S为丰度为1的OTUs个数，D为丰度为2的OTUs数目； shannon_2 bits sum(f), 对所有OTU频率计算p*log(p,2)和, p为OTU的频率； shannon_e nats [...]]]></description>
				<content:encoded><![CDATA[<div class="post-headline">
<h1>扩增子数据分析之多样性指数： alpha多样性</h1>
</div>
<div class="post-bodycopy clearfix">
<p>多样性指数（Diversity index）和计算公式可以见： <a href="https://en.wikipedia.org/wiki/Diversity_index#Shannon_index">wikipedia</a></p>
<p>Alpha多样性（Alpha Diversity）是对某个样品中物种多样性的分析，包含样品中的物种类别的多样性——丰富度（Richness）和物种组成多少的整体分布——均匀度（Evenness）两个因素，通常用Richness,Chao1，Shannon，Simpson，Dominance和Equitability等指数来评估样本的物种多样性。</p>
<p><strong>丰富度指数</strong></p>
<p>Richness, Chao1，Shannon三个指数是常用的评估丰富度的指标，数值越高表明样品包含的物种丰富度就越高。</p>
<pre><code>Richness指数: 指样本中被检测到的OTU量；
Chao1指数   : 通过低丰度OTUs来进一步预测样品中的OTUs数量；
Shannon指数 : 计算考虑到样品中的OTUs及其相对丰度信息，
             通过对数（如以2为底的shannon_2，以自然对数为底的shannon_e
             以10为底的shannon_10）转换来预测样品中的分类多样性。
</code></pre>
<p><strong>均匀度指数</strong></p>
<p>Simpson，Dominance和Equitability三个指数是常用的评估均匀度的指标。</p>
<pre><code>Simpson指数     : 表示随机选取两条序列属于同一个分类（如OTUs）的概率（故数值在0~1之间），
                  数值越接近1表示表明OTUs的丰度分部越不均匀；
Dominancez指数  : 取值为1-Simpson，表示随机选取两条序列属于不同分类（如OTUs）的概率；
Equitability指数: 根据Shannon指数值计算，当其值为1时表明样品中的物种丰度分布绝对均匀，
                  而其值越小这表明物种丰度分布呈现出越高的偏向。
</code></pre>
<p><strong>汇总表：</strong></p>
<table>
<thead>
<tr class="alt">
<th>指数</th>
<th>单位</th>
<th>计算方式</th>
</tr>
</thead>
<tbody>
<tr>
<td>richness</td>
<td>OTUs</td>
<td>样本中至少包含一条序列的OTU数目</td>
</tr>
<tr class="alt">
<td>chao1</td>
<td>OTUs</td>
<td>N + S^2 / (2D^2)，其中N为OTU个数, S为丰度为1的OTUs个数，D为丰度为2的OTUs数目；</td>
</tr>
<tr>
<td>shannon_2</td>
<td>bits</td>
<td>sum(f), 对所有OTU频率计算p*log(p,2)和, p为OTU的频率；</td>
</tr>
<tr class="alt">
<td>shannon_e</td>
<td>nats</td>
<td>sum(f), 对所有OTU频率计算p*log(p,e)和, p为OTU的频率；</td>
</tr>
<tr>
<td>shannon_10</td>
<td>dits</td>
<td>sum(f), 对所有OTU频率计算p*log(p,10)和, p为OTU的频率；</td>
</tr>
<tr class="alt">
<td>simpson</td>
<td>Probability</td>
<td>sum(f^2)， f为所有OTU频率的和</td>
</tr>
<tr>
<td>dominance</td>
<td>Probability</td>
<td>1-simpson</td>
</tr>
<tr class="alt">
<td>equitability</td>
<td></td>
<td>shannon/log(N), N为OTU数(logs to base 2)</td>
</tr>
</tbody>
</table>
<p><strong>实例：</strong></p>
<p><strong>USEARCH alpha_div</strong></p>
<p>USEARCH 提供了alpha_div函数进行计算各种指数, 可通<code>·-metrics</code> 指定需要计算指数，支持的指数有： berger_parker、buzas_gibson、chao1、dominance、equitability、jost、jost1、reads、richness、robbins、simpson shannon_e、shannon_2、shannon_10</p>
<pre><code>usearch -alpha_div otutable.txt -output alpha.txt
usearch -alpha_div otutable.txt -output gini.txt  -metrics gini_simpson
usearch -alpha_div otutable.txt -output alpha.txt -metrics chao1,
</code></pre>
<p><strong>QIIME diversity alpha</strong></p>
<p>qiime2 数据分析流程通过 <code>qiime diversity</code>接口提供了分析`alpha多样性·的各种命令：</p>
<pre><code>--i-table  : FeatureTable
--p-metric : enspie|michaelis_menten_fit|strong|lladser_pe|fisher_alpha
             |goods_coverage|doubles|simpson|margalef|observed_otus|osd
             |shannon|pielou_e|chao1|brillouin_d|menhinick|simpson_e
             |kempton_taylor_q|robbins|dominance|lladser_ci|heip_e
             |singles|chao1_ci|mcintosh_d|ace|mcintosh_e|gini_index
             |berger_parker_d|esty_ci
--o-alpha-diversity: 输出alpha多样性；
--output-dir： 输出目录（如不指定--o-distance-matrix）；
</code></pre>
<p>执行：</p>
<pre><code>qiime diversity alpha          \
   --i-table  table.qza       \
   --p-metric  goods_coverage \
   --o-alpha-diversity  goods_coverage.qza

</code></pre>
<div>物多样性测定主要有三个空间尺度：α多样性，β多样性，γ多样性。</div>
<div></div>
<div> <wbr></wbr>  <wbr></wbr>  <wbr></wbr><b>α多样性</b>主要关注局域均匀生境下的物种数目，因此也被称为生境内的多样性（within-habitat diversity）</div>
<div></div>
<div> <wbr></wbr>  <wbr></wbr>  <wbr></wbr> <b>β多样性</b>指沿环境梯度不同生境群落之间物种组成的的相异性或物种沿环境梯度的更替速率也被称为生境间的多样性（between-habitat diversity），控制β多样性的主要生态因子有土壤、地貌及干扰等。</div>
<div>
<div>不同群落或某环境梯度上不同点之间的共有种越少，β多样性越大。精确地测定<a href="http://baike.baidu.com/view/3815769.htm" target="_blank">β多样性</a>具有重要的意义。这是因为：①它可以指示生境被物种隔离的程度；②β多样性的测定值可以用来比较不同地段的生境多样性；③β多样性与<a href="http://baike.baidu.com/view/3815768.htm" target="_blank">α多样性</a>一起构成了总体多样性或一定地段的生物异质性。</div>
<div>γ多样性描述区域或大陆尺度的多样性，是指区域或大陆尺度的物种数量，也被称为区域多样性（regional diversity）。控制<a href="http://baike.baidu.com/view/3815770.htm" target="_blank">γ多样性</a>的<a href="http://baike.baidu.com/view/3273923.htm" target="_blank">生态过程</a>主要为水热动态，气候和<a href="http://baike.baidu.com/view/702554.htm" target="_blank">物种形成</a>及演化的历史。主要指标为物种数（S）。γ多样性测定沿海拔梯度具有两种分布格局：偏锋分布和显著的负相关格局。</div>
</div>
<pre><code>
<span style="font-family: Consolas,Monaco,monospace;">https://rdrr.io/cran/otuSummary/man/alphaDiversity.html

</span></code></pre>
<div class="b_title">
<h2><a href="https://www.mothur.org/wiki/Invsimpson" target="_blank">Invsimpson &#8211; mothur</a></h2>
<div class="b_suffix b_secondaryText nowrap">
<div>The invsimpson calculator is the inverse of the classical Simpson diversity estimator. This parameter is preferred to other measures of <strong>alpha</strong>-diversity because it is an indication of the richness in a community with uniform evenness that would have the same level of diversity.</div>
</div>
</div>
<div class="b_caption">
<div class="b_attribution"><cite>https://www.mothur.org/wiki/<strong>Invsimpson</strong></cite></div>
</div>
<pre><strong>Biological diversity - the great variety of life !</strong></pre>
<div id="ct" class="ct2 wp cl">
<div class="mn">
<div class="bm">
<div class="bm_c">
<div class="vw mbm">
<div id="blog_article" class="d cl">
<p>在探索simpson指数之前，我们需要理解几个很重要的概念：</p>
<p>生物多样性可以用很多种方式定量，其中两个主要的因素是丰富度（richness）和均匀度（evenness）。</p>
<p>1. Richness</p>
<p>丰富度即每个样本的物种数，样本中物种越多，样本越“丰富”。</p>
<p>物种丰富度从概念上讲，并不考虑（样本中）每个物种有多少个个体。它给于个体数少的物种与个体数多的数种相同的权重。因此，在某地区1朵雏菊与1000朵金凤花对丰富度的影响是一样的。</p>
<p>2. Evenness</p>
<p>均匀度即不同物种的相对丰度（abundance）,它与丰富度互相补充，相辅相成（make up）。</p>
<p>[译者注] 这里其实有三个概念：Richness, Evennes 和abundance。例如A组：类1有3个，类2有5个，类3有6个；B组：类1有4个，类2有4个，类3有4个。那么A组有3类，B组也有3类，所以它们的richness是一样的；A组中3个类所含个体数均不相同，而B组中3个类所含个体数相同，因此A组和B组的evennes不同；A组类1有3个，B组类1有4个，所以就类1而言B组的abundance更高。</p>
<p>我们对两个地区不同的野花进行取样，以此为例。第1个地区包括300朵雏菊，335朵蒲公英和365朵金凤花。第2个地区包括20朵雏菊，49朵蒲公英和931朵金凤花，如下表。两个样本丰富度相同（均有3个物种），总的个体数也相同（均为1000朵）。然而第1个地区样本的均匀度比第2个地区样本的均匀度更高。这是因为（在第1个地区）3个物种个体分布较均匀，第2个地区大多数是金凤花，仅有少数雏菊和蒲公英。因此认为样本2比样本1的多样性更低。</p>
<p><a href="http://image.sciencenet.cn/home/201708/04/164923bgmoicctp4kq90bl.png" target="_blank"><img src="http://image.sciencenet.cn/home/201708/04/164923bgmoicctp4kq90bl.png" alt="" /></a></p>
<p><em>相比于由相</em><em>似</em><em>丰度的许多物种组成的群落</em><em>，由一两个优势物种组成的群落具有更低的多样性。</em></p>
<p>多样性随物种丰富度和均匀度的增加而增加。Simpson指数兼顾丰富度和均匀度。</p>
<p>Simpson多样性指数实际上涉及三个相似的指数：</p>
<p><strong>Simpson’s Index (D) </strong></p>
<p>它反映的是在同一个样本中随机的抽取2个个体，这两个个体来自同一个类的概率。有以下两个版本的公式来计算simpson指数。两者不矛盾，均可接受。</p>
<table border="3" width="60%" align="CENTER">
<tbody>
<tr>
<td><a href="http://image.sciencenet.cn/home/201708/04/165216cljy66u6cjthffh6.png" target="_blank"><img src="http://image.sciencenet.cn/home/201708/04/165216cljy66u6cjthffh6.png" alt="" /></a></td>
<td><a href="http://image.sciencenet.cn/home/201708/04/1652207ot61olwjjpf9o1k.png" target="_blank"><img src="http://image.sciencenet.cn/home/201708/04/1652207ot61olwjjpf9o1k.png" alt="" /></a></td>
</tr>
<tr>
<td colspan="2" rowspan="1"><strong>n = the total number of organisms of a particular species</strong><strong><br />
</strong><strong>N = the total number of organisms of all species</strong></td>
</tr>
</tbody>
</table>
<p>D值在0-1之间。0表示无限多样，1表示没有多样性。也就是说D值越大，多样性越低。这与直觉和逻辑不符，为了解决这个问题，通常会用1减去D：</p>
<p><strong>Simpson’s Index of Diversity 1-D</strong></p>
<p>这个值也在0-1之间，但是此时，值越大多样性越高，这就变得更直观了。这种情况下，指数代表的意义是在同一个样本中随机的抽取2个个体，这两个个体来自不同类的概率。</p>
<p>对于违背直觉的D值，还有另一种处理办法，即用1除以D:</p>
<p><strong>Simpson&#8217;s Reciprocal Index 1 / D</strong></p>
<p>1/D的最小值为1。当它为1时表示样本仅由1个物种组成。值越大，多样性越高。最大值是样本中的物种数。例如，假设一个样本中有5个物种，则1/D的最大值为5。</p>
<p>[译者注] 当样本中这5个物种的丰度都相等时1/D达到最大值5。大家可以通过求二阶偏导来求出极值，因非本文重点，证明从略。</p>
<p>以上三个指数想用哪一个取决于使用者的分析需求，但是在研究中需指明使用哪一个指标作为simpson指数！[译者注：该文作者着重强调了这一点，请注意！]</p>
<p># ====================== 译文结束 =======================</p>
<p>这篇材料提供的案例很好，但是遗憾的是仅说明了simpson指数与evennes关系。为了进行单因素比较，作者将两组丰富度设为相同。那么如果丰富度不同呢？而且simpson指数是否与shannon指数一样与丰度无关呢？这里再举一个例子(因为各组相互独立，这里就不给生物学意义，直接上数字了，具体可查看另一篇shannon指数博文[2])：</p>
<p>A组：2, 4, 6, 8</p>
<p>B组：20, 40, 60, 80</p>
<p>C组：5, 5, 5, 5</p>
<p>D组：5, 5, 5, 5, 5</p>
<p>代入公式1-D计算（因为微生物16SrRNA经典流程QIIME使用的scikit库是利用这个公式计算的〔3〕），我们可以得出：</p>
<p>A组simpson指数为： 1-((2/20)^2+(4/20)^2+(6/20)^2+(8/20)^2) = 0.7</p>
<p>A组shannon指数为 1.846439（计算公式见博文[2]，下同）</p>
<p>B组simpson指数为： 1-((20/200)^2+(40/200)^2+(60/200)^2+(80/200)^2) = 0.7</p>
<p>B组shannon指数为 1.846439</p>
<p>C组simpson指数为： 1-((5/20)^2)*4 = 0.75</p>
<p>C组shannon指数为 2.0</p>
<p>D组simpson指数为： 1-((5/25)^2)*5 = 0.8</p>
<p>D组shannon指数为 2.321928</p>
<p>从上面的计算过程很明显看出A组和B组相等，C组和D组不相等，A组和C组也不相等。</p>
<p>AB组结果相同显示出在丰富度一致时simpson指数与丰度无关，它只与相对丰度（均匀度）有关。这和shannon指数一致，归根结底是因为公式中自变量都是相对丰度pi！</p>
<p>CD组结果不同显示出在均匀度一致时simpson指数与丰富度有关，丰富度越大，simpson指数越小。这一点也和shannon指数的情况一致，归根结底，原因在于公式中都有加和项，而且加和部分无论是simpson指数的(pi)<sup>2</sup>还是shannon指数的x*log2(x)在区间（0，1〕上均大于0（有关x*log2(x)&gt;0, x∈（0，1〕可以查看博文〔2〕中的y= &#8211; x*log2(x)那张图）。因此，无论是shannon指数还是simpson指数每多加一项（即丰富度增加），值都会越来越小。回到抽样上来讲，当样本中每种个体数都相同时，在一个样本中随机抽取两个个体，种类越多抽到的这两个个体来自同一个种类的概率越大。</p>
<p>AC组显示出当丰富度相同时，样本中种类越均一，simpson指数越大，即种类越均一，随机抽取两个个体属于同一个种类的概率越大。这一点可以查看博文〔2〕中的分析过程。对应shannon指数的y = &#8211; x*log2(x)， simpson指数的y = &#8211; x<sup>2 </sup>在（0，1〕间区上，也是一个斜率逐渐减小的单调递减函数。</p>
<p>综上，simpson和shannon指数都是均匀度和丰富度的综合指标。</p>
<p>〔1〕 <a href="http://www.countrysideinfo.co.uk/simpsons.htm" target="_blank">http://www.countrysideinfo.co.uk/simpsons.htm</a></p>
<p>〔2〕 <a href="http://blog.sciencenet.cn/blog-2970729-1069399.html" target="_blank">http://blog.sciencenet.cn/blog-2970729-1069399.html</a></p>
<p>〔3〕 <a href="http://scikit-bio.org/docs/latest/generated/generated/skbio.diversity.alpha.simpson.html#skbio.diversity.alpha.simpson" target="_blank">http://scikit-bio.org/docs/latest/generated/generated/skbio.diversity.alpha.simpson.html#skbio.diversity.alpha.simpson</a></p>
<p><label>本文来自卢锐科学网博客。<br />
链接地址：</label><a href="http://blog.sciencenet.cn/blog-2970729-1069539.html" target="_blank">http://blog.sciencenet.cn/blog-2970729-1069539.html </a></div>
</div>
</div>
</div>
</div>
</div>
<pre><code><span style="font-family: Consolas,Monaco,monospace;"> </span></code></pre>
</div>
]]></content:encoded>
			<wfw:commentRss>https://www.biofacebook.com/?feed=rss2&#038;p=1436</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multivariate analyses in R (PERMANOVA )</title>
		<link>https://www.biofacebook.com/?p=1420</link>
		<comments>https://www.biofacebook.com/?p=1420#comments</comments>
		<pubDate>Sat, 07 Dec 2019 06:08:55 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[Linux相关]]></category>
		<category><![CDATA[统计学习]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1420</guid>
		<description><![CDATA[<p>https://rpubs.com/collnell/manova</p> Multivariate analyses in R <p>By C Nell</p> Types of questions <p>Do groups differ in composition? Does community structure vary among regions or over time? Do environmental variables explain community patterns? Which species are responsible for differences among groups?</p> <p>Multivariate analysis of ecological communities with vegan</p> install.packages('vegan') library(vegan) ##Community ecology: ordination, disversity &#38; dissimilarities Dataset [...]]]></description>
				<content:encoded><![CDATA[<p>https://rpubs.com/collnell/manova</p>
<div id="multivariate-analyses-in-r" class="section level2">
<h2>Multivariate analyses in R</h2>
<p>By C Nell</p>
<div id="types-of-questions" class="section level4">
<h4>Types of questions</h4>
<p>Do groups differ in composition?<br />
Does community structure vary among regions or over time?<br />
Do environmental variables explain community patterns?<br />
Which species are responsible for differences among groups?</p>
<p><a href="http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf">Multivariate analysis of ecological communities with vegan</a></p>
<pre class="r"><code class="r"><span class="identifier">install.packages</span><span class="paren">(</span><span class="string">'vegan'</span><span class="paren">)</span>
<span class="keyword">library</span><span class="paren">(</span><span class="identifier">vegan</span><span class="paren">)</span> <span class="comment">##Community ecology: ordination, disversity &amp; dissimilarities</span></code></pre>
</div>
<div id="dataset" class="section level4">
<h4>Dataset</h4>
<p>Bird abundances from 32 different plots (rows), 12 of which have 1 tree species (DIVERSITY = M) and 20 with 4 tree species (DIVERSITY = P).<br />
Tree composition: there are a total of 6 possible tree species (treecomp), each signified with a letter A to F. Bird abundances are totalled according to their feeding guild (columns).</p>
<p>Get data from internet:</p>
<pre class="r"><code class="r"><span class="identifier">birds</span><span class="operator">&lt;-</span><span class="identifier">read.csv</span><span class="paren">(</span><span class="string">'https://raw.githubusercontent.com/collnell/lab-demo/master/bird_by_fg.csv'</span><span class="paren">)</span>
<span class="identifier">trees</span><span class="operator">&lt;-</span><span class="identifier">read.csv</span><span class="paren">(</span><span class="string">'https://raw.githubusercontent.com/collnell/lab-demo/master/tree_comp.csv'</span><span class="paren">)</span></code></pre>
<p>Or from your computer:</p>
<pre class="r"><code class="r"><span class="identifier">setwd</span><span class="paren">(</span><span class="string">"/Users/colleennell/Dropbox/Projects/Mexico/R"</span><span class="paren">)</span> <span class="comment">#change to data folder</span>
<span class="identifier">birds</span><span class="operator">&lt;-</span><span class="identifier">read.csv</span><span class="paren">(</span><span class="string">'bird_by_fg.csv'</span><span class="paren">)</span>
<span class="identifier">head</span><span class="paren">(</span><span class="identifier">birds</span><span class="paren">)</span></code></pre>
<pre><code>##   DIVERSITY PLOT CA FR GR HE IN NE OM
## 1         M    3  0  0  0  0  2  0  0
## 2         M    9  0  0  2  0  6  0  4
## 3         M   12  0  0  0  0  2  0  2
## 4         M   17  0  0  0  0  7  0  4
## 5         M   20  0  0  0  0  1  0  4
## 6         M   21  0  0  3  0 14  0  7</code></pre>
<pre class="r"><code class="r"><span class="identifier">trees</span><span class="operator">&lt;-</span><span class="identifier">read.csv</span><span class="paren">(</span><span class="string">'tree_comp.csv'</span><span class="paren">)</span>
<span class="identifier">head</span><span class="paren">(</span><span class="identifier">trees</span><span class="paren">)</span></code></pre>
<pre><code>##   PLOT comp A B C D E F row col
## 1    3    D 0 0 0 1 0 0   3   1
## 2    9    A 1 0 0 0 0 0   2   2
## 3   12    E 0 0 0 0 1 0   5   2
## 4   17    F 0 0 0 0 0 1   3   3
## 5   20    A 1 0 0 0 0 0   6   3
## 6   21    B 0 1 0 0 0 0   7   3</code></pre>
<p>Questions: Is <em>C. pentandara</em> (B) associated with variation in bird species composition? Does feeding guild composition differ between monoculture and polyculture plots?</p>
</div>
<div id="manova-multivariate-analysis-of-variance" class="section level3">
<h3>MANOVA (<strong>M</strong>ultivariate <strong>an</strong>alysis <strong>o</strong>f <strong>va</strong>riance)</h3>
<p>Parametric test for differences between independent groups for multiple continuous dependent variables. Like ANOVA for many response variables. Requires variables to be fewer than number of smaples.</p>
<p>Is <em>C. pentandara</em> (B) associated with variation in bird species composition? Or D &amp; F (both Fabaceae)?</p>
<pre class="r"><code class="r"><span class="identifier">bird.matrix</span><span class="operator">&lt;-</span><span class="identifier">as.matrix</span><span class="paren">(</span><span class="identifier">birds</span><span class="paren">[</span>,<span class="number">3</span><span class="operator">:</span><span class="number">9</span><span class="paren">]</span><span class="paren">)</span><span class="comment">##response variables in a sample x species matrix</span>
<span class="identifier">trees</span><span class="operator">$</span><span class="identifier">B</span><span class="operator">&lt;-</span><span class="identifier">as.factor</span><span class="paren">(</span><span class="identifier">trees</span><span class="operator">$</span><span class="identifier">B</span><span class="paren">)</span>

<span class="identifier">bird.manova</span><span class="operator">&lt;-</span><span class="identifier">manova</span><span class="paren">(</span><span class="identifier">bird.matrix</span><span class="operator">~</span><span class="identifier">as.factor</span><span class="paren">(</span><span class="identifier">B</span><span class="paren">)</span>, <span class="identifier">data</span><span class="operator">=</span><span class="identifier">trees</span><span class="paren">)</span> <span class="comment">##manova test</span>
<span class="identifier">summary</span><span class="paren">(</span><span class="identifier">bird.manova</span><span class="paren">)</span> </code></pre>
<pre><code>##              Df  Pillai approx F num Df den Df  Pr(&gt;F)  
## as.factor(B)  1 0.39147   2.2056      7     24 0.07027 .
## Residuals    30                                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
<p>Show univariate results:</p>
<pre class="r"><code class="r"><span class="identifier">summary.aov</span><span class="paren">(</span><span class="identifier">bird.manova</span><span class="paren">)</span></code></pre>
<div id="assumptions-of-manova" class="section level4">
<h4>Assumptions of MANOVA</h4>
<ol>
<li>Normal distribution</li>
<li>Linearity</li>
<li>Homogeneity of variances</li>
<li>Homogeneity of covariances</li>
</ol>
<p><strong>Problem:</strong> Most ecological data is overdispersed, has many 0’s or rare species, unequal sample sizes.<br />
<strong>Solution:</strong> Dissimilarity coefficients, permutation tests</p>
</div>
</div>
<div id="permanova-permutational-multivariate-analysis-of-variance" class="section level3">
<h3>PERMANOVA: Permutational multivariate analysis of variance</h3>
<p>Non-paramentric, based on dissimilarities. Allows for partitioning of variability, similar to ANOVA, allowing for complex design (multiple factors, nested design, interactions, covariates). Uses permutation to compute F-statistic (pseudo-F).<br />
<a href="http://rosetta.ahmedmoustafa.io/permutation/">Interactive app demonstrating permutation tests</a></p>
<p>&nbsp;</p>
<p>Based on Legendre &amp; Anderson (1999, Ecological Monographs) and Anderson (2001, Austral Ecology).</p>
<p><strong>Null hypothesis: Groups do not differ in spread or positioni n multivaraite space.</strong></p>
<div id="transform-or-standardize-data" class="section level4">
<h4>1. Transform or standardize data</h4>
<p>Use square root or proportions to minimize influence of most abundant groups.</p>
<pre class="r"><code class="r"><span class="identifier">bird.mat</span><span class="operator">&lt;-</span><span class="identifier">sqrt</span><span class="paren">(</span><span class="identifier">bird.matrix</span><span class="paren">)</span><span class="comment">#square root transform</span>
<span class="comment">#bird.prop&lt;-decostand(bird.matrix, method="total")</span></code></pre>
</div>
<div id="calculate-ecological-resemblance" class="section level4">
<h4>2. Calculate ecological resemblance</h4>
<p>Quantify pairwise compositional dissimilarity between sites based on species occurances.<br />
&#8211; Bray-Curtis dissimilarity (abundance weighted)<br />
&#8211; Jaccard (presence/absence)<br />
&#8211; Gower’s (non-continuous variables)</p>
<p><strong>Dissimilarity: 0 = sites are indentical, 1 = sites do not share any species</strong></p>
<p>Create a dissimilarity matrix:</p>
<pre class="r"><code class="r"><span class="identifier">bird.dist</span><span class="operator">&lt;-</span><span class="identifier">vegdist</span><span class="paren">(</span><span class="identifier">bird.mat</span>, <span class="identifier">method</span><span class="operator">=</span><span class="string">'bray'</span><span class="paren">)</span></code></pre>
</div>
<div id="permanova" class="section level4">
<h4>3. perMANOVA</h4>
<p>Do monoculture and polyculture plots differ in feeding guild composition?</p>
<pre class="r"><code class="r"><span class="identifier">set.seed</span><span class="paren">(</span><span class="number">36</span><span class="paren">)</span> <span class="comment">#reproducible results</span>

<span class="identifier">bird.div</span><span class="operator">&lt;-</span><span class="identifier">adonis2</span><span class="paren">(</span><span class="identifier">bird.dist</span><span class="operator">~</span><span class="identifier">DIVERSITY</span>, <span class="identifier">data</span><span class="operator">=</span><span class="identifier">birds</span>, <span class="identifier">permutations</span> <span class="operator">=</span> <span class="number">999</span>, <span class="identifier">method</span><span class="operator">=</span><span class="string">"bray"</span>, <span class="identifier">strata</span><span class="operator">=</span><span class="string">"PLOT"</span><span class="paren">)</span>
<span class="identifier">bird.div</span></code></pre>
<pre><code>## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = bird.dist ~ DIVERSITY, data = birds, permutations = 999, method = "bray", strata = "PLOT")
##           Df SumOfSqs      F Pr(&gt;F)   
## DIVERSITY  1  0.32857 4.1585  0.008 **
## Residual  30  2.37033                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
<p>strata = ‘exchangeable units’ for permutation. Important for nested design.</p>
</div>
<div id="multivariate-dispersion" class="section level4">
<h4>4. Multivariate dispersion</h4>
<p>The average distance to group centroid. Used as a measure of multivariate beta diversity.</p>
<pre class="r"><code class="r"><span class="identifier">dispersion</span><span class="operator">&lt;-</span><span class="identifier">betadisper</span><span class="paren">(</span><span class="identifier">bird.dist</span>, <span class="identifier">group</span><span class="operator">=</span><span class="identifier">birds</span><span class="operator">$</span><span class="identifier">DIVERSITY</span><span class="paren">)</span>
<span class="identifier">permutest</span><span class="paren">(</span><span class="identifier">dispersion</span><span class="paren">)</span></code></pre>
<pre><code>## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##           Df  Sum Sq   Mean Sq      F N.Perm Pr(&gt;F)
## Groups     1 0.00369 0.0036924 0.2231    999  0.638
## Residuals 30 0.49659 0.0165530</code></pre>
<pre class="r"><code class="r"><span class="identifier">plot</span><span class="paren">(</span><span class="identifier">dispersion</span>, <span class="identifier">hull</span><span class="operator">=</span><span class="literal">FALSE</span>, <span class="identifier">ellipse</span><span class="operator">=</span><span class="literal">TRUE</span><span class="paren">)</span> <span class="comment">##sd ellipse</span></code></pre>
<p>&nbsp;</p>
</div>
</div>
<div id="nmds" class="section level3">
<h3>NMDS</h3>
<p>Non-metric multi-dimensional scaling. Unconstrained ordination. See (<a class="uri" href="https://jonlefcheck.net/2012/10/24/nmds-tutorial-in-r/">https://jonlefcheck.net/2012/10/24/nmds-tutorial-in-r/</a>).<br />
The goal of NMDS is to represent the original position of communities in multidimensional space as accurately as possible using a reduced number of dimensions that can be easily visualized. NMDS uses rank orders to preserve distances among objects thus can accomodate a variety of data types.</p>
<p>&nbsp;</p>
<p>Configure samples in 2-dimensional space:</p>
<pre class="r"><code class="r"><span class="identifier">birdMDS</span><span class="operator">&lt;-</span><span class="identifier">metaMDS</span><span class="paren">(</span><span class="identifier">bird.mat</span>, <span class="identifier">distance</span><span class="operator">=</span><span class="string">"bray"</span>, <span class="identifier">k</span><span class="operator">=</span><span class="number">2</span>, <span class="identifier">trymax</span><span class="operator">=</span><span class="number">35</span>, <span class="identifier">autotransform</span><span class="operator">=</span><span class="literal">TRUE</span><span class="paren">)</span> <span class="comment">##k is the number of dimensions</span>
<span class="identifier">birdMDS</span> <span class="comment">##metaMDS takes eaither a distance matrix or your community matrix (then requires method for 'distance=')</span>

<span class="identifier">stressplot</span><span class="paren">(</span><span class="identifier">birdMDS</span><span class="paren">)</span></code></pre>
<p>Stress: similarity of observed distance to ordination distance. &lt; 0.15 to indidates acceptable fit.</p>
<pre class="r"><code class="r"><span class="identifier">install.packages</span><span class="paren">(</span><span class="string">'ggplot2'</span><span class="paren">)</span> <span class="comment">##plotting package</span>
<span class="keyword">library</span><span class="paren">(</span><span class="identifier">ggplot2</span><span class="paren">)</span>

<span class="comment">##pull points from MDS</span>
<span class="identifier">NMDS1</span> <span class="operator">&lt;-</span> <span class="identifier">birdMDS</span><span class="operator">$</span><span class="identifier">points</span><span class="paren">[</span>,<span class="number">1</span><span class="paren">]</span> <span class="comment">##also found using scores(birdMDS)</span>
<span class="identifier">NMDS2</span> <span class="operator">&lt;-</span> <span class="identifier">birdMDS</span><span class="operator">$</span><span class="identifier">points</span><span class="paren">[</span>,<span class="number">2</span><span class="paren">]</span>
<span class="identifier">bird.plot</span><span class="operator">&lt;-</span><span class="identifier">cbind</span><span class="paren">(</span><span class="identifier">birds</span>, <span class="identifier">NMDS1</span>, <span class="identifier">NMDS2</span>, <span class="identifier">trees</span><span class="paren">)</span>

<span class="comment">#plot ordination</span>
<span class="identifier">p</span><span class="operator">&lt;-</span><span class="identifier">ggplot</span><span class="paren">(</span><span class="identifier">bird.plot</span>, <span class="identifier">aes</span><span class="paren">(</span><span class="identifier">NMDS1</span>, <span class="identifier">NMDS2</span>, <span class="identifier">color</span><span class="operator">=</span><span class="identifier">DIVERSITY</span><span class="paren">)</span><span class="paren">)</span><span class="operator">+</span>
  <span class="identifier">geom_point</span><span class="paren">(</span><span class="identifier">position</span><span class="operator">=</span><span class="identifier">position_jitter</span><span class="paren">(</span><span class="number">.1</span><span class="paren">)</span>, <span class="identifier">shape</span><span class="operator">=</span><span class="number">3</span><span class="paren">)</span><span class="operator">+</span><span class="comment">##separates overlapping points</span>
  <span class="identifier">stat_ellipse</span><span class="paren">(</span><span class="identifier">type</span><span class="operator">=</span><span class="string">'t'</span>,<span class="identifier">size</span> <span class="operator">=</span><span class="number">1</span><span class="paren">)</span><span class="operator">+</span> <span class="comment">##draws 95% confidence interval ellipses</span>
  <span class="identifier">theme_minimal</span><span class="paren">(</span><span class="paren">)</span>
<span class="identifier">p</span></code></pre>
<p>&nbsp;</p>
<div id="add-labels-for-tree-species-composition" class="section level4">
<h4>Add labels for tree species composition:</h4>
<pre class="r"><code class="r"><span class="comment">#plot ordination</span>
<span class="identifier">p</span><span class="operator">&lt;-</span><span class="identifier">ggplot</span><span class="paren">(</span><span class="identifier">bird.plot</span>, <span class="identifier">aes</span><span class="paren">(</span><span class="identifier">NMDS1</span>, <span class="identifier">NMDS2</span>, <span class="identifier">color</span><span class="operator">=</span><span class="identifier">DIVERSITY</span><span class="paren">)</span><span class="paren">)</span><span class="operator">+</span>
  <span class="identifier">stat_ellipse</span><span class="paren">(</span><span class="identifier">type</span><span class="operator">=</span><span class="string">'t'</span>,<span class="identifier">size</span> <span class="operator">=</span><span class="number">1</span><span class="paren">)</span><span class="operator">+</span>
  <span class="identifier">theme_minimal</span><span class="paren">(</span><span class="paren">)</span><span class="operator">+</span><span class="identifier">geom_text</span><span class="paren">(</span><span class="identifier">data</span><span class="operator">=</span><span class="identifier">bird.plot</span>,<span class="identifier">aes</span><span class="paren">(</span><span class="identifier">NMDS1</span>, <span class="identifier">NMDS2</span>, <span class="identifier">label</span><span class="operator">=</span><span class="identifier">comp</span><span class="paren">)</span>, <span class="identifier">position</span><span class="operator">=</span><span class="identifier">position_jitter</span><span class="paren">(</span><span class="number">.35</span><span class="paren">)</span><span class="paren">)</span><span class="operator">+</span>
  <span class="identifier">annotate</span><span class="paren">(</span><span class="string">"text"</span>, <span class="identifier">x</span><span class="operator">=</span><span class="identifier">min</span><span class="paren">(</span><span class="identifier">NMDS1</span><span class="paren">)</span>, <span class="identifier">y</span><span class="operator">=</span><span class="identifier">min</span><span class="paren">(</span><span class="identifier">NMDS2</span><span class="paren">)</span>, <span class="identifier">label</span><span class="operator">=</span><span class="identifier">paste</span><span class="paren">(</span><span class="string">'Stress ='</span>,<span class="identifier">round</span><span class="paren">(</span><span class="identifier">birdMDS</span><span class="operator">$</span><span class="identifier">stress</span>,<span class="number">3</span><span class="paren">)</span><span class="paren">)</span><span class="paren">)</span> <span class="comment">#add stress to plot</span>
<span class="identifier">p</span></code></pre>
<p>&nbsp;</p>
</div>
<div id="fit-vectors-to-ordination" class="section level4">
<h4>Fit vectors to ordination</h4>
<p>Which envornmental variables are correlated with the ordination?</p>
<pre class="r"><code class="r"><span class="identifier">fit</span><span class="operator">&lt;-</span><span class="identifier">envfit</span><span class="paren">(</span><span class="identifier">birdMDS</span>, <span class="identifier">bird.mat</span><span class="paren">)</span>
<span class="identifier">arrow</span><span class="operator">&lt;-</span><span class="identifier">data.frame</span><span class="paren">(</span><span class="identifier">fit</span><span class="operator">$</span><span class="identifier">vectors</span><span class="operator">$</span><span class="identifier">arrows</span>,<span class="identifier">R</span> <span class="operator">=</span> <span class="identifier">fit</span><span class="operator">$</span><span class="identifier">vectors</span><span class="operator">$</span><span class="identifier">r</span>, <span class="identifier">P</span> <span class="operator">=</span> <span class="identifier">fit</span><span class="operator">$</span><span class="identifier">vectors</span><span class="operator">$</span><span class="identifier">pvals</span><span class="paren">)</span>
<span class="identifier">arrow</span><span class="operator">$</span><span class="identifier">FG</span> <span class="operator">&lt;-</span> <span class="identifier">rownames</span><span class="paren">(</span><span class="identifier">arrow</span><span class="paren">)</span>
<span class="identifier">arrow.p</span><span class="operator">&lt;-</span><span class="identifier">filter</span><span class="paren">(</span><span class="identifier">arrow</span>, <span class="identifier">P</span> <span class="operator">&lt;=</span> <span class="number">0.05</span><span class="paren">)</span>

<span class="identifier">p</span><span class="operator">&lt;-</span><span class="identifier">ggplot</span><span class="paren">(</span><span class="identifier">data</span><span class="operator">=</span><span class="identifier">bird.plot</span>, <span class="identifier">aes</span><span class="paren">(</span><span class="identifier">NMDS1</span>, <span class="identifier">NMDS2</span><span class="paren">)</span><span class="paren">)</span><span class="operator">+</span>
  <span class="identifier">geom_point</span><span class="paren">(</span><span class="identifier">data</span><span class="operator">=</span><span class="identifier">bird.plot</span>, <span class="identifier">aes</span><span class="paren">(</span><span class="identifier">NMDS1</span>, <span class="identifier">NMDS2</span>, <span class="identifier">color</span><span class="operator">=</span><span class="identifier">DIVERSITY</span><span class="paren">)</span>,<span class="identifier">position</span><span class="operator">=</span><span class="identifier">position_jitter</span><span class="paren">(</span><span class="number">.1</span><span class="paren">)</span><span class="paren">)</span><span class="operator">+</span><span class="comment">##separates overlapping points</span>
  <span class="identifier">stat_ellipse</span><span class="paren">(</span><span class="identifier">aes</span><span class="paren">(</span><span class="identifier">fill</span><span class="operator">=</span><span class="identifier">DIVERSITY</span><span class="paren">)</span>, <span class="identifier">alpha</span><span class="operator">=</span><span class="number">.2</span>,<span class="identifier">type</span><span class="operator">=</span><span class="string">'t'</span>,<span class="identifier">size</span> <span class="operator">=</span><span class="number">1</span>, <span class="identifier">geom</span><span class="operator">=</span><span class="string">"polygon"</span><span class="paren">)</span><span class="operator">+</span> <span class="comment">##changes shading on ellipses</span>
  <span class="identifier">theme_minimal</span><span class="paren">(</span><span class="paren">)</span><span class="operator">+</span>
  <span class="identifier">geom_segment</span><span class="paren">(</span><span class="identifier">data</span><span class="operator">=</span><span class="identifier">arrow.p</span>, <span class="identifier">aes</span><span class="paren">(</span><span class="identifier">x</span><span class="operator">=</span><span class="number">0</span>, <span class="identifier">y</span><span class="operator">=</span><span class="number">0</span>, <span class="identifier">xend</span><span class="operator">=</span><span class="identifier">NMDS1</span>, <span class="identifier">yend</span><span class="operator">=</span><span class="identifier">NMDS2</span>, <span class="identifier">label</span><span class="operator">=</span><span class="identifier">FG</span>, <span class="identifier">lty</span><span class="operator">=</span><span class="identifier">FG</span><span class="paren">)</span>, <span class="identifier">arrow</span><span class="operator">=</span><span class="identifier">arrow</span><span class="paren">(</span><span class="identifier">length</span><span class="operator">=</span><span class="identifier">unit</span><span class="paren">(</span><span class="number">.2</span>, <span class="string">"cm"</span><span class="paren">)</span><span class="operator">*</span><span class="identifier">arrow.p</span><span class="operator">$</span><span class="identifier">R</span><span class="paren">)</span><span class="paren">)</span> <span class="comment">##add arrows (scaled by R-squared value)</span>

<span class="identifier">p</span></code></pre>
<p>&nbsp;</p>
<p>Show gradient of insectivores:</p>
<pre class="r"><code class="r"><span class="identifier">ordisurf</span><span class="paren">(</span><span class="identifier">birdMDS</span>, <span class="identifier">bird.mat</span><span class="paren">[</span>,<span class="string">'IN'</span><span class="paren">]</span>, <span class="identifier">bubble</span><span class="operator">=</span><span class="literal">TRUE</span><span class="paren">)</span><span class="comment">##bubble size reflects abundance of insectivores</span></code></pre>
<p>&nbsp;</p>
<pre><code>## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
## 
## Estimated degrees of freedom:
## 5.6  total = 6.6 
## 
## REML score: 35.72947</code></pre>
<p>Resources:<br />
<a href="https://sites.google.com/site/mb3gustame/">GUSTA ME</a> &#8211; Provides several ‘wizards’ in choosing the correct statistical test, walkthrough examples of multivariate analyses, and guide to the major types of analyses.</p>
</div>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>https://www.biofacebook.com/?feed=rss2&#038;p=1420</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Correlation tests, correlation matrix, and corresponding visualization methods in R (forward)</title>
		<link>https://www.biofacebook.com/?p=1373</link>
		<comments>https://www.biofacebook.com/?p=1373#comments</comments>
		<pubDate>Wed, 26 Jun 2019 04:46:28 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[R语言]]></category>
		<category><![CDATA[统计学习]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1373</guid>
		<description><![CDATA[<p>https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html</p> Correlation tests, correlation matrix, and corresponding visualization methods in R Igor Hut 12 January, 2017 Install and load required R packages Methods for correlation analyses Compute correlation in R R functions Preliminary considerations Preleminary test to check the test assumptions Pearson correlation test Kendall rank correlation test Spearman rank correlation coefficient How to interpret [...]]]></description>
				<content:encoded><![CDATA[<p>https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html</p>
<div id="header" class="fluid-row">
<h1 class="title toc-ignore">Correlation tests, correlation matrix, and corresponding visualization methods in R</h1>
<h4 class="author"><em>Igor Hut</em></h4>
<h4 class="date"><em>12 January, 2017</em></h4>
</div>
<div id="TOC">
<ul>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#install-and-load-required-r-packages">Install and load required R packages</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#methods-for-correlation-analyses">Methods for correlation analyses</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#compute-correlation-in-r">Compute correlation in R</a>
<ul>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#r-functions">R functions</a></li>
</ul>
</li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#preliminary-considerations">Preliminary considerations</a>
<ul>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#preleminary-test-to-check-the-test-assumptions">Preleminary test to check the test assumptions</a></li>
</ul>
</li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#pearson-correlation-test">Pearson correlation test</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#kendall-rank-correlation-test">Kendall rank correlation test</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#spearman-rank-correlation-coefficient">Spearman rank correlation coefficient</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#how-to-interpret-correlation-coefficient">How to interpret correlation coefficient</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#what-is-a-correlation-matrix">What is a correlation matrix?</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#compute-correlation-matrix-in-r">Compute correlation matrix in R</a>
<ul>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#plain-correlation-matrix">Plain correlation matrix</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#correlation-matrix-with-significance-levels-p-value">Correlation matrix with significance levels (p-value)</a></li>
</ul>
</li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#custom-function-for-convinient-formatting-of-the-correlation-matrix">Custom function for convinient formatting of the correlation matrix</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#visualization-of-a-correlation-matrix">Visualization of a correlation matrix</a>
<ul>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#use-symnum-function-symbolic-number-coding">Use <code>symnum()</code> function: Symbolic number coding</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#use-the-corrplot-function-draw-a-correlogram">Use the <code>corrplot()</code> function: Draw a correlogram</a>
<ul>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#correlogram-layouts">Correlogram layouts:</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#reordering-the-correlation-matrix">Reordering the correlation matrix</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#changing-the-color-and-direction-of-text-labels-in-the-correlogram">Changing the color and direction of text labels in the correlogram</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#combining-correlogram-with-the-significance-test">Combining correlogram with the significance test</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#fine-tuning-customization-of-the-correlogram">Fine tuning customization of the correlogram</a></li>
</ul>
</li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#use-chart.correlation-draw-scatter-plots">Use <code>chart.Correlation()</code>: Draw scatter plots</a></li>
<li><a href="https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html#use-heatmap">Use <code>heatmap()</code></a></li>
</ul>
</li>
</ul>
</div>
<p><strong>The following content is mostly compiled (with some original additions on my side) from the material that can be found at<a class="uri" href="http://www.sthda.com/">http://www.sthda.com/</a>, as well as in the vignette for the <code>corrplot</code> R package &#8211; <a href="https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html">An Introduction to corrplot Package</a>. The sole purpose of this text is to put all the info into one document in an easy to search format. Since I’m a huge fan of Hadley Wickham’s work I’ll insist on solutions based in <a href="https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/">“tidyverse”</a> whenewer possible…</strong></p>
<div id="install-and-load-required-r-packages" class="section level3">
<h3>Install and load required R packages</h3>
<p>We’ll use the <code>ggpubr</code> R package for an easy ggplot2-based data visualization, <code>corrplot</code> package to plot correlograms, <code>Hmisc</code> to calculate correlation matrices containing both cor. coefs. and p-values,<code>corrplot</code> for plotting correlograms, and of course <code>tidyverse</code> for all the data wrangling, plotting and alike:</p>
<pre class="r"><code class="r"><span class="keyword">require</span><span class="paren">(</span><span class="identifier">ggpubr</span><span class="paren">)</span>
<span class="keyword">require</span><span class="paren">(</span><span class="identifier">tidyverse</span><span class="paren">)</span>
<span class="keyword">require</span><span class="paren">(</span><span class="identifier">Hmisc</span><span class="paren">)</span>
<span class="keyword">require</span><span class="paren">(</span><span class="identifier">corrplot</span><span class="paren">)</span></code></pre>
</div>
<div id="methods-for-correlation-analyses" class="section level2">
<h2>Methods for correlation analyses</h2>
<p>There are different methods to perform correlation analysis:</p>
<ul>
<li><strong>Pearson correlation (r)</strong>, which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. It can be used only when x and y are from normal distribution. The plot of <em>y = f(x)</em> is named the <em>linear regression curve</em>.</li>
<li><strong>Kendall <span class="math inline"><span id="MathJax-Element-1-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;mi&gt;&amp;#x03C4;&lt;/mi&gt;&lt;/math&gt;"><span id="MathJax-Span-1" class="math"><span id="MathJax-Span-2" class="mrow"><span id="MathJax-Span-3" class="mi">τ</span></span></span><span class="MJX_Assistive_MathML">τ</span></span></span></strong> and <strong>Spearman <span class="math inline"><span id="MathJax-Element-2-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;mi&gt;&amp;#x03C1;&lt;/mi&gt;&lt;/math&gt;"><span id="MathJax-Span-4" class="math"><span id="MathJax-Span-5" class="mrow"><span id="MathJax-Span-6" class="mi">ρ</span></span></span><span class="MJX_Assistive_MathML">ρ</span></span></span></strong>, which are rank-based correlation coefficients (non-parametric)</li>
<li><strong>The most commonly used method is the Pearson correlation method</strong></li>
</ul>
</div>
<div id="compute-correlation-in-r" class="section level2">
<h2>Compute correlation in R</h2>
<div id="r-functions" class="section level3">
<h3>R functions</h3>
<p>Correlation coefficients can be computed in R by using the functions <code>cor()</code> and <code>cor.test()</code>:</p>
<blockquote>
<ul>
<li><code>cor()</code> computes the correlation coefficient</li>
<li><code>cor.test()</code> test for association/correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation.</li>
</ul>
</blockquote>
<p><em>The simplified formats are:</em></p>
<pre class="r"><code class="r"><span class="identifier">cor</span><span class="paren">(</span><span class="identifier">x</span>, <span class="identifier">y</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="identifier">c</span><span class="paren">(</span><span class="string">"pearson"</span>, <span class="string">"kendall"</span>, <span class="string">"spearman"</span><span class="paren">)</span><span class="paren">)</span>
<span class="identifier">cor.test</span><span class="paren">(</span><span class="identifier">x</span>, <span class="identifier">y</span>, <span class="identifier">method</span><span class="operator">=</span><span class="identifier">c</span><span class="paren">(</span><span class="string">"pearson"</span>, <span class="string">"kendall"</span>, <span class="string">"spearman"</span><span class="paren">)</span><span class="paren">)</span></code></pre>
<p>where:</p>
<blockquote>
<ul>
<li>x, y: numeric vectors with the same length</li>
<li>method: correlation method</li>
</ul>
</blockquote>
<p><strong>If the data contain missing values, the following R code can be used to handle missing values by case-wise deletion:</strong></p>
<pre class="r"><code class="r"><span class="identifier">cor</span><span class="paren">(</span><span class="identifier">x</span>, <span class="identifier">y</span>,  <span class="identifier">method</span> <span class="operator">=</span> <span class="string">"pearson"</span>, <span class="identifier">use</span> <span class="operator">=</span> <span class="string">"complete.obs"</span><span class="paren">)</span></code></pre>
</div>
</div>
<div id="preliminary-considerations" class="section level2">
<h2>Preliminary considerations</h2>
<p>We’ll use the well known built-in <code>mtcars</code> R dataset.</p>
<pre class="r"><code class="r"><span class="identifier">head</span><span class="paren">(</span><span class="identifier">mtcars</span><span class="paren">)</span></code></pre>
<pre><code>##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1</code></pre>
<blockquote><p>We’d like to compute the correlation between <code>mpg</code> and <code>wt</code> variables.</p></blockquote>
<p><strong>First let’s visualise our data by the means of a scatter plot. We’ll be using <code>ggpubr</code> R package</strong></p>
<pre class="r"><code class="r"><span class="keyword">library</span><span class="paren">(</span><span class="identifier">ggpubr</span><span class="paren">)</span>

<span class="identifier">my_data</span> <span class="operator">&lt;-</span> <span class="identifier">mtcars</span>
<span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">cyl</span> <span class="operator">&lt;-</span> <span class="identifier">factor</span><span class="paren">(</span><span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">cyl</span><span class="paren">)</span>
<span class="identifier">str</span><span class="paren">(</span><span class="identifier">my_data</span><span class="paren">)</span></code></pre>
<pre><code>## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...</code></pre>
<pre class="r"><code class="r"><span class="identifier">ggscatter</span><span class="paren">(</span><span class="identifier">my_data</span>, <span class="identifier">x</span> <span class="operator">=</span> <span class="string">"wt"</span>, <span class="identifier">y</span> <span class="operator">=</span> <span class="string">"mpg"</span>,
          <span class="identifier">add</span> <span class="operator">=</span> <span class="string">"reg.line"</span>, <span class="identifier">conf.int</span> <span class="operator">=</span> <span class="literal">TRUE</span>,
          <span class="identifier">cor.coef</span> <span class="operator">=</span> <span class="literal">TRUE</span>, <span class="identifier">cor.method</span> <span class="operator">=</span> <span class="string">"pearson"</span>,
          <span class="identifier">xlab</span> <span class="operator">=</span> <span class="string">"Weight (1000 lbs)"</span>, <span class="identifier">ylab</span> <span class="operator">=</span> <span class="string">"Miles/ (US) gallon"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<div id="preleminary-test-to-check-the-test-assumptions" class="section level3">
<h3>Preleminary test to check the test assumptions</h3>
<ol>
<li>Is the relation between variables linear? Yes, from the plot above, the relationship can be, closely enough, modeled as linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.</li>
<li>Are the data from each of the 2 variables (x, y) following a normal distribution?
<ul>
<li>Use <em>Shapiro-Wilk</em> normality test <span class="math inline"><span id="MathJax-Element-3-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;mo stretchy=&quot;false&quot;&gt;&amp;#x2192;&lt;/mo&gt;&lt;/math&gt;"><span id="MathJax-Span-7" class="math"><span id="MathJax-Span-8" class="mrow"><span id="MathJax-Span-9" class="mo">→</span></span></span><span class="MJX_Assistive_MathML">→</span></span></span> R function: <code>shapiro.test()</code></li>
<li>and look at the normality plot <span class="math inline"><span id="MathJax-Element-4-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;mo stretchy=&quot;false&quot;&gt;&amp;#x2192;&lt;/mo&gt;&lt;/math&gt;"><span id="MathJax-Span-10" class="math"><span id="MathJax-Span-11" class="mrow"><span id="MathJax-Span-12" class="mo">→</span></span></span><span class="MJX_Assistive_MathML">→</span></span></span> R function: <code>ggpubr::ggqqplot()</code></li>
</ul>
</li>
</ol>
<ul>
<li><em>Shapiro-Wilk</em> test can be performed as follow:
<ul>
<li>Null hypothesis: the data are normally distributed</li>
<li>Alternative hypothesis: the data are not normally distributed</li>
</ul>
</li>
</ul>
<pre class="r"><code class="r"><span class="comment"># Shapiro-Wilk normality test for mpg</span>
<span class="identifier">shapiro.test</span><span class="paren">(</span><span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">mpg</span><span class="paren">)</span> <span class="comment"># =&gt; p = 0.1229</span></code></pre>
<pre><code>## 
##  Shapiro-Wilk normality test
## 
## data:  my_data$mpg
## W = 0.94756, p-value = 0.1229</code></pre>
<pre class="r"><code class="r"><span class="comment"># Shapiro-Wilk normality test for wt</span>
<span class="identifier">shapiro.test</span><span class="paren">(</span><span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">wt</span><span class="paren">)</span> <span class="comment"># =&gt; p = 0.09</span></code></pre>
<pre><code>## 
##  Shapiro-Wilk normality test
## 
## data:  my_data$wt
## W = 0.94326, p-value = 0.09265</code></pre>
<p><em>As can be seen from the output, the two p-values are greater than the predetermined significance level of 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.</em></p>
<ul>
<li>One more option for checking the normality of the data distribution is visual inspection of the Q-Q plots (quantile-quantile plots). Q-Q plot draws the correlation between a given sample and the theoretical normal distribution.</li>
</ul>
<p>Again, we’ll use the <code>ggpubr</code> R package to obtain “pretty”, i.e. publishing-ready, Q-Q plots.</p>
<pre class="r"><code class="r"><span class="keyword">library</span><span class="paren">(</span><span class="string">"ggpubr"</span><span class="paren">)</span>
<span class="comment"># Check for the normality of "mpg""</span>
<span class="identifier">ggqqplot</span><span class="paren">(</span><span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">mpg</span>, <span class="identifier">ylab</span> <span class="operator">=</span> <span class="string">"MPG"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<pre class="r"><code class="r"><span class="comment"># Check for the normality of "wt""</span>
<span class="identifier">ggqqplot</span><span class="paren">(</span><span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">wt</span>, <span class="identifier">ylab</span> <span class="operator">=</span> <span class="string">"WT"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<p><em>From the Q-Q normality plots, we can assume that both samples may come from populations that, closely enough, follow normal distributions.</em></p>
<p><strong>It is important to note that if the data does not follow the normal distribution, at least closely enough, it’s recommended to use the non-parametric correlation, including Spearman and Kendall rank-based correlation tests.</strong></p>
</div>
</div>
<div id="pearson-correlation-test" class="section level2">
<h2>Pearson correlation test</h2>
<p><strong>Example:</strong></p>
<pre class="r"><code class="r"><span class="identifier">res</span> <span class="operator">&lt;-</span> <span class="identifier">cor.test</span><span class="paren">(</span><span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">wt</span>, <span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">mpg</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="string">"pearson"</span><span class="paren">)</span>
<span class="identifier">res</span></code></pre>
<pre><code>## 
##  Pearson's product-moment correlation
## 
## data:  my_data$wt and my_data$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594</code></pre>
<p>So what’s happening here? First of all let’s clarify the meaning of this printout:</p>
<ul>
<li><code>t</code> is the <em>t-test</em> statistic value <code>(t = -9.559)</code>,</li>
<li><code>df</code> is the degrees of freedom <code>(df= 30)</code>,</li>
<li><code>p-value</code> is the significance level of the <em>t-test</em> <code>(p-value =</code> <span class="math inline"><span id="MathJax-Element-5-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;msup&gt;&lt;mn&gt;1.29410&lt;/mn&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mo&gt;&amp;#x2212;&lt;/mo&gt;&lt;mn&gt;10&lt;/mn&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/math&gt;"><span id="MathJax-Span-13" class="math"><span id="MathJax-Span-14" class="mrow"><span id="MathJax-Span-15" class="msubsup"><span id="MathJax-Span-16" class="mn">1.29410</span><span id="MathJax-Span-17" class="texatom"><span id="MathJax-Span-18" class="mrow"><span id="MathJax-Span-19" class="mo">−</span><span id="MathJax-Span-20" class="mn">10</span></span></span></span></span></span><span class="MJX_Assistive_MathML">1.29410−10</span></span></span><code>)</code>.</li>
<li><code>conf.int</code> is the <em>confidence interval</em> of the correlation coefficient at 95% <code>(conf.int = [-0.9338, -0.7441])</code>;</li>
<li><code>sample estimates</code> is the <em>correlation coefficient</em> <code>(Cor.coeff = -0.87)</code>.</li>
</ul>
<p><strong>Interpretation of the results:</strong> As can be see from the results above the <em>p-value</em> of the test is <span class="math inline"><span id="MathJax-Element-6-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;msup&gt;&lt;mn&gt;1.29410&lt;/mn&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mo&gt;&amp;#x2212;&lt;/mo&gt;&lt;mn&gt;10&lt;/mn&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/math&gt;"><span id="MathJax-Span-21" class="math"><span id="MathJax-Span-22" class="mrow"><span id="MathJax-Span-23" class="msubsup"><span id="MathJax-Span-24" class="mn">1.29410</span><span id="MathJax-Span-25" class="texatom"><span id="MathJax-Span-26" class="mrow"><span id="MathJax-Span-27" class="mo">−</span><span id="MathJax-Span-28" class="mn">10</span></span></span></span></span></span><span class="MJX_Assistive_MathML">1.29410−10</span></span></span>, which is less than the significance level<span class="math inline"><span id="MathJax-Element-7-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;mi&gt;&amp;#x03B1;&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0.05&lt;/mn&gt;&lt;/math&gt;"><span id="MathJax-Span-29" class="math"><span id="MathJax-Span-30" class="mrow"><span id="MathJax-Span-31" class="mi">α</span><span id="MathJax-Span-32" class="mo">=</span><span id="MathJax-Span-33" class="mn">0.05</span></span></span><span class="MJX_Assistive_MathML">α=0.05</span></span></span>. We can conclude that <code>wt</code> and <code>mpg</code> are significantly correlated with a correlation coefficient of -0.87 and <em>p-value</em> of <span class="math inline"><span id="MathJax-Element-8-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;msup&gt;&lt;mn&gt;1.29410&lt;/mn&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mo&gt;&amp;#x2212;&lt;/mo&gt;&lt;mn&gt;10&lt;/mn&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/math&gt;"><span id="MathJax-Span-34" class="math"><span id="MathJax-Span-35" class="mrow"><span id="MathJax-Span-36" class="msubsup"><span id="MathJax-Span-37" class="mn">1.29410</span><span id="MathJax-Span-38" class="texatom"><span id="MathJax-Span-39" class="mrow"><span id="MathJax-Span-40" class="mo">−</span><span id="MathJax-Span-41" class="mn">10</span></span></span></span></span></span><span class="MJX_Assistive_MathML">1.29410−10</span></span></span>.</p>
<p><strong>Access to the values returned by <code>cor.test()</code> function</strong></p>
<p>The function <code>cor.test()</code> returns a list containing the following components:</p>
<pre class="r"><code class="r"><span class="identifier">str</span><span class="paren">(</span><span class="identifier">res</span><span class="paren">)</span></code></pre>
<pre><code>## List of 9
##  $ statistic  : Named num -9.56
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named int 30
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 1.29e-10
##  $ estimate   : Named num -0.868
##   ..- attr(*, "names")= chr "cor"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "correlation"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Pearson's product-moment correlation"
##  $ data.name  : chr "my_data$wt and my_data$mpg"
##  $ conf.int   : atomic [1:2] -0.934 -0.744
##   ..- attr(*, "conf.level")= num 0.95
##  - attr(*, "class")= chr "htest"</code></pre>
<p>Of these we are most interested with:</p>
<ul>
<li><code>p.value</code>: the p-value of the test</li>
<li><code>estimate</code>: the correlation coefficient</li>
</ul>
<pre class="r"><code class="r"><span class="comment"># Extract the p.value</span>
<span class="identifier">res</span><span class="operator">$</span><span class="identifier">p.value</span></code></pre>
<pre><code>## [1] 1.293959e-10</code></pre>
<pre class="r"><code class="r"><span class="comment"># Extract the correlation coefficient</span>
<span class="identifier">res</span><span class="operator">$</span><span class="identifier">estimate</span></code></pre>
<pre><code>##        cor 
## -0.8676594</code></pre>
</div>
<div id="kendall-rank-correlation-test" class="section level2">
<h2>Kendall rank correlation test</h2>
<p><strong>The Kendall rank correlation coefficient</strong> or <strong>Kendall’s <span class="math inline"><span id="MathJax-Element-9-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;mi&gt;&amp;#x03C4;&lt;/mi&gt;&lt;/math&gt;"><span id="MathJax-Span-42" class="math"><span id="MathJax-Span-43" class="mrow"><span id="MathJax-Span-44" class="mi">τ</span></span></span><span class="MJX_Assistive_MathML">τ</span></span></span> statistic</strong> is used to estimate a rank-based measure of association. This test may be used if the data do not necessarily come from a bivariate normal distribution.</p>
<pre class="r"><code class="r"><span class="identifier">res2</span> <span class="operator">&lt;-</span> <span class="identifier">cor.test</span><span class="paren">(</span><span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">mpg</span>, <span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">wt</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="string">"kendall"</span><span class="paren">)</span></code></pre>
<pre><code>## Warning in cor.test.default(my_data$mpg, my_data$wt, method = "kendall"):
## Cannot compute exact p-value with ties</code></pre>
<pre class="r"><code class="r"><span class="identifier">res2</span></code></pre>
<pre><code>## 
##  Kendall's rank correlation tau
## 
## data:  my_data$mpg and my_data$wt
## z = -5.7981, p-value = 6.706e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.7278321</code></pre>
<p>Here <code>tau</code> is the Kendall correlation coefficient, so The correlation coefficient between <code>mpg</code> and <code>wy</code> is -0.7278 and the p-value is <span class="math inline"><span id="MathJax-Element-10-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;msup&gt;&lt;mn&gt;6.70610&lt;/mn&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mo&gt;&amp;#x2212;&lt;/mo&gt;&lt;mn&gt;9&lt;/mn&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/math&gt;"><span id="MathJax-Span-45" class="math"><span id="MathJax-Span-46" class="mrow"><span id="MathJax-Span-47" class="msubsup"><span id="MathJax-Span-48" class="mn">6.70610</span><span id="MathJax-Span-49" class="texatom"><span id="MathJax-Span-50" class="mrow"><span id="MathJax-Span-51" class="mo">−</span><span id="MathJax-Span-52" class="mn">9</span></span></span></span></span></span><span class="MJX_Assistive_MathML">6.70610−9</span></span></span>.</p>
</div>
<div id="spearman-rank-correlation-coefficient" class="section level2">
<h2>Spearman rank correlation coefficient</h2>
<p><strong>Spearman’s <span class="math inline"><span id="MathJax-Element-11-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;mi&gt;&amp;#x03C1;&lt;/mi&gt;&lt;/math&gt;"><span id="MathJax-Span-53" class="math"><span id="MathJax-Span-54" class="mrow"><span id="MathJax-Span-55" class="mi">ρ</span></span></span><span class="MJX_Assistive_MathML">ρ</span></span></span> statistic</strong> is also used to estimate a rank-based measure of association. This test may be used if the data do not come from a bivariate normal distribution.</p>
<pre class="r"><code class="r"><span class="identifier">res3</span> <span class="operator">&lt;-</span> <span class="identifier">cor.test</span><span class="paren">(</span><span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">wt</span>, <span class="identifier">my_data</span><span class="operator">$</span><span class="identifier">mpg</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="string">"spearman"</span><span class="paren">)</span></code></pre>
<pre><code>## Warning in cor.test.default(my_data$wt, my_data$mpg, method = "spearman"):
## Cannot compute exact p-value with ties</code></pre>
<pre class="r"><code class="r"><span class="identifier">res3</span></code></pre>
<pre><code>## 
##  Spearman's rank correlation rho
## 
## data:  my_data$wt and my_data$mpg
## S = 10292, p-value = 1.488e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.886422</code></pre>
<p>Here, <code>rho</code> is the Spearman’s correlation coefficient, so the correlation coefficient between <code>mpg</code> and <code>wt</code> is -0.8864 and the p-value is <span class="math inline"><span id="MathJax-Element-12-Frame" class="MathJax" style="box-sizing: border-box; display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;" tabindex="0" data-mathml="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;msup&gt;&lt;mn&gt;1.48810&lt;/mn&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mo&gt;&amp;#x2212;&lt;/mo&gt;&lt;mn&gt;11&lt;/mn&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/math&gt;"><span id="MathJax-Span-56" class="math"><span id="MathJax-Span-57" class="mrow"><span id="MathJax-Span-58" class="msubsup"><span id="MathJax-Span-59" class="mn">1.48810</span><span id="MathJax-Span-60" class="texatom"><span id="MathJax-Span-61" class="mrow"><span id="MathJax-Span-62" class="mo">−</span><span id="MathJax-Span-63" class="mn">11</span></span></span></span></span></span><span class="MJX_Assistive_MathML">1.48810−11</span></span></span>.</p>
</div>
<div id="how-to-interpret-correlation-coefficient" class="section level2">
<h2>How to interpret correlation coefficient</h2>
<p><strong>Value of the correlation coefficient can vary between -1 and 1:</strong></p>
<ul>
<li>-1 indicates a strong negative correlation : this means that every time x increases, y decreases</li>
<li>0 means that there is no association between the two variables (x and y)</li>
<li>1 indicates a strong positive correlation : this means that y increases with x</li>
</ul>
</div>
<div id="what-is-a-correlation-matrix" class="section level2">
<h2>What is a correlation matrix?</h2>
<p>Previously, we described how to perform correlation test between two variables. In the following sections we’ll see how a <strong>correlation matrix</strong> can be computed and visualized. The <strong>correlation matrix</strong> is used to investigate the dependence between multiple variables at the same time. The result is a table containing the correlation coefficients between each variable and the others.</p>
</div>
<div id="compute-correlation-matrix-in-r" class="section level2">
<h2>Compute correlation matrix in R</h2>
<p>We have already mentioned the <code>cor()</code> function, at the intoductory part of this document dealing with the correlation test for a bivariate case. It be used to compute a correlation matrix. A simplified format of the function is :</p>
<pre class="r"><code class="r"><span class="identifier">cor</span><span class="paren">(</span><span class="identifier">x</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="identifier">c</span><span class="paren">(</span><span class="string">"pearson"</span>, <span class="string">"kendall"</span>, <span class="string">"spearman"</span><span class="paren">)</span><span class="paren">)</span> </code></pre>
<p>Here:</p>
<ul>
<li><code>x</code> is numeric matrix or a data frame.</li>
<li><code>method</code>: indicates the correlation coefficient to be computed. The default is “pearson”&#8221; correlation coefficient which measures the linear dependence between two variables. As already explained “kendall” and “spearman” correlation methods are non-parametric rank-based correlation tests.</li>
</ul>
<p>If your data contain missing values, the following R code can be used to handle missing values by case-wise deletion:</p>
<pre class="r"><code class="r"><span class="identifier">cor</span><span class="paren">(</span><span class="identifier">x</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="string">"pearson"</span>, <span class="identifier">use</span> <span class="operator">=</span> <span class="string">"complete.obs"</span><span class="paren">)</span></code></pre>
<div id="plain-correlation-matrix" class="section level3">
<h3>Plain correlation matrix</h3>
<p><strong>Example:</strong></p>
<pre class="r"><code class="r"><span class="keyword">library</span><span class="paren">(</span><span class="identifier">dplyr</span><span class="paren">)</span>

<span class="identifier">my_data</span> <span class="operator">&lt;-</span> <span class="identifier">select</span><span class="paren">(</span><span class="identifier">mtcars</span>, <span class="identifier">mpg</span>, <span class="identifier">disp</span>, <span class="identifier">hp</span>, <span class="identifier">drat</span>, <span class="identifier">wt</span>, <span class="identifier">qsec</span><span class="paren">)</span>
<span class="identifier">head</span><span class="paren">(</span><span class="identifier">my_data</span><span class="paren">)</span></code></pre>
<pre><code>##                    mpg disp  hp drat    wt  qsec
## Mazda RX4         21.0  160 110 3.90 2.620 16.46
## Mazda RX4 Wag     21.0  160 110 3.90 2.875 17.02
## Datsun 710        22.8  108  93 3.85 2.320 18.61
## Hornet 4 Drive    21.4  258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7  360 175 3.15 3.440 17.02
## Valiant           18.1  225 105 2.76 3.460 20.22</code></pre>
<pre class="r"><code class="r"><span class="comment">#Let's compute the correlation matrix</span>
<span class="identifier">cor_1</span> <span class="operator">&lt;-</span> <span class="identifier">round</span><span class="paren">(</span><span class="identifier">cor</span><span class="paren">(</span><span class="identifier">my_data</span><span class="paren">)</span>, <span class="number">2</span><span class="paren">)</span>
<span class="identifier">cor_1</span></code></pre>
<pre><code>##        mpg  disp    hp  drat    wt  qsec
## mpg   1.00 -0.85 -0.78  0.68 -0.87  0.42
## disp -0.85  1.00  0.79 -0.71  0.89 -0.43
## hp   -0.78  0.79  1.00 -0.45  0.66 -0.71
## drat  0.68 -0.71 -0.45  1.00 -0.71  0.09
## wt   -0.87  0.89  0.66 -0.71  1.00 -0.17
## qsec  0.42 -0.43 -0.71  0.09 -0.17  1.00</code></pre>
<p>Unfortunately, the function <code>cor()</code> returns only the correlation coefficients between variables. In the next section, we will use <code>Hmisc</code> R package to calculate the correlation <em>p-values</em>.</p>
</div>
<div id="correlation-matrix-with-significance-levels-p-value" class="section level3">
<h3>Correlation matrix with significance levels (p-value)</h3>
<p>The function <code>rcorr()</code> (in <code>Hmisc</code> package) can be used to compute the significance levels for pearson and spearman correlations. It returns both the <em>correlation coefficients</em> and the <em>p-value</em> of the correlation for all possible pairs of columns in the data table.</p>
<p><em>Simplified format:</em></p>
<pre class="r"><code class="r"><span class="identifier">rcorr</span><span class="paren">(</span><span class="identifier">x</span>, <span class="identifier">type</span> <span class="operator">=</span> <span class="identifier">c</span><span class="paren">(</span><span class="string">"pearson"</span>,<span class="string">"spearman"</span><span class="paren">)</span><span class="paren">)</span></code></pre>
<p><em>x</em> should be a matrix. The correlation type can be either <em>pearson</em> or <em>spearman</em>.</p>
<p><strong>Example:</strong></p>
<pre class="r"><code class="r"><span class="keyword">library</span><span class="paren">(</span><span class="string">"Hmisc"</span><span class="paren">)</span>

<span class="identifier">cor_2</span> <span class="operator">&lt;-</span> <span class="identifier">rcorr</span><span class="paren">(</span><span class="identifier">as.matrix</span><span class="paren">(</span><span class="identifier">my_data</span><span class="paren">)</span><span class="paren">)</span>
<span class="identifier">cor_2</span></code></pre>
<pre><code>##        mpg  disp    hp  drat    wt  qsec
## mpg   1.00 -0.85 -0.78  0.68 -0.87  0.42
## disp -0.85  1.00  0.79 -0.71  0.89 -0.43
## hp   -0.78  0.79  1.00 -0.45  0.66 -0.71
## drat  0.68 -0.71 -0.45  1.00 -0.71  0.09
## wt   -0.87  0.89  0.66 -0.71  1.00 -0.17
## qsec  0.42 -0.43 -0.71  0.09 -0.17  1.00
## 
## n= 32 
## 
## 
## P
##      mpg    disp   hp     drat   wt     qsec  
## mpg         0.0000 0.0000 0.0000 0.0000 0.0171
## disp 0.0000        0.0000 0.0000 0.0000 0.0131
## hp   0.0000 0.0000        0.0100 0.0000 0.0000
## drat 0.0000 0.0000 0.0100        0.0000 0.6196
## wt   0.0000 0.0000 0.0000 0.0000        0.3389
## qsec 0.0171 0.0131 0.0000 0.6196 0.3389</code></pre>
<p>The output of the function <code>rcorr()</code> is a list containing the following elements :</p>
<ul>
<li><strong>r</strong> : the correlation matrix</li>
<li><strong>n</strong> : the matrix of the number of observations used in analyzing each pair of variables</li>
<li><strong>P</strong> : the p-values corresponding to the significance levels of correlations.</li>
</ul>
<p>Extracting the p-values or the correlation coefficients from the output:</p>
<pre class="r"><code class="r"><span class="identifier">str</span><span class="paren">(</span><span class="identifier">cor_2</span><span class="paren">)</span></code></pre>
<pre><code>## List of 3
##  $ r: num [1:6, 1:6] 1 -0.848 -0.776 0.681 -0.868 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##  $ n: int [1:6, 1:6] 32 32 32 32 32 32 32 32 32 32 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##  $ P: num [1:6, 1:6] NA 9.38e-10 1.79e-07 1.78e-05 1.29e-10 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##  - attr(*, "class")= chr "rcorr"</code></pre>
<pre class="r"><code class="r"><span class="comment"># As you can see "cor_2" is a list so extracting these values is quite simple...</span>

<span class="comment"># p-values</span>
<span class="identifier">cor_2</span><span class="operator">$</span><span class="identifier">P</span></code></pre>
<pre><code>##               mpg         disp           hp         drat           wt
## mpg            NA 9.380354e-10 1.787838e-07 1.776241e-05 1.293956e-10
## disp 9.380354e-10           NA 7.142686e-08 5.282028e-06 1.222311e-11
## hp   1.787838e-07 7.142686e-08           NA 9.988768e-03 4.145833e-05
## drat 1.776241e-05 5.282028e-06 9.988768e-03           NA 4.784268e-06
## wt   1.293956e-10 1.222311e-11 4.145833e-05 4.784268e-06           NA
## qsec 1.708199e-02 1.314403e-02 5.766250e-06 6.195823e-01 3.388682e-01
##              qsec
## mpg  1.708199e-02
## disp 1.314403e-02
## hp   5.766250e-06
## drat 6.195823e-01
## wt   3.388682e-01
## qsec           NA</code></pre>
<pre class="r"><code class="r"><span class="comment"># Correlation matrix</span>
<span class="identifier">cor_2</span><span class="operator">$</span><span class="identifier">r</span></code></pre>
<pre><code>##             mpg       disp         hp        drat         wt        qsec
## mpg   1.0000000 -0.8475513 -0.7761683  0.68117189 -0.8676594  0.41868404
## disp -0.8475513  1.0000000  0.7909486 -0.71021390  0.8879799 -0.43369791
## hp   -0.7761683  0.7909486  1.0000000 -0.44875914  0.6587479 -0.70822340
## drat  0.6811719 -0.7102139 -0.4487591  1.00000000 -0.7124406  0.09120482
## wt   -0.8676594  0.8879799  0.6587479 -0.71244061  1.0000000 -0.17471591
## qsec  0.4186840 -0.4336979 -0.7082234  0.09120482 -0.1747159  1.00000000</code></pre>
</div>
</div>
<div id="custom-function-for-convinient-formatting-of-the-correlation-matrix" class="section level2">
<h2>Custom function for convinient formatting of the correlation matrix</h2>
<p>This section provides a simple function for formatting a correlation matrix into a table with 4 columns containing :</p>
<ul>
<li>Column 1 : row names (variable 1 for the correlation test)</li>
<li>Column 2 : column names (variable 2 for the correlation test)</li>
<li>Column 3 : the correlation coefficients</li>
<li>Column 4 : the p-values of the correlations</li>
</ul>
<pre class="r"><code class="r"><span class="identifier">flat_cor_mat</span> <span class="operator">&lt;-</span> <span class="keyword">function</span><span class="paren">(</span><span class="identifier">cor_r</span>, <span class="identifier">cor_p</span><span class="paren">)</span><span class="paren">{</span>
  <span class="comment">#This function provides a simple formatting of a correlation matrix</span>
  <span class="comment">#into a table with 4 columns containing :</span>
    <span class="comment"># Column 1 : row names (variable 1 for the correlation test)</span>
    <span class="comment"># Column 2 : column names (variable 2 for the correlation test)</span>
    <span class="comment"># Column 3 : the correlation coefficients</span>
    <span class="comment"># Column 4 : the p-values of the correlations</span>
  <span class="keyword">library</span><span class="paren">(</span><span class="identifier">tidyr</span><span class="paren">)</span>
  <span class="keyword">library</span><span class="paren">(</span><span class="identifier">tibble</span><span class="paren">)</span>
  <span class="identifier">cor_r</span> <span class="operator">&lt;-</span> <span class="identifier">rownames_to_column</span><span class="paren">(</span><span class="identifier">as.data.frame</span><span class="paren">(</span><span class="identifier">cor_r</span><span class="paren">)</span>, <span class="identifier">var</span> <span class="operator">=</span> <span class="string">"row"</span><span class="paren">)</span>
  <span class="identifier">cor_r</span> <span class="operator">&lt;-</span> <span class="identifier">gather</span><span class="paren">(</span><span class="identifier">cor_r</span>, <span class="identifier">column</span>, <span class="identifier">cor</span>, <span class="operator">-</span><span class="number">1</span><span class="paren">)</span>
  <span class="identifier">cor_p</span> <span class="operator">&lt;-</span> <span class="identifier">rownames_to_column</span><span class="paren">(</span><span class="identifier">as.data.frame</span><span class="paren">(</span><span class="identifier">cor_p</span><span class="paren">)</span>, <span class="identifier">var</span> <span class="operator">=</span> <span class="string">"row"</span><span class="paren">)</span>
  <span class="identifier">cor_p</span> <span class="operator">&lt;-</span> <span class="identifier">gather</span><span class="paren">(</span><span class="identifier">cor_p</span>, <span class="identifier">column</span>, <span class="identifier">p</span>, <span class="operator">-</span><span class="number">1</span><span class="paren">)</span>
  <span class="identifier">cor_p_matrix</span> <span class="operator">&lt;-</span> <span class="identifier">left_join</span><span class="paren">(</span><span class="identifier">cor_r</span>, <span class="identifier">cor_p</span>, <span class="identifier">by</span> <span class="operator">=</span> <span class="identifier">c</span><span class="paren">(</span><span class="string">"row"</span>, <span class="string">"column"</span><span class="paren">)</span><span class="paren">)</span>
  <span class="identifier">cor_p_matrix</span>
<span class="paren">}</span>

<span class="identifier">cor_3</span> <span class="operator">&lt;-</span> <span class="identifier">rcorr</span><span class="paren">(</span><span class="identifier">as.matrix</span><span class="paren">(</span><span class="identifier">mtcars</span><span class="paren">[</span>, <span class="number">1</span><span class="operator">:</span><span class="number">7</span><span class="paren">]</span><span class="paren">)</span><span class="paren">)</span>

<span class="identifier">my_cor_matrix</span> <span class="operator">&lt;-</span> <span class="identifier">flat_cor_mat</span><span class="paren">(</span><span class="identifier">cor_3</span><span class="operator">$</span><span class="identifier">r</span>, <span class="identifier">cor_3</span><span class="operator">$</span><span class="identifier">P</span><span class="paren">)</span>
<span class="identifier">head</span><span class="paren">(</span><span class="identifier">my_cor_matrix</span><span class="paren">)</span></code></pre>
<pre><code>##    row column        cor            p
## 1  mpg    mpg  1.0000000           NA
## 2  cyl    mpg -0.8521619 6.112697e-10
## 3 disp    mpg -0.8475513 9.380354e-10
## 4   hp    mpg -0.7761683 1.787838e-07
## 5 drat    mpg  0.6811719 1.776241e-05
## 6   wt    mpg -0.8676594 1.293956e-10</code></pre>
</div>
<div id="visualization-of-a-correlation-matrix" class="section level2">
<h2>Visualization of a correlation matrix</h2>
<p>There are several different ways for visualizing a correlation matrix in R software:</p>
<ul>
<li><code>symnum()</code> function</li>
<li><code>corrplot()</code> function to plot a correlogram</li>
<li>scatter plots</li>
<li>heatmap</li>
</ul>
<p>We’ll run trough all of these, and then go a bit more into deatil with correlograms.</p>
<div id="use-symnum-function-symbolic-number-coding" class="section level3">
<h3>Use <code>symnum()</code> function: Symbolic number coding</h3>
<p>The R function <code>symnum()</code> is used to symbolically encode a given numeric or logical vector or array. It is particularly useful for visualization of structured matrices, e.g., correlation, sparse, or logical ones. In the case of a correlatino matrix it replaces correlation coefficients by symbols according to the level of the correlation.</p>
<p><em>Simplified format:</em></p>
<pre class="r"><code class="r"><span class="identifier">symnum</span><span class="paren">(</span><span class="identifier">x</span>, <span class="identifier">cutpoints</span> <span class="operator">=</span> <span class="identifier">c</span><span class="paren">(</span><span class="number">0.3</span>, <span class="number">0.6</span>, <span class="number">0.8</span>, <span class="number">0.9</span>, <span class="number">0.95</span><span class="paren">)</span>,
       <span class="identifier">symbols</span> <span class="operator">=</span> <span class="identifier">c</span><span class="paren">(</span><span class="string">" "</span>, <span class="string">"."</span>, <span class="string">","</span>, <span class="string">"+"</span>, <span class="string">"*"</span>, <span class="string">"B"</span><span class="paren">)</span>,
       <span class="identifier">abbr.colnames</span> <span class="operator">=</span> <span class="literal">TRUE</span><span class="paren">)</span></code></pre>
<p>Here:</p>
<ul>
<li><strong>x:</strong> the correlation matrix to visualize</li>
<li><strong>cutpoints:</strong> correlation coefficient cutpoints. The correlation coefficients between 0 and 0.3 are replaced by a space (&#8221; “); correlation coefficients between 0.3 and 0.6 are replaced by”.“; etc .</li>
<li><strong>symbols:</strong> the symbols to use.</li>
<li><strong>abbr.colnames:</strong> logical value. If TRUE, colnames are abbreviated.</li>
</ul>
<p><strong>Example:</strong></p>
<pre class="r"><code class="r"><span class="identifier">cor_4</span> <span class="operator">&lt;-</span> <span class="identifier">cor</span><span class="paren">(</span><span class="identifier">mtcars</span><span class="paren">[</span><span class="number">1</span><span class="operator">:</span><span class="number">6</span><span class="paren">]</span><span class="paren">)</span>
<span class="identifier">symnum</span><span class="paren">(</span><span class="identifier">cor_4</span>, <span class="identifier">abbr.colnames</span> <span class="operator">=</span> <span class="literal">FALSE</span><span class="paren">)</span></code></pre>
<pre><code>##      mpg cyl disp hp drat wt
## mpg  1                      
## cyl  +   1                  
## disp +   *   1              
## hp   ,   +   ,    1         
## drat ,   ,   ,    .  1      
## wt   +   ,   +    ,  ,    1 
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1</code></pre>
<p>*As indicated in the legend, the correlation coefficients between 0 and 0.3 are replaced by a space (&#8221; “); correlation coefficients between 0.3 and 0.6 are replace by”.“; etc .*</p>
</div>
<div id="use-the-corrplot-function-draw-a-correlogram" class="section level3">
<h3>Use the <code>corrplot()</code> function: Draw a correlogram</h3>
<p>The function <code>corrplot()</code>, in the package of the same name, creates a graphical display of a correlation matrix, highlighting the most correlated variables in a data table.</p>
<p>In this plot, correlation coefficients are colored according to the value. Correlation matrix can be also reordered according to the degree of association between variables.</p>
<p><em>The simplified format of the function is:</em></p>
<pre class="r"><code class="r"><span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">corr</span>, <span class="identifier">method</span><span class="operator">=</span><span class="string">"circle"</span><span class="paren">)</span></code></pre>
<p>Here:</p>
<ul>
<li><strong>corr:</strong> the correlation matrix to be visualized</li>
<li><strong>method:</strong> The visualization method to be used, there are seven different options: “circle”, “square”, “ellipse”, “number”, “shade”, “color”, “pie”.</li>
</ul>
<p><strong>Example:</strong></p>
<pre class="r"><code class="r"><span class="identifier">M</span><span class="operator">&lt;-</span><span class="identifier">cor</span><span class="paren">(</span><span class="identifier">mtcars</span><span class="paren">)</span>
<span class="identifier">head</span><span class="paren">(</span><span class="identifier">round</span><span class="paren">(</span><span class="identifier">M</span>,<span class="number">2</span><span class="paren">)</span><span class="paren">)</span></code></pre>
<pre><code>##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43</code></pre>
<pre class="r"><code class="r"><span class="comment">#Visualize the correlation matrix</span>

<span class="comment"># method = "circle""</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="string">"circle"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<pre class="r"><code class="r"><span class="comment"># method = "ellipse""</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="string">"ellipse"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<pre class="r"><code class="r"><span class="comment"># method = "pie"</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="string">"pie"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<pre class="r"><code class="r"><span class="comment"># method = "color"</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="string">"color"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<p><em>Display the <strong>correlation coefficient</strong>:</em></p>
<pre class="r"><code class="r"><span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="string">"number"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<div id="correlogram-layouts" class="section level4">
<h4>Correlogram layouts:</h4>
<p>There are three general types of a correlogram layout :</p>
<ul>
<li><strong>“full”</strong> (default) : display full correlation matrix</li>
<li><strong>“upper”</strong>: display upper triangular of the correlation matrix</li>
<li><strong>“lower”</strong>: display lower triangular of the correlation matrix</li>
</ul>
<p><strong>Examples:</strong></p>
<pre class="r"><code class="r"><span class="comment"># upper triangular</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">type</span> <span class="operator">=</span> <span class="string">"upper"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<pre class="r"><code class="r"><span class="comment">#lower triangular</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">type</span> <span class="operator">=</span> <span class="string">"lower"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
</div>
<div id="reordering-the-correlation-matrix" class="section level4">
<h4>Reordering the correlation matrix</h4>
<p>The correlation matrix can be reordered according to the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix. Use <code>order = "hclust"</code> argument for hierarchical clustering of correlation coefficients.</p>
<p><strong>Example:</strong></p>
<pre class="r"><code class="r"><span class="comment"># correlogram with hclust reordering</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">order</span> <span class="operator">=</span> <span class="string">"hclust"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<pre class="r"><code class="r"><span class="comment"># or exploit the symetry of the correlation matrix </span>
<span class="comment"># correlogram with hclust reordering</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">type</span> <span class="operator">=</span> <span class="string">"upper"</span>, <span class="identifier">order</span> <span class="operator">=</span> <span class="string">"hclust"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
</div>
<div id="changing-the-color-and-direction-of-text-labels-in-the-correlogram" class="section level4">
<h4>Changing the color and direction of text labels in the correlogram</h4>
<p><strong>Examples:</strong></p>
<pre class="r"><code class="r"><span class="comment"># Change background color to lightgreen and color of the circles to darkorange and steel blue</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">type</span> <span class="operator">=</span> <span class="string">"upper"</span>, <span class="identifier">order</span> <span class="operator">=</span> <span class="string">"hclust"</span>, <span class="identifier">col</span> <span class="operator">=</span> <span class="identifier">c</span><span class="paren">(</span><span class="string">"darkorange"</span>, <span class="string">"steelblue"</span><span class="paren">)</span>,
         <span class="identifier">bg</span> <span class="operator">=</span> <span class="string">"lightgreen"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<pre class="r"><code class="r"><span class="comment"># use "colorRampPallete" to obtain contionus color scales</span>
<span class="identifier">col</span> <span class="operator">&lt;-</span> <span class="identifier">colorRampPalette</span><span class="paren">(</span><span class="identifier">c</span><span class="paren">(</span><span class="string">"darkorange"</span>, <span class="string">"white"</span>, <span class="string">"steelblue"</span><span class="paren">)</span><span class="paren">)</span><span class="paren">(</span><span class="number">20</span><span class="paren">)</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">type</span> <span class="operator">=</span> <span class="string">"upper"</span>, <span class="identifier">order</span> <span class="operator">=</span> <span class="string">"hclust"</span>, <span class="identifier">col</span> <span class="operator">=</span> <span class="identifier">col</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<pre class="r"><code class="r"><span class="comment"># Or use "RColorBrewer" package</span>
<span class="keyword">library</span><span class="paren">(</span><span class="identifier">RColorBrewer</span><span class="paren">)</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">type</span> <span class="operator">=</span> <span class="string">"upper"</span>, <span class="identifier">order</span> <span class="operator">=</span> <span class="string">"hclust"</span>,
         <span class="identifier">col</span> <span class="operator">=</span> <span class="identifier">brewer.pal</span><span class="paren">(</span><span class="identifier">n</span> <span class="operator">=</span> <span class="number">9</span>, <span class="identifier">name</span> <span class="operator">=</span> <span class="string">"PuOr"</span><span class="paren">)</span>, <span class="identifier">bg</span> <span class="operator">=</span> <span class="string">"darkgreen"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<p>Use the <strong><code>tl.col</code></strong> argument for defining the text label color and <strong><code>tl.srt</code></strong> for text label string rotation.</p>
<p><strong>Example:</strong></p>
<pre class="r"><code class="r"><span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">type</span> <span class="operator">=</span> <span class="string">"upper"</span>, <span class="identifier">order</span> <span class="operator">=</span> <span class="string">"hclust"</span>, <span class="identifier">tl.col</span> <span class="operator">=</span> <span class="string">"darkblue"</span>, <span class="identifier">tl.srt</span> <span class="operator">=</span> <span class="number">45</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
</div>
<div id="combining-correlogram-with-the-significance-test" class="section level4">
<h4>Combining correlogram with the significance test</h4>
<pre class="r"><code class="r"><span class="comment"># Mark the insignificant coefficients according to the specified p-value significance level</span>
<span class="identifier">cor_5</span> <span class="operator">&lt;-</span> <span class="identifier">rcorr</span><span class="paren">(</span><span class="identifier">as.matrix</span><span class="paren">(</span><span class="identifier">mtcars</span><span class="paren">)</span><span class="paren">)</span>
<span class="identifier">M</span> <span class="operator">&lt;-</span> <span class="identifier">cor_5</span><span class="operator">$</span><span class="identifier">r</span>
<span class="identifier">p_mat</span> <span class="operator">&lt;-</span> <span class="identifier">cor_5</span><span class="operator">$</span><span class="identifier">P</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">type</span> <span class="operator">=</span> <span class="string">"upper"</span>, <span class="identifier">order</span> <span class="operator">=</span> <span class="string">"hclust"</span>, 
         <span class="identifier">p.mat</span> <span class="operator">=</span> <span class="identifier">p_mat</span>, <span class="identifier">sig.level</span> <span class="operator">=</span> <span class="number">0.01</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<pre class="r"><code class="r"><span class="comment"># Leave blank on no significant coefficient</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">type</span> <span class="operator">=</span> <span class="string">"upper"</span>, <span class="identifier">order</span> <span class="operator">=</span> <span class="string">"hclust"</span>, 
         <span class="identifier">p.mat</span> <span class="operator">=</span> <span class="identifier">p_mat</span>, <span class="identifier">sig.level</span> <span class="operator">=</span> <span class="number">0.05</span>, <span class="identifier">insig</span> <span class="operator">=</span> <span class="string">"blank"</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
</div>
<div id="fine-tuning-customization-of-the-correlogram" class="section level4">
<h4>Fine tuning customization of the correlogram</h4>
<pre class="r"><code class="r"><span class="identifier">col</span> <span class="operator">&lt;-</span> <span class="identifier">colorRampPalette</span><span class="paren">(</span><span class="identifier">c</span><span class="paren">(</span><span class="string">"#BB4444"</span>, <span class="string">"#EE9988"</span>, <span class="string">"#FFFFFF"</span>, <span class="string">"#77AADD"</span>, <span class="string">"#4477AA"</span><span class="paren">)</span><span class="paren">)</span>
<span class="identifier">corrplot</span><span class="paren">(</span><span class="identifier">M</span>, <span class="identifier">method</span> <span class="operator">=</span> <span class="string">"color"</span>, <span class="identifier">col</span> <span class="operator">=</span> <span class="identifier">col</span><span class="paren">(</span><span class="number">200</span><span class="paren">)</span>,  
         <span class="identifier">type</span> <span class="operator">=</span> <span class="string">"upper"</span>, <span class="identifier">order</span> <span class="operator">=</span> <span class="string">"hclust"</span>, 
         <span class="identifier">addCoef.col</span> <span class="operator">=</span> <span class="string">"black"</span>, <span class="comment"># Add coefficient of correlation</span>
         <span class="identifier">tl.col</span> <span class="operator">=</span> <span class="string">"darkblue"</span>, <span class="identifier">tl.srt</span> <span class="operator">=</span> <span class="number">45</span>, <span class="comment">#Text label color and rotation</span>
         <span class="comment"># Combine with significance level</span>
         <span class="identifier">p.mat</span> <span class="operator">=</span> <span class="identifier">p_mat</span>, <span class="identifier">sig.level</span> <span class="operator">=</span> <span class="number">0.01</span>,  
         <span class="comment"># hide correlation coefficient on the principal diagonal</span>
         <span class="identifier">diag</span> <span class="operator">=</span> <span class="literal">FALSE</span> 
         <span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<p>I’d say this is more than enough for introductory exploration of correlograms. More information can be found in the, already mentioned, vignette for the <code>corrplot</code> R package &#8211; <a href="https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html">An Introduction to corrplot Package</a></p>
</div>
</div>
<div id="use-chart.correlation-draw-scatter-plots" class="section level3">
<h3>Use <code>chart.Correlation()</code>: Draw scatter plots</h3>
<p>The function <code>chart.Correlation()</code> from the package “<code>PerformanceAnalytics</code>”, can be used to display a chart of a correlation matrix. This is a very convinient way of exploring multivariate correlations.</p>
<pre class="r"><code class="r"><span class="keyword">library</span><span class="paren">(</span><span class="string">"PerformanceAnalytics"</span><span class="paren">)</span></code></pre>
<pre><code>## Loading required package: xts</code></pre>
<pre><code>## Loading required package: zoo</code></pre>
<pre><code>## 
## Attaching package: 'zoo'</code></pre>
<pre><code>## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric</code></pre>
<pre><code>## 
## Attaching package: 'xts'</code></pre>
<pre><code>## The following objects are masked from 'package:dplyr':
## 
##     first, last</code></pre>
<pre><code>## 
## Attaching package: 'PerformanceAnalytics'</code></pre>
<pre><code>## The following object is masked from 'package:graphics':
## 
##     legend</code></pre>
<pre class="r"><code class="r"><span class="identifier">my_data</span> <span class="operator">&lt;-</span> <span class="identifier">mtcars</span><span class="paren">[</span>, <span class="identifier">c</span><span class="paren">(</span><span class="number">1</span>,<span class="number">3</span>,<span class="number">4</span>,<span class="number">5</span>,<span class="number">6</span>,<span class="number">7</span><span class="paren">)</span><span class="paren">]</span>
<span class="identifier">chart.Correlation</span><span class="paren">(</span><span class="identifier">my_data</span>, <span class="identifier">histogram</span> <span class="operator">=</span> <span class="literal">TRUE</span>, <span class="identifier">pch</span> <span class="operator">=</span> <span class="number">19</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
<p>In the above plot:</p>
<ul>
<li>The distribution of each variable is shown on the diagonal.</li>
<li>On the bottom of the diagonal : the bivariate scatter plots with a fitted line are displayed</li>
<li>On the top of the diagonal : the value of the correlation plus the significance level as stars</li>
<li>Each significance level is associated to a symbol : p-values(0, 0.001, 0.01, 0.05, 0.1, 1) &lt;=&gt; symbols(“<em><strong>“,”</strong>“,”</em>”, “.”, &#8221; “)</li>
</ul>
</div>
<div id="use-heatmap" class="section level3">
<h3>Use <code>heatmap()</code></h3>
<p>I don’t really consider this method of correlation matrix visualization to be of practical value, but nevertheless here is a small example:</p>
<pre class="r"><code class="r"><span class="comment"># Get some colors</span>
<span class="identifier">col</span> <span class="operator">&lt;-</span> <span class="identifier">colorRampPalette</span><span class="paren">(</span><span class="identifier">c</span><span class="paren">(</span><span class="string">"darkblue"</span>, <span class="string">"white"</span>, <span class="string">"darkorange"</span><span class="paren">)</span><span class="paren">)</span><span class="paren">(</span><span class="number">20</span><span class="paren">)</span>
<span class="identifier">M</span> <span class="operator">&lt;-</span> <span class="identifier">cor</span><span class="paren">(</span><span class="identifier">mtcars</span><span class="paren">[</span><span class="number">1</span><span class="operator">:</span><span class="number">7</span><span class="paren">]</span><span class="paren">)</span>
<span class="identifier">heatmap</span><span class="paren">(</span><span class="identifier">x</span> <span class="operator">=</span> <span class="identifier">M</span>, <span class="identifier">col</span> <span class="operator">=</span> <span class="identifier">col</span>, <span class="identifier">symm</span> <span class="operator">=</span> <span class="literal">TRUE</span><span class="paren">)</span></code></pre>
<p>&nbsp;</p>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>https://www.biofacebook.com/?feed=rss2&#038;p=1373</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Correlation analysis (zhuantie)</title>
		<link>https://www.biofacebook.com/?p=1365</link>
		<comments>https://www.biofacebook.com/?p=1365#comments</comments>
		<pubDate>Thu, 04 Apr 2019 02:33:19 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[统计学习]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1365</guid>
		<description><![CDATA[ Defination Types of correlation coefficient Correlation formula Pearson correlation formula Spearman correlation formula Kendall correlation formula Calculate correlation coefficient Preleminary test to check the test assumptions data are normally distributed data are not normally distributed Interprete correlation coefficient Generate correlation matrix Compute correlation matrix Method one: use ggcorrplot() Method two: use rcorr() Method three: [...]]]></description>
				<content:encoded><![CDATA[<div id="toc">
<ol class="lorem ipsum">
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#defination">Defination</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#types-of-correlation-coefficient">Types of correlation coefficient</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#correlation-formula">Correlation formula</a>
<ol class="lorem ipsum">
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#pearson-correlation-formula">Pearson correlation formula</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#spearman-correlation-formula">Spearman correlation formula</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#kendall-correlation-formula">Kendall correlation formula</a></li>
</ol>
</li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#calculate-correlation-coefficient">Calculate correlation coefficient</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#preleminary-test-to-check-the-test-assumptions">Preleminary test to check the test assumptions</a>
<ol class="lorem ipsum">
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#data-are-normally-distributed">data are normally distributed</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#data-are-not-normally-distributed">data are not normally distributed</a></li>
</ol>
</li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#interprete-correlation-coefficient">Interprete correlation coefficient</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#generate-correlation-matrix">Generate correlation matrix</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#compute-correlation-matrix">Compute correlation matrix</a>
<ol class="lorem ipsum">
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#method-one-use-ggcorrplot">Method one: use ggcorrplot()</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#method-two-use-rcorr">Method two: use rcorr()</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#method-three-use-cor">Method three: use cor()</a></li>
</ol>
</li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#computing-the-p-value-of-correlations">Computing the p-value of correlations</a>
<ol class="lorem ipsum">
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#format-the-correlation-matrix-unnecessary">Format the correlation matrix (unnecessary)</a></li>
</ol>
</li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#visualize-correlation-matrix">Visualize correlation matrix</a>
<ol class="lorem ipsum">
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#corrplot-function-to-plot-a-correlogram">corrplot() function to plot a correlogram</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#chartcorrelation-draw-scatter-plots">chart.Correlation(): Draw scatter plots</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#use-heatmap">Use heatmap()</a></li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#ggcorrplot-function-recommend">ggcorrplot function (recommend)</a></li>
</ol>
</li>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#partial-correlation">Partial correlation</a></li>
</ol>
<ul>
<li class="dolor sit amet"><a href="http://chenyuan.date/2017/07/Correlation-analysis/#references">References</a></li>
</ul>
</div>
<h2 id="defination" class="clickable-header">Defination</h2>
<p>Correlation coefficient is a quantity that measures the strength of the association (or dependence) between two or more variables.</p>
<h2 id="types-of-correlation-coefficient" class="clickable-header">Types of correlation coefficient</h2>
<ol>
<li>Pearson r: is a <code class="highlighter-rouge">parametric correlation test</code> as it depends on the distribution (normal distribution) of the data. It measures the linear dependence between two variables. The plot of y = f(x) is named the linear regression curve. (the mostly used method)</li>
<li>Kendall tau: rank-based correlation coefficient (non-parametric methods). Recommended if the data do not come from a bivariate normal distribution.</li>
<li>Spearman rho: rank-base correlation coefficient (non-parametric methods). Recommended if the data do not come from a bivariate normal distribution.</li>
</ol>
<h2 id="correlation-formula" class="clickable-header">Correlation formula</h2>
<p>In the formula below,<br />
$x$ and $y$ are two vectors of length $n$ $m_y$ and $m_y$ corresponds to the means of $x$ and $y$, respectively.</p>
<h3 id="pearson-correlation-formula" class="clickable-header">Pearson correlation formula</h3>
<p>The p-value (significance level) of the correlation can be determined :</p>
<ul>
<li>by using the correlation coefficient table for the degrees of freedom : $df=n-2$, where $n$ is the number of observation in x and y variables.</li>
<li>or by calculating the t value as follow: In the case 2) the corresponding p-value is determined using t distribution table for $df=n-2$</li>
</ul>
<blockquote><p>If the p-value is &lt; 5%, then the correlation between x and y is significant.</p></blockquote>
<h3 id="spearman-correlation-formula" class="clickable-header">Spearman correlation formula</h3>
<p>The Spearman correlation method computes the correlation between the rank of $x$ and the rank of $y$ variables.</p>
<p>Where $x’ = rank(x_)$ and $y’ = rank(y_)$.</p>
<h3 id="kendall-correlation-formula" class="clickable-header">Kendall correlation formula</h3>
<p>The Kendall correlation method measures the correspondence between the ranking of x and y variables. The total number of possible pairings of x with y observations is $n(n???1)/2$, where n is the size of x and y.</p>
<p>The procedure is as follow:</p>
<p>Begin by ordering the pairs by the x values. If x and y are correlated, then they would have the same relative rank orders.</p>
<p>Now, for each $y_i$, count the number of $y_j &gt; y_i$ (concordant pairs (c)) and the number of $y_j &lt; y_i$ (discordant pairs (d)).</p>
<p>Kendall correlation distance is defined as follow:</p>
<p>Where, $n_c$: total number of concordant pairs $n_d$: total number of discordant pairs $n$: size of x and y</p>
<h2 id="calculate-correlation-coefficient" class="clickable-header">Calculate correlation coefficient</h2>
<p>Correlation coefficient can be computed using the functions <code class="highlighter-rouge">cor()</code> or <code class="highlighter-rouge">cor.test()</code>:</p>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>cor(x, y, method = c("pearson", "kendall", "spearman"), use = "complete.obs")
cor.test(x, y, method=c("pearson", "kendall", "spearman"), use = "complete.obs")

x and y are two numeric vectors with the same length     
cor() computes the correlation coefficient
cor.test() test for association/correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation.
use = "complete.obs" handle missing values by case-wise deletion
</code></pre>
</div>
</div>
<h2 id="preleminary-test-to-check-the-test-assumptions" class="clickable-header">Preleminary test to check the test assumptions</h2>
<h3 id="data-are-normally-distributed" class="clickable-header">data are normally distributed</h3>
<ol>
<li>Is the covariation linear? Yes, form the plot, the relationship is linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.</li>
<li>Are the data from each of the 2 variables (x, y) follow a normal distribution?
<ul>
<li>Use Shapiro-Wilk normality test -&gt; R function: <code class="highlighter-rouge">shapiro.test()</code> and look at the normality plot -&gt; R function: <code class="highlighter-rouge">ggpubr::ggqqplot()</code></li>
<li>Shapiro-Wilk test can be performed as follow:</li>
</ul>
<ul>
<li>Null hypothesis: the data are normally distributed</li>
<li>Alternative hypothesis: the data are not normally distributed</li>
</ul>
</li>
</ol>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">ggpubr</span><span class="p">)</span>
</code></pre>
</div>
</div>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>## Loading required package: ggplot2
</code></pre>
</div>
</div>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>## Loading required package: magrittr
</code></pre>
</div>
</div>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="n">ggscatter</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="s2">"mpg"</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s2">"wt"</span><span class="p">,</span> 
          <span class="n">add</span> <span class="o">=</span> <span class="s2">"reg.line"</span><span class="p">,</span> 
          <span class="n">conf.int</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> 
          <span class="n">cor.coef</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> 
          <span class="n">cor.method</span> <span class="o">=</span> <span class="s2">"pearson"</span><span class="p">,</span>
          <span class="n">xlab</span> <span class="o">=</span> <span class="s2">"Miles/(US) gallon"</span><span class="p">,</span> 
          <span class="n">ylab</span> <span class="o">=</span> <span class="s2">"Weight (1000 lbs)"</span><span class="p">)</span>
</code></pre>
</div>
</div>
<p><img src="http://chenyuan.date/images/Correlation-analysis-unnamed-chunk-2-1.png" alt="" /></p>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#Shapiro-Wilk normality test for mpg and wt</span>
<span class="n">shapiro.test</span><span class="p">(</span><span class="n">mtcars</span><span class="o">$</span><span class="n">mpg</span><span class="p">)</span> 
</code></pre>
</div>
</div>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>## 
## 	Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229
</code></pre>
</div>
</div>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="n">shapiro.test</span><span class="p">(</span><span class="n">mtcars</span><span class="o">$</span><span class="n">wt</span><span class="p">)</span> 
</code></pre>
</div>
</div>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>## 
## 	Shapiro-Wilk normality test
## 
## data:  mtcars$wt
## W = 0.94326, p-value = 0.09265
</code></pre>
</div>
</div>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#Visual inspection of the data normality using Q-Q plots (quantile-quantile plots)</span>
<span class="c1">#Q-Q plot draws the correlation between a given sample and the normal distribution.</span>
<span class="n">ggqqplot</span><span class="p">(</span><span class="n">mtcars</span><span class="o">$</span><span class="n">mpg</span><span class="p">,</span> <span class="n">ylab</span> <span class="o">=</span> <span class="s2">"MPG"</span><span class="p">)</span>
</code></pre>
</div>
</div>
<p><img src="http://chenyuan.date/images/Correlation-analysis-unnamed-chunk-2-2.png" alt="" /></p>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="n">ggqqplot</span><span class="p">(</span><span class="n">mtcars</span><span class="o">$</span><span class="n">wt</span><span class="p">,</span> <span class="n">ylab</span> <span class="o">=</span> <span class="s2">"WT"</span><span class="p">)</span>
</code></pre>
</div>
</div>
<p><img src="http://chenyuan.date/images/Correlation-analysis-unnamed-chunk-2-3.png" alt="" /></p>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#Pearson correlation test</span>
<span class="n">cor.test</span><span class="p">(</span><span class="n">mtcars</span><span class="o">$</span><span class="n">wt</span><span class="p">,</span> <span class="n">mtcars</span><span class="o">$</span><span class="n">mpg</span><span class="p">,</span> <span class="n">method</span> <span class="o">=</span> <span class="s2">"pearson"</span><span class="p">)</span>
</code></pre>
</div>
</div>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>## 
## 	Pearson's product-moment correlation
## 
## data:  mtcars$wt and mtcars$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594
</code></pre>
</div>
</div>
<h3 id="data-are-not-normally-distributed" class="clickable-header">data are not normally distributed</h3>
<p>If the data are not normally distributed, it’s recommended to use the non-parametric correlation, including Spearman and Kendall rank-based correlation tests.</p>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#Spearman rank correlation coefficient</span>
<span class="n">cor.test</span><span class="p">(</span><span class="n">mtcars</span><span class="o">$</span><span class="n">wt</span><span class="p">,</span> <span class="n">mtcars</span><span class="o">$</span><span class="n">mpg</span><span class="p">,</span>  <span class="n">method</span> <span class="o">=</span> <span class="s2">"spearman"</span><span class="p">)</span>
</code></pre>
</div>
</div>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>## Warning in cor.test.default(mtcars$wt, mtcars$mpg, method = "spearman"):
## Cannot compute exact p-value with ties
</code></pre>
</div>
</div>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>## 
## 	Spearman's rank correlation rho
## 
## data:  mtcars$wt and mtcars$mpg
## S = 10292, p-value = 1.488e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.886422
</code></pre>
</div>
</div>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#Kendall rank correlation test</span>
<span class="n">res</span> <span class="o">&lt;-</span> <span class="n">cor.test</span><span class="p">(</span><span class="n">mtcars</span><span class="o">$</span><span class="n">wt</span><span class="p">,</span> <span class="n">mtcars</span><span class="o">$</span><span class="n">mpg</span><span class="p">,</span>  <span class="n">method</span><span class="o">=</span><span class="s2">"kendall"</span><span class="p">)</span>
</code></pre>
</div>
</div>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>## Warning in cor.test.default(mtcars$wt, mtcars$mpg, method = "kendall"):
## Cannot compute exact p-value with ties
</code></pre>
</div>
</div>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#Extract the p.value and the correlation coefficient</span>
<span class="n">res</span><span class="o">$</span><span class="n">p.value</span>
</code></pre>
</div>
</div>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>## [1] 6.70577e-09
</code></pre>
</div>
</div>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="n">res</span><span class="o">$</span><span class="n">estimate</span>
</code></pre>
</div>
</div>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>##        tau 
## -0.7278321
</code></pre>
</div>
</div>
<h2 id="interprete-correlation-coefficient" class="clickable-header">Interprete correlation coefficient</h2>
<p>The value of correlation coefficient can be negative or positive, range [-1, 1]:</p>
<ul>
<li>-1: strong negative correlation</li>
<li>0: no relationship between the two variables (x and y)</li>
<li>1: strong positive correlation</li>
</ul>
<h2 id="generate-correlation-matrix" class="clickable-header">Generate correlation matrix</h2>
<p>Function <code class="highlighter-rouge">rcorr</code> from Hmisc package used to return correlation coefficients and the correlation p-values, function <code class="highlighter-rouge">ggcorrplot</code> from ggcorrplot used to visualize correlation matrix.</p>
<p>Method one: calculate matrix manually<br />
Method two: function <code class="highlighter-rouge">ggcorr()</code> in ggally package. However, can’t reorder matrix and display significance level.<br />
Method three: <code class="highlighter-rouge">corrplot()</code> function from <code class="highlighter-rouge">corrplot</code> package can be used<br />
Method four: <code class="highlighter-rouge">ggcorrplot()</code> function from ggcorrplot package. Functions: reordering, displays significance level, computing a matrix of correlation p-values.</p>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#install.packages("corrplot")</span>
<span class="n">library</span><span class="p">(</span><span class="n">corrplot</span><span class="p">)</span>

<span class="c1">#install.packages("Hmisc")</span>
<span class="n">library</span><span class="p">(</span><span class="n">Hmisc</span><span class="p">)</span>
</code></pre>
</div>
</div>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#install.packages("ggcorrplot")</span>
<span class="n">library</span><span class="p">(</span><span class="n">ggcorrplot</span><span class="p">)</span>
</code></pre>
</div>
</div>
<h2 id="compute-correlation-matrix" class="clickable-header">Compute correlation matrix</h2>
<h3 id="method-one-use-ggcorrplot" class="clickable-header">Method one: use ggcorrplot()</h3>
<p><code class="highlighter-rouge">cor()</code> return correlation matrix; <code class="highlighter-rouge">cor_pmat()</code> in ggcorrplot package computes a matrix of correlation p-values.</p>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="n">corr</span> <span class="o">&lt;-</span> <span class="nf">round</span><span class="p">(</span><span class="n">cor</span><span class="p">(</span><span class="n">mtcars</span><span class="p">),</span> <span class="m">2</span><span class="p">)</span> <span class="c1">#Compute a correlation matrix</span>
<span class="n">p.mat</span> <span class="o">&lt;-</span> <span class="n">cor_pmat</span><span class="p">(</span><span class="n">mtcars</span><span class="p">)</span> <span class="c1">#Compute a matrix of correlation p-values</span>
</code></pre>
</div>
</div>
<h3 id="method-two-use-rcorr" class="clickable-header">Method two: use rcorr()</h3>
<p>The function <code class="highlighter-rouge">rcorr()</code> [in Hmisc package] returns both the correlation coefficients and the correlation p-values for all possible pairs of <em>columns</em> in the data table.</p>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>rcorr(x, type = c("pearson","spearman"))

The output is a list containing the following elements: 
r: the correlation matrix 
n: the matrix of the number of observations used in analyzing each pair of variables 
P: the p-values corresponding to the significance levels of correlations.
</code></pre>
</div>
</div>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#corr &lt;- rcorr(as.matrix(mtcars))</span>
<span class="c1">#head(round(corr$r, 2)) #Extract the correlation coefficients</span>
<span class="c1">#head(round(corr$P, 3)) #Extract p-values</span>
</code></pre>
</div>
</div>
<h3 id="method-three-use-cor" class="clickable-header">Method three: use cor()</h3>
<p><code class="highlighter-rouge">cor()</code> can be used to compute a correlation matrix, but not correlation p-values</p>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>cor(x, method = c("pearson", "kendall", "spearman"), use = "complete.obs")
x: numeric matrix or a data frame.
method: indicates the correlation coefficient to be computed. 
  - pearson(default): measures the linear dependence between two variables. 
  - kendall: non-parametric rank-based correlation test.
  - spearman: non-parametric rank-based correlation test.
use = "complete.obs": case-wise deletion, which is useful for NA-containing matrix.     
</code></pre>
</div>
</div>
<h2 id="computing-the-p-value-of-correlations" class="clickable-header">Computing the p-value of correlations</h2>
<p>P value calculation principle</p>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code># mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest &lt;- function(mat, ...) {
    mat &lt;- as.matrix(mat)
    n &lt;- ncol(mat)
    p.mat&lt;- matrix(NA, n, n)
    diag(p.mat) &lt;- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp &lt;- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] &lt;- p.mat[j, i] &lt;- tmp$p.value
        }
    }
  colnames(p.mat) &lt;- rownames(p.mat) &lt;- colnames(mat)
  p.mat
}

#matrix of the p-value of the correlation
p.mat &lt;- cor.mtest(mtcars)
</code></pre>
</div>
</div>
<h3 id="format-the-correlation-matrix-unnecessary" class="clickable-header">Format the correlation matrix (unnecessary)</h3>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">### flattenCorrMatrix function</span>
<span class="c1">#cormat: matrix of the correlation coefficients</span>
<span class="c1">#pmat: matrix of the correlation p-values</span>
<span class="n">flattenCorrMatrix</span> <span class="o">&lt;-</span> <span class="k">function</span><span class="p">(</span><span class="n">cormat</span><span class="p">,</span> <span class="n">pmat</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">ut</span> <span class="o">&lt;-</span> <span class="n">upper.tri</span><span class="p">(</span><span class="n">cormat</span><span class="p">)</span>
  <span class="n">data.frame</span><span class="p">(</span>
    <span class="n">row</span> <span class="o">=</span> <span class="n">rownames</span><span class="p">(</span><span class="n">cormat</span><span class="p">)[</span><span class="n">row</span><span class="p">(</span><span class="n">cormat</span><span class="p">)[</span><span class="n">ut</span><span class="p">]],</span>
    <span class="n">column</span> <span class="o">=</span> <span class="n">rownames</span><span class="p">(</span><span class="n">cormat</span><span class="p">)[</span><span class="n">col</span><span class="p">(</span><span class="n">cormat</span><span class="p">)[</span><span class="n">ut</span><span class="p">]],</span>
    <span class="n">cor</span>  <span class="o">=</span><span class="p">(</span><span class="n">cormat</span><span class="p">)[</span><span class="n">ut</span><span class="p">],</span>
    <span class="n">p</span> <span class="o">=</span> <span class="n">pmat</span><span class="p">[</span><span class="n">ut</span><span class="p">]</span>
    <span class="p">)</span>
<span class="p">}</span>

<span class="c1">#head(flattenCorrMatrix(corr$r, corr$P))</span>
</code></pre>
</div>
</div>
<h2 id="visualize-correlation-matrix" class="clickable-header">Visualize correlation matrix</h2>
<p>Five different ways to visualize a correlation matrix:</p>
<ol>
<li>symnum() function (gave up, too lazy to introduce)</li>
<li>corrplot() function to plot a correlogram</li>
<li>scatter plots</li>
<li>heatmap</li>
<li>ggcorrplot function (recommend)</li>
</ol>
<h3 id="corrplot-function-to-plot-a-correlogram" class="clickable-header">corrplot() function to plot a correlogram</h3>
<p>R <code class="highlighter-rouge">corrplot</code> function is used to plot the graph of the correlation matrix.</p>
<div class="highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>corrplot(corr, method="circle", type="upper", order="hclust", col=col, bg="lightblue", tl.col="black", tl.srt=45, )

corr:	The correlation matrix to visualize. To visualize a general matrix, please use is.corr=FALSE.
method:	seven visualization method: "circle", "square"", "ellipse", number", "shade"", "color", "pie"
type: types of layout. 
  - "full" (default): display full correlation matrix
  - "upper": display upper triangular of the correlation matrix
  - "lower": display lower triangular of the correlation matrix
order: reorder the correlation matrix. The correlation matrix can be reordered according to the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix.
  - "hclust": for hierarchical clustering order
tl.col: for text label color, used to change text colors
tl.srt: for text label string rotation. used to change label rotations.
</code></pre>
</div>
</div>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">RColorBrewer</span><span class="p">)</span>
<span class="n">col</span><span class="o">&lt;-</span> <span class="n">colorRampPalette</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"red"</span><span class="p">,</span> <span class="s2">"white"</span><span class="p">,</span> <span class="s2">"blue"</span><span class="p">))(</span><span class="m">20</span><span class="p">)</span>

<span class="n">corrplot</span><span class="p">(</span><span class="n">corr</span><span class="p">,</span> 
         <span class="n">method</span><span class="o">=</span><span class="s2">"circle"</span><span class="p">,</span> <span class="c1">#visualization method</span>
         <span class="n">type</span><span class="o">=</span><span class="s2">"upper"</span><span class="p">,</span> <span class="c1">#types of layout</span>
         <span class="n">order</span><span class="o">=</span><span class="s2">"hclust"</span><span class="p">,</span> <span class="c1">#order: reorder the correlation matrix</span>
         <span class="n">col</span><span class="o">=</span><span class="n">col</span><span class="p">,</span> <span class="c1">#col: Using different color spectrum</span>
         <span class="c1">#bg="lightblue", #bg: Change background color to lightblue</span>
         <span class="c1">#col=brewer.pal(n=8, name="RdBu") #use RcolorBrewer palette of colors</span>
         <span class="c1">#col=brewer.pal(n=8, name="RdYlBu") #use RcolorBrewer palette of colors</span>
         <span class="c1">#col=brewer.pal(n=8, name="PuOr") #use RcolorBrewer palette of colors</span>
         <span class="n">tl.col</span><span class="o">=</span><span class="s2">"blue"</span><span class="p">,</span> <span class="c1">#change text colors</span>
         <span class="n">tl.srt</span><span class="o">=</span><span class="m">45</span><span class="p">,</span> <span class="c1">#change label rotations</span>
         
         <span class="c1">#Combine with significance</span>
         <span class="n">p.mat</span> <span class="o">=</span> <span class="n">p.mat</span><span class="p">,</span> <span class="c1">#add the matrix of p-value</span>
         <span class="n">sig.level</span> <span class="o">=</span> <span class="m">0.01</span><span class="p">,</span> <span class="c1">## Specialized the insignificant value according to the significant level</span>
         <span class="n">insig</span> <span class="o">=</span> <span class="s2">"blank"</span><span class="p">,</span> <span class="c1">#Leave blank on no significant coefficient</span>
         
         <span class="n">diag</span><span class="o">=</span><span class="kc">FALSE</span> <span class="c1">#hide correlation coefficient on the principal diagonal</span>
         <span class="p">)</span>
</code></pre>
</div>
</div>
<p><img src="http://chenyuan.date/images/Correlation-analysis-unnamed-chunk-8-1.png" alt="" /></p>
<h3 id="chartcorrelation-draw-scatter-plots" class="clickable-header">chart.Correlation(): Draw scatter plots</h3>
<p><code class="highlighter-rouge">chart.Correlation()</code> in package <em>PerformanceAnalytics</em> can be used to display a chart of a correlation matrix.</p>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#install.packages("PerformanceAnalytics")</span>
<span class="n">library</span><span class="p">(</span><span class="s2">"PerformanceAnalytics"</span><span class="p">)</span>
</code></pre>
</div>
</div>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="n">my_data</span> <span class="o">&lt;-</span> <span class="n">mtcars</span><span class="p">[,</span> <span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">6</span><span class="p">,</span><span class="m">7</span><span class="p">)]</span>
<span class="n">chart.Correlation</span><span class="p">(</span><span class="n">my_data</span><span class="p">,</span> <span class="n">histogram</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span> <span class="n">pch</span><span class="o">=</span><span class="m">19</span><span class="p">)</span>
</code></pre>
</div>
</div>
<p><img src="http://chenyuan.date/images/Correlation-analysis-unnamed-chunk-9-1.png" alt="" /></p>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#The distribution of each variable is shown on the diagonal.</span>
<span class="c1">#left bottom of the diagonal: the bivariate scatter plots with a fitted line are displayed</span>
<span class="c1">#right top of the diagonal: the value of the correlation plus the significance level as stars</span>
<span class="c1">#Each significance level is associated to a symbol: </span>
<span class="c1">#p-values(0, 0.001, 0.01, 0.05, 0.1, 1) &lt;=&gt; symbols(***, **, *, ., ,"")</span>
</code></pre>
</div>
</div>
<h3 id="use-heatmap" class="clickable-header">Use heatmap()</h3>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="n">col</span> <span class="o">&lt;-</span> <span class="n">colorRampPalette</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"blue"</span><span class="p">,</span> <span class="s2">"white"</span><span class="p">,</span> <span class="s2">"red"</span><span class="p">))(</span><span class="m">20</span><span class="p">)</span>
<span class="n">heatmap</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">corr</span><span class="p">,</span> <span class="n">col</span> <span class="o">=</span> <span class="n">col</span><span class="p">,</span> <span class="n">symm</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</code></pre>
</div>
</div>
<p><img src="http://chenyuan.date/images/Correlation-analysis-unnamed-chunk-10-1.png" alt="" /></p>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">#x: the correlation matrix to be plotted</span>
<span class="c1">#col: color palettes</span>
<span class="c1">#symm: logical indicating if x should be treated symmetrically; can only be true when x is a square matrix.</span>
</code></pre>
</div>
</div>
<h3 id="ggcorrplot-function-recommend" class="clickable-header">ggcorrplot function (recommend)</h3>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="n">ggcorrplot</span><span class="p">(</span><span class="n">corr</span><span class="p">,</span> 
           <span class="n">method</span> <span class="o">=</span> <span class="s2">"circle"</span><span class="p">,</span>  <span class="c1">#"square" (default), "circle" </span>
           <span class="n">type</span> <span class="o">=</span> <span class="s2">"lower"</span><span class="p">,</span> <span class="c1"># "full" (default), "lower" or "upper" display.</span>
           <span class="n">hc.order</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> <span class="c1">#logical value. If TRUE, correlation matrix will be hc.ordered using hclust function.</span>
           <span class="n">outline.col</span> <span class="o">=</span> <span class="s2">"white"</span><span class="p">,</span> <span class="c1">#the outline color of square or circle. Default "gray"</span>
           <span class="n">ggtheme</span> <span class="o">=</span> <span class="n">ggplot2</span><span class="o">::</span><span class="n">theme_classic</span><span class="p">,</span> <span class="c1">#Change theme</span>
           <span class="n">colors</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s2">"red"</span><span class="p">,</span> <span class="s2">"white"</span><span class="p">,</span> <span class="s2">"green"</span><span class="p">),</span> <span class="c1">#Change colors</span>
           <span class="n">lab</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> <span class="c1">#Add correlation coefficients</span>
           <span class="n">sig.level</span><span class="o">=</span><span class="m">0.05</span><span class="p">,</span> <span class="c1">#set significant level</span>
           <span class="n">p.mat</span> <span class="o">=</span> <span class="n">p.mat</span><span class="p">,</span> <span class="c1">#Add correlation significance level</span>
           <span class="n">insig</span> <span class="o">=</span> <span class="s2">"blank"</span> <span class="c1">#Leave blank on no significant coefficient</span>
           <span class="p">)</span>
</code></pre>
</div>
</div>
<p><img src="http://chenyuan.date/images/Correlation-analysis-unnamed-chunk-11-1.png" alt="" /></p>
<h2 id="partial-correlation" class="clickable-header">Partial correlation</h2>
<p>The R package ppcor provides users with four functions: pcor(), pcor.test(), spcor(), and spcor.test().<br />
<code class="highlighter-rouge">pcor()</code> calculates the partial correlations of all pairs of two random variables of a matrix or a data frame and provides the matrices of statistics and p-values of each pairwise partial correlation. <code class="highlighter-rouge">pcor.test()</code> computes the pairwise partial correlation coefficient of a pair of two random variables given one or more random variables.</p>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1"># install.packages("ppcor")</span>
<span class="n">library</span><span class="p">(</span><span class="n">ppcor</span><span class="p">)</span>
</code></pre>
</div>
</div>
<div class="language-r highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1"># calculate the correlations between each pair with all other variables are adjusted </span>
<span class="n">partial.corr</span> <span class="o">&lt;-</span> <span class="n">pcor</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">mtcars</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s2">"spearman"</span><span class="p">)</span>
<span class="c1">#Results interpretation: ?pcor()</span>

<span class="c1"># calculate the correlations between each pair with specified variables are adjusted </span>
<span class="n">partial_correlation</span> <span class="o">&lt;-</span> <span class="k">function</span><span class="p">(</span><span class="n">df</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">n</span> <span class="o">&lt;-</span> <span class="n">ncol</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
  <span class="n">results</span> <span class="o">&lt;-</span> <span class="nf">list</span><span class="p">()</span> <span class="c1"># define an empty list</span>
  <span class="n">results</span><span class="p">[[</span><span class="s2">"estimate"</span><span class="p">]]</span> <span class="o">&lt;-</span> <span class="n">sapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="n">n</span><span class="m">-3</span><span class="p">),</span> <span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">sapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="n">n</span><span class="m">-3</span><span class="p">),</span> <span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="p">{</span>
      <span class="n">ifelse</span><span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="n">pcor.test</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span><span class="n">x</span><span class="p">],</span> <span class="n">df</span><span class="p">[,</span><span class="n">y</span><span class="p">],</span> <span class="n">df</span><span class="p">[,</span><span class="nf">c</span><span class="p">((</span><span class="n">n</span><span class="m">-2</span><span class="p">)</span><span class="o">:</span><span class="n">n</span><span class="p">)],</span> <span class="n">method</span><span class="o">=</span><span class="s2">"spearman"</span><span class="p">)</span><span class="o">$</span><span class="n">estimate</span><span class="p">)</span>
    <span class="p">})</span>
  <span class="p">})</span>
  <span class="n">results</span><span class="p">[[</span><span class="s2">"p.value"</span><span class="p">]]</span> <span class="o">&lt;-</span> <span class="n">sapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="n">n</span><span class="m">-3</span><span class="p">),</span> <span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">sapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="n">n</span><span class="m">-3</span><span class="p">),</span> <span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="p">{</span>
      <span class="n">ifelse</span><span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="n">pcor.test</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span><span class="n">x</span><span class="p">],</span> <span class="n">df</span><span class="p">[,</span><span class="n">y</span><span class="p">],</span> <span class="n">df</span><span class="p">[,</span><span class="nf">c</span><span class="p">((</span><span class="n">n</span><span class="m">-2</span><span class="p">)</span><span class="o">:</span><span class="n">n</span><span class="p">)],</span> <span class="n">method</span><span class="o">=</span><span class="s2">"spearman"</span><span class="p">)</span><span class="o">$</span><span class="n">p.value</span><span class="p">)</span>
    <span class="p">})</span>
  <span class="p">})</span>
  <span class="n">colnames</span><span class="p">(</span><span class="n">results</span><span class="p">[[</span><span class="s2">"estimate"</span><span class="p">]])</span> <span class="o">&lt;-</span> <span class="n">rownames</span><span class="p">(</span><span class="n">results</span><span class="p">[[</span><span class="s2">"estimate"</span><span class="p">]])</span> <span class="o">&lt;-</span> <span class="n">colnames</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span> <span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="n">n</span><span class="m">-3</span><span class="p">))])</span>
  <span class="n">colnames</span><span class="p">(</span><span class="n">results</span><span class="p">[[</span><span class="s2">"p.value"</span><span class="p">]])</span> <span class="o">&lt;-</span> <span class="n">rownames</span><span class="p">(</span><span class="n">results</span><span class="p">[[</span><span class="s2">"p.value"</span><span class="p">]])</span> <span class="o">&lt;-</span> <span class="n">colnames</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span> <span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="n">n</span><span class="m">-3</span><span class="p">))])</span>
  <span class="nf">return</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
<span class="p">}</span>

<span class="n">pcor</span> <span class="o">&lt;-</span> <span class="n">partial_correlation</span><span class="p">(</span><span class="n">mtcars</span><span class="p">)</span>

<span class="n">library</span><span class="p">(</span><span class="n">ggcorrplot</span><span class="p">)</span>
<span class="n">ggcorrplot</span><span class="p">(</span><span class="n">pcor</span><span class="o">$</span><span class="n">estimate</span><span class="p">,</span>
           <span class="n">method</span> <span class="o">=</span> <span class="s2">"circle"</span><span class="p">,</span>  <span class="c1">#"square" (default), "circle"</span>
           <span class="n">type</span> <span class="o">=</span> <span class="s2">"lower"</span><span class="p">,</span> <span class="c1"># "full" (default), "lower" or "upper" display.</span>
           <span class="n">hc.order</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> <span class="c1">#logical value. If TRUE, correlation matrix will be hc.ordered using hclust function.</span>
           <span class="n">outline.col</span> <span class="o">=</span> <span class="s2">"white"</span><span class="p">,</span> <span class="c1">#the outline color of square or circle. Default "gray"</span>
           <span class="n">ggtheme</span> <span class="o">=</span> <span class="n">ggplot2</span><span class="o">::</span><span class="n">theme_classic</span><span class="p">,</span> <span class="c1">#Change theme</span>
           <span class="n">colors</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s2">"red"</span><span class="p">,</span> <span class="s2">"white"</span><span class="p">,</span> <span class="s2">"green"</span><span class="p">),</span> <span class="c1">#Change colors</span>
           <span class="n">lab</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> <span class="c1">#Add correlation coefficients</span>
           <span class="n">sig.level</span><span class="o">=</span><span class="m">0.05</span><span class="p">,</span> <span class="c1">#set significant level</span>
           <span class="n">p.mat</span> <span class="o">=</span> <span class="n">pcor</span><span class="o">$</span><span class="n">p.value</span><span class="p">,</span> <span class="c1">#Add correlation significance level</span>
           <span class="n">insig</span> <span class="o">=</span> <span class="s2">"blank"</span><span class="p">,</span> <span class="c1">#Leave blank on no significant coefficient</span>
           <span class="n">title</span> <span class="o">=</span> <span class="s2">"Partial correlations for mtcars"</span>
           <span class="p">)</span>
</code></pre>
</div>
</div>
<p><img src="http://chenyuan.date/images/Correlation-analysis-unnamed-chunk-12-1.png" alt="" /></p>
<h1 id="references" class="clickable-header top-level-header">References</h1>
<p><a href="http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r">Correlation Test Between Two Variables in R</a><br />
<a href="http://www.sthda.com/english/wiki/correlation-formula">correlation formula</a><br />
<a href="http://www.sthda.com/english/wiki/correlation-coefficient">Correlation coefficient</a><br />
<a href="http://www.sthda.com/english/wiki/correlation-coefficient-calculator">Correlation coefficient calculator</a><br />
<a href="http://www.sthda.com/english/rsthda/correlation-matrix.php">Correlation matrix online software: Analysis and visualization</a><br />
<a href="http://www.sthda.com/english/wiki/elegant-correlation-table-using-xtable-r-package">Elegant correlation table using xtable R package</a><br />
<a href="http://www.sthda.com/english/wiki/correlation-matrix-an-r-function-to-do-all-you-need">Correlation matrix : An R function to do all you need</a><br />
<a href="http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software">Correlation matrix: analyze, format and visualize</a><br />
<a href="http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization">ggplot2: Quick correlation matrix heatmap</a><br />
<a href="http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram">corrplot: Visualize correlation matrix</a><br />
<a href="http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2">ggcorrplot: Visualization of a correlation matrix using ggplot2</a><br />
<a href="https://cran.r-project.org/web/packages/ppcor/ppcor.pdf">ppcor</a><br />
<a href="https://stackoverflow.com/questions/30198968/pairwise-partial-correlation-of-a-matrix-controlling-by-one-variable">stackoverflow</a></p>
]]></content:encoded>
			<wfw:commentRss>https://www.biofacebook.com/?feed=rss2&#038;p=1365</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to Compare Regression Slopes</title>
		<link>https://www.biofacebook.com/?p=1234</link>
		<comments>https://www.biofacebook.com/?p=1234#comments</comments>
		<pubDate>Thu, 15 Dec 2016 05:52:02 +0000</pubDate>
		<dc:creator><![CDATA[szypanther]]></dc:creator>
				<category><![CDATA[统计学习]]></category>

		<guid isPermaLink="false">http://www.biofacebook.com/?p=1234</guid>
		<description><![CDATA[ How to Compare Regression Slopes Jim Frost 13 January, 2016 9 75 83 27 <p>If you perform linear regression analysis, you might need to compare different regression lines to see if their constants and slope coefficients are different. Imagine there is an established relationship between X and Y. Now, suppose you want to determine [...]]]></description>
				<content:encoded><![CDATA[<div class="postHeader">
<h1>How to Compare Regression Slopes</h1>
<div class="byline"><span class="author"><a href="http://blog.minitab.com/blog/adventures-in-statistics">Jim Frost</a></span> <span class="date">13 January, 2016</span></div>
<div class="social-share">
<div class="st_twitter_custom twitter social-item">
<div class="countbox"><i class="fa fa-lg fa-twitter"></i><span id="twitcount757b44c1-b42a-426e-a0d9-7a1349f32ed7" class="bubble">9</span></div>
</div>
<div class="st_facebook_custom facebook social-item">
<div class="countbox"><i class="fa fa-lg fa-facebook"></i><span id="fbcount757b44c1-b42a-426e-a0d9-7a1349f32ed7" class="bubble">75</span></div>
</div>
<div class="st_linkedin_custom linkedin social-item">
<div class="countbox"><i class="fa fa-lg fa-linkedin"></i><span id="incount757b44c1-b42a-426e-a0d9-7a1349f32ed7" class="bubble">83</span></div>
</div>
<div class="comment social-item">
<div class="countbox"><i class="fa fa-lg fa-comment"></i><a class="count-comments" href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-compare-regression-lines-between-different-models" data-compendium-count="0">27</a></div>
</div>
</div>
</div>
<div class="post-body">
<div class="content">
<p>If you perform linear regression analysis, you might need to compare different regression lines to see if their constants and slope coefficients are different. Imagine there is an established relationship between X and Y. Now, suppose you want to determine whether that relationship has changed. Perhaps there is a new context, process, or some other qualitative change, and you want to determine whether that affects the relationship between X and Y.</p>
<p>For example, you might want to assess whether the relationship between the height and weight of football players is significantly different than the same relationship in the general population.</p>
<p>You can graph the regression lines to visually compare the slope coefficients and constants. However, you should also statistically test the differences. <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">Hypothesis testing</a> helps separate the true differences from the random differences caused by sampling error so you can have more confidence in your findings.</p>
<p>In this blog post, I’ll show you how to compare a relationship between different regression models and determine whether the differences are statistically significant. Fortunately, these tests are easy to do using <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a>.</p>
<p>In the example I’ll use throughout this post, there is an input variable and an output variable for a hypothetical process. We want to compare the relationship between these two variables under two different conditions. Here is the <a href="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/569a0e7d067944f6f9147434794efcd6/comparingregressionmodels.MPJ">Minitab project file</a> with the data.</p>
<h2>Comparing Constants in Regression Analysis</h2>
<p>When the <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-to-interpret-the-constant-y-intercept" target="_blank">constants</a> (or y intercepts) in two different regression equations are different, this indicates that the two regression lines are shifted up or down on the Y axis. In the scatterplot below, you can see that the Output from Condition B is consistently higher than Condition A for any given Input value. We want to determine whether this vertical shift is statistically significant.</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/2ed27f4204515bac9d9674c16fa0c0f7/scatter_constant_dift.png" alt="Scatterplot with two regression lines that have different constants." /></p>
<p>To test the difference between the constants, we just need to include a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/data-concepts/cat-quan-variable/" target="_blank">categorical variable</a> that identifies the qualitative attribute of interest in the model. For our example, I have created a variable for the condition (A or B) associated with each observation.</p>
<p>To fit the model in Minitab, I’ll use: <strong>Stat &gt; Regression &gt; Regression &gt; Fit Regression Model</strong>. I’ll include <em>Output</em> as the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variable</a>, <em>Input</em> as the continuous <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">predictor</a>, and <em>Condition</em> as the categorical predictor.</p>
<p>In the regression analysis output, we’ll first check the coefficients table.</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/23657868f2cf893d216d05d3400ab9e6/coeff_constant_dift.png" alt="Coefficients table that shows that the constants are different" /></p>
<p>This table shows us that the relationship between Input and Output is statistically significant because the p-value for Input is 0.000.</p>
<p>The <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">coefficient</a> for Condition is 10 and its <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">p-value</a> is significant (0.000). The coefficient tells us that the vertical distance between the two regression lines in the scatterplot is 10 units of Output. The p-value tells us that this difference is statistically significant—you can reject the null hypothesis that the distance between the two constants is zero. You can also see the difference between the two constants in the regression equation table below.</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/a879996e37ebb05a297721e695a71943/equ_constant_dift.png" alt="Regression equation table that shows constants that are different" /></p>
<h2>Comparing Coefficients in Regression Analysis</h2>
<p>When two slope coefficients are different, a one-unit change in a predictor is associated with different mean changes in the response. In the scatterplot below, it appears that a one-unit increase in Input is associated with a greater increase in Output in Condition B than in Condition A. We can <em>see</em> that the slopes look different, but we want to be sure this difference is statistically significant.</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/200c12087fdf7eecd9b773d9ce213020/scatter_slope_dift.png" alt="Scatterplot that shows two slopes that are different" /></p>
<p>How do you statistically test the difference between regression coefficients? It sounds like it might be complicated, but it is actually very simple. We can even use the same Condition variable that we did for testing the constants.</p>
<p>We need to determine whether the coefficient for Input depends on the Condition. In statistics, when we say that the effect of one variable depends on another variable, that’s an interaction effect. All we need to do is include the interaction term for Input*Condition!</p>
<p>In Minitab, you can specify interaction terms by clicking the <strong>Model</strong> button in the main regression dialog box. After I fit the regression model with the interaction term, we obtain the following coefficients table:</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f06eff56f2266d0ff7e3919aa1292285/coeff_slope_dift.png" alt="Coefficients table that shows different slopes" /></p>
<p>The table shows us that the interaction term (Input*Condition) is statistically significant (p = 0.000). Consequently, we reject the null hypothesis and conclude that the difference between the two coefficients for Input (below, 1.5359 and 2.0050) does not equal zero. We also see that the main effect of Condition is not significant (p = 0.093), which indicates that difference between the two constants is not statistically significant.</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d5e5142c0ff13645d1dacc3e2c0bee27/equ_coeff_dift.png" alt="Regression equation table that shows different slopes" /></p>
<p>It is easy to compare and test the differences between the constants and coefficients in regression models by including a categorical variable. These tests are useful when you can see differences between regression models and you want to defend your conclusions with p-values.</p>
<p>If you&#8217;re learning about regression, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression tutorial</a>!</p>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>https://www.biofacebook.com/?feed=rss2&#038;p=1234</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
