小生这厢有礼了(BioFaceBook Personal Blog) » 统计学习

Making a pairwise distance matrix in pandas

szypanther — Tue, 09 Jun 2020 03:31:54 +0000

This is a somewhat specialized problem that forms part of a lot of data science and clustering workflows. It starts with a relatively straightforward question: if we have a bunch of measurements for two different things, how do we come up with a single number that represents the difference between the two things?

An example will make the question clearer. Let’s load our olympic medal dataset:

import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 6
data = pd.read_csv("https://raw.githubusercontent.com/mojones/binders/master/olympics.csv", sep="\t")
data

	City	Year	Sport	…	Medal	Country	Int Olympic Committee code
0	Athens	1896	Aquatics	…	Gold	Hungary	HUN
1	Athens	1896	Aquatics	…	Silver	Austria	AUT
2	Athens	1896	Aquatics	…	Bronze	Greece	GRE
3	Athens	1896	Aquatics	…	Gold	Greece	GRE
4	Athens	1896	Aquatics	…	Silver	Greece	GRE
…	…	…	…	…	…	…	…
29211	Beijing	2008	Wrestling	…	Silver	Germany	GER
29212	Beijing	2008	Wrestling	…	Bronze	Lithuania	LTU
29213	Beijing	2008	Wrestling	…	Bronze	Armenia	ARM
29214	Beijing	2008	Wrestling	…	Gold	Cuba	CUB
29215	Beijing	2008	Wrestling	…	Silver	Russia	RUS

29216 rows × 12 columns

and measure, for each different country, the number of medals they’ve won in each different sport:

summary = data.groupby(['Country', 'Sport']).size().unstack().fillna(0)
summary

Sport	Aquatics	Archery	Athletics	…	Water Motorsports	Weightlifting	Wrestling
Country
Afghanistan	0.0	0.0	0.0	…	0.0	0.0	0.0
Algeria	0.0	0.0	6.0	…	0.0	0.0	0.0
Argentina	3.0	0.0	5.0	…	0.0	2.0	0.0
Armenia	0.0	0.0	0.0	…	0.0	4.0	4.0
Australasia	11.0	0.0	1.0	…	0.0	0.0	0.0
…	…	…	…	…	…	…	…
Virgin Islands*	0.0	0.0	0.0	…	0.0	0.0	0.0
West Germany	62.0	0.0	67.0	…	0.0	7.0	9.0
Yugoslavia	91.0	0.0	2.0	…	0.0	0.0	16.0
Zambia	0.0	0.0	1.0	…	0.0	0.0	0.0
Zimbabwe	7.0	0.0	0.0	…	0.0	0.0	0.0

137 rows × 42 columns

Now we’ll pick two countries:

summary.loc[['Germany', 'Italy']]

Sport	Aquatics	Archery	Athletics	…	Water Motorsports	Weightlifting	Wrestling
Country
Germany	175.0	6.0	99.0	…	0.0	20.0	24.0
Italy	113.0	12.0	71.0	…	0.0	14.0	20.0

2 rows × 42 columns

Each country has 42 columns giving the total number of medals won in each sport. Our job is to come up with a single number that summarizes how different those two lists of numbers are. Mathematicians have figured out lots of different ways of doing that, many of which are implemented in the scipy.spatial.distance module.

If we just import pdist from the module, and pass in our dataframe of two countries, we’ll get a measuremnt:

from scipy.spatial.distance import pdist
pdist(summary.loc[['Germany', 'Italy']])

array([342.3024978])

That’s the distance score using the default metric, which is called the euclidian distance. Think of it as the straight line distance between the two points in space defined by the two lists of 42 numbers.

Now, what happens if we pass in a dataframe with three countries?

pdist(summary.loc[['Germany', 'Italy', 'France']])

array([342.3024978 , 317.98584874, 144.82403116])

As we might expect, we have three measurements:

Germany and Italy
Germnay and France
Italy and France

But it’s not easy to figure out which belongs to which. Happily, scipy also has a helper function that will take this list of numbers and turn it back into a square matrix:

from scipy.spatial.distance import squareform

squareform(pdist(summary.loc[['Germany', 'Italy', 'France']]))

array([[  0.        , 342.3024978 , 317.98584874],
       [342.3024978 ,   0.        , 144.82403116],
       [317.98584874, 144.82403116,   0.        ]])

In order to make sense of this, we need to re-attach the country names, which we can just do by turning it into a DataFrame:

pd.DataFrame(
    squareform(pdist(summary.loc[['Germany', 'Italy', 'France']])),
    columns = ['Germany', 'Italy', 'France'],
    index = ['Germany', 'Italy', 'France']
)

	Germany	Italy	France
Germany	0.000000	342.302498	317.985849
Italy	342.302498	0.000000	144.824031
France	317.985849	144.824031	0.000000

Hopefully this agrees with our intuition; the numbers on the diagonal are all zero, because each country is identical to itself, and the numbers above and below are mirror images, because the distance between Germany and France is the same as the distance between France and Germany (remember that we are talking about distance in terms of their medal totals, not geographical distance!)

Finally, to get pairwise measurements for the whole input dataframe, we just pass in the complete object and get the country names from the index:

pairwise = pd.DataFrame(
    squareform(pdist(summary)),
    columns = summary.index,
    index = summary.index
)

pairwise

Country	Afghanistan	Algeria	Argentina	…	Yugoslavia	Zambia	Zimbabwe
Country
Afghanistan	0.000000	8.774964	96.643675	…	171.947666	1.732051	17.492856
Algeria	8.774964	0.000000	95.199790	…	171.688672	7.348469	19.519221
Argentina	96.643675	95.199790	0.000000	…	148.128323	96.348326	89.810912
Armenia	5.830952	9.848858	96.477977	…	171.604196	5.744563	18.384776
Australasia	18.708287	20.024984	97.744565	…	166.991018	18.627936	22.360680
…	…	…	…	…	…	…	…
Virgin Islands*	1.414214	8.774964	96.457244	…	171.947666	1.732051	17.492856
West Germany	153.052279	150.306354	142.537714	…	184.945938	152.577849	144.045132
Yugoslavia	171.947666	171.688672	148.128323	…	0.000000	171.874955	169.103519
Zambia	1.732051	7.348469	96.348326	…	171.874955	0.000000	17.521415
Zimbabwe	17.492856	19.519221	89.810912	…	169.103519	17.521415	0.000000

137 rows × 137 columns

A nice way to visualize these is with a heatmap. 137 countries is a bit too much to show on a webpage, so let’s restrict it to just the countries that have scored at least 500 medals total:

import seaborn as sns
import matplotlib.pyplot as plt

# make summary table for just top countries
top_countries = (
    data
    .groupby('Country')
    .filter(lambda x : len(x) > 500)
    .groupby(['Country', 'Sport'])
    .size()
    .unstack()
    .fillna(0)
    )

# make pairwise distance matrix
pairwise_top = pd.DataFrame(
    squareform(pdist(top_countries)),
    columns = top_countries.index,
    index = top_countries.index
)

# plot it with seaborn
plt.figure(figsize=(10,10))
sns.heatmap(
    pairwise_top,
    cmap='OrRd',
    linewidth=1
)

Now that we have a plot to look at, we can see a problem with the distance metric we’re using. The US has won so many more medals than other countries that it distorts the measurement. And if we think about it, what we’re really interested in is not the exact number of medals in each category, but the relative number. In other words, we want two contries to be considered similar if they both have about twice as many medals in boxing as athletics, for example, regardless of the exact numbers.

Luckily for us, there is a distance measure already implemented in scipy that has that property – it’s called cosine distance. Think of it as a measurement that only looks at the relationships between the 44 numbers for each country, not their magnitude. We can switch to cosine distance by specifying the metric keyword argument in pdist:

# make pairwise distance matrix
pairwise_top = pd.DataFrame(
    squareform(pdist(top_countries, metric='cosine')),
    columns = top_countries.index,
    index = top_countries.index
)

# plot it with seaborn
plt.figure(figsize=(10,10))
sns.heatmap(
    pairwise_top,
    cmap='OrRd',
    linewidth=1
)

And as you can see we spot some much more interstesting patterns. Notice, for example, that Russia and Soviet Union have a very low distance (i.e. their medal distributions are very similar).

When looking at data like this, remember that the shade of each cell is not telling us anything about how many medals a country has won – simply how different or similar each country is to each other. Compare the above heatmap with this one which displays the proportion of medals in each sport per country:

plt.figure(figsize=(10,10))
sns.heatmap(
    top_countries.apply(lambda x : x / x.sum(), axis=1),
    cmap='BuPu',
    square=True,
    cbar_kws = {'fraction' : 0.02}
)

Finally, how might we find pairs of countries that have very similar medal distributions (i.e. very low numbers in the pairwise table)? By far the easiest way is to start of by reshaping the table into long form, so that each comparison is on a separate row:

# create our pairwise distance matrix
pairwise = pd.DataFrame(
    squareform(pdist(summary, metric='cosine')),
    columns = summary.index,
    index = summary.index
)

# move to long form
long_form = pairwise.unstack()

# rename columns and turn into a dataframe
long_form.index.rename(['Country A', 'Country B'], inplace=True)
long_form = long_form.to_frame('cosine distance').reset_index()

Now we can write our filter as normal, remembering to filter out the unintersting rows that tell us a country’s distance from itself!

long_form[
    (long_form['cosine distance'] < 0.05) 
    & (long_form['Country A'] != long_form['Country B'])
]

	Country A	Country B	cosine distance
272	Algeria	Zambia	0.026671
1034	Azerbaijan	Mongolia	0.045618
1105	Bahamas	Barbados	0.021450
1111	Bahamas	British West Indies	0.021450
1113	Bahamas	Burundi	0.021450
…	…	…	…
17033	United Arab Emirates	Haiti	0.010051
17037	United Arab Emirates	Independent Olympic Participants	0.000000
17051	United Arab Emirates	Kuwait	0.000000
18164	Virgin Islands	Netherlands Antilles	0.000000
18496	Zambia	Algeria	0.026671

462 rows × 3 columns

https://www.drawingfromdata.com/making-a-pairwise-distance-matrix-with-pandas dm = squareform(pdist(input_data,metric=’braycurtis’))

Pandas and Sklearn

szypanther — Fri, 05 Jun 2020 03:02:14 +0000

pandas isnull函数检查数据是否有缺失

pandas isnull sum with column headers

for col in main_df:
    print(sum(pd.isnull(data[col])))

I get a list of the null count for each column:

0
1
100

What I’m trying to do is create a new dataframe which has the column header alongside the null count, e.g.

col1 | 0
col2 | 1
col3 | 100

#print every column using:

nulls = df.isnull().sum().to_frame()
for index, row in nulls.iterrows():
    print(index, row[0])

for col in df:
    print(df[col].unique())

pandas.get_dummies 的用法 (One-Hot Encoding)

get_dummies 是利用pandas实现one hot encode的方式。详细参数请查看官方文档

pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False)[source]

参数说明：

data : array-like, Series, or DataFrame 输入的数据
prefix : string, list of strings, or dict of strings, default None get_dummies转换后，列名的前缀
columns : list-like, default None 指定需要实现类别转换的列名
dummy_na : bool, default False 增加一列表示空缺值，如果False就忽略空缺值
drop_first : bool, default False 获得k中的k-1个类别值，去除第一个

离散特征的编码分为两种情况：

1、离散特征的取值之间没有大小的意义，比如color：[red,blue],那么就使用one-hot编码

2、离散特征的取值有大小的意义，比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}

例子：

import pandas as pd

df = pd.DataFrame([  
            ['green' , 'A'],   
            ['red'   , 'B'],   
            ['blue'  , 'A']])  

df.columns = ['color',  'class'] 
pd.get_dummies(df)

get_dummies 前：

get_dummies 后：

上述执行完以后再打印df 出来的还是get_dummies 前的图，因为你没有写

df = pd.get_dummies(df)

可以对指定列进行get_dummies

pd.get_dummies(df.color)

将指定列进行get_dummies 后合并到元数据中

df = df.join(pd.get_dummies(df.color))

参考：https://blog.csdn.net/maymay_/article/details/80198468

 
>>> train_filter.info()

RangeIndex: 1482 entries, 0 to 1481
Columns: 182 entries, SampleID to BS120
dtypes: float64(177), int64(2), object(3)
memory usage: 2.1+ MB
>>> train_filter.dtypes
SampleID object
Streptococcus Infection float64
Duration_of_gestation object
Gestation_age float64
Gestation_age_G1 float64
Gestation_age_G2 float64
GDM_HDP float64
Age int64
Age_group int64
Blood_type float64
Medication_use float64
Progesterone_use float64
Pregnancy_mode float64
Native_place float64
Combined_disease float64
Infection float64
Scar_uterus float64
Risk_rating float64
Anamnesis float64
Thalassemia float64
Ovary_disease float64
Hepatopathy float64
Allergic_history float64
Thyroid_disease float64
Hysteromyoma float64
Breast_disease float64
Weight_at_delivery object
Weight_before_pregnancy float64
Height float64
BMI_before_pregnancy float64
 ... 
B_A/G float64
B_r_GT_G float64
B_r_GT float64
B_TBA_G float64
B_TBA float64
B_ALT_G float64
B_ALT float64
B_AST_G float64
B_AST float64
B_TBIL_G float64
B_TBIL float64
B_DBIL_G float64
B_DBIL float64
B_IBIL_G float64
B_IBIL float64
B_Crea_G float64
B_Crea float64
B_CysC_G float64
B_CysC float64
B_UA_G float64
B_UA float64
B_Urea_G float64
B_Urea float64
B_GLU_G float64
B_GLU float64
HbA1c_G float64
HbA1c float64
BS float64
BS60 float64
BS120 float64
Length: 182, dtype: object

scikit-learn 是基于 Python 语言的机器学习工具

http://www.scikitlearn.com.cn/

微生物多样研究—差异分析

szypanther — Fri, 27 Mar 2020 07:29:11 +0000

1. 随机森林模型

随机森林是一种基于决策树（Decisiontree）的高效的机器学习算法，可以用于对样本进行分类（Classification），也可以用于回归分析（Regression）。

它属于非线性分类器，因此可以挖掘变量之间复杂的非线性的相互依赖关系。通过随机森林分析，可以找出能够区分两组样本间差异关键OTU。

Feature Importance Scores表格-来源于随机森林结果

记录了各OTU对组间差异的贡献值大小。

注：一般地，选取Mean_decrease_in_accuracy值大于0.05的OTU，作进一步分析；对于组间差异较小的样本，该值可能会降至0.03。

2. 交叉验证分析

交叉验证（Crossvalidation)，是一种统计学上将数据样本切割成较小子集的实用方法。先在一个子集上做分析，而其它子集则用来做后续对此分析的确认及验证。一开始的子集被称为训练集。而其它的子集则被称为验证集或测试集。

其中最常见的为k-foldercross-validation，它指的是将所有数据分成k个子集，每个子集均做一次测试集，其余的作为训练集。交叉验证重复k次，每次选择一个子集作为测试集，并将k次的平均交叉验证识别正确率作为结果。

所有的样本都被作为了训练集和测试集，每个样本都被验证一次。

对随机森林方法筛选出的关键OTU的组合进行遍历，以期用最少的OTU数目组合构建一个错误率最低高效分类器。

一般地，对随机森林分析筛选出的关键OTU，按照不同组合进行10倍交叉验证分析，找出能够最准确区分组间差异的最少的OTU组合，再做进一步的分析，如ROC分析等。

注：图中横坐标表示不同数量的OTU组合，纵坐标表示该数量OTU组合下分类的错误率。OTU组合数越少，且错误率越低，则该OTU组合被认为是能够区分组间差异的最少的OTU组合。

3. ROC曲线

接收者操作特征曲线（Receiveroperating characteristic curve，ROC 曲线）也是一种有效的有监督学习方法。ROC分析属于二元分类算法，用来处理只有两种分类的问题，可以用于选择最佳的判别模型,选择最佳的诊断界限值。

可依据专业知识，对疾病组和参照组测定结果进行分析，确定测定值的上下限、组距以及截断点(cut-offpoint)，按选择的组距间隔列出累积频数分布表，分别计算出所有截断点的敏感性(Sensetivity)、特异性和假阳性率(1-特异性:Specificity)。以敏感性为纵坐标代表真阳性率，(1-特异性)为横坐标代表假阳性率，作图绘成ROC曲线。ROC曲线越靠近左上角，诊断的准确性就越高。亦可通过分别计算各个试验的ROC曲线下的面积(AUC)进行比较，哪一种试验的AUC最大，则哪一种试验的诊断价值最佳。

注：图中横坐标为假阳性率false positive rate（FPR）：Specificity，纵坐标为真阳性率true positive rate（TPR）：Sensetivity。最靠近左上角的ROC曲线的点是错误最少的最好阈值，其假阳性和假阴性的总数最少。ROC曲线下的面积值在1.0和0.5之间。在AUC>0.5的情况下，AUC越接近于1，说明诊断效果越好。AUC在 0.5~0.7时有较低准确性，AUC在0.7~0.9时有一定准确性，AUC在0.9以上时有较高准确性。AUC=0.5时，说明诊断方法完全不起作用，无诊断价值。AUC<0.5不符合真实情况，在实际中极少出现。

4. Wilcoxon秩和检验分析

Wilcoxonrank-sum test，也叫曼-惠特尼U检验（Mann–WhitneyU test），是两组独立样本非参数检验的一种方法。其原假设为两组独立样本来自的两总体分布无显著差异，通过对两组样本平均秩的研究来实现判断两总体的分布是否存在差异，该分析可以对两组样品的物种进行显著性差异分析，并对p值计算假发现率（FDR）q值。

注：mean分别为两组样品物种的平均相对丰度，sd分别是两组样本物种相对丰度的标准差。P值为对两组检验原假设为真的概率值，p<0.05表示存在差异，p<0.01表示差异显著，q值为假发现率。

5. 差异菌群Heatmap分析

以10倍交叉验证（10-foldcross-validation）估计泛化误差（Generalizationerror）的大小，其余参数使用默认设置。建模结果同时包含“基线”误差（Baselineerror）的期望值，即数据集中属于最优势分类的样本全部被错误分类的概率。每个OTU根据其被移除后模型预报错误率增加的大小确定其重要度数值，重要度越高，该OTU对模型预报准确率的贡献越大。

根据挑选出来的差异OTU，根据其在每个样品中的丰度信息，对物种进行聚类，绘制成热图，便于观察哪些物种在哪些样品中聚集较多或含量较低。

注：图中越接近蓝色表示物种丰度越低，越接近橙红色表示丰度越高。左边的聚类树是根据各物种间的spearman相关性距离进行聚类；上边的聚类树是采用样本间距离算法中最常用的Bray-Curtis算法进行聚类。

6. 两组样本Welch’s t-test分析

两组不同方差的样本可使用Welch’st-test进行差异比较分析，通过此分析可获得在两组中有显著性差异的物种[或差异基因丰度—适用于元（宏）基因组]。

注：上图所示为不同基因丰度（或物种）在两组样品中的丰度比例，中间所示为95%置信度区间内，物种丰度的差异比例，最右边的值为p值，p值＜0.05，表示差异显著。

7. Shannon多样性指数比较盒状图

将不同分类或环境的多组样本的Shannon多样性指数进行四分位计算，比较不同样本组的组间Shannon指数差异。同时进行非参数Mann-Whitney判断样本组间的显著性差异

注：横坐标表示样本分组，纵坐标表示相对应的Alpha多样性指数值；图形可以显示5个统计量（最小值，第一个四分位，中位数，第三个中位数和最大值，及由下到上5条线）。p＜0.05，表示差异显著；P<0.01，表示差异极显著。

8. 基于距离的箱式图

将不同分类或环境的多组样本的距离进行四分位计算，比较不同样本组的组内和组间的距离分布差异。同时进行multipleStudent’s two-sample t-tests判断样本组间差异的显著性。

箱式图的作用：识别数据异常值；粗略估计和判断数据特征；比较几批数据的形状，同一数轴上，几批数据的箱形图并行排列，几批数据的中位数、尾长、异常值、分布区间等形状信息一目了然。

箱线图（Boxplot）也称箱须图（Box-whiskerPlot），是利用数据中的五个统计量：最小值、第一四分位数、中位数、第三四分位数与最大值来描述数据的一种方法，它也可以粗略地看出数据是否具有对称性，分布的分散程度等信息，特别可以用于对几组样本的比较。简单箱线图由五部分组成，分别是最小值、中位数、最大值和两个四分位数。

注：第一四分位数 (Q1)，又称“下四分位数”，等于该样本中所有数值由小到大排列后第25%的数字。第二四分位数 (Q2)，又称“中位数”，等于该样本中所有数值由小到大排列后第50%的数字。第三四分位数 (Q3)，又称“上四分位数”，等于该样本中所有数值由小到大排列后第75%的数字。

9. LEfSe分析

LEfSe是一种用于发现高维生物标识和揭示基因组特征的软件。包括基因，代谢和分类，用于区别两个或两个以上生物条件（或者是类群）。该算法强调的是统计意义和生物相关性。让研究人员能够识别不同丰度的特征以及相关联的类别。

LEfSe通过生物学统计差异使其具有强大的识别功能。然后，它执行额外的测试，以评估这些差异是否符合预期的生物学行为。

具体来说，首先使用non-parametric factorial Kruskal-Wallis (KW) sum-rank test（非参数因子克鲁斯卡尔—沃利斯和秩验检）检测具有显著丰度差异特征，并找到与丰度有显著性差异的类群。最后，LEfSe采用线性判别分析（LDA）来估算每个组分（物种）丰度对差异效果影响的大小。

说明：左边的图为统计两个组别当中有显著作用的微生物类群通过LDA分析（线性回归分析）后获得的LDA分值。右边的图为聚类树，节点大小表示丰度，默认从门到属依次向外排列。红色区域和绿色区域表示不同分组，树枝中红色节点表示在红色组别中起到重要作用的微生物类群，绿色节点表示在绿色组别中起到重要作用的微生物类群，黄色节点表示的是在两组中均没有起到重要作用的微生物类群。图中英文字母表示的物种名称在右侧图例中进行展示。

10. ANOSIM相似性分析

相似性分析(ANOSIM)是一种非参数检验，用来检验组间（两组或多组）的差异是否显著大于组内差异，从而判断分组是否有意义。首先利用Bray-Curtis算法计算两两样品间的距离，然后将所有距离从小到大进行排序，按以下公式计算R值，之后将样品进行置换，重新计算R*值，R*大于R的概率即为P值。

其中，

r ̅ _b：表示组间（Between groups）距离排名的平均值；

r ̅ _w：表示组内（Within groups）距离排名的平均值；

n：表示样品总数。

Table. Anosimanalysis

注：理论上，R值范围为-1到+1，实际中R值一般从0到1，R值接近1表示组间差异越大于组内差异，R值接近0则表示组间和组内没有明显差异。P值则反映了分析结果的统计学显著性，P值越小，表明各样本分组之间的差异显著性越高，P< 0.05表示统计具有显著性；Number of permutation表示置换次数。

11. Adonis多因素方差分析

Adonis又称置换多因素方差分析（permutationalMANOSVA）或非参数多因素方差分析（nonparametricMANOVA）。它利用半度量(如Bray-Curtis)或度量距离矩阵(如Euclidean)对总方差进行分解，分析不同分组因素对样品差异的解释度，并使用置换检验对划分的统计学意义进行显著性分析。

Table permutational MANOVA analysis

注：

Group：表示分组；

Df：表示自由度；

SumsOfSqs：总方差，又称离差平方和；

MeanSqs：平均方差，即SumsOfSqs/Df；

F.Model：F检验值；

R2：表示不同分组对样品差异的解释度，即分组方差与总方差的比值，即分组所能解释的原始数据中差异的比例，R2越大表示分组对差异的解释度越高；

Pr(>F)：通过置换检验获得的P值，P值越小，表明组间差异显著性越强。

作者：JarySun
链接：https://www.jianshu.com/p/87f24cceaa43
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

ANOSIM，PERMANOVA/Adonis，MRPP　（转贴）

szypanther — Fri, 27 Mar 2020 07:13:55 +0000

1. ANOSIM 组间相似性分析

相似性分析(ANOSIM)是一种非参数检验，用来检验组间(两组或多组)的差异是否显著大于组内差异，从而判断分组是否有意义。首先利用 Bray-Curtis 算法计算两两样品间的距离，然后将所有距离从小到大进行排序，按以下公式计算 R 值，之后将样品进行置换，重新计算 R值，R大于 R 的概率即为 P 值。

注:图上总共有 N+1 个盒子，N 为分组数量。“Between”的盒子指代的是分组之间的差异，其他分别代表各自组内差异。R 值范围为-1 到+1，实际中 R 值一般从 0 到 1。R 值接近 1 表示组间差异越大于组内差异，R 值接近 0 则表示组间和组内没有明显差异;此次统计分析的可信度用 P-value 表示，P< 0.05 表示统计具有显著性。

2. PERMANOVA/Adonis 置换多元方差分析

PERMANOVA (Permutational multivariate analysis of variance，置换多元方差分析)，又称 Adonis 分析，可利用半度量(如 Bray-Curtis)或度量距离矩阵(如 Euclidean)对总方差进行分解，通过线性模型分析不同分组因素或环境因子(如临床表型数据、土壤理化指标等)对样品差异的解释度，并使用置换检验进行显著性分析。

3. MRPP 多响应置换过程分析

MRPP 分析(Multiple Response Permutation Procedure，MRPP)用来检验组间(两组或多组)的差异是否显著大于组内差异。与 ANOSIM 分析类似，可利用半度量(如 Bray-Curtis)或度量距离矩阵(如 Euclidean)计算 A 值表示组间差异，使用置换检验对分组进行显著性分析。

Adonis与ANOSIM检验究竟是什么？(转贴）

szypanther — Fri, 27 Mar 2020 06:51:55 +0000

做微生物16S测序的时候，公司的报告里经常会给到两种检验Adonis和ANOSIM，听过t.test、wilicox、anova各种检验，那么Adonis和ANOSIM检验是什么呢

Adonis 多元方差分析

Adonis，多元方差分析，亦可称为非参数多元方差分析。其原理是利用距离矩阵（比如基于Bray-Curtis距离、Euclidean距离）对总方差进行分解，分析不同分组因素对样品差异的解释度，并使用置换检验对其统计学意义进行显著性分析。

Adonis分析结果通常如下：

Index	Df	SumsOfSqs	MeanSqs	F.Model	R2	Pr(>F)
GroupFactor	4	1.0899	0.27248	1.4862	0.14883	0.011
Residuals	34	6.2335	0.18334	0.85117
Total	38	7.3234	1.00000

其中，GroupFactor表示实验中的分组方法
Df表示自由度
SumsOfSqs表示总方差即离差平方和
MeanSqs表示均方差（SumsOfSeqs/Df）
F.Model表示检验值F
R2表示该分组方式对样品间差异的解释度，R2越大说明该分组方案对差异的解释度越高
Pr表示P值，小于0.05时显著说明本次检验的可信度高。

那么Adonis具体要如何使用呢？
在微生物的分析中我们通常把Adonis和PCA分析结合在一起。进行完PCA分析后，我们想要检验不同的分组之间究竟是否有差异，差异是否显著，这时候我们就可以用Adonis检验。如下图，虽然我们可以看到三组被分开了，但是这种分开真的显著吗？这种分组又能对样本的差异解释多少呢？那么右侧的Adonis检验就告诉了我们明确的答案，这种分组时显著的，R2=0.11。

Sylvain I A, Adams R I, Taylor J W. A different suite: The assemblage of distinct fungal communities in water-damaged units of a poorly-maintained public housing building[J]. PloS one, 2019, 14(3): e0213355.

在R中我们可以使用Vegan包中的函数adonis()或adonis2()进行adonis检验。

adonis2(formula, data, permutations = 999, method = "bray",
    sqrt.dist = FALSE, add = FALSE, by = "terms",
    parallel = getOption("mc.cores"), ...)

adonis(formula, data, permutations = 999, method = "bray",
    strata = NULL, contr.unordered = "contr.sum",
    contr.ordered = "contr.poly", parallel = getOption("mc.cores"), ...)

ANOSIM 相似性分析

ANOSIM，相似性分析是一种非参数检验，用于检验高纬度数据间的相似性，比较组间和组内差异的大小，从而判断分组是否有意义，其可以用于检验两组的组间和组内差异，也可以用于多组。

ANOSIM的原理如下，以最基本的两个组为例：
现在一共有6个样本，根据我们的实验方案将其分为两组Group1和Group2，每组含有3个样本。

1、首先我们基于组内样本间的距离计算组内的相似性。

2、然后我们基于组间样本的距离计算组间的相似性。

结合组内和组间，得到：

然后我们根据公式计算R值：

其中，
r0= mean rank of between group dissimilarities 即组间差异性秩的平均值
rw= mean rank of within group dissimilarities 即组内差异性秩的平均值
n=the number of samples 即样本总数量

因此根据公式可以知道，R的取值范围为[-1,1]：
当R趋向于1时，说明组间差异大于组内差异
当R=0时，说明组间没有差异，即分组无效，不同分组之间没有差异。
当R趋向于-1时，说明组间差异小于组内差异。

当R大于0时，我们还要进一步检验这种差异是否显著具有可信度，ANOSIM中对其的检验方法也是使用Permutation Test即置换检验。

置换检验：1、对原始样本进行随机分组，分为实验组和对照组
2、计算随机分组的Ri值，并和R比较
3、重复1000次
4、计算p=Ri大于R的百分比，从而计算P值

在我们做完PCoA、NMDS等降维分析的时候，我们也会遇到一同样的问题，数据看起来是分开的，但是不同的组之间差异真的显著吗？这个时候也可以选择ANOSIM进行检验。

R中Vegan包也提供了ANOSIM检验。下面用R中自带的鸢尾花数据集（iris）做一个示范：

library(vegan)
library(ggplot2)
#Delete Species Infor
dat<-subset(iris,select = -Species)
#Calculate Distance
iris.dist<-vegdist(dat)
#MDS analysis
m<-monoMDS(iris.dist)
MDS<-as.data.frame(m$points)
#Gain group information
MDS$group<-iris$Species
#Plot
p<-ggplot(MDS,aes(MDS1,MDS2,col=group,shape=group))+
  geom_point()+
  theme_bw()+
  theme(legend.title=element_blank())

从上图我们可以直观地看出，组间差异大于组内差异，三组样本明显可以分开。
 那么进一步我们用ANOSIM检验来验证我们从图中得到的结论。

#ANOSIM
anosim_result<-anosim(dat,iris$Species,permutations = 999)
summary(anosim_result)
plot(anosim_result, col = c('#FFD700','#FF7F00','#EE2C2C','#D02090'))


从上图可以直观看到组间差异大于组内差异，R=0.858，接近于1，P值为0.001，小于0.05，说明该不同的分组之间差异明显，该分组是有意义的。

作者：jlyq617
链接：https://www.jianshu.com/p/dfa689f7cafd
来源：简书

alpha多样性

szypanther — Tue, 18 Feb 2020 03:24:22 +0000

扩增子数据分析之多样性指数： alpha多样性

多样性指数（Diversity index）和计算公式可以见： wikipedia

Alpha多样性（Alpha Diversity）是对某个样品中物种多样性的分析，包含样品中的物种类别的多样性——丰富度（Richness）和物种组成多少的整体分布——均匀度（Evenness）两个因素，通常用Richness,Chao1，Shannon，Simpson，Dominance和Equitability等指数来评估样本的物种多样性。

丰富度指数

Richness, Chao1，Shannon三个指数是常用的评估丰富度的指标，数值越高表明样品包含的物种丰富度就越高。

Richness指数: 指样本中被检测到的OTU量；
Chao1指数   : 通过低丰度OTUs来进一步预测样品中的OTUs数量；
Shannon指数 : 计算考虑到样品中的OTUs及其相对丰度信息，
             通过对数（如以2为底的shannon_2，以自然对数为底的shannon_e
             以10为底的shannon_10）转换来预测样品中的分类多样性。

均匀度指数

Simpson，Dominance和Equitability三个指数是常用的评估均匀度的指标。

Simpson指数     : 表示随机选取两条序列属于同一个分类（如OTUs）的概率（故数值在0~1之间），
                  数值越接近1表示表明OTUs的丰度分部越不均匀；
Dominancez指数  : 取值为1-Simpson，表示随机选取两条序列属于不同分类（如OTUs）的概率；
Equitability指数: 根据Shannon指数值计算，当其值为1时表明样品中的物种丰度分布绝对均匀，
                  而其值越小这表明物种丰度分布呈现出越高的偏向。

汇总表：

指数	单位	计算方式
richness	OTUs	样本中至少包含一条序列的OTU数目
chao1	OTUs	N + S^2 / (2D^2)，其中N为OTU个数, S为丰度为1的OTUs个数，D为丰度为2的OTUs数目；
shannon_2	bits	sum(f), 对所有OTU频率计算p*log(p,2)和, p为OTU的频率；
shannon_e	nats	sum(f), 对所有OTU频率计算p*log(p,e)和, p为OTU的频率；
shannon_10	dits	sum(f), 对所有OTU频率计算p*log(p,10)和, p为OTU的频率；
simpson	Probability	sum(f^2)， f为所有OTU频率的和
dominance	Probability	1-simpson
equitability		shannon/log(N), N为OTU数(logs to base 2)

实例：

USEARCH alpha_div

USEARCH 提供了alpha_div函数进行计算各种指数, 可通·-metrics 指定需要计算指数，支持的指数有： berger_parker、buzas_gibson、chao1、dominance、equitability、jost、jost1、reads、richness、robbins、simpson shannon_e、shannon_2、shannon_10

usearch -alpha_div otutable.txt -output alpha.txt
usearch -alpha_div otutable.txt -output gini.txt  -metrics gini_simpson
usearch -alpha_div otutable.txt -output alpha.txt -metrics chao1,

QIIME diversity alpha

qiime2 数据分析流程通过 qiime diversity接口提供了分析`alpha多样性·的各种命令：

--i-table  : FeatureTable
--p-metric : enspie|michaelis_menten_fit|strong|lladser_pe|fisher_alpha
             |goods_coverage|doubles|simpson|margalef|observed_otus|osd
             |shannon|pielou_e|chao1|brillouin_d|menhinick|simpson_e
             |kempton_taylor_q|robbins|dominance|lladser_ci|heip_e
             |singles|chao1_ci|mcintosh_d|ace|mcintosh_e|gini_index
             |berger_parker_d|esty_ci
--o-alpha-diversity: 输出alpha多样性；
--output-dir： 输出目录（如不指定--o-distance-matrix）；

执行：

qiime diversity alpha          \
   --i-table  table.qza       \
   --p-metric  goods_coverage \
   --o-alpha-diversity  goods_coverage.qza

物多样性测定主要有三个空间尺度：α多样性，β多样性，γ多样性。

α多样性主要关注局域均匀生境下的物种数目，因此也被称为生境内的多样性（within-habitat diversity）

β多样性指沿环境梯度不同生境群落之间物种组成的的相异性或物种沿环境梯度的更替速率也被称为生境间的多样性（between-habitat diversity），控制β多样性的主要生态因子有土壤、地貌及干扰等。

不同群落或某环境梯度上不同点之间的共有种越少，β多样性越大。精确地测定β多样性具有重要的意义。这是因为：①它可以指示生境被物种隔离的程度；②β多样性的测定值可以用来比较不同地段的生境多样性；③β多样性与α多样性一起构成了总体多样性或一定地段的生物异质性。

γ多样性描述区域或大陆尺度的多样性，是指区域或大陆尺度的物种数量，也被称为区域多样性（regional diversity）。控制γ多样性的生态过程主要为水热动态，气候和物种形成及演化的历史。主要指标为物种数（S）。γ多样性测定沿海拔梯度具有两种分布格局：偏锋分布和显著的负相关格局。


https://rdrr.io/cran/otuSummary/man/alphaDiversity.html

Invsimpson – mothur

The invsimpson calculator is the inverse of the classical Simpson diversity estimator. This parameter is preferred to other measures of alpha-diversity because it is an indication of the richness in a community with uniform evenness that would have the same level of diversity.

https://www.mothur.org/wiki/Invsimpson

Biological diversity - the great variety of life !

在探索simpson指数之前，我们需要理解几个很重要的概念：

生物多样性可以用很多种方式定量，其中两个主要的因素是丰富度（richness）和均匀度（evenness）。

1. Richness

丰富度即每个样本的物种数，样本中物种越多，样本越“丰富”。

物种丰富度从概念上讲，并不考虑（样本中）每个物种有多少个个体。它给于个体数少的物种与个体数多的数种相同的权重。因此，在某地区1朵雏菊与1000朵金凤花对丰富度的影响是一样的。

2. Evenness

均匀度即不同物种的相对丰度（abundance）,它与丰富度互相补充，相辅相成（make up）。

[译者注] 这里其实有三个概念：Richness, Evennes 和abundance。例如A组：类1有3个，类2有5个，类3有6个；B组：类1有4个，类2有4个，类3有4个。那么A组有3类，B组也有3类，所以它们的richness是一样的；A组中3个类所含个体数均不相同，而B组中3个类所含个体数相同，因此A组和B组的evennes不同；A组类1有3个，B组类1有4个，所以就类1而言B组的abundance更高。

我们对两个地区不同的野花进行取样，以此为例。第1个地区包括300朵雏菊，335朵蒲公英和365朵金凤花。第2个地区包括20朵雏菊，49朵蒲公英和931朵金凤花，如下表。两个样本丰富度相同（均有3个物种），总的个体数也相同（均为1000朵）。然而第1个地区样本的均匀度比第2个地区样本的均匀度更高。这是因为（在第1个地区）3个物种个体分布较均匀，第2个地区大多数是金凤花，仅有少数雏菊和蒲公英。因此认为样本2比样本1的多样性更低。

相比于由相似丰度的许多物种组成的群落，由一两个优势物种组成的群落具有更低的多样性。

多样性随物种丰富度和均匀度的增加而增加。Simpson指数兼顾丰富度和均匀度。

Simpson多样性指数实际上涉及三个相似的指数：

Simpson’s Index (D)

它反映的是在同一个样本中随机的抽取2个个体，这两个个体来自同一个类的概率。有以下两个版本的公式来计算simpson指数。两者不矛盾，均可接受。


n = the total number of organisms of a particular species N = the total number of organisms of all species

D值在0-1之间。0表示无限多样，1表示没有多样性。也就是说D值越大，多样性越低。这与直觉和逻辑不符，为了解决这个问题，通常会用1减去D：

Simpson’s Index of Diversity 1-D

这个值也在0-1之间，但是此时，值越大多样性越高，这就变得更直观了。这种情况下，指数代表的意义是在同一个样本中随机的抽取2个个体，这两个个体来自不同类的概率。

对于违背直觉的D值，还有另一种处理办法，即用1除以D:

Simpson’s Reciprocal Index 1 / D

1/D的最小值为1。当它为1时表示样本仅由1个物种组成。值越大，多样性越高。最大值是样本中的物种数。例如，假设一个样本中有5个物种，则1/D的最大值为5。

[译者注] 当样本中这5个物种的丰度都相等时1/D达到最大值5。大家可以通过求二阶偏导来求出极值，因非本文重点，证明从略。

以上三个指数想用哪一个取决于使用者的分析需求，但是在研究中需指明使用哪一个指标作为simpson指数！[译者注：该文作者着重强调了这一点，请注意！]

# ====================== 译文结束 =======================

这篇材料提供的案例很好，但是遗憾的是仅说明了simpson指数与evennes关系。为了进行单因素比较，作者将两组丰富度设为相同。那么如果丰富度不同呢？而且simpson指数是否与shannon指数一样与丰度无关呢？这里再举一个例子(因为各组相互独立，这里就不给生物学意义，直接上数字了，具体可查看另一篇shannon指数博文[2])：

A组：2, 4, 6, 8

B组：20, 40, 60, 80

C组：5, 5, 5, 5

D组：5, 5, 5, 5, 5

代入公式1-D计算（因为微生物16SrRNA经典流程QIIME使用的scikit库是利用这个公式计算的〔3〕），我们可以得出：

A组simpson指数为： 1-((2/20)^2+(4/20)^2+(6/20)^2+(8/20)^2) = 0.7

A组shannon指数为 1.846439（计算公式见博文[2]，下同）

B组simpson指数为： 1-((20/200)^2+(40/200)^2+(60/200)^2+(80/200)^2) = 0.7

B组shannon指数为 1.846439

C组simpson指数为： 1-((5/20)^2)*4 = 0.75

C组shannon指数为 2.0

D组simpson指数为： 1-((5/25)^2)*5 = 0.8

D组shannon指数为 2.321928

从上面的计算过程很明显看出A组和B组相等，C组和D组不相等，A组和C组也不相等。

AB组结果相同显示出在丰富度一致时simpson指数与丰度无关，它只与相对丰度（均匀度）有关。这和shannon指数一致，归根结底是因为公式中自变量都是相对丰度pi！

CD组结果不同显示出在均匀度一致时simpson指数与丰富度有关，丰富度越大，simpson指数越小。这一点也和shannon指数的情况一致，归根结底，原因在于公式中都有加和项，而且加和部分无论是simpson指数的(pi)²还是shannon指数的x*log2(x)在区间（0，1〕上均大于0（有关x*log2(x)>0, x∈（0，1〕可以查看博文〔2〕中的y= – x*log2(x)那张图）。因此，无论是shannon指数还是simpson指数每多加一项（即丰富度增加），值都会越来越小。回到抽样上来讲，当样本中每种个体数都相同时，在一个样本中随机抽取两个个体，种类越多抽到的这两个个体来自同一个种类的概率越大。

AC组显示出当丰富度相同时，样本中种类越均一，simpson指数越大，即种类越均一，随机抽取两个个体属于同一个种类的概率越大。这一点可以查看博文〔2〕中的分析过程。对应shannon指数的y = – x*log2(x)， simpson指数的y = – x²在（0，1〕间区上，也是一个斜率逐渐减小的单调递减函数。

综上，simpson和shannon指数都是均匀度和丰富度的综合指标。

〔1〕 http://www.countrysideinfo.co.uk/simpsons.htm

〔2〕 http://blog.sciencenet.cn/blog-2970729-1069399.html

〔3〕 http://scikit-bio.org/docs/latest/generated/generated/skbio.diversity.alpha.simpson.html#skbio.diversity.alpha.simpson

本文来自卢锐科学网博客。
链接地址：http://blog.sciencenet.cn/blog-2970729-1069539.html

Multivariate analyses in R (PERMANOVA )

szypanther — Sat, 07 Dec 2019 06:08:55 +0000

https://rpubs.com/collnell/manova

Multivariate analyses in R

By C Nell

Types of questions

Do groups differ in composition?
Does community structure vary among regions or over time?
Do environmental variables explain community patterns?
Which species are responsible for differences among groups?

Multivariate analysis of ecological communities with vegan

install.packages('vegan')
library(vegan) ##Community ecology: ordination, disversity & dissimilarities

Dataset

Bird abundances from 32 different plots (rows), 12 of which have 1 tree species (DIVERSITY = M) and 20 with 4 tree species (DIVERSITY = P).
Tree composition: there are a total of 6 possible tree species (treecomp), each signified with a letter A to F. Bird abundances are totalled according to their feeding guild (columns).

Get data from internet:

birds<-read.csv('https://raw.githubusercontent.com/collnell/lab-demo/master/bird_by_fg.csv')
trees<-read.csv('https://raw.githubusercontent.com/collnell/lab-demo/master/tree_comp.csv')

Or from your computer:

setwd("/Users/colleennell/Dropbox/Projects/Mexico/R") #change to data folder
birds<-read.csv('bird_by_fg.csv')
head(birds)

##   DIVERSITY PLOT CA FR GR HE IN NE OM
## 1         M    3  0  0  0  0  2  0  0
## 2         M    9  0  0  2  0  6  0  4
## 3         M   12  0  0  0  0  2  0  2
## 4         M   17  0  0  0  0  7  0  4
## 5         M   20  0  0  0  0  1  0  4
## 6         M   21  0  0  3  0 14  0  7

trees<-read.csv('tree_comp.csv')
head(trees)

##   PLOT comp A B C D E F row col
## 1    3    D 0 0 0 1 0 0   3   1
## 2    9    A 1 0 0 0 0 0   2   2
## 3   12    E 0 0 0 0 1 0   5   2
## 4   17    F 0 0 0 0 0 1   3   3
## 5   20    A 1 0 0 0 0 0   6   3
## 6   21    B 0 1 0 0 0 0   7   3

Questions: Is C. pentandara (B) associated with variation in bird species composition? Does feeding guild composition differ between monoculture and polyculture plots?

MANOVA (Multivariate analysis of variance)

Parametric test for differences between independent groups for multiple continuous dependent variables. Like ANOVA for many response variables. Requires variables to be fewer than number of smaples.

Is C. pentandara (B) associated with variation in bird species composition? Or D & F (both Fabaceae)?

bird.matrix<-as.matrix(birds[,3:9])##response variables in a sample x species matrix
trees$B<-as.factor(trees$B)

bird.manova<-manova(bird.matrix~as.factor(B), data=trees) ##manova test
summary(bird.manova)

##              Df  Pillai approx F num Df den Df  Pr(>F)  
## as.factor(B)  1 0.39147   2.2056      7     24 0.07027 .
## Residuals    30                                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show univariate results:

summary.aov(bird.manova)

Assumptions of MANOVA

Normal distribution
Linearity
Homogeneity of variances
Homogeneity of covariances

Problem: Most ecological data is overdispersed, has many 0’s or rare species, unequal sample sizes.
Solution: Dissimilarity coefficients, permutation tests

PERMANOVA: Permutational multivariate analysis of variance

Non-paramentric, based on dissimilarities. Allows for partitioning of variability, similar to ANOVA, allowing for complex design (multiple factors, nested design, interactions, covariates). Uses permutation to compute F-statistic (pseudo-F).
Interactive app demonstrating permutation tests

Based on Legendre & Anderson (1999, Ecological Monographs) and Anderson (2001, Austral Ecology).

Null hypothesis: Groups do not differ in spread or positioni n multivaraite space.

1. Transform or standardize data

Use square root or proportions to minimize influence of most abundant groups.

bird.mat<-sqrt(bird.matrix)#square root transform
#bird.prop<-decostand(bird.matrix, method="total")

2. Calculate ecological resemblance

Quantify pairwise compositional dissimilarity between sites based on species occurances.
– Bray-Curtis dissimilarity (abundance weighted)
– Jaccard (presence/absence)
– Gower’s (non-continuous variables)

Dissimilarity: 0 = sites are indentical, 1 = sites do not share any species

Create a dissimilarity matrix:

bird.dist<-vegdist(bird.mat, method='bray')

3. perMANOVA

Do monoculture and polyculture plots differ in feeding guild composition?

set.seed(36) #reproducible results

bird.div<-adonis2(bird.dist~DIVERSITY, data=birds, permutations = 999, method="bray", strata="PLOT")
bird.div

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = bird.dist ~ DIVERSITY, data = birds, permutations = 999, method = "bray", strata = "PLOT")
##           Df SumOfSqs      F Pr(>F)   
## DIVERSITY  1  0.32857 4.1585  0.008 **
## Residual  30  2.37033                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

strata = ‘exchangeable units’ for permutation. Important for nested design.

4. Multivariate dispersion

The average distance to group centroid. Used as a measure of multivariate beta diversity.

dispersion<-betadisper(bird.dist, group=birds$DIVERSITY)
permutest(dispersion)

## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##           Df  Sum Sq   Mean Sq      F N.Perm Pr(>F)
## Groups     1 0.00369 0.0036924 0.2231    999  0.638
## Residuals 30 0.49659 0.0165530

plot(dispersion, hull=FALSE, ellipse=TRUE) ##sd ellipse

NMDS

Non-metric multi-dimensional scaling. Unconstrained ordination. See (https://jonlefcheck.net/2012/10/24/nmds-tutorial-in-r/).
The goal of NMDS is to represent the original position of communities in multidimensional space as accurately as possible using a reduced number of dimensions that can be easily visualized. NMDS uses rank orders to preserve distances among objects thus can accomodate a variety of data types.

Configure samples in 2-dimensional space:

birdMDS<-metaMDS(bird.mat, distance="bray", k=2, trymax=35, autotransform=TRUE) ##k is the number of dimensions
birdMDS ##metaMDS takes eaither a distance matrix or your community matrix (then requires method for 'distance=')

stressplot(birdMDS)

Stress: similarity of observed distance to ordination distance. < 0.15 to indidates acceptable fit.

install.packages('ggplot2') ##plotting package
library(ggplot2)

##pull points from MDS
NMDS1 <- birdMDS$points[,1] ##also found using scores(birdMDS)
NMDS2 <- birdMDS$points[,2]
bird.plot<-cbind(birds, NMDS1, NMDS2, trees)

#plot ordination
p<-ggplot(bird.plot, aes(NMDS1, NMDS2, color=DIVERSITY))+
  geom_point(position=position_jitter(.1), shape=3)+##separates overlapping points
  stat_ellipse(type='t',size =1)+ ##draws 95% confidence interval ellipses
  theme_minimal()
p

Add labels for tree species composition:

#plot ordination
p<-ggplot(bird.plot, aes(NMDS1, NMDS2, color=DIVERSITY))+
  stat_ellipse(type='t',size =1)+
  theme_minimal()+geom_text(data=bird.plot,aes(NMDS1, NMDS2, label=comp), position=position_jitter(.35))+
  annotate("text", x=min(NMDS1), y=min(NMDS2), label=paste('Stress =',round(birdMDS$stress,3))) #add stress to plot
p

Fit vectors to ordination

Which envornmental variables are correlated with the ordination?

fit<-envfit(birdMDS, bird.mat)
arrow<-data.frame(fit$vectors$arrows,R = fit$vectors$r, P = fit$vectors$pvals)
arrow$FG <- rownames(arrow)
arrow.p<-filter(arrow, P <= 0.05)

p<-ggplot(data=bird.plot, aes(NMDS1, NMDS2))+
  geom_point(data=bird.plot, aes(NMDS1, NMDS2, color=DIVERSITY),position=position_jitter(.1))+##separates overlapping points
  stat_ellipse(aes(fill=DIVERSITY), alpha=.2,type='t',size =1, geom="polygon")+ ##changes shading on ellipses
  theme_minimal()+
  geom_segment(data=arrow.p, aes(x=0, y=0, xend=NMDS1, yend=NMDS2, label=FG, lty=FG), arrow=arrow(length=unit(.2, "cm")*arrow.p$R)) ##add arrows (scaled by R-squared value)

p

Show gradient of insectivores:

ordisurf(birdMDS, bird.mat[,'IN'], bubble=TRUE)##bubble size reflects abundance of insectivores

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
## 
## Estimated degrees of freedom:
## 5.6  total = 6.6 
## 
## REML score: 35.72947

Resources:
GUSTA ME – Provides several ‘wizards’ in choosing the correct statistical test, walkthrough examples of multivariate analyses, and guide to the major types of analyses.

Correlation tests, correlation matrix, and corresponding visualization methods in R (forward)

szypanther — Wed, 26 Jun 2019 04:46:28 +0000

https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html

The following content is mostly compiled (with some original additions on my side) from the material that can be found athttp://www.sthda.com/, as well as in the vignette for the corrplot R package – An Introduction to corrplot Package. The sole purpose of this text is to put all the info into one document in an easy to search format. Since I’m a huge fan of Hadley Wickham’s work I’ll insist on solutions based in “tidyverse” whenewer possible…

Install and load required R packages

We’ll use the ggpubr R package for an easy ggplot2-based data visualization, corrplot package to plot correlograms, Hmisc to calculate correlation matrices containing both cor. coefs. and p-values,corrplot for plotting correlograms, and of course tidyverse for all the data wrangling, plotting and alike:

require(ggpubr)
require(tidyverse)
require(Hmisc)
require(corrplot)

Methods for correlation analyses

There are different methods to perform correlation analysis:

Pearson correlation (r), which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. It can be used only when x and y are from normal distribution. The plot of y = f(x) is named the linear regression curve.
Kendall $τ "> τ$ and Spearman $ρ "> ρ$ , which are rank-based correlation coefficients (non-parametric)
The most commonly used method is the Pearson correlation method

Compute correlation in R

R functions

Correlation coefficients can be computed in R by using the functions cor() and cor.test():

cor() computes the correlation coefficient

cor.test() test for association/correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation.

The simplified formats are:

cor(x, y, method = c("pearson", "kendall", "spearman"))
cor.test(x, y, method=c("pearson", "kendall", "spearman"))

where:

x, y: numeric vectors with the same length

method: correlation method

If the data contain missing values, the following R code can be used to handle missing values by case-wise deletion:

cor(x, y,  method = "pearson", use = "complete.obs")

Preliminary considerations

We’ll use the well known built-in mtcars R dataset.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We’d like to compute the correlation between mpg and wt variables.

First let’s visualise our data by the means of a scatter plot. We’ll be using ggpubr R package

library(ggpubr)

my_data <- mtcars
my_data$cyl <- factor(my_data$cyl)
str(my_data)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

ggscatter(my_data, x = "wt", y = "mpg",
          add = "reg.line", conf.int = TRUE,
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Weight (1000 lbs)", ylab = "Miles/ (US) gallon")

Preleminary test to check the test assumptions

Is the relation between variables linear? Yes, from the plot above, the relationship can be, closely enough, modeled as linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.
Are the data from each of the 2 variables (x, y) following a normal distribution?
- Use Shapiro-Wilk normality test $\to "> \to$ R function: shapiro.test()
- and look at the normality plot $\to "> \to$ R function: ggpubr::ggqqplot()

Shapiro-Wilk test can be performed as follow:
- Null hypothesis: the data are normally distributed
- Alternative hypothesis: the data are not normally distributed

# Shapiro-Wilk normality test for mpg
shapiro.test(my_data$mpg) # => p = 0.1229

## 
##  Shapiro-Wilk normality test
## 
## data:  my_data$mpg
## W = 0.94756, p-value = 0.1229

# Shapiro-Wilk normality test for wt
shapiro.test(my_data$wt) # => p = 0.09

## 
##  Shapiro-Wilk normality test
## 
## data:  my_data$wt
## W = 0.94326, p-value = 0.09265

As can be seen from the output, the two p-values are greater than the predetermined significance level of 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.

One more option for checking the normality of the data distribution is visual inspection of the Q-Q plots (quantile-quantile plots). Q-Q plot draws the correlation between a given sample and the theoretical normal distribution.

Again, we’ll use the ggpubr R package to obtain “pretty”, i.e. publishing-ready, Q-Q plots.

library("ggpubr")
# Check for the normality of "mpg""
ggqqplot(my_data$mpg, ylab = "MPG")

# Check for the normality of "wt""
ggqqplot(my_data$wt, ylab = "WT")

From the Q-Q normality plots, we can assume that both samples may come from populations that, closely enough, follow normal distributions.

It is important to note that if the data does not follow the normal distribution, at least closely enough, it’s recommended to use the non-parametric correlation, including Spearman and Kendall rank-based correlation tests.

Pearson correlation test

Example:

res <- cor.test(my_data$wt, my_data$mpg, method = "pearson")
res

## 
##  Pearson's product-moment correlation
## 
## data:  my_data$wt and my_data$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594

So what’s happening here? First of all let’s clarify the meaning of this printout:

t is the t-test statistic value (t = -9.559),
df is the degrees of freedom (df= 30),
p-value is the significance level of the t-test (p-value = $1.29410 - 10 "> 1.29410 - 10$ ).
conf.int is the confidence interval of the correlation coefficient at 95% (conf.int = [-0.9338, -0.7441]);
sample estimates is the correlation coefficient (Cor.coeff = -0.87).

Interpretation of the results: As can be see from the results above the p-value of the test is $1.29410 - 10 "> 1.29410 - 10$ , which is less than the significance level $α = 0.05 "> α = 0.05$ . We can conclude that wt and mpg are significantly correlated with a correlation coefficient of -0.87 and p-value of $1.29410 - 10 "> 1.29410 - 10$ .

Access to the values returned by cor.test() function

The function cor.test() returns a list containing the following components:

str(res)

## List of 9
##  $ statistic  : Named num -9.56
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named int 30
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 1.29e-10
##  $ estimate   : Named num -0.868
##   ..- attr(*, "names")= chr "cor"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "correlation"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Pearson's product-moment correlation"
##  $ data.name  : chr "my_data$wt and my_data$mpg"
##  $ conf.int   : atomic [1:2] -0.934 -0.744
##   ..- attr(*, "conf.level")= num 0.95
##  - attr(*, "class")= chr "htest"

Of these we are most interested with:

p.value: the p-value of the test
estimate: the correlation coefficient

# Extract the p.value
res$p.value

## [1] 1.293959e-10

# Extract the correlation coefficient
res$estimate

##        cor 
## -0.8676594

Kendall rank correlation test

The Kendall rank correlation coefficient or Kendall’s $τ "> τ$ statistic is used to estimate a rank-based measure of association. This test may be used if the data do not necessarily come from a bivariate normal distribution.

res2 <- cor.test(my_data$mpg, my_data$wt, method = "kendall")

## Warning in cor.test.default(my_data$mpg, my_data$wt, method = "kendall"):
## Cannot compute exact p-value with ties

res2

## 
##  Kendall's rank correlation tau
## 
## data:  my_data$mpg and my_data$wt
## z = -5.7981, p-value = 6.706e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.7278321

Here tau is the Kendall correlation coefficient, so The correlation coefficient between mpg and wy is -0.7278 and the p-value is $6.70610 - 9 "> 6.70610 - 9$ .

Spearman rank correlation coefficient

Spearman’s $ρ "> ρ$ statistic is also used to estimate a rank-based measure of association. This test may be used if the data do not come from a bivariate normal distribution.

res3 <- cor.test(my_data$wt, my_data$mpg, method = "spearman")

## Warning in cor.test.default(my_data$wt, my_data$mpg, method = "spearman"):
## Cannot compute exact p-value with ties

res3

## 
##  Spearman's rank correlation rho
## 
## data:  my_data$wt and my_data$mpg
## S = 10292, p-value = 1.488e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.886422

Here, rho is the Spearman’s correlation coefficient, so the correlation coefficient between mpg and wt is -0.8864 and the p-value is $1.48810 - 11 "> 1.48810 - 11$ .

How to interpret correlation coefficient

Value of the correlation coefficient can vary between -1 and 1:

-1 indicates a strong negative correlation : this means that every time x increases, y decreases
0 means that there is no association between the two variables (x and y)
1 indicates a strong positive correlation : this means that y increases with x

What is a correlation matrix?

Previously, we described how to perform correlation test between two variables. In the following sections we’ll see how a correlation matrix can be computed and visualized. The correlation matrix is used to investigate the dependence between multiple variables at the same time. The result is a table containing the correlation coefficients between each variable and the others.

Compute correlation matrix in R

We have already mentioned the cor() function, at the intoductory part of this document dealing with the correlation test for a bivariate case. It be used to compute a correlation matrix. A simplified format of the function is :

cor(x, method = c("pearson", "kendall", "spearman"))

Here:

x is numeric matrix or a data frame.
method: indicates the correlation coefficient to be computed. The default is “pearson”” correlation coefficient which measures the linear dependence between two variables. As already explained “kendall” and “spearman” correlation methods are non-parametric rank-based correlation tests.

If your data contain missing values, the following R code can be used to handle missing values by case-wise deletion:

cor(x, method = "pearson", use = "complete.obs")

Plain correlation matrix

Example:

library(dplyr)

my_data <- select(mtcars, mpg, disp, hp, drat, wt, qsec)
head(my_data)

##                    mpg disp  hp drat    wt  qsec
## Mazda RX4         21.0  160 110 3.90 2.620 16.46
## Mazda RX4 Wag     21.0  160 110 3.90 2.875 17.02
## Datsun 710        22.8  108  93 3.85 2.320 18.61
## Hornet 4 Drive    21.4  258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7  360 175 3.15 3.440 17.02
## Valiant           18.1  225 105 2.76 3.460 20.22

#Let's compute the correlation matrix
cor_1 <- round(cor(my_data), 2)
cor_1

##        mpg  disp    hp  drat    wt  qsec
## mpg   1.00 -0.85 -0.78  0.68 -0.87  0.42
## disp -0.85  1.00  0.79 -0.71  0.89 -0.43
## hp   -0.78  0.79  1.00 -0.45  0.66 -0.71
## drat  0.68 -0.71 -0.45  1.00 -0.71  0.09
## wt   -0.87  0.89  0.66 -0.71  1.00 -0.17
## qsec  0.42 -0.43 -0.71  0.09 -0.17  1.00

Unfortunately, the function cor() returns only the correlation coefficients between variables. In the next section, we will use Hmisc R package to calculate the correlation p-values.

Correlation matrix with significance levels (p-value)

The function rcorr() (in Hmisc package) can be used to compute the significance levels for pearson and spearman correlations. It returns both the correlation coefficients and the p-value of the correlation for all possible pairs of columns in the data table.

Simplified format:

rcorr(x, type = c("pearson","spearman"))

x should be a matrix. The correlation type can be either pearson or spearman.

Example:

library("Hmisc")

cor_2 <- rcorr(as.matrix(my_data))
cor_2

##        mpg  disp    hp  drat    wt  qsec
## mpg   1.00 -0.85 -0.78  0.68 -0.87  0.42
## disp -0.85  1.00  0.79 -0.71  0.89 -0.43
## hp   -0.78  0.79  1.00 -0.45  0.66 -0.71
## drat  0.68 -0.71 -0.45  1.00 -0.71  0.09
## wt   -0.87  0.89  0.66 -0.71  1.00 -0.17
## qsec  0.42 -0.43 -0.71  0.09 -0.17  1.00
## 
## n= 32 
## 
## 
## P
##      mpg    disp   hp     drat   wt     qsec  
## mpg         0.0000 0.0000 0.0000 0.0000 0.0171
## disp 0.0000        0.0000 0.0000 0.0000 0.0131
## hp   0.0000 0.0000        0.0100 0.0000 0.0000
## drat 0.0000 0.0000 0.0100        0.0000 0.6196
## wt   0.0000 0.0000 0.0000 0.0000        0.3389
## qsec 0.0171 0.0131 0.0000 0.6196 0.3389

The output of the function rcorr() is a list containing the following elements :

r : the correlation matrix
n : the matrix of the number of observations used in analyzing each pair of variables
P : the p-values corresponding to the significance levels of correlations.

Extracting the p-values or the correlation coefficients from the output:

str(cor_2)

## List of 3
##  $ r: num [1:6, 1:6] 1 -0.848 -0.776 0.681 -0.868 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##  $ n: int [1:6, 1:6] 32 32 32 32 32 32 32 32 32 32 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##  $ P: num [1:6, 1:6] NA 9.38e-10 1.79e-07 1.78e-05 1.29e-10 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##  - attr(*, "class")= chr "rcorr"

# As you can see "cor_2" is a list so extracting these values is quite simple...

# p-values
cor_2$P

##               mpg         disp           hp         drat           wt
## mpg            NA 9.380354e-10 1.787838e-07 1.776241e-05 1.293956e-10
## disp 9.380354e-10           NA 7.142686e-08 5.282028e-06 1.222311e-11
## hp   1.787838e-07 7.142686e-08           NA 9.988768e-03 4.145833e-05
## drat 1.776241e-05 5.282028e-06 9.988768e-03           NA 4.784268e-06
## wt   1.293956e-10 1.222311e-11 4.145833e-05 4.784268e-06           NA
## qsec 1.708199e-02 1.314403e-02 5.766250e-06 6.195823e-01 3.388682e-01
##              qsec
## mpg  1.708199e-02
## disp 1.314403e-02
## hp   5.766250e-06
## drat 6.195823e-01
## wt   3.388682e-01
## qsec           NA

# Correlation matrix
cor_2$r

##             mpg       disp         hp        drat         wt        qsec
## mpg   1.0000000 -0.8475513 -0.7761683  0.68117189 -0.8676594  0.41868404
## disp -0.8475513  1.0000000  0.7909486 -0.71021390  0.8879799 -0.43369791
## hp   -0.7761683  0.7909486  1.0000000 -0.44875914  0.6587479 -0.70822340
## drat  0.6811719 -0.7102139 -0.4487591  1.00000000 -0.7124406  0.09120482
## wt   -0.8676594  0.8879799  0.6587479 -0.71244061  1.0000000 -0.17471591
## qsec  0.4186840 -0.4336979 -0.7082234  0.09120482 -0.1747159  1.00000000

Custom function for convinient formatting of the correlation matrix

This section provides a simple function for formatting a correlation matrix into a table with 4 columns containing :

Column 1 : row names (variable 1 for the correlation test)
Column 2 : column names (variable 2 for the correlation test)
Column 3 : the correlation coefficients
Column 4 : the p-values of the correlations

flat_cor_mat <- function(cor_r, cor_p){
  #This function provides a simple formatting of a correlation matrix
  #into a table with 4 columns containing :
    # Column 1 : row names (variable 1 for the correlation test)
    # Column 2 : column names (variable 2 for the correlation test)
    # Column 3 : the correlation coefficients
    # Column 4 : the p-values of the correlations
  library(tidyr)
  library(tibble)
  cor_r <- rownames_to_column(as.data.frame(cor_r), var = "row")
  cor_r <- gather(cor_r, column, cor, -1)
  cor_p <- rownames_to_column(as.data.frame(cor_p), var = "row")
  cor_p <- gather(cor_p, column, p, -1)
  cor_p_matrix <- left_join(cor_r, cor_p, by = c("row", "column"))
  cor_p_matrix
}

cor_3 <- rcorr(as.matrix(mtcars[, 1:7]))

my_cor_matrix <- flat_cor_mat(cor_3$r, cor_3$P)
head(my_cor_matrix)

##    row column        cor            p
## 1  mpg    mpg  1.0000000           NA
## 2  cyl    mpg -0.8521619 6.112697e-10
## 3 disp    mpg -0.8475513 9.380354e-10
## 4   hp    mpg -0.7761683 1.787838e-07
## 5 drat    mpg  0.6811719 1.776241e-05
## 6   wt    mpg -0.8676594 1.293956e-10

Visualization of a correlation matrix

There are several different ways for visualizing a correlation matrix in R software:

symnum() function
corrplot() function to plot a correlogram
scatter plots
heatmap

We’ll run trough all of these, and then go a bit more into deatil with correlograms.

Use `symnum()` function: Symbolic number coding

The R function symnum() is used to symbolically encode a given numeric or logical vector or array. It is particularly useful for visualization of structured matrices, e.g., correlation, sparse, or logical ones. In the case of a correlatino matrix it replaces correlation coefficients by symbols according to the level of the correlation.

Simplified format:

symnum(x, cutpoints = c(0.3, 0.6, 0.8, 0.9, 0.95),
       symbols = c(" ", ".", ",", "+", "*", "B"),
       abbr.colnames = TRUE)

Here:

x: the correlation matrix to visualize
cutpoints: correlation coefficient cutpoints. The correlation coefficients between 0 and 0.3 are replaced by a space (” “); correlation coefficients between 0.3 and 0.6 are replaced by”.“; etc .
symbols: the symbols to use.
abbr.colnames: logical value. If TRUE, colnames are abbreviated.

Example:

cor_4 <- cor(mtcars[1:6])
symnum(cor_4, abbr.colnames = FALSE)

##      mpg cyl disp hp drat wt
## mpg  1                      
## cyl  +   1                  
## disp +   *   1              
## hp   ,   +   ,    1         
## drat ,   ,   ,    .  1      
## wt   +   ,   +    ,  ,    1 
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

*As indicated in the legend, the correlation coefficients between 0 and 0.3 are replaced by a space (” “); correlation coefficients between 0.3 and 0.6 are replace by”.“; etc .*

Use the `corrplot()` function: Draw a correlogram

The function corrplot(), in the package of the same name, creates a graphical display of a correlation matrix, highlighting the most correlated variables in a data table.

In this plot, correlation coefficients are colored according to the value. Correlation matrix can be also reordered according to the degree of association between variables.

The simplified format of the function is:

corrplot(corr, method="circle")

Here:

corr: the correlation matrix to be visualized
method: The visualization method to be used, there are seven different options: “circle”, “square”, “ellipse”, “number”, “shade”, “color”, “pie”.

Example:

M<-cor(mtcars)
head(round(M,2))

##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43

#Visualize the correlation matrix

# method = "circle""
corrplot(M, method = "circle")

# method = "ellipse""
corrplot(M, method = "ellipse")

# method = "pie"
corrplot(M, method = "pie")

# method = "color"
corrplot(M, method = "color")

Display the correlation coefficient:

corrplot(M, method = "number")

Correlogram layouts:

There are three general types of a correlogram layout :

“full” (default) : display full correlation matrix
“upper”: display upper triangular of the correlation matrix
“lower”: display lower triangular of the correlation matrix

Examples:

# upper triangular
corrplot(M, type = "upper")

#lower triangular
corrplot(M, type = "lower")

Reordering the correlation matrix

The correlation matrix can be reordered according to the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix. Use order = "hclust" argument for hierarchical clustering of correlation coefficients.

Example:

# correlogram with hclust reordering
corrplot(M, order = "hclust")

# or exploit the symetry of the correlation matrix 
# correlogram with hclust reordering
corrplot(M, type = "upper", order = "hclust")

Changing the color and direction of text labels in the correlogram

Examples:

# Change background color to lightgreen and color of the circles to darkorange and steel blue
corrplot(M, type = "upper", order = "hclust", col = c("darkorange", "steelblue"),
         bg = "lightgreen")

# use "colorRampPallete" to obtain contionus color scales
col <- colorRampPalette(c("darkorange", "white", "steelblue"))(20)
corrplot(M, type = "upper", order = "hclust", col = col)

# Or use "RColorBrewer" package
library(RColorBrewer)
corrplot(M, type = "upper", order = "hclust",
         col = brewer.pal(n = 9, name = "PuOr"), bg = "darkgreen")

Use the tl.col argument for defining the text label color and tl.srt for text label string rotation.

Example:

corrplot(M, type = "upper", order = "hclust", tl.col = "darkblue", tl.srt = 45)

Combining correlogram with the significance test

# Mark the insignificant coefficients according to the specified p-value significance level
cor_5 <- rcorr(as.matrix(mtcars))
M <- cor_5$r
p_mat <- cor_5$P
corrplot(M, type = "upper", order = "hclust", 
         p.mat = p_mat, sig.level = 0.01)

# Leave blank on no significant coefficient
corrplot(M, type = "upper", order = "hclust", 
         p.mat = p_mat, sig.level = 0.05, insig = "blank")

Fine tuning customization of the correlogram

col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method = "color", col = col(200),  
         type = "upper", order = "hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "darkblue", tl.srt = 45, #Text label color and rotation
         # Combine with significance level
         p.mat = p_mat, sig.level = 0.01,  
         # hide correlation coefficient on the principal diagonal
         diag = FALSE 
         )

I’d say this is more than enough for introductory exploration of correlograms. More information can be found in the, already mentioned, vignette for the corrplot R package – An Introduction to corrplot Package

Use `chart.Correlation()`: Draw scatter plots

The function chart.Correlation() from the package “PerformanceAnalytics”, can be used to display a chart of a correlation matrix. This is a very convinient way of exploring multivariate correlations.

library("PerformanceAnalytics")

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

my_data <- mtcars[, c(1,3,4,5,6,7)]
chart.Correlation(my_data, histogram = TRUE, pch = 19)

In the above plot:

The distribution of each variable is shown on the diagonal.
On the bottom of the diagonal : the bivariate scatter plots with a fitted line are displayed
On the top of the diagonal : the value of the correlation plus the significance level as stars
Each significance level is associated to a symbol : p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(““,”“,””, “.”, ” “)

Use `heatmap()`

I don’t really consider this method of correlation matrix visualization to be of practical value, but nevertheless here is a small example:

# Get some colors
col <- colorRampPalette(c("darkblue", "white", "darkorange"))(20)
M <- cor(mtcars[1:7])
heatmap(x = M, col = col, symm = TRUE)

Correlation analysis (zhuantie)

szypanther — Thu, 04 Apr 2019 02:33:19 +0000

Defination

Correlation coefficient is a quantity that measures the strength of the association (or dependence) between two or more variables.

Types of correlation coefficient

Pearson r: is a parametric correlation test as it depends on the distribution (normal distribution) of the data. It measures the linear dependence between two variables. The plot of y = f(x) is named the linear regression curve. (the mostly used method)
Kendall tau: rank-based correlation coefficient (non-parametric methods). Recommended if the data do not come from a bivariate normal distribution.
Spearman rho: rank-base correlation coefficient (non-parametric methods). Recommended if the data do not come from a bivariate normal distribution.

Correlation formula

In the formula below,
$x$ and $y$ are two vectors of length $n$ $m_y$ and $m_y$ corresponds to the means of $x$ and $y$, respectively.

Pearson correlation formula

The p-value (significance level) of the correlation can be determined :

by using the correlation coefficient table for the degrees of freedom : $df=n-2$, where $n$ is the number of observation in x and y variables.
or by calculating the t value as follow: In the case 2) the corresponding p-value is determined using t distribution table for $df=n-2$

If the p-value is < 5%, then the correlation between x and y is significant.

Spearman correlation formula

The Spearman correlation method computes the correlation between the rank of $x$ and the rank of $y$ variables.

Where $x’ = rank(x_)$ and $y’ = rank(y_)$.

Kendall correlation formula

The Kendall correlation method measures the correspondence between the ranking of x and y variables. The total number of possible pairings of x with y observations is $n(n???1)/2$, where n is the size of x and y.

The procedure is as follow:

Begin by ordering the pairs by the x values. If x and y are correlated, then they would have the same relative rank orders.

Now, for each $y_i$, count the number of $y_j > y_i$ (concordant pairs (c)) and the number of $y_j < y_i$ (discordant pairs (d)).

Kendall correlation distance is defined as follow:

Where, $n_c$: total number of concordant pairs $n_d$: total number of discordant pairs $n$: size of x and y

Calculate correlation coefficient

Correlation coefficient can be computed using the functions cor() or cor.test():

cor(x, y, method = c("pearson", "kendall", "spearman"), use = "complete.obs")
cor.test(x, y, method=c("pearson", "kendall", "spearman"), use = "complete.obs")

x and y are two numeric vectors with the same length     
cor() computes the correlation coefficient
cor.test() test for association/correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation.
use = "complete.obs" handle missing values by case-wise deletion

Preleminary test to check the test assumptions

data are normally distributed

Is the covariation linear? Yes, form the plot, the relationship is linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.
Are the data from each of the 2 variables (x, y) follow a normal distribution?
- Use Shapiro-Wilk normality test -> R function: shapiro.test() and look at the normality plot -> R function: ggpubr::ggqqplot()
- Shapiro-Wilk test can be performed as follow:
- Null hypothesis: the data are normally distributed
- Alternative hypothesis: the data are not normally distributed

library(ggpubr)

## Loading required package: ggplot2

## Loading required package: magrittr

ggscatter(mtcars, x = "mpg", y = "wt", 
          add = "reg.line", 
          conf.int = TRUE, 
          cor.coef = TRUE, 
          cor.method = "pearson",
          xlab = "Miles/(US) gallon", 
          ylab = "Weight (1000 lbs)")

#Shapiro-Wilk normality test for mpg and wt
shapiro.test(mtcars$mpg) 

## 
## 	Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

shapiro.test(mtcars$wt) 

## 
## 	Shapiro-Wilk normality test
## 
## data:  mtcars$wt
## W = 0.94326, p-value = 0.09265

#Visual inspection of the data normality using Q-Q plots (quantile-quantile plots)
#Q-Q plot draws the correlation between a given sample and the normal distribution.
ggqqplot(mtcars$mpg, ylab = "MPG")

ggqqplot(mtcars$wt, ylab = "WT")

#Pearson correlation test
cor.test(mtcars$wt, mtcars$mpg, method = "pearson")

## 
## 	Pearson's product-moment correlation
## 
## data:  mtcars$wt and mtcars$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594

data are not normally distributed

If the data are not normally distributed, it’s recommended to use the non-parametric correlation, including Spearman and Kendall rank-based correlation tests.

#Spearman rank correlation coefficient
cor.test(mtcars$wt, mtcars$mpg,  method = "spearman")

## Warning in cor.test.default(mtcars$wt, mtcars$mpg, method = "spearman"):
## Cannot compute exact p-value with ties

## 
## 	Spearman's rank correlation rho
## 
## data:  mtcars$wt and mtcars$mpg
## S = 10292, p-value = 1.488e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.886422

#Kendall rank correlation test
res <- cor.test(mtcars$wt, mtcars$mpg,  method="kendall")

## Warning in cor.test.default(mtcars$wt, mtcars$mpg, method = "kendall"):
## Cannot compute exact p-value with ties

#Extract the p.value and the correlation coefficient
res$p.value

## [1] 6.70577e-09

res$estimate

##        tau 
## -0.7278321

Interprete correlation coefficient

The value of correlation coefficient can be negative or positive, range [-1, 1]:

-1: strong negative correlation
0: no relationship between the two variables (x and y)
1: strong positive correlation

Generate correlation matrix

Function rcorr from Hmisc package used to return correlation coefficients and the correlation p-values, function ggcorrplot from ggcorrplot used to visualize correlation matrix.

Method one: calculate matrix manually
Method two: function ggcorr() in ggally package. However, can’t reorder matrix and display significance level.
Method three: corrplot() function from corrplot package can be used
Method four: ggcorrplot() function from ggcorrplot package. Functions: reordering, displays significance level, computing a matrix of correlation p-values.

#install.packages("corrplot")
library(corrplot)

#install.packages("Hmisc")
library(Hmisc)

#install.packages("ggcorrplot")
library(ggcorrplot)

Compute correlation matrix

Method one: use ggcorrplot()

cor() return correlation matrix; cor_pmat() in ggcorrplot package computes a matrix of correlation p-values.

corr <- round(cor(mtcars), 2) #Compute a correlation matrix
p.mat <- cor_pmat(mtcars) #Compute a matrix of correlation p-values

Method two: use rcorr()

The function rcorr() [in Hmisc package] returns both the correlation coefficients and the correlation p-values for all possible pairs of columns in the data table.

rcorr(x, type = c("pearson","spearman"))

The output is a list containing the following elements: 
r: the correlation matrix 
n: the matrix of the number of observations used in analyzing each pair of variables 
P: the p-values corresponding to the significance levels of correlations.

#corr <- rcorr(as.matrix(mtcars))
#head(round(corr$r, 2)) #Extract the correlation coefficients
#head(round(corr$P, 3)) #Extract p-values

Method three: use cor()

cor() can be used to compute a correlation matrix, but not correlation p-values

cor(x, method = c("pearson", "kendall", "spearman"), use = "complete.obs")
x: numeric matrix or a data frame.
method: indicates the correlation coefficient to be computed. 
  - pearson(default): measures the linear dependence between two variables. 
  - kendall: non-parametric rank-based correlation test.
  - spearman: non-parametric rank-based correlation test.
use = "complete.obs": case-wise deletion, which is useful for NA-containing matrix.     

Computing the p-value of correlations

P value calculation principle

# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}

#matrix of the p-value of the correlation
p.mat <- cor.mtest(mtcars)

Format the correlation matrix (unnecessary)

### flattenCorrMatrix function
#cormat: matrix of the correlation coefficients
#pmat: matrix of the correlation p-values
flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
    )
}

#head(flattenCorrMatrix(corr$r, corr$P))

Visualize correlation matrix

Five different ways to visualize a correlation matrix:

symnum() function (gave up, too lazy to introduce)
corrplot() function to plot a correlogram
scatter plots
heatmap
ggcorrplot function (recommend)

corrplot() function to plot a correlogram

R corrplot function is used to plot the graph of the correlation matrix.

corrplot(corr, method="circle", type="upper", order="hclust", col=col, bg="lightblue", tl.col="black", tl.srt=45, )

corr:	The correlation matrix to visualize. To visualize a general matrix, please use is.corr=FALSE.
method:	seven visualization method: "circle", "square"", "ellipse", number", "shade"", "color", "pie"
type: types of layout. 
  - "full" (default): display full correlation matrix
  - "upper": display upper triangular of the correlation matrix
  - "lower": display lower triangular of the correlation matrix
order: reorder the correlation matrix. The correlation matrix can be reordered according to the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix.
  - "hclust": for hierarchical clustering order
tl.col: for text label color, used to change text colors
tl.srt: for text label string rotation. used to change label rotations.

library(RColorBrewer)
col<- colorRampPalette(c("red", "white", "blue"))(20)

corrplot(corr, 
         method="circle", #visualization method
         type="upper", #types of layout
         order="hclust", #order: reorder the correlation matrix
         col=col, #col: Using different color spectrum
         #bg="lightblue", #bg: Change background color to lightblue
         #col=brewer.pal(n=8, name="RdBu") #use RcolorBrewer palette of colors
         #col=brewer.pal(n=8, name="RdYlBu") #use RcolorBrewer palette of colors
         #col=brewer.pal(n=8, name="PuOr") #use RcolorBrewer palette of colors
         tl.col="blue", #change text colors
         tl.srt=45, #change label rotations
         
         #Combine with significance
         p.mat = p.mat, #add the matrix of p-value
         sig.level = 0.01, ## Specialized the insignificant value according to the significant level
         insig = "blank", #Leave blank on no significant coefficient
         
         diag=FALSE #hide correlation coefficient on the principal diagonal
         )

chart.Correlation(): Draw scatter plots

chart.Correlation() in package PerformanceAnalytics can be used to display a chart of a correlation matrix.

#install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")

my_data <- mtcars[, c(1,3,4,5,6,7)]
chart.Correlation(my_data, histogram=TRUE, pch=19)

#The distribution of each variable is shown on the diagonal.
#left bottom of the diagonal: the bivariate scatter plots with a fitted line are displayed
#right top of the diagonal: the value of the correlation plus the significance level as stars
#Each significance level is associated to a symbol: 
#p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(***, **, *, ., ,"")

Use heatmap()

col <- colorRampPalette(c("blue", "white", "red"))(20)
heatmap(x = corr, col = col, symm = TRUE)

#x: the correlation matrix to be plotted
#col: color palettes
#symm: logical indicating if x should be treated symmetrically; can only be true when x is a square matrix.

ggcorrplot(corr, 
           method = "circle",  #"square" (default), "circle" 
           type = "lower", # "full" (default), "lower" or "upper" display.
           hc.order = TRUE, #logical value. If TRUE, correlation matrix will be hc.ordered using hclust function.
           outline.col = "white", #the outline color of square or circle. Default "gray"
           ggtheme = ggplot2::theme_classic, #Change theme
           colors = c("red", "white", "green"), #Change colors
           lab = TRUE, #Add correlation coefficients
           sig.level=0.05, #set significant level
           p.mat = p.mat, #Add correlation significance level
           insig = "blank" #Leave blank on no significant coefficient
           )

Partial correlation

The R package ppcor provides users with four functions: pcor(), pcor.test(), spcor(), and spcor.test().
pcor() calculates the partial correlations of all pairs of two random variables of a matrix or a data frame and provides the matrices of statistics and p-values of each pairwise partial correlation. pcor.test() computes the pairwise partial correlation coefficient of a pair of two random variables given one or more random variables.

# install.packages("ppcor")
library(ppcor)

# calculate the correlations between each pair with all other variables are adjusted 
partial.corr <- pcor(x=mtcars, method="spearman")
#Results interpretation: ?pcor()

# calculate the correlations between each pair with specified variables are adjusted 
partial_correlation <- function(df) {
  n <- ncol(df)
  results <- list() # define an empty list
  results[["estimate"]] <- sapply(1:(n-3), function(x) {
    sapply(1:(n-3), function(y) {
      ifelse(x == y, 1, pcor.test(df[,x], df[,y], df[,c((n-2):n)], method="spearman")$estimate)
    })
  })
  results[["p.value"]] <- sapply(1:(n-3), function(x) {
    sapply(1:(n-3), function(y) {
      ifelse(x == y, 0, pcor.test(df[,x], df[,y], df[,c((n-2):n)], method="spearman")$p.value)
    })
  })
  colnames(results[["estimate"]]) <- rownames(results[["estimate"]]) <- colnames(df[, c(1:(n-3))])
  colnames(results[["p.value"]]) <- rownames(results[["p.value"]]) <- colnames(df[, c(1:(n-3))])
  return(results)
}

pcor <- partial_correlation(mtcars)

library(ggcorrplot)
ggcorrplot(pcor$estimate,
           method = "circle",  #"square" (default), "circle"
           type = "lower", # "full" (default), "lower" or "upper" display.
           hc.order = TRUE, #logical value. If TRUE, correlation matrix will be hc.ordered using hclust function.
           outline.col = "white", #the outline color of square or circle. Default "gray"
           ggtheme = ggplot2::theme_classic, #Change theme
           colors = c("red", "white", "green"), #Change colors
           lab = TRUE, #Add correlation coefficients
           sig.level=0.05, #set significant level
           p.mat = pcor$p.value, #Add correlation significance level
           insig = "blank", #Leave blank on no significant coefficient
           title = "Partial correlations for mtcars"
           )

References

Correlation Test Between Two Variables in R
correlation formula
Correlation coefficient
Correlation coefficient calculator
Correlation matrix online software: Analysis and visualization
Elegant correlation table using xtable R package
Correlation matrix : An R function to do all you need
Correlation matrix: analyze, format and visualize
ggplot2: Quick correlation matrix heatmap
corrplot: Visualize correlation matrix
ggcorrplot: Visualization of a correlation matrix using ggplot2
ppcor
stackoverflow

How to Compare Regression Slopes

szypanther — Thu, 15 Dec 2016 05:52:02 +0000

How to Compare Regression Slopes

Jim Frost 13 January, 2016

If you perform linear regression analysis, you might need to compare different regression lines to see if their constants and slope coefficients are different. Imagine there is an established relationship between X and Y. Now, suppose you want to determine whether that relationship has changed. Perhaps there is a new context, process, or some other qualitative change, and you want to determine whether that affects the relationship between X and Y.

For example, you might want to assess whether the relationship between the height and weight of football players is significantly different than the same relationship in the general population.

You can graph the regression lines to visually compare the slope coefficients and constants. However, you should also statistically test the differences. Hypothesis testing helps separate the true differences from the random differences caused by sampling error so you can have more confidence in your findings.

In this blog post, I’ll show you how to compare a relationship between different regression models and determine whether the differences are statistically significant. Fortunately, these tests are easy to do using Minitab statistical software.

In the example I’ll use throughout this post, there is an input variable and an output variable for a hypothetical process. We want to compare the relationship between these two variables under two different conditions. Here is the Minitab project file with the data.

Comparing Constants in Regression Analysis

When the constants (or y intercepts) in two different regression equations are different, this indicates that the two regression lines are shifted up or down on the Y axis. In the scatterplot below, you can see that the Output from Condition B is consistently higher than Condition A for any given Input value. We want to determine whether this vertical shift is statistically significant.

To test the difference between the constants, we just need to include a categorical variable that identifies the qualitative attribute of interest in the model. For our example, I have created a variable for the condition (A or B) associated with each observation.

To fit the model in Minitab, I’ll use: Stat > Regression > Regression > Fit Regression Model. I’ll include Output as the response variable, Input as the continuous predictor, and Condition as the categorical predictor.

In the regression analysis output, we’ll first check the coefficients table.

This table shows us that the relationship between Input and Output is statistically significant because the p-value for Input is 0.000.

The coefficient for Condition is 10 and its p-value is significant (0.000). The coefficient tells us that the vertical distance between the two regression lines in the scatterplot is 10 units of Output. The p-value tells us that this difference is statistically significant—you can reject the null hypothesis that the distance between the two constants is zero. You can also see the difference between the two constants in the regression equation table below.

Comparing Coefficients in Regression Analysis

When two slope coefficients are different, a one-unit change in a predictor is associated with different mean changes in the response. In the scatterplot below, it appears that a one-unit increase in Input is associated with a greater increase in Output in Condition B than in Condition A. We can see that the slopes look different, but we want to be sure this difference is statistically significant.

How do you statistically test the difference between regression coefficients? It sounds like it might be complicated, but it is actually very simple. We can even use the same Condition variable that we did for testing the constants.

We need to determine whether the coefficient for Input depends on the Condition. In statistics, when we say that the effect of one variable depends on another variable, that’s an interaction effect. All we need to do is include the interaction term for Input*Condition!

In Minitab, you can specify interaction terms by clicking the Model button in the main regression dialog box. After I fit the regression model with the interaction term, we obtain the following coefficients table:

The table shows us that the interaction term (Input*Condition) is statistically significant (p = 0.000). Consequently, we reject the null hypothesis and conclude that the difference between the two coefficients for Input (below, 1.5359 and 2.0050) does not equal zero. We also see that the main effect of Condition is not significant (p = 0.093), which indicates that difference between the two constants is not statistically significant.

It is easy to compare and test the differences between the constants and coefficients in regression models by including a categorical variable. These tests are useful when you can see differences between regression models and you want to defend your conclusions with p-values.

If you’re learning about regression, read my regression tutorial!

小生这厢有礼了(BioFaceBook Personal Blog) » 统计学习

Making a pairwise distance matrix in pandas

Pandas and Sklearn

pandas isnull函数检查数据是否有缺失

pandas isnull sum with column headers

pandas.get_dummies 的用法 (One-Hot Encoding)

微生物多样研究—差异分析

ANOSIM，PERMANOVA/Adonis，MRPP （转贴）

1. ANOSIM 组间相似性分析

2. PERMANOVA/Adonis 置换多元方差分析

3. MRPP 多响应置换过程分析

Adonis与ANOSIM检验究竟是什么？(转贴）

Adonis 多元方差分析

ANOSIM 相似性分析

alpha多样性

扩增子数据分析之多样性指数： alpha多样性

Invsimpson – mothur

Multivariate analyses in R (PERMANOVA )

Multivariate analyses in R

Types of questions

Dataset

MANOVA (Multivariate analysis of variance)

Assumptions of MANOVA

PERMANOVA: Permutational multivariate analysis of variance

1. Transform or standardize data

2. Calculate ecological resemblance

3. perMANOVA

4. Multivariate dispersion

NMDS

Add labels for tree species composition:

Fit vectors to ordination

Correlation tests, correlation matrix, and corresponding visualization methods in R (forward)

Correlation tests, correlation matrix, and corresponding visualization methods in R

Igor Hut

12 January, 2017

Install and load required R packages

Methods for correlation analyses

Compute correlation in R

R functions

Preliminary considerations

Preleminary test to check the test assumptions

Pearson correlation test

Kendall rank correlation test

Spearman rank correlation coefficient

How to interpret correlation coefficient

What is a correlation matrix?

Compute correlation matrix in R

Plain correlation matrix

Correlation matrix with significance levels (p-value)

Custom function for convinient formatting of the correlation matrix

Visualization of a correlation matrix

Use symnum() function: Symbolic number coding

Use the corrplot() function: Draw a correlogram

Correlogram layouts:

Reordering the correlation matrix

Changing the color and direction of text labels in the correlogram

Combining correlogram with the significance test

Fine tuning customization of the correlogram

Use chart.Correlation(): Draw scatter plots

Use heatmap()

Correlation analysis (zhuantie)

Defination

Types of correlation coefficient

Correlation formula

Pearson correlation formula

Spearman correlation formula

Kendall correlation formula

Calculate correlation coefficient

Preleminary test to check the test assumptions

data are normally distributed

data are not normally distributed

Interprete correlation coefficient

Generate correlation matrix

Compute correlation matrix

Method one: use ggcorrplot()

Method two: use rcorr()

Method three: use cor()

Computing the p-value of correlations

Format the correlation matrix (unnecessary)

Visualize correlation matrix

ANOSIM，PERMANOVA/Adonis，MRPP　（转贴）

Use `symnum()` function: Symbolic number coding

Use the `corrplot()` function: Draw a correlogram

Use `chart.Correlation()`: Draw scatter plots

Use `heatmap()`