# Correlation analysis (zhuantie)

## Defination

Correlation coefficient is a quantity that measures the strength of the association (or dependence) between two or more variables.

## Types of correlation coefficient

1. Pearson r: is a `parametric correlation test` as it depends on the distribution (normal distribution) of the data. It measures the linear dependence between two variables. The plot of y = f(x) is named the linear regression curve. (the mostly used method)
2. Kendall tau: rank-based correlation coefficient (non-parametric methods). Recommended if the data do not come from a bivariate normal distribution.
3. Spearman rho: rank-base correlation coefficient (non-parametric methods). Recommended if the data do not come from a bivariate normal distribution.

## Correlation formula

In the formula below,
\$x\$ and \$y\$ are two vectors of length \$n\$ \$m_y\$ and \$m_y\$ corresponds to the means of \$x\$ and \$y\$, respectively.

### Pearson correlation formula

The p-value (significance level) of the correlation can be determined :

• by using the correlation coefficient table for the degrees of freedom : \$df=n-2\$, where \$n\$ is the number of observation in x and y variables.
• or by calculating the t value as follow: In the case 2) the corresponding p-value is determined using t distribution table for \$df=n-2\$

If the p-value is < 5%, then the correlation between x and y is significant.

### Spearman correlation formula

The Spearman correlation method computes the correlation between the rank of \$x\$ and the rank of \$y\$ variables.

Where \$x’ = rank(x_)\$ and \$y’ = rank(y_)\$.

### Kendall correlation formula

The Kendall correlation method measures the correspondence between the ranking of x and y variables. The total number of possible pairings of x with y observations is \$n(n???1)/2\$, where n is the size of x and y.

The procedure is as follow:

Begin by ordering the pairs by the x values. If x and y are correlated, then they would have the same relative rank orders.

Now, for each \$y_i\$, count the number of \$y_j > y_i\$ (concordant pairs (c)) and the number of \$y_j < y_i\$ (discordant pairs (d)).

Kendall correlation distance is defined as follow:

Where, \$n_c\$: total number of concordant pairs \$n_d\$: total number of discordant pairs \$n\$: size of x and y

## Calculate correlation coefficient

Correlation coefficient can be computed using the functions `cor()` or `cor.test()`:

``````cor(x, y, method = c("pearson", "kendall", "spearman"), use = "complete.obs")
cor.test(x, y, method=c("pearson", "kendall", "spearman"), use = "complete.obs")

x and y are two numeric vectors with the same length
cor() computes the correlation coefficient
cor.test() test for association/correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation.
use = "complete.obs" handle missing values by case-wise deletion
``````

## Preleminary test to check the test assumptions

### data are normally distributed

1. Is the covariation linear? Yes, form the plot, the relationship is linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.
2. Are the data from each of the 2 variables (x, y) follow a normal distribution?
• Use Shapiro-Wilk normality test -> R function: `shapiro.test()` and look at the normality plot -> R function: `ggpubr::ggqqplot()`
• Shapiro-Wilk test can be performed as follow:
• Null hypothesis: the data are normally distributed
• Alternative hypothesis: the data are not normally distributed
``````library(ggpubr)
``````
``````## Loading required package: ggplot2
``````
``````## Loading required package: magrittr
``````
``````ggscatter(mtcars, x = "mpg", y = "wt",
conf.int = TRUE,
cor.coef = TRUE,
cor.method = "pearson",
xlab = "Miles/(US) gallon",
ylab = "Weight (1000 lbs)")
`````` ``````#Shapiro-Wilk normality test for mpg and wt
shapiro.test(mtcars\$mpg)
``````
``````##
## 	Shapiro-Wilk normality test
##
## data:  mtcars\$mpg
## W = 0.94756, p-value = 0.1229
``````
``````shapiro.test(mtcars\$wt)
``````
``````##
## 	Shapiro-Wilk normality test
##
## data:  mtcars\$wt
## W = 0.94326, p-value = 0.09265
``````
``````#Visual inspection of the data normality using Q-Q plots (quantile-quantile plots)
#Q-Q plot draws the correlation between a given sample and the normal distribution.
ggqqplot(mtcars\$mpg, ylab = "MPG")
`````` ``````ggqqplot(mtcars\$wt, ylab = "WT")
`````` ``````#Pearson correlation test
cor.test(mtcars\$wt, mtcars\$mpg, method = "pearson")
``````
``````##
## 	Pearson's product-moment correlation
##
## data:  mtcars\$wt and mtcars\$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor
## -0.8676594
``````

### data are not normally distributed

If the data are not normally distributed, it’s recommended to use the non-parametric correlation, including Spearman and Kendall rank-based correlation tests.

``````#Spearman rank correlation coefficient
cor.test(mtcars\$wt, mtcars\$mpg,  method = "spearman")
``````
``````## Warning in cor.test.default(mtcars\$wt, mtcars\$mpg, method = "spearman"):
## Cannot compute exact p-value with ties
``````
``````##
## 	Spearman's rank correlation rho
##
## data:  mtcars\$wt and mtcars\$mpg
## S = 10292, p-value = 1.488e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho
## -0.886422
``````
``````#Kendall rank correlation test
res <- cor.test(mtcars\$wt, mtcars\$mpg,  method="kendall")
``````
``````## Warning in cor.test.default(mtcars\$wt, mtcars\$mpg, method = "kendall"):
## Cannot compute exact p-value with ties
``````
``````#Extract the p.value and the correlation coefficient
res\$p.value
``````
``````##  6.70577e-09
``````
``````res\$estimate
``````
``````##        tau
## -0.7278321
``````

## Interprete correlation coefficient

The value of correlation coefficient can be negative or positive, range [-1, 1]:

• -1: strong negative correlation
• 0: no relationship between the two variables (x and y)
• 1: strong positive correlation

## Generate correlation matrix

Function `rcorr` from Hmisc package used to return correlation coefficients and the correlation p-values, function `ggcorrplot` from ggcorrplot used to visualize correlation matrix.

Method one: calculate matrix manually
Method two: function `ggcorr()` in ggally package. However, can’t reorder matrix and display significance level.
Method three: `corrplot()` function from `corrplot` package can be used
Method four: `ggcorrplot()` function from ggcorrplot package. Functions: reordering, displays significance level, computing a matrix of correlation p-values.

``````#install.packages("corrplot")
library(corrplot)

#install.packages("Hmisc")
library(Hmisc)
``````
``````#install.packages("ggcorrplot")
library(ggcorrplot)
``````

## Compute correlation matrix

### Method one: use ggcorrplot()

`cor()` return correlation matrix; `cor_pmat()` in ggcorrplot package computes a matrix of correlation p-values.

``````corr <- round(cor(mtcars), 2) #Compute a correlation matrix
p.mat <- cor_pmat(mtcars) #Compute a matrix of correlation p-values
``````

### Method two: use rcorr()

The function `rcorr()` [in Hmisc package] returns both the correlation coefficients and the correlation p-values for all possible pairs of columns in the data table.

``````rcorr(x, type = c("pearson","spearman"))

The output is a list containing the following elements:
r: the correlation matrix
n: the matrix of the number of observations used in analyzing each pair of variables
P: the p-values corresponding to the significance levels of correlations.
``````
``````#corr <- rcorr(as.matrix(mtcars))
#head(round(corr\$r, 2)) #Extract the correlation coefficients
``````

### Method three: use cor()

`cor()` can be used to compute a correlation matrix, but not correlation p-values

``````cor(x, method = c("pearson", "kendall", "spearman"), use = "complete.obs")
x: numeric matrix or a data frame.
method: indicates the correlation coefficient to be computed.
- pearson(default): measures the linear dependence between two variables.
- kendall: non-parametric rank-based correlation test.
- spearman: non-parametric rank-based correlation test.
use = "complete.obs": case-wise deletion, which is useful for NA-containing matrix.
``````

## Computing the p-value of correlations

P value calculation principle

``````# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp\$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}

#matrix of the p-value of the correlation
p.mat <- cor.mtest(mtcars)
``````

### Format the correlation matrix (unnecessary)

``````### flattenCorrMatrix function
#cormat: matrix of the correlation coefficients
#pmat: matrix of the correlation p-values
flattenCorrMatrix <- function(cormat, pmat) {
ut <- upper.tri(cormat)
data.frame(
row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor  =(cormat)[ut],
p = pmat[ut]
)
}

``````

## Visualize correlation matrix

Five different ways to visualize a correlation matrix:

1. symnum() function (gave up, too lazy to introduce)
2. corrplot() function to plot a correlogram
3. scatter plots
4. heatmap
5. ggcorrplot function (recommend)

### corrplot() function to plot a correlogram

R `corrplot` function is used to plot the graph of the correlation matrix.

``````corrplot(corr, method="circle", type="upper", order="hclust", col=col, bg="lightblue", tl.col="black", tl.srt=45, )

corr:	The correlation matrix to visualize. To visualize a general matrix, please use is.corr=FALSE.
method:	seven visualization method: "circle", "square"", "ellipse", number", "shade"", "color", "pie"
type: types of layout.
- "full" (default): display full correlation matrix
- "upper": display upper triangular of the correlation matrix
- "lower": display lower triangular of the correlation matrix
order: reorder the correlation matrix. The correlation matrix can be reordered according to the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix.
- "hclust": for hierarchical clustering order
tl.col: for text label color, used to change text colors
tl.srt: for text label string rotation. used to change label rotations.
``````
``````library(RColorBrewer)
col<- colorRampPalette(c("red", "white", "blue"))(20)

corrplot(corr,
method="circle", #visualization method
type="upper", #types of layout
order="hclust", #order: reorder the correlation matrix
col=col, #col: Using different color spectrum
#bg="lightblue", #bg: Change background color to lightblue
#col=brewer.pal(n=8, name="RdBu") #use RcolorBrewer palette of colors
#col=brewer.pal(n=8, name="RdYlBu") #use RcolorBrewer palette of colors
#col=brewer.pal(n=8, name="PuOr") #use RcolorBrewer palette of colors
tl.col="blue", #change text colors
tl.srt=45, #change label rotations

#Combine with significance
p.mat = p.mat, #add the matrix of p-value
sig.level = 0.01, ## Specialized the insignificant value according to the significant level
insig = "blank", #Leave blank on no significant coefficient

diag=FALSE #hide correlation coefficient on the principal diagonal
)
`````` ### chart.Correlation(): Draw scatter plots

`chart.Correlation()` in package PerformanceAnalytics can be used to display a chart of a correlation matrix.

``````#install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")
``````
``````my_data <- mtcars[, c(1,3,4,5,6,7)]
chart.Correlation(my_data, histogram=TRUE, pch=19)
`````` ``````#The distribution of each variable is shown on the diagonal.
#left bottom of the diagonal: the bivariate scatter plots with a fitted line are displayed
#right top of the diagonal: the value of the correlation plus the significance level as stars
#Each significance level is associated to a symbol:
#p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(***, **, *, ., ,"")
``````

### Use heatmap()

``````col <- colorRampPalette(c("blue", "white", "red"))(20)
heatmap(x = corr, col = col, symm = TRUE)
`````` ``````#x: the correlation matrix to be plotted
#col: color palettes
#symm: logical indicating if x should be treated symmetrically; can only be true when x is a square matrix.
``````

### ggcorrplot function (recommend)

``````ggcorrplot(corr,
method = "circle",  #"square" (default), "circle"
type = "lower", # "full" (default), "lower" or "upper" display.
hc.order = TRUE, #logical value. If TRUE, correlation matrix will be hc.ordered using hclust function.
outline.col = "white", #the outline color of square or circle. Default "gray"
ggtheme = ggplot2::theme_classic, #Change theme
colors = c("red", "white", "green"), #Change colors
lab = TRUE, #Add correlation coefficients
sig.level=0.05, #set significant level
p.mat = p.mat, #Add correlation significance level
insig = "blank" #Leave blank on no significant coefficient
)
`````` ## Partial correlation

The R package ppcor provides users with four functions: pcor(), pcor.test(), spcor(), and spcor.test().
`pcor()` calculates the partial correlations of all pairs of two random variables of a matrix or a data frame and provides the matrices of statistics and p-values of each pairwise partial correlation. `pcor.test()` computes the pairwise partial correlation coefficient of a pair of two random variables given one or more random variables.

``````# install.packages("ppcor")
library(ppcor)
``````
``````# calculate the correlations between each pair with all other variables are adjusted
partial.corr <- pcor(x=mtcars, method="spearman")
#Results interpretation: ?pcor()

# calculate the correlations between each pair with specified variables are adjusted
partial_correlation <- function(df) {
n <- ncol(df)
results <- list() # define an empty list
results[["estimate"]] <- sapply(1:(n-3), function(x) {
sapply(1:(n-3), function(y) {
ifelse(x == y, 1, pcor.test(df[,x], df[,y], df[,c((n-2):n)], method="spearman")\$estimate)
})
})
results[["p.value"]] <- sapply(1:(n-3), function(x) {
sapply(1:(n-3), function(y) {
ifelse(x == y, 0, pcor.test(df[,x], df[,y], df[,c((n-2):n)], method="spearman")\$p.value)
})
})
colnames(results[["estimate"]]) <- rownames(results[["estimate"]]) <- colnames(df[, c(1:(n-3))])
colnames(results[["p.value"]]) <- rownames(results[["p.value"]]) <- colnames(df[, c(1:(n-3))])
return(results)
}

pcor <- partial_correlation(mtcars)

library(ggcorrplot)
ggcorrplot(pcor\$estimate,
method = "circle",  #"square" (default), "circle"
type = "lower", # "full" (default), "lower" or "upper" display.
hc.order = TRUE, #logical value. If TRUE, correlation matrix will be hc.ordered using hclust function.
outline.col = "white", #the outline color of square or circle. Default "gray"
ggtheme = ggplot2::theme_classic, #Change theme
colors = c("red", "white", "green"), #Change colors
lab = TRUE, #Add correlation coefficients
sig.level=0.05, #set significant level
p.mat = pcor\$p.value, #Add correlation significance level
insig = "blank", #Leave blank on no significant coefficient
title = "Partial correlations for mtcars"
)
`````` 