小生这厢有礼了(BioFaceBook Personal Blog) » R语言

heatmap R

szypanther — Tue, 18 Feb 2020 04:55:37 +0000

> library(gplots)

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

lowess

>
> setwd(“/home/zyshen/work/QM_nanjing”)
> data2<-read.csv(“combined_example.level_5.csv”, header=T, sep=”,”)
> data2plot<-data.matrix(data2[2:3])
> row.names(data2plot)<-data2[,1]
> heatmap.2(data2plot,trace=”none”,cexCol = 2,col=greenred(50), margins = c(5, 40), sepwidth=c(0.05,0.05))

if (!require(“gplots”)) {
install.packages(“gplots”, dependencies = TRUE)
library(gplots)
}
if (!require(“RColorBrewer”)) {
install.packages(“RColorBrewer”, dependencies = TRUE)
library(RColorBrewer)
}
#########################################################
### B) Reading in data and transform it into matrix format
#########################################################
setwd(“/home/zyshen/Downloads/酵母代谢”)
data <- read.csv(“W29-knockout-0.1-SD-10X-internal.val.filter.csv”, comment.char=”#”, header=T)
rnames <- data[,1] # assign labels in column 1 to “rnames”
mat_data <- data.matrix(data[,2:ncol(data)]) # transform column 2-5 into a matrix
rownames(mat_data) <- rnames # assign row names
my_palette <- colorRampPalette(c(“red”, “yellow”, “green”))(n = 299)

# (optional) defines the color breaks manually for a “skewed” color transition
col_breaks = c(seq(-50,0,length=100), # for red
seq(0.1,20,length=100), # for yellow
seq(21,100,length=100)) # for green

heatmap.2(mat_data,
cellnote = mat_data, # same data set for cell labels
main = “Correlation”, # heat map title
notecol=”black”, # change font color of cell labels to black
density.info=”none”, # turns off density plot inside color legend
trace=”none”, # turns off trace lines inside the heat map
margins =c(8,20), # widens margins around plot
col=my_palette, # use on color palette defined earlier
breaks=col_breaks, # enable color transition at specified limits
dendrogram=”row”, # only draw a row dendrogram
Colv=”NA”)

Correlation tests, correlation matrix, and corresponding visualization methods in R (forward)

szypanther — Wed, 26 Jun 2019 04:46:28 +0000

https://rstudio-pubs-static.s3.amazonaws.com/240657_5157ff98e8204c358b2118fa69162e18.html

The following content is mostly compiled (with some original additions on my side) from the material that can be found athttp://www.sthda.com/, as well as in the vignette for the corrplot R package – An Introduction to corrplot Package. The sole purpose of this text is to put all the info into one document in an easy to search format. Since I’m a huge fan of Hadley Wickham’s work I’ll insist on solutions based in “tidyverse” whenewer possible…

Install and load required R packages

We’ll use the ggpubr R package for an easy ggplot2-based data visualization, corrplot package to plot correlograms, Hmisc to calculate correlation matrices containing both cor. coefs. and p-values,corrplot for plotting correlograms, and of course tidyverse for all the data wrangling, plotting and alike:

require(ggpubr)
require(tidyverse)
require(Hmisc)
require(corrplot)

Methods for correlation analyses

There are different methods to perform correlation analysis:

Pearson correlation (r), which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. It can be used only when x and y are from normal distribution. The plot of y = f(x) is named the linear regression curve.
Kendall $τ "> τ$ and Spearman $ρ "> ρ$ , which are rank-based correlation coefficients (non-parametric)
The most commonly used method is the Pearson correlation method

Compute correlation in R

R functions

Correlation coefficients can be computed in R by using the functions cor() and cor.test():

cor() computes the correlation coefficient

cor.test() test for association/correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation.

The simplified formats are:

cor(x, y, method = c("pearson", "kendall", "spearman"))
cor.test(x, y, method=c("pearson", "kendall", "spearman"))

where:

x, y: numeric vectors with the same length

method: correlation method

If the data contain missing values, the following R code can be used to handle missing values by case-wise deletion:

cor(x, y,  method = "pearson", use = "complete.obs")

Preliminary considerations

We’ll use the well known built-in mtcars R dataset.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We’d like to compute the correlation between mpg and wt variables.

First let’s visualise our data by the means of a scatter plot. We’ll be using ggpubr R package

library(ggpubr)

my_data <- mtcars
my_data$cyl <- factor(my_data$cyl)
str(my_data)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

ggscatter(my_data, x = "wt", y = "mpg",
          add = "reg.line", conf.int = TRUE,
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Weight (1000 lbs)", ylab = "Miles/ (US) gallon")

Preleminary test to check the test assumptions

Is the relation between variables linear? Yes, from the plot above, the relationship can be, closely enough, modeled as linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.
Are the data from each of the 2 variables (x, y) following a normal distribution?
- Use Shapiro-Wilk normality test $\to "> \to$ R function: shapiro.test()
- and look at the normality plot $\to "> \to$ R function: ggpubr::ggqqplot()

Shapiro-Wilk test can be performed as follow:
- Null hypothesis: the data are normally distributed
- Alternative hypothesis: the data are not normally distributed

# Shapiro-Wilk normality test for mpg
shapiro.test(my_data$mpg) # => p = 0.1229

## 
##  Shapiro-Wilk normality test
## 
## data:  my_data$mpg
## W = 0.94756, p-value = 0.1229

# Shapiro-Wilk normality test for wt
shapiro.test(my_data$wt) # => p = 0.09

## 
##  Shapiro-Wilk normality test
## 
## data:  my_data$wt
## W = 0.94326, p-value = 0.09265

As can be seen from the output, the two p-values are greater than the predetermined significance level of 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.

One more option for checking the normality of the data distribution is visual inspection of the Q-Q plots (quantile-quantile plots). Q-Q plot draws the correlation between a given sample and the theoretical normal distribution.

Again, we’ll use the ggpubr R package to obtain “pretty”, i.e. publishing-ready, Q-Q plots.

library("ggpubr")
# Check for the normality of "mpg""
ggqqplot(my_data$mpg, ylab = "MPG")

# Check for the normality of "wt""
ggqqplot(my_data$wt, ylab = "WT")

From the Q-Q normality plots, we can assume that both samples may come from populations that, closely enough, follow normal distributions.

It is important to note that if the data does not follow the normal distribution, at least closely enough, it’s recommended to use the non-parametric correlation, including Spearman and Kendall rank-based correlation tests.

Pearson correlation test

Example:

res <- cor.test(my_data$wt, my_data$mpg, method = "pearson")
res

## 
##  Pearson's product-moment correlation
## 
## data:  my_data$wt and my_data$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594

So what’s happening here? First of all let’s clarify the meaning of this printout:

t is the t-test statistic value (t = -9.559),
df is the degrees of freedom (df= 30),
p-value is the significance level of the t-test (p-value = $1.29410 - 10 "> 1.29410 - 10$ ).
conf.int is the confidence interval of the correlation coefficient at 95% (conf.int = [-0.9338, -0.7441]);
sample estimates is the correlation coefficient (Cor.coeff = -0.87).

Interpretation of the results: As can be see from the results above the p-value of the test is $1.29410 - 10 "> 1.29410 - 10$ , which is less than the significance level $α = 0.05 "> α = 0.05$ . We can conclude that wt and mpg are significantly correlated with a correlation coefficient of -0.87 and p-value of $1.29410 - 10 "> 1.29410 - 10$ .

Access to the values returned by cor.test() function

The function cor.test() returns a list containing the following components:

str(res)

## List of 9
##  $ statistic  : Named num -9.56
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named int 30
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 1.29e-10
##  $ estimate   : Named num -0.868
##   ..- attr(*, "names")= chr "cor"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "correlation"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Pearson's product-moment correlation"
##  $ data.name  : chr "my_data$wt and my_data$mpg"
##  $ conf.int   : atomic [1:2] -0.934 -0.744
##   ..- attr(*, "conf.level")= num 0.95
##  - attr(*, "class")= chr "htest"

Of these we are most interested with:

p.value: the p-value of the test
estimate: the correlation coefficient

# Extract the p.value
res$p.value

## [1] 1.293959e-10

# Extract the correlation coefficient
res$estimate

##        cor 
## -0.8676594

Kendall rank correlation test

The Kendall rank correlation coefficient or Kendall’s $τ "> τ$ statistic is used to estimate a rank-based measure of association. This test may be used if the data do not necessarily come from a bivariate normal distribution.

res2 <- cor.test(my_data$mpg, my_data$wt, method = "kendall")

## Warning in cor.test.default(my_data$mpg, my_data$wt, method = "kendall"):
## Cannot compute exact p-value with ties

res2

## 
##  Kendall's rank correlation tau
## 
## data:  my_data$mpg and my_data$wt
## z = -5.7981, p-value = 6.706e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.7278321

Here tau is the Kendall correlation coefficient, so The correlation coefficient between mpg and wy is -0.7278 and the p-value is $6.70610 - 9 "> 6.70610 - 9$ .

Spearman rank correlation coefficient

Spearman’s $ρ "> ρ$ statistic is also used to estimate a rank-based measure of association. This test may be used if the data do not come from a bivariate normal distribution.

res3 <- cor.test(my_data$wt, my_data$mpg, method = "spearman")

## Warning in cor.test.default(my_data$wt, my_data$mpg, method = "spearman"):
## Cannot compute exact p-value with ties

res3

## 
##  Spearman's rank correlation rho
## 
## data:  my_data$wt and my_data$mpg
## S = 10292, p-value = 1.488e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.886422

Here, rho is the Spearman’s correlation coefficient, so the correlation coefficient between mpg and wt is -0.8864 and the p-value is $1.48810 - 11 "> 1.48810 - 11$ .

How to interpret correlation coefficient

Value of the correlation coefficient can vary between -1 and 1:

-1 indicates a strong negative correlation : this means that every time x increases, y decreases
0 means that there is no association between the two variables (x and y)
1 indicates a strong positive correlation : this means that y increases with x

What is a correlation matrix?

Previously, we described how to perform correlation test between two variables. In the following sections we’ll see how a correlation matrix can be computed and visualized. The correlation matrix is used to investigate the dependence between multiple variables at the same time. The result is a table containing the correlation coefficients between each variable and the others.

Compute correlation matrix in R

We have already mentioned the cor() function, at the intoductory part of this document dealing with the correlation test for a bivariate case. It be used to compute a correlation matrix. A simplified format of the function is :

cor(x, method = c("pearson", "kendall", "spearman"))

Here:

x is numeric matrix or a data frame.
method: indicates the correlation coefficient to be computed. The default is “pearson”” correlation coefficient which measures the linear dependence between two variables. As already explained “kendall” and “spearman” correlation methods are non-parametric rank-based correlation tests.

If your data contain missing values, the following R code can be used to handle missing values by case-wise deletion:

cor(x, method = "pearson", use = "complete.obs")

Plain correlation matrix

Example:

library(dplyr)

my_data <- select(mtcars, mpg, disp, hp, drat, wt, qsec)
head(my_data)

##                    mpg disp  hp drat    wt  qsec
## Mazda RX4         21.0  160 110 3.90 2.620 16.46
## Mazda RX4 Wag     21.0  160 110 3.90 2.875 17.02
## Datsun 710        22.8  108  93 3.85 2.320 18.61
## Hornet 4 Drive    21.4  258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7  360 175 3.15 3.440 17.02
## Valiant           18.1  225 105 2.76 3.460 20.22

#Let's compute the correlation matrix
cor_1 <- round(cor(my_data), 2)
cor_1

##        mpg  disp    hp  drat    wt  qsec
## mpg   1.00 -0.85 -0.78  0.68 -0.87  0.42
## disp -0.85  1.00  0.79 -0.71  0.89 -0.43
## hp   -0.78  0.79  1.00 -0.45  0.66 -0.71
## drat  0.68 -0.71 -0.45  1.00 -0.71  0.09
## wt   -0.87  0.89  0.66 -0.71  1.00 -0.17
## qsec  0.42 -0.43 -0.71  0.09 -0.17  1.00

Unfortunately, the function cor() returns only the correlation coefficients between variables. In the next section, we will use Hmisc R package to calculate the correlation p-values.

Correlation matrix with significance levels (p-value)

The function rcorr() (in Hmisc package) can be used to compute the significance levels for pearson and spearman correlations. It returns both the correlation coefficients and the p-value of the correlation for all possible pairs of columns in the data table.

Simplified format:

rcorr(x, type = c("pearson","spearman"))

x should be a matrix. The correlation type can be either pearson or spearman.

Example:

library("Hmisc")

cor_2 <- rcorr(as.matrix(my_data))
cor_2

##        mpg  disp    hp  drat    wt  qsec
## mpg   1.00 -0.85 -0.78  0.68 -0.87  0.42
## disp -0.85  1.00  0.79 -0.71  0.89 -0.43
## hp   -0.78  0.79  1.00 -0.45  0.66 -0.71
## drat  0.68 -0.71 -0.45  1.00 -0.71  0.09
## wt   -0.87  0.89  0.66 -0.71  1.00 -0.17
## qsec  0.42 -0.43 -0.71  0.09 -0.17  1.00
## 
## n= 32 
## 
## 
## P
##      mpg    disp   hp     drat   wt     qsec  
## mpg         0.0000 0.0000 0.0000 0.0000 0.0171
## disp 0.0000        0.0000 0.0000 0.0000 0.0131
## hp   0.0000 0.0000        0.0100 0.0000 0.0000
## drat 0.0000 0.0000 0.0100        0.0000 0.6196
## wt   0.0000 0.0000 0.0000 0.0000        0.3389
## qsec 0.0171 0.0131 0.0000 0.6196 0.3389

The output of the function rcorr() is a list containing the following elements :

r : the correlation matrix
n : the matrix of the number of observations used in analyzing each pair of variables
P : the p-values corresponding to the significance levels of correlations.

Extracting the p-values or the correlation coefficients from the output:

str(cor_2)

## List of 3
##  $ r: num [1:6, 1:6] 1 -0.848 -0.776 0.681 -0.868 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##  $ n: int [1:6, 1:6] 32 32 32 32 32 32 32 32 32 32 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##  $ P: num [1:6, 1:6] NA 9.38e-10 1.79e-07 1.78e-05 1.29e-10 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##   .. ..$ : chr [1:6] "mpg" "disp" "hp" "drat" ...
##  - attr(*, "class")= chr "rcorr"

# As you can see "cor_2" is a list so extracting these values is quite simple...

# p-values
cor_2$P

##               mpg         disp           hp         drat           wt
## mpg            NA 9.380354e-10 1.787838e-07 1.776241e-05 1.293956e-10
## disp 9.380354e-10           NA 7.142686e-08 5.282028e-06 1.222311e-11
## hp   1.787838e-07 7.142686e-08           NA 9.988768e-03 4.145833e-05
## drat 1.776241e-05 5.282028e-06 9.988768e-03           NA 4.784268e-06
## wt   1.293956e-10 1.222311e-11 4.145833e-05 4.784268e-06           NA
## qsec 1.708199e-02 1.314403e-02 5.766250e-06 6.195823e-01 3.388682e-01
##              qsec
## mpg  1.708199e-02
## disp 1.314403e-02
## hp   5.766250e-06
## drat 6.195823e-01
## wt   3.388682e-01
## qsec           NA

# Correlation matrix
cor_2$r

##             mpg       disp         hp        drat         wt        qsec
## mpg   1.0000000 -0.8475513 -0.7761683  0.68117189 -0.8676594  0.41868404
## disp -0.8475513  1.0000000  0.7909486 -0.71021390  0.8879799 -0.43369791
## hp   -0.7761683  0.7909486  1.0000000 -0.44875914  0.6587479 -0.70822340
## drat  0.6811719 -0.7102139 -0.4487591  1.00000000 -0.7124406  0.09120482
## wt   -0.8676594  0.8879799  0.6587479 -0.71244061  1.0000000 -0.17471591
## qsec  0.4186840 -0.4336979 -0.7082234  0.09120482 -0.1747159  1.00000000

Custom function for convinient formatting of the correlation matrix

This section provides a simple function for formatting a correlation matrix into a table with 4 columns containing :

Column 1 : row names (variable 1 for the correlation test)
Column 2 : column names (variable 2 for the correlation test)
Column 3 : the correlation coefficients
Column 4 : the p-values of the correlations

flat_cor_mat <- function(cor_r, cor_p){
  #This function provides a simple formatting of a correlation matrix
  #into a table with 4 columns containing :
    # Column 1 : row names (variable 1 for the correlation test)
    # Column 2 : column names (variable 2 for the correlation test)
    # Column 3 : the correlation coefficients
    # Column 4 : the p-values of the correlations
  library(tidyr)
  library(tibble)
  cor_r <- rownames_to_column(as.data.frame(cor_r), var = "row")
  cor_r <- gather(cor_r, column, cor, -1)
  cor_p <- rownames_to_column(as.data.frame(cor_p), var = "row")
  cor_p <- gather(cor_p, column, p, -1)
  cor_p_matrix <- left_join(cor_r, cor_p, by = c("row", "column"))
  cor_p_matrix
}

cor_3 <- rcorr(as.matrix(mtcars[, 1:7]))

my_cor_matrix <- flat_cor_mat(cor_3$r, cor_3$P)
head(my_cor_matrix)

##    row column        cor            p
## 1  mpg    mpg  1.0000000           NA
## 2  cyl    mpg -0.8521619 6.112697e-10
## 3 disp    mpg -0.8475513 9.380354e-10
## 4   hp    mpg -0.7761683 1.787838e-07
## 5 drat    mpg  0.6811719 1.776241e-05
## 6   wt    mpg -0.8676594 1.293956e-10

Visualization of a correlation matrix

There are several different ways for visualizing a correlation matrix in R software:

symnum() function
corrplot() function to plot a correlogram
scatter plots
heatmap

We’ll run trough all of these, and then go a bit more into deatil with correlograms.

Use `symnum()` function: Symbolic number coding

The R function symnum() is used to symbolically encode a given numeric or logical vector or array. It is particularly useful for visualization of structured matrices, e.g., correlation, sparse, or logical ones. In the case of a correlatino matrix it replaces correlation coefficients by symbols according to the level of the correlation.

Simplified format:

symnum(x, cutpoints = c(0.3, 0.6, 0.8, 0.9, 0.95),
       symbols = c(" ", ".", ",", "+", "*", "B"),
       abbr.colnames = TRUE)

Here:

x: the correlation matrix to visualize
cutpoints: correlation coefficient cutpoints. The correlation coefficients between 0 and 0.3 are replaced by a space (” “); correlation coefficients between 0.3 and 0.6 are replaced by”.“; etc .
symbols: the symbols to use.
abbr.colnames: logical value. If TRUE, colnames are abbreviated.

Example:

cor_4 <- cor(mtcars[1:6])
symnum(cor_4, abbr.colnames = FALSE)

##      mpg cyl disp hp drat wt
## mpg  1                      
## cyl  +   1                  
## disp +   *   1              
## hp   ,   +   ,    1         
## drat ,   ,   ,    .  1      
## wt   +   ,   +    ,  ,    1 
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

*As indicated in the legend, the correlation coefficients between 0 and 0.3 are replaced by a space (” “); correlation coefficients between 0.3 and 0.6 are replace by”.“; etc .*

Use the `corrplot()` function: Draw a correlogram

The function corrplot(), in the package of the same name, creates a graphical display of a correlation matrix, highlighting the most correlated variables in a data table.

In this plot, correlation coefficients are colored according to the value. Correlation matrix can be also reordered according to the degree of association between variables.

The simplified format of the function is:

corrplot(corr, method="circle")

Here:

corr: the correlation matrix to be visualized
method: The visualization method to be used, there are seven different options: “circle”, “square”, “ellipse”, “number”, “shade”, “color”, “pie”.

Example:

M<-cor(mtcars)
head(round(M,2))

##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43

#Visualize the correlation matrix

# method = "circle""
corrplot(M, method = "circle")

# method = "ellipse""
corrplot(M, method = "ellipse")

# method = "pie"
corrplot(M, method = "pie")

# method = "color"
corrplot(M, method = "color")

Display the correlation coefficient:

corrplot(M, method = "number")

Correlogram layouts:

There are three general types of a correlogram layout :

“full” (default) : display full correlation matrix
“upper”: display upper triangular of the correlation matrix
“lower”: display lower triangular of the correlation matrix

Examples:

# upper triangular
corrplot(M, type = "upper")

#lower triangular
corrplot(M, type = "lower")

Reordering the correlation matrix

The correlation matrix can be reordered according to the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix. Use order = "hclust" argument for hierarchical clustering of correlation coefficients.

Example:

# correlogram with hclust reordering
corrplot(M, order = "hclust")

# or exploit the symetry of the correlation matrix 
# correlogram with hclust reordering
corrplot(M, type = "upper", order = "hclust")

Changing the color and direction of text labels in the correlogram

Examples:

# Change background color to lightgreen and color of the circles to darkorange and steel blue
corrplot(M, type = "upper", order = "hclust", col = c("darkorange", "steelblue"),
         bg = "lightgreen")

# use "colorRampPallete" to obtain contionus color scales
col <- colorRampPalette(c("darkorange", "white", "steelblue"))(20)
corrplot(M, type = "upper", order = "hclust", col = col)

# Or use "RColorBrewer" package
library(RColorBrewer)
corrplot(M, type = "upper", order = "hclust",
         col = brewer.pal(n = 9, name = "PuOr"), bg = "darkgreen")

Use the tl.col argument for defining the text label color and tl.srt for text label string rotation.

Example:

corrplot(M, type = "upper", order = "hclust", tl.col = "darkblue", tl.srt = 45)

Combining correlogram with the significance test

# Mark the insignificant coefficients according to the specified p-value significance level
cor_5 <- rcorr(as.matrix(mtcars))
M <- cor_5$r
p_mat <- cor_5$P
corrplot(M, type = "upper", order = "hclust", 
         p.mat = p_mat, sig.level = 0.01)

# Leave blank on no significant coefficient
corrplot(M, type = "upper", order = "hclust", 
         p.mat = p_mat, sig.level = 0.05, insig = "blank")

Fine tuning customization of the correlogram

col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method = "color", col = col(200),  
         type = "upper", order = "hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "darkblue", tl.srt = 45, #Text label color and rotation
         # Combine with significance level
         p.mat = p_mat, sig.level = 0.01,  
         # hide correlation coefficient on the principal diagonal
         diag = FALSE 
         )

I’d say this is more than enough for introductory exploration of correlograms. More information can be found in the, already mentioned, vignette for the corrplot R package – An Introduction to corrplot Package

Use `chart.Correlation()`: Draw scatter plots

The function chart.Correlation() from the package “PerformanceAnalytics”, can be used to display a chart of a correlation matrix. This is a very convinient way of exploring multivariate correlations.

library("PerformanceAnalytics")

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

my_data <- mtcars[, c(1,3,4,5,6,7)]
chart.Correlation(my_data, histogram = TRUE, pch = 19)

In the above plot:

The distribution of each variable is shown on the diagonal.
On the bottom of the diagonal : the bivariate scatter plots with a fitted line are displayed
On the top of the diagonal : the value of the correlation plus the significance level as stars
Each significance level is associated to a symbol : p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(““,”“,””, “.”, ” “)

Use `heatmap()`

I don’t really consider this method of correlation matrix visualization to be of practical value, but nevertheless here is a small example:

# Get some colors
col <- colorRampPalette(c("darkblue", "white", "darkorange"))(20)
M <- cor(mtcars[1:7])
heatmap(x = M, col = col, symm = TRUE)

R drawing png with high resolution

szypanther — Fri, 31 May 2019 04:34:37 +0000

可重复的示例：

the_plot <- function()
{
  x <- seq(0, 1, length.out = 100)
  y <- pbeta(x, 1, 10)
  plot(
    x,
    y,
    xlab = "False Positive Rate",
    ylab = "Average true positive rate",
    type = "l"
  )
}

png(
  "test.png",
  width     = 3.25,
  height    = 3.25,
  units     = "in",
  res       = 1200,
  pointsize = 4
)
par(
  mar      = c(5, 5, 2, 2),
  xaxs     = "i",
  yaxs     = "i",
  cex.axis = 2,
  cex.lab  = 2
)
the_plot()
dev.off()

当然，更好的解决方案是放弃这种基本的图形和使用一个系统，将处理你的分辨率缩放。例如，

library(ggplot2)

ggplot_alternative <- function()
{
  the_data <- data.frame(
    x <- seq(0, 1, length.out = 100),
    y = pbeta(x, 1, 10)
  )

ggplot(the_data, aes(x, y)) +
    geom_line() +
    xlab("False Positive Rate") +
    ylab("Average true positive rate") +
    coord_cartesian(0:1, 0:1)
}

ggsave(
  "ggtest.png",
  ggplot_alternative(),
  width = 3.25,
  height = 3.25,
  dpi = 1200
)

fasta2nexus by R script

szypanther — Fri, 12 Jan 2018 08:59:54 +0000

Workspace loaded from ~/.RData]

> setwd("/home/shenzy/work/beast/51samples")
> library(seqinr)
> data=read.fasta("51strain_core_gene_alignment.aln")
> library(ape)

Attaching package: ‘ape’

The following objects are masked from ‘package:seqinr’:

    as.alignment, consensus

> write.nexus.data(data,file="51strain_core_gene_alignment.aln.nexus", format="DNA")

Remove grid and background from plot (ggplot2)

szypanther — Mon, 10 Apr 2017 06:16:20 +0000

Remove grid and background from plot (ggplot2)

2013-11-27 | category RStudy | tag ggplot2

Generate data

library(ggplot2)
a <- seq(1, 20)
b <- a^0.25
df <- as.data.frame(cbind(a, b))

basic plot

myplot = ggplot(df, aes(x = a, y = b)) + geom_point()
myplot

theme_bw() will get rid of the background

myplot + theme_bw()

remove grid (does not remove backgroud colour and border lines)

myplot + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

remove border lines (does not remove backgroud colour and grid lines)

myplot + theme(panel.border = element_blank())

remove background (remove backgroud colour and border lines, but does not remove grid lines)

myplot + theme(panel.background = element_blank())

add axis line

myplot + theme(axis.line = element_line(colour = "black"))

put all together – method 1

myplot + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))

put all together – method 2

myplot + theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))

Size Matters: Metabolic Rate and Longevity (Regression analysis sample)

szypanther — Wed, 30 Nov 2016 04:21:29 +0000

Size Matters: Metabolic Rate and Longevity

John Tukey once said, “The best thing about being a statistician is that you get to play in everyone’s backyard.” I enthusiastically agree!

I frequently enjoy reading and watching science-related material. This invariably raises questions, involving other “backyards,” that I can better understand using statistics. For instance, see my post about the statistical analysis of dolphin sounds.

The latest topic that grabbed my attention was an apparent error in the BBC program Wonders of Life. In the episode “Size Matters,” Professor Brian Cox presents a graph with a linear regression line that illustrates the relationship between the size of mammals and their metabolic rate.

How Does the Size of Mammals Affect Their Lives?

Brian Cox, a theoretical physicist, is a really smart guy and one of my favorite science presenters. So, I was surprised when his interpretation of the linear regression model seemed incorrect. Below is a closer look at the graph he presents, and his claim.

Cox points out the straight line and states, “That implies, gram-for-gram, large animals use less energy than small animals . . . because the slope is less than one.”

For linear regression, the slope being less than 1 is irrelevant. Instead, the fact that it is a straight line indicates that the same relationship applies for both small and large mammals. If you increase mass by 1 unit for a small mammal and for a large mammal, metabolism increases by the same average amount for both sizes. In other words, gram-for-gram, size doesn’t seem to matter!

However, it’s unlikely that Cox would make such a fundamental mistake, so I conducted some research. It turns out that biologists use a log-log plot to model the relationship between the mass of mammals and their basal metabolic rate.

Perhaps Cox actually presented a log-log plot but glossed over the details?

If so, this dramatically changes the interpretation of this graph, because log-log plots transform both axes in order to model curvature while using linear regression. If the slope on a log-log plot of metabolic rate by mass is between 0 and 1, it indicates that the nonlinear effect of mass on metabolic rate lessens as mass increases.

This description fits Cox’s statements about the slope and how mass effects metabolic rate.

Let’s test Cox’s hypothesis ourselves! Thanks to the PanTHERIA database*, we can fit the same type of log-log plot using similar data.

Log-Log Plot of Mammal Mass and Basal Metabolic Rate

I’ll use the Fitted Line Plot in Minitab statistical software to fit a regression line to 572 mammals, ranging from the masked shrew (4.2 grams) to the common eland (562,000 grams). You can find the data here.

In Minitab, go to Stat > Regression > Fitted Line Plot. Metabolic rate is our response variable and adult mass is our predictor. Go to Options and check all four boxes under Transformations to produce the log-log plot.

The slope (0.7063) is significant (p = 0.000) and its value is consistent with recently published estimates. Because the slope is between 0 and 1, it confirms Cox’s interpretation. Gram-for-gram, larger animals use less energy than smaller animals. In order to function, a cell in a larger animal requires less energy than a cell in a smaller animal.

I’m quite amazed that the R-squared is 94.3%! Usually you only see R-squared values this high for a low-noise physical process. Instead, these data were collected by a variety of researchers in different settings and cover a huge range of mammals that live in completely different habitats.

So Cox presented the correct interpretation after all: for mammals, size matters. However, he presented an oversimplified version of the underlying analysis by not mentioning the double-log transformations. This is television, after all!

There are important implications based on the fact that this model is curved rather than linear. If the increase in metabolic rate remained constant as mass increased (the linear model), we’d have to eat 16,000 calories a day to sustain ourselves. Further, mammals wouldn’t be able to grow larger than a goat due to overheating!

Basal Metabolic Rate and Longevity

Cox also presented the idea that mammals with a slower metabolism live longer than those with a faster metabolism. However, he didn’t present data or graphs to support this contention. Fortunately, we can test this hypothesis as well.

In Minitab, I used Calc > Calculator to divide metabolic rate by grams. This division allows us to make the gram-for-gram comparison. Time for another fitted line plot with a double-log transformation!

The negative slope is significant (0.000) and tells us that as metabolic rate per gram increases, longevity decreases. The R-squared is 45.8%, which is pretty good considering that we’re looking at just one of many factors than can impact maximum lifespan!

However, it is not a linear relationship because this is a log-log plot. Maximum longevity asymptotically approaches a minimum value around 13 months as metabolism increases. The graph below shows the curvilinear relationship using the natural scale.

A one-unit increase in the slower metabolic rates (left side of x-axis) produces much larger drops in longevity than a on-unit increase in the faster metabolic rates (right side of x-axis).

Once again, we’ve shown that size does matter! Larger mammals tend to have a slower metabolism. And animals that have a slower metabolism tend to live longer. That’s fortunate for us because without our slower metabolism, we’d only live about a year!

数据分析之美：如何进行回归分析

szypanther — Thu, 30 Jun 2016 04:21:27 +0000

1. 确定自变量与Y是否相关

证明：自变量X1，X2，….XP中至少存在一个自变量与因变量Y相关

For any given value of n（观测数据的数目） and p（自变量X的数目）, any statistical software package can be used to compute the p-value associated with the F-statistic using this distribution. Based on this p-value, we can determine whether or not to reject H0. （用软件计算出的与F-statistic 相关的p-value来验证假设，the p-value associated with the F-statistic）

例子：

Is there a relationship between advertising sales（销售额） and budget（广告预算：TV, radio, and newspaper）?

the p-value corresponding to the F-statistic in Table 3.6 is very low, indicating clear evidence of a relationship between advertising and sales.

背景知识回顾：

t-statistic T统计量（t检验）与F-statistic

t-statistic T统计量=（回归系数β的估计值-0）/β的标准误，which measures the number of standard deviations thatβis away from 0。用来对计量经济学模型中关于参数的单个假设进行检验的一种统计量。

我们一般用t统计量来检验回归系数是否为0做检验。例如：线性回归Y=β0+β1X，为了验证X与Y是否相关，

假设H0：X与Y无关,即β1=0

假设H1：X与Y相关,即β1不等于0

计算t-statistic，如果t-statistic is far away from zero,则x和y相关。一般用p-values来检验X和Y是否相关。

1）p-values（Probability，Pr）

1 定义

pvalue的定义：在原假设正确的情况下，出现当前情况或者更加极端情况的概率。

p值是用来衡量统计显著性的常用指标。

P值( P-Value，Probability，Pr）即概率，反映某一事件发生的可能性大小。统计学根据显著性检验方法所得到的P 值，一般以P < 0.05 为显著， P <0.01 为非常显著，其含义是样本间的差异由抽样误差所致的概率小于0.05 或0.01。实际上，P 值不能赋予数据任何重要性，只能说明某事件发生的机率。

假设检验是推断统计中的一项重要内容。在假设检验中常见到P 值( P-Value，Probability，Pr)，P 值是进行检验决策的另一个依据。

大的pvalue说明还没有足够的证据拒绝原假设。

2 为何有p-value

P值方法的思路是先进行一项实验，然后观察实验结果是否符合随机结果的特征。研究人员首先提出一个他们想要推翻的“零假设”（null hypothesis），比如，两组数据没有相关性或两组数据没有显著差别。接下来，他们会故意唱反调，假设零假设是成立的，然后计算实际观察结果与零假设相吻合的概率。这个概率就是P值。费希尔说，P值越小，研究人员成功证明这个零假设不成立的可能性就越大。

其实理解起来很简单，基本原理只有两个：

1）一个命题只能证伪，不能证明为真

2）小概率事件不可能发生

证明逻辑就是：我要证明命题为真->证明该命题的否命题为假->在否命题的假设下，观察到小概率事件发生了->搞定。

3 demo

投飞镖，假设一个飞镖有10，9,8,7,6,5,4,3,2,1总共十个环（10是中心），定义合格投手为其真实水平能投到10~3环，而不管他临场表现如何。假设10~3环占靶子面积的95%。

H0：A是一个合格投手

H1：A不是合格投手

结合这个例子来看：证明A是合格的投手-》证明“A不是合格投手”的命题为假-》观察到一个事件（比如A连续10次投中10环），而这个事件在“A不是合格投手”的假设下，概率为p，小于0.05->小概率事件发生，否命题被推翻。

可以看到p越小-》这个事件越是小概率事件-》否命题越可能被推翻-》原命题越可信

2）F-statistic

t检验是单个系数显著性的检验，检验一个变量X与Y是否相关，如电视上广告投入是否有利于销售额。T检验的原假设为某一解释变量的系数为0 。

F检验是是所有的自变量在一起对因变量的影响，当处理3个及其以上的时候（变量X1，X2，X3…等）用的是F检验。F检验的原假设为所有回归系数为0。

即F检验用于证明变量X1，X2，X3…中至少有一个变量和Y相关

F检验的原假设是H0：所有回归参数都等于0，所以F检验通过的话说明模型总体存在，F检验不通过，其他的检验就别做了，因为模型所有参数不显著异于0，相当于模型不存在（即没有任何一个变量X1，X2，X3… have no relationship with Y）。

2.确定有用的自变量子集

Do all the predictors help to explain Y , or is only a subset of the predictors useful? （确定对Y有用的自变量）

The first step in a multiple regression analysis is to compute the F-statistic and to examine the associated pvalue. If we conclude on the basis of that p-value that at least one of the predictors is related to the response, then it is natural to wonder which are the guilty ones!

The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection.

There are three classical approaches for this task:Forward selection.Forward selection.Forward selection.

1）Forward selection.

We begin with the null model—a model that conforward selection null model tains an intercept but no predictors. We then fit p simple linear regressions

and add to the null model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model. This approach is continued until some stopping rule is satisfied.

2)Backward selection.

We start with all variables in the model, and backward remove the variable with the largest p-value—that is, the variable selection that is the least statistically significant. The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed. This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.

3)Mixed selection.

This is a combination of forward and backward semixed lection. We start with no variables in the model, and as with forward selection , we add the variable that provides the best fit. We continue to add variables one-by-one. Of course, as we noted with the Advertising example, the p-values for variables can become larger as new predictors are added to the model. Hence, if at any point the p-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model. We continue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model.

Compare:

Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large.

How to selecting the best model among a collection of models with different numbers of predictors?

Instead, we wish to choose a model with a low test error. As is evident here, the training error can be a poor estimate of the test error. Therefore, RSS and R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors.

These approaches can be used to select among a set of models with different numbers of variables.

Cp, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted R2

In the past, performing cross-validation was computationally prohibitive for many problems with large p and/or large n, and so AIC, BIC, Cp, and adjusted R2 were more attractive approaches for choosing among a set of models. However, nowadays with fast computers, the computations required to perform cross-validation are hardly ever an issue. Thus, crossvalidation is a very attractive approach for selecting from among a number of models under consideration.

TODO chapter6

3.模型误差（RSE,R^2）

How well does the model fit the data?

An R2 value close to 1 indicates that the model explains a large portion of the variance（自变量X） in the response variable（因变量Y）.
It turns out that R2 will always increase when more variables are added to the model, even if those variables are only weakly associated with the response.

例子：

For the Advertising data, the RSE is 1,681units while the mean value for the response is 14,022, indicating a percentage error of roughly 12 %(RSE/mean value). Second, the R2 statistic records the percentage of variability in the response that is explained by the predictors. The predictors explain almost 90 % of the variance in sales.

背景知识：

RSE标准差

R2 Statistic（R-square）用于评判一个模型拟合好坏的重要标准

R平方介于0~1之间，越接近1，回归拟合效果越好，模型越精确。

R^2判定系数就是拟合优度判定系数，它体现了回归模型中自变量Y的变异在因变量X的变异中所占的比例。即用来表示y值中有多少可以用x值来解释（R2 measures the proportion
of variability in Y that can be explained using X.），0.92的意思就是y值中有92%可以用x值来解释。

当R^2=1时表示，所有观测点都落在拟合的直线或曲线上；当R^2=0时，表示自变量与因变量不存在直线或曲线关系。

如何根据R-squared判断模型是否准确？

However, it can still be challenging to determine what is a good R2 value, and in general, this will depend on the application. For instance, in certain problems in

physics, we may know that the data truly comes from a linear model with a small residual error. In this case, we would expect to see an R2 value that is extremely close to 1, and a substantially smaller R2 value might indicate a serious problem with the experiment in which the data were generated. On the other hand, in typical applications in biology, psychology, marketing, and other domains, the linear model (3.5) is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large. In this setting, we would expect only a very small proportion of the variance in the response to be explained by the

predictor, and an R2 value well below 0.1 might be more realistic!

4.应用模型：Y准确度（置信度，置信区间，预测区间）

Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Once we have fit the multiple regression model, it is straightforward to apply in order to predict the response Y on the basis of a set of values for the predictors X1, X2, . . . , Xp.

We can compute a confidence interval （置信区间）in order to determine how close Y'(用模型计算出的值) will be to f(X)（理论中的真实值）.

predict an individual response use prediction interval, predict the average response use confidence interval.

confidence interval置信区间与预测区间

1 置信区间

表示在给定预测变量的指定设置时，平均响应可能落入的范围。

置信区间是结合置信度来说的，简单来说就是随机变量有一定概率落在一个范围内，这个概率就叫置信度，范围就是对应的置信区间。

真实数据往往是实际上不能获知的，我们只能进行估计，估计的结果是给出一对数据，比如从1到1.5，真实的值落在1到1.5之间的可能性是95%（也有5%的可能性在这区间之外的）。

90%置信区间（Confidence Interval,CI）：当给出某个估计值的90%置信区间为【a,b】时，可以理解为我们有90%的信心（Confidence）可以说样本的平均值介于a到b之间，而发生错误的概率为10%。

2 预测区间Prediction Interval

表示在给定预测变量的指定设置时，单个观测值可能落入的范围。

预测区间PI总是要比对应的置信区间CI大，这是因为在对单个响应与响应均值的预测中包括了更多的不确定性。

The basic syntax is lm(y∼x,data), where y is the response（预测值）, x is the predictor（影响因子：x1,x2）, and data is the data set in which these two variables are kept.

5.模型修正

1）各自变量X1，X2…对因变量Y的影响程度

Which media contribute to sales?

To answer this question, we can examine the p-values associated with each predictor’s t-statistic. In the multiple linear regression displayed in Table 3.4, the p-values for TV and radio are low,but the p-value for newspaper is not. This suggests that only TV and radio are related to sales.

2）解决共线性问题

所谓多重共线性（Multicollinearity）是指线性回归模型中的解释变量之间由于存在精确相关关系或高度相关关系而使模型估计失真或难以估计准确。一般来说，由于经济数据的限制使得模型设计不当，导致设计矩阵中解释变量间存在普遍的相关关系。

如何解决共线性问题？

方差膨胀因子（Variance Inflation Factor，VIF）：容忍度的倒数，VIF越大，显示共线性越严重。经验判断方法表明：当0＜VIF＜10，不存在多重共线性；当10≤VIF＜100，存在较强的多重共线性；当VIF≥100，存在严重多重共线性。

3）交互项系数（interaction terms）

衡量的是一个变量对于“另一个变量对因变量影响能力”的影响。

Is there synergy among the advertising media?

Perhaps spending $50,000 on television advertising and $50,000 on radio advertising results in more sales than allocating $100,000 to either television or radio individually. In marketing, this is known as

a synergy effect, while in statistics it is called an interaction effect.

何时适合在模型中加入交互系数？

4）异常值outlier检测

Residual plots（残差散点图） can be used to identify outliers. 检测到异常值后，从数据中去掉异常值，再生成纠正后的模型。
残差是指观测值与预测值（拟合值）之间的差，即是实际观察值与回归估计值的差。残差分析就是通过残差所提供的信息，分析出数据的可靠性、周期性或其它干扰。在线性回归中,残差的重要应用之一是根据它的绝对值大小判定异常点。

But in practice, it can be difficult to decide how large a residual needs to be before we consider the point to be an outlier. To address this problem, instead of plotting the residuals, we can plot the studentized residuals, computed by dividing each residual ei by its estimated standard studentized error. Observations whose studentized residuals are greater than 3 in abso- residuallute value are possible outliers.

MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments

szypanther — Thu, 08 May 2014 07:35:43 +0000

MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments.

R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment

szypanther — Thu, 08 May 2014 07:33:41 +0000

R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment.

R pheatmap

szypanther — Wed, 12 Feb 2014 04:04:41 +0000

> library(caTools);
> library(bitops);
> library(grid);
> data=read.csv(“/home/shenzy/Desktop/R/Bac.heatmap1.csv”)
> data=read.csv(“/home/shenzy/Desktop/R/Bac.heatmap1.2.csv”)
> View(data)
> data=read.csv(“/home/shenzy/Desktop/R/Bac.heatmap1.2.csv”,sep=”\t”)
> View(data)
> row.names(data) <- data$X.OTU.ID;
> View(data)
> data_matrix<-data[,2:15]
> View(data_matrix)
> data_matrix<-data[,2:14]
> View(data_matrix)
> View(data)
> data_matrix<-data[,1:14]
> View(data_matrix)
> library(pheatmap)
> data_matrix[is.na(data_matrix)]<-1
> View(data_matrix)
> data_log10<-log10(data_matrix)
> View(data_log10)
> data_log2<-log2(data_matrix)
> View(data_log2)
> pheatmap(data_log2,fontsize=9, fontsize_row=6)
> pheatmap(data_log2, color = colorRampPalette(c(“white”, “yellow”, “firebrick3″))(50), fontsize=9, fontsize_row=6)
> pheatmap(data_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6)
> help(pheatmap)
> pheatmap(data_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6,cluster_rows=FALSE, cluster_cols=FALSE)
> pheatmap(data_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6)
> data2=read.csv(/home/shenzy/Desktop/R/Bac.heatmap1.3_GZ.csv)
Error: unexpected ‘/’ in “data2=read.csv(/”
> data2=read.csv(“/home/shenzy/Desktop/R/Bac.heatmap1.3_GZ.csv”,sep=”\t”)
> View(data2)
> row.names(data2) <- data2$X.OTU.ID;
> data3=read.csv(“/home/shenzy/Desktop/R/Bac.heatmap1.4_SH.csv”,sep=”\t”)
> row.names(data3) <- data3$X.OTU.ID;
> View(data3)
> data2_matrix<-data2[,1:7]
> View(data2_matrix)
> data3_matrix<-data3[,1:7]
> View(data3_matrix)
> data2_matrix[is.na(data2_matrix)]<-1
> data3_matrix[is.na(data3_matrix)]<-1
> data2_log2<-log2(data2_matrix)
> data3_log2<-log2(data3_matrix)
> pheatmap(data2_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6,cluster_rows=FALSE, cluster_cols=FALSE)
> pheatmap(data2_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6)
> pheatmap(data3_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6,cluster_rows=FALSE, cluster_cols=FALSE)
> pheatmap(data3_log2, color = colorRampPalette(c(“white”, “blue”, “firebrick3″))(50), fontsize=9, fontsize_row=6)

小生这厢有礼了(BioFaceBook Personal Blog) » R语言

heatmap R

Correlation tests, correlation matrix, and corresponding visualization methods in R (forward)

Correlation tests, correlation matrix, and corresponding visualization methods in R

Igor Hut

12 January, 2017

Install and load required R packages

Methods for correlation analyses

Compute correlation in R

R functions

Preliminary considerations

Preleminary test to check the test assumptions

Pearson correlation test

Kendall rank correlation test

Spearman rank correlation coefficient

How to interpret correlation coefficient

What is a correlation matrix?

Compute correlation matrix in R

Plain correlation matrix

Correlation matrix with significance levels (p-value)

Custom function for convinient formatting of the correlation matrix

Visualization of a correlation matrix

Use symnum() function: Symbolic number coding

Use the corrplot() function: Draw a correlogram

Correlogram layouts:

Reordering the correlation matrix

Changing the color and direction of text labels in the correlogram

Combining correlogram with the significance test

Fine tuning customization of the correlogram

Use chart.Correlation(): Draw scatter plots

Use heatmap()

R drawing png with high resolution

fasta2nexus by R script

Remove grid and background from plot (ggplot2)

Generate data

basic plot

theme_bw() will get rid of the background

remove grid (does not remove backgroud colour and border lines)

remove border lines (does not remove backgroud colour and grid lines)

remove background (remove backgroud colour and border lines, but does not remove grid lines)

add axis line

put all together – method 1

put all together – method 2

Further reading

Size Matters: Metabolic Rate and Longevity (Regression analysis sample)

Size Matters: Metabolic Rate and Longevity

How Does the Size of Mammals Affect Their Lives?

Log-Log Plot of Mammal Mass and Basal Metabolic Rate

Basal Metabolic Rate and Longevity

数据分析之美：如何进行回归分析

1. 确定自变量与Y是否相关

2.确定有用的自变量子集

3.模型误差（RSE,R^2）

4.应用模型：Y准确度（置信度，置信区间，预测区间）

5.模型修正

1）各自变量X1，X2…对因变量Y的影响 程度

2）解决共线性问题

3）交互项系数（interaction terms）

4）异常值outlier检测

MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments

R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment

R pheatmap

Use `symnum()` function: Symbolic number coding

Use the `corrplot()` function: Draw a correlogram

Use `chart.Correlation()`: Draw scatter plots

Use `heatmap()`

1）各自变量X1，X2…对因变量Y的影响程度