First, we load libraries (download and install if not available) & set the working direction.
library(tidyverse) # load library
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pheatmap) # load another library (heatmap)
getwd() # show working directory
## [1] "C:/Users/weiga/Dropbox/Courses/bio-714-spring"
R as a calculator, press Control-Enter/Command-Enter to execute each command:
4.50 + 6.7
## [1] 11.2
59/89.6
## [1] 0.6584821
log10(2)
## [1] 0.30103
log2(4) # fold change (FC): 2-fold increase
## [1] 2
log2(0.5) # fold change (FC): 1-fold decrease
## [1] -1
1:10 # a vector
## [1] 1 2 3 4 5 6 7 8 9 10
All R commands are functions. A function consists of the function name, arguments (inputs), and outputs.
We Use the “<-” operator to save function output to a new object.
seq(from = 1, to = 10, by = 1) # same as above, with a function
## [1] 1 2 3 4 5 6 7 8 9 10
# save outputs to an object (a data container to store outputs of a function)
x <- seq(from = 1, to = 10, by = 3)
x * 2 # doulbe each value in the vector
## [1] 2 8 14 20
x ^ 2 # square each value
## [1] 1 16 49 100
# get help
?seq # opens the built-in help page
## starting httpd help server ... done
example(seq) # run the built-in example
##
## seq> seq(0, 1, length.out = 11)
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
##
## seq> seq(stats::rnorm(20)) # effectively 'along'
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## seq> seq(1, 9, by = 2) # matches 'end'
## [1] 1 3 5 7 9
##
## seq> seq(1, 9, by = pi) # stays below 'end'
## [1] 1.000000 4.141593 7.283185
##
## seq> seq(1, 6, by = 3)
## [1] 1 4
##
## seq> seq(1.575, 5.125, by = 0.05)
## [1] 1.575 1.625 1.675 1.725 1.775 1.825 1.875 1.925 1.975 2.025 2.075 2.125
## [13] 2.175 2.225 2.275 2.325 2.375 2.425 2.475 2.525 2.575 2.625 2.675 2.725
## [25] 2.775 2.825 2.875 2.925 2.975 3.025 3.075 3.125 3.175 3.225 3.275 3.325
## [37] 3.375 3.425 3.475 3.525 3.575 3.625 3.675 3.725 3.775 3.825 3.875 3.925
## [49] 3.975 4.025 4.075 4.125 4.175 4.225 4.275 4.325 4.375 4.425 4.475 4.525
## [61] 4.575 4.625 4.675 4.725 4.775 4.825 4.875 4.925 4.975 5.025 5.075 5.125
##
## seq> seq(17) # same as 1:17, or even better seq_len(17)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
The standard data type in R is a data frame.
Pay special attention to the data type of each column (numerical or categorical).
data(iris) # load the famous iris data set
# show summaries
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
dim(iris)
## [1] 150 5
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
colnames(iris) # note all variables and names are CASE-SENSITIVE
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Use select() to pick columns, filter() to pick rows, and mutate() to add columns.
We use the %>% or |> (“pipes”) to chain multiple commands.
# filter rows
x.setosa <- iris %>% # use the pipe operator "ctl-shift-M"
filter(Species == 'setosa') # choose rows; Note the double equal (means comparison)
# select columns
x.iris <- iris %>%
select(1:4) # choose columns 1-4
x.iris <- iris %>%
select(-1) # remove the species (last) column
Use group_by followed by summarise() to calculate statistics by groups:
iris %>%
group_by(Species) %>%
summarise(mean(Sepal.Length), sd(Sepal.Length))
## # A tibble: 3 × 3
## Species `mean(Sepal.Length)` `sd(Sepal.Length)`
## <fct> <dbl> <dbl>
## 1 setosa 5.01 0.352
## 2 versicolor 5.94 0.516
## 3 virginica 6.59 0.636
It is often necessary to show data distribution (e.g., to make sure they are normally distributed). We show 3 common plots using the ggplot2 visualization library.
In ggplot, we first maps variables to x, y axis using the aes() function. Then we add the geom_xxx() functions render into geometric shapes.
The highest peaks in a histogram represent the most frequently occurring value (not the highest value).
Key Question: Are data points normally distributed? If not, apply log transformation. Statistical test: QQ plots
iris %>%
ggplot(aes(x=Sepal.Length)) + # use the aes() function to map variables to axes
geom_histogram(bins=10) + # choose the graphic types
facet_wrap(~Species) # one species per panel
iris %>%
ggplot(aes(sample = Sepal.Length)) +
stat_qq() +
stat_qq_line() +
facet_wrap(~Species)
iris %>%
ggplot(aes(x=Sepal.Length)) + # use the aes() function to map variables to axes
geom_histogram(bins=10) + # choose the graphic types;
facet_wrap(~Species) + # one species per panel
scale_y_log10() # apply log10 normalization
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
iris %>%
ggplot(aes(sample = log2(Sepal.Length))) + # log2 transformation
stat_qq() +
stat_qq_line() +
facet_wrap(~Species)
A boxplot shows the median and quartiles of a sample.
Key Question: Are the means different between the groups?
Statistical analysis: t-test (for two groups); ANOVA (for multiple groups)
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) + # categorical variable on x-axis, numerical on y-axis
geom_boxplot() # choose boxplot
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) + # categorical variable on x-axis, numerical on y-axis
geom_boxplot() + # choose boxplot
geom_jitter(shape = 1) # add data points to show sample sizes
t-test for two groups, with the null hypothesis of equal means (\(\mu_1=\mu_2\)) between the two groups:
two.species <- iris %>%
filter(Species != 'virginica')
two.species %>%
t.test(data = ., Sepal.Length ~ Species)
##
## Welch Two Sample t-test
##
## data: Sepal.Length by Species
## t = -10.521, df = 86.538, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group setosa and group versicolor is not equal to 0
## 95 percent confidence interval:
## -1.1057074 -0.7542926
## sample estimates:
## mean in group setosa mean in group versicolor
## 5.006 5.936
two.species %>%
t.test(data = ., Petal.Length ~ Species)
##
## Welch Two Sample t-test
##
## data: Petal.Length by Species
## t = -39.493, df = 62.14, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group setosa and group versicolor is not equal to 0
## 95 percent confidence interval:
## -2.939618 -2.656382
## sample estimates:
## mean in group setosa mean in group versicolor
## 1.462 4.260
ANOVA (Analysis of Variance) for multiple groups, with the null hypothesis of equal variability between and within groups (\(V_{within}=V_{between}\)):
# use the lm() function (linear model)
sepal.length.anova <- lm(data = iris, Sepal.Length ~ Species)
summary(sepal.length.anova)
##
## Call:
## lm(formula = Sepal.Length ~ Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6880 -0.3285 -0.0060 0.3120 1.3120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
## Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
## Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
## F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
Violin plots show data density (by width). Also, the widest regions are the most frequent data points:
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_violin() + # violin plot show side-way histograms
geom_jitter(shape = 1) # choose symbol
Key question: are the two variables correlated?
Statistical test: linear regression, with the null hypothesis of correction coefficient \(\rho=0\)
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
# Map species (a categorical variable) to colors
iris %>%
ggplot(aes(x=Sepal.Length, y= Sepal.Width, color = Species)) +
geom_point()
# add regression lines
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
geom_smooth(method = "lm") # add a linear regression line
## `geom_smooth()` using formula = 'y ~ x'
Calculate linear regression. Note the p-value and R-squared (variance explained by the regression line).
sepal.width.lm <- lm(data = iris, Sepal.Width ~ Sepal.Length)
summary(sepal.width.lm)
##
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1095 -0.2454 -0.0167 0.2763 1.3338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.41895 0.25356 13.48 <2e-16 ***
## Sepal.Length -0.06188 0.04297 -1.44 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4343 on 148 degrees of freedom
## Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
## F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
x.sepal <- iris %>% filter(Species == 'setosa')
sepal.lm <- lm(data = x.sepal, Sepal.Width ~ Sepal.Length)
summary(sepal.lm)
##
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = x.sepal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72394 -0.18273 -0.00306 0.15738 0.51709
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5694 0.5217 -1.091 0.281
## Sepal.Length 0.7985 0.1040 7.681 6.71e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2565 on 48 degrees of freedom
## Multiple R-squared: 0.5514, Adjusted R-squared: 0.542
## F-statistic: 58.99 on 1 and 48 DF, p-value: 6.71e-10
A tidy table is ideal for co-visualization of multiiple variables (high-dimension) at once:
# transform to a long (tidy) table
iris.long <- iris %>%
pivot_longer(1:4, names_to = "var", values_to = "value")
head(iris.long)
## # A tibble: 6 × 3
## Species var value
## <fct> <chr> <dbl>
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Width 3.5
## 3 setosa Petal.Length 1.4
## 4 setosa Petal.Width 0.2
## 5 setosa Sepal.Length 4.9
## 6 setosa Sepal.Width 3
glimpse(iris.long)
## Rows: 600
## Columns: 3
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa…
## $ var <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", …
## $ value <dbl> 5.1, 3.5, 1.4, 0.2, 4.9, 3.0, 1.4, 0.2, 4.7, 3.2, 1.3, 0.2, 4.…
We can now visualize all four measurements together.
Question: Which measurements vary the most and the least among the species?
# panels by species
iris.long %>%
ggplot(aes(x = var, y = value)) +
geom_boxplot() +
geom_jitter(shape = 1) +
facet_wrap(~Species) +
coord_flip() +
theme_bw()
# panels by measurements
iris.long %>%
ggplot(aes(x = Species, y = value)) +
geom_boxplot() +
geom_jitter(shape = 1) +
facet_wrap(~var) +
coord_flip() +
theme_bw()
The goal is to find groups in samples (in rows) based on the four traits (in columns). The hypothesis is that the samples of the same species should be more similar to each other than to samples from a difference species.
In machine learning, this is a classification problem. The model trained on the 150 samples could be used to identify (predict) species identity of a new sample based on the four measurements.
Key questions: What do rows and columns represent? what do the cluster diagram represent?
Statistical analysis: distances represent Euclidean distances \(d=\sqrt{\sum_{k=1}^n{(x_{k,1}-x_{k,2})^2}}\), where \(x_k\) is a measurement (e.g., sepal length). Or correlation distances \(d=1-cor(s_1, s_2)\) between two samples (in rows) or two species (in columns).
pheatmap(iris[,1:4])
x.iris <- iris %>% select(-5) # remove the species column (retain only numerical variables)
rownames(x.iris) <- paste("sample", 1:150)
ann_row <- data.frame(species = factor(iris$Species)) # a data frame containing species for each sample
rownames(ann_row) <- paste("sample", 1:150)
pheatmap(x.iris) # basic heatmap: rows are samples, cols are variables, colors are values, scale to normalize among variables (mean =0); row-side dendrogram groups samples with similar values
# annotated heatmap: scale by each measurement: z-score, standard deviates (mean = 0, in the units of sd; -3 ~ 3 for a normally distributed variable)
pheatmap(x.iris,
scale = "column",
show_rownames = F,
annotation_row = ann_row) # add rowside color to show species
pheatmap(x.iris,
scale = "column",
show_rownames = F,
annotation_row = ann_row,
cellwidth = 10)
PCA projects this multi-dimensional data set (e.g., four measurements in the iris data set) into a 2D space. Alternative dimensional reduction techniques include MDS (multi-dimensional scaling), UMAP (non-linear combination), t-sne (for binary measurements)
A principal component is a linear combination of original variables
pca.iris <- prcomp(x.iris, scale. = T) # run PCA with scaling to z-scores
plot(pca.iris) # show variance explained by each principal component
biplot(pca.iris) # show sample groups
# Make a better plot
pca.coord <- as.data.frame(pca.iris[[5]]) # get coordinates from the PCA output for PC1 & PC2
df.pca <- tibble(pca.coord[1:2]) %>%
mutate(species = iris$Species) # extrac PC1 and PC2, add a species column
# plot 1st & 2nd principal components
df.pca %>%
ggplot(aes(x = PC1, y = PC2, color = species)) +
geom_point(size = 3, alpha = 0.5) +
theme_bw() +
theme(legend.position = "bottom") +
stat_ellipse() # shows clear separation of setosa samples from the other two species, and partial overlap between the vesicolor and virginica samples
We will replicate results from a study of transcriptome changes during the growth of Lyme disease pathogen (Borrelia burgdorferi).
The key research question is to identify genes up- and down-regulated during early- mid- and late-phases of growth.
The original paper is available at rom Arnold et al (2016), PLoS One.
We will use the Supplemental data set S2 to replicate some of the results.
Questions: How many growth phases? How many genes? How many replicons? How many replicates per growth phase?
# load libraries
library(openxlsx2)
library(broom)
# run the following two line once
# install.packages("BiocManager")
# BiocManager::install("EnhancedVolcano")
library(EnhancedVolcano)
## Loading required package: ggrepel
# read data from web
bb <- read_xlsx("https://doi.org/10.1371/journal.pone.0164165.s003", start_row = 2, skip_empty_cols = T, skip_empty_rows = T)
# rename column names
colnames(bb) <- c("replicon", "gene",
"early1", "early2", "early3",
"mid1", "mid2", "mid3",
"late1", "late2", "late3"
)
# Add locus and annotation columns
bb <- bb %>%
mutate(locus = str_remove(gene, "BB_") %>% str_remove("_.+")) %>%
mutate(anno = str_remove(gene, "BB_.*\\d+"))
# show data
glimpse(bb)
## Rows: 1,386
## Columns: 13
## $ replicon <chr> "Chromosome", "Chromosome", "Chromosome", "Chromosome", "Chro…
## $ gene <chr> "BB_0001_-_hypothetical_protein", "BB_0002_-_hypothetical_pro…
## $ early1 <dbl> 32.3891, 320.0120, 101.6160, 395.3540, 1689.9300, 644.4490, 1…
## $ early2 <dbl> 42.8668, 360.1180, 125.9710, 356.3700, 1630.8100, 551.4670, 1…
## $ early3 <dbl> 20.5760, 323.6450, 106.9470, 409.2720, 1651.6900, 645.2620, 1…
## $ mid1 <dbl> 12.1266, 356.2670, 122.3380, 298.4960, 2033.7500, 670.1690, 1…
## $ mid2 <dbl> 35.1927, 392.0060, 111.8710, 272.2370, 1763.4300, 679.7530, 1…
## $ mid3 <dbl> 40.0169, 336.3710, 127.3200, 313.9650, 1270.3700, 745.0010, 2…
## $ late1 <dbl> 65.8883, 450.0690, 104.5620, 486.9660, 2687.3200, 1635.8200, …
## $ late2 <dbl> 11.5959, 351.4560, 46.3174, 562.1400, 3081.1900, 1886.6500, 4…
## $ late3 <dbl> 22.7004, 390.3370, 80.5042, 333.1020, 1538.0000, 781.5110, 15…
## $ locus <chr> "0001", "0002", "0004", "0005", "0006", "0007", "0008", "0009…
## $ anno <chr> "_-_hypothetical_protein", "_-_hypothetical_protein", "_-_pho…
Note that log2+1 transformation is a technique to
# make a long table:
bb.long <- bb %>%
pivot_longer(3:11, names_to = "sample", values_to = "rna.counts")
# visualize distribution, check data normality
bb.long %>% ggplot(aes(sample = rna.counts)) +
stat_qq() +
stat_qq_line() +
facet_wrap(~sample)
# apply log2+1 transformation
bb.long <- bb.long %>%
mutate(log2.counts = log2(rna.counts+1))
# check normality
bb.long %>% ggplot(aes(sample = log2.counts)) +
stat_qq() +
stat_qq_line() +
facet_wrap(~sample)
We will run t-tests for each gene. This allows us to rank genes by p-values, so we could select the most significant genes that vary between the two phases.
# add a group factor
bb.long <- bb.long %>%
mutate(growth.phase = str_remove(sample, "\\d$"))
# remove mid-phase
bb.early.late <- bb.long %>%
filter(growth.phase != 'mid')
# t-test by gene, estimate is the fold change!
t.out <- bb.early.late %>%
group_by(locus) %>%
do(tidy(t.test(data = ., log2.counts ~ growth.phase)))
# show head
head(t.out)
## # A tibble: 6 × 11
## # Groups: locus [6]
## locus estimate estimate1 estimate2 statistic p.value parameter conf.low
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0001 0.221 4.98 4.76 0.289 0.793 2.70 -2.37
## 2 0002 -0.242 8.39 8.63 -2.08 0.129 3.02 -0.611
## 3 0004 0.596 6.81 6.21 1.69 0.218 2.29 -0.753
## 4 0005 -0.220 8.60 8.82 -0.946 0.433 2.28 -1.11
## 5 0006 -0.495 10.7 11.2 -1.61 0.248 2.01 -1.81
## 6 0007 -1.13 9.26 10.4 -2.82 0.0980 2.15 -2.75
## # ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
Explain:
EnhancedVolcano(toptable = t.out, lab = t.out$locus, x = "estimate", y = "p.value",
ylim = c(0, 4),
pCutoff = 0.05,
title = "Differentially expressed genes during Borrelia burgorferi growth",
subtitle = "Early vs Late growth",
caption = "Data source: Arnold et al (2016). PLoS One. DOI: 10.1371/journal.pone.0164165")
We verify the most significant genes by a boxplot:
# choose only significant genes:
sig.genes <- t.out %>%
filter(p.value < 0.01) %>%
pull(locus)
# filter data frame
bb.sig <- bb.long %>%
filter(locus %in% sig.genes)
# make a boxplot by genes
bb.sig %>%
ggplot(aes(x = growth.phase, y = log2.counts)) +
geom_boxplot() +
geom_jitter(color = "red") +
facet_wrap(~locus, scales = "free") +
theme_bw()
We conclude with a heatmap showing the topmost significant differentially expressed genes:
# choose only significant genes:
sig.genes <- t.out %>%
filter(p.value < 0.05) %>%
pull(locus)
bb.sig <- bb %>%
filter(locus %in% sig.genes)
# no scaling:
pheatmap(log2(bb.sig[,3:11]+1), labels_row = bb.sig$locus)
# scale by sample:
pheatmap(log2(bb.sig[,3:11]+1), scale = "row", labels_row = bb.sig$locus)
# scale by gene:
pheatmap(log2(bb.sig[,3:11]+1), scale = "row", labels_row = bb.sig$locus)
# scale by gene (most informative)
pheatmap(log2(bb.sig[,3:11]+1), scale = "row", labels_row = bb.sig$locus, cellheight = 10)
Questions: