Part 1. Getting started
Part 2: Data manipulation
Part 3. Data visualization & statistical analysis
Part 4. Advanced topic: cluster analysis of multi-dimensional data
- Method 1. hierarchical clustering with a heatmap
- Method 2. principal component analysis (PCA)
Part 5. Case study: Differential gene expressions in Lyme diease bacteria

Part 1. Getting started

First, we load libraries (download and install if not available) & set the working direction.

library(tidyverse) # load library

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(pheatmap) # load another library (heatmap)
getwd() # show working directory

## [1] "C:/Users/weiga/Dropbox/Courses/bio-714-spring"

R as a calculator, press Control-Enter/Command-Enter to execute each command:

4.50 + 6.7

## [1] 11.2

59/89.6

## [1] 0.6584821

log10(2)

## [1] 0.30103

log2(4) # fold change (FC): 2-fold increase

## [1] 2

log2(0.5) # fold change (FC): 1-fold decrease

## [1] -1

1:10 # a vector

##  [1]  1  2  3  4  5  6  7  8  9 10

All R commands are functions. A function consists of the function name, arguments (inputs), and outputs.

We Use the “<-” operator to save function output to a new object.

seq(from = 1, to = 10, by = 1) # same as above, with a function

##  [1]  1  2  3  4  5  6  7  8  9 10

# save outputs to an object (a data container to store outputs of a function)
x <- seq(from = 1, to = 10, by = 3)
x * 2 # doulbe each value in the vector

## [1]  2  8 14 20

x ^ 2 # square each value

## [1]   1  16  49 100

# get help
?seq # opens the built-in help page

## starting httpd help server ... done

example(seq) # run the built-in example

## 
## seq> seq(0, 1, length.out = 11)
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
## 
## seq> seq(stats::rnorm(20)) # effectively 'along'
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
## 
## seq> seq(1, 9, by = 2)     # matches 'end'
## [1] 1 3 5 7 9
## 
## seq> seq(1, 9, by = pi)    # stays below 'end'
## [1] 1.000000 4.141593 7.283185
## 
## seq> seq(1, 6, by = 3)
## [1] 1 4
## 
## seq> seq(1.575, 5.125, by = 0.05)
##  [1] 1.575 1.625 1.675 1.725 1.775 1.825 1.875 1.925 1.975 2.025 2.075 2.125
## [13] 2.175 2.225 2.275 2.325 2.375 2.425 2.475 2.525 2.575 2.625 2.675 2.725
## [25] 2.775 2.825 2.875 2.925 2.975 3.025 3.075 3.125 3.175 3.225 3.275 3.325
## [37] 3.375 3.425 3.475 3.525 3.575 3.625 3.675 3.725 3.775 3.825 3.875 3.925
## [49] 3.975 4.025 4.075 4.125 4.175 4.225 4.275 4.325 4.375 4.425 4.475 4.525
## [61] 4.575 4.625 4.675 4.725 4.775 4.825 4.875 4.925 4.975 5.025 5.075 5.125
## 
## seq> seq(17) # same as 1:17, or even better seq_len(17)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

Part 2: Data manipulation

The standard data type in R is a data frame.

Pay special attention to the data type of each column (numerical or categorical).

data(iris) # load the famous iris data set

# show summaries
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(iris)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

dim(iris)

## [1] 150   5

glimpse(iris)

## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

colnames(iris) # note all variables and names are CASE-SENSITIVE

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Use select() to pick columns, filter() to pick rows, and mutate() to add columns.

We use the %>% or |> (“pipes”) to chain multiple commands.

# filter rows
x.setosa <- iris %>% # use the pipe operator "ctl-shift-M" 
  filter(Species == 'setosa') # choose rows; Note the double equal (means comparison)

# select columns
x.iris <- iris %>% 
  select(1:4)  # choose columns 1-4

x.iris <- iris %>% 
    select(-1) # remove the species (last) column

Use group_by followed by summarise() to calculate statistics by groups:

iris %>% 
  group_by(Species) %>% 
  summarise(mean(Sepal.Length), sd(Sepal.Length))

## # A tibble: 3 × 3
##   Species    `mean(Sepal.Length)` `sd(Sepal.Length)`
##   <fct>                     <dbl>              <dbl>
## 1 setosa                     5.01              0.352
## 2 versicolor                 5.94              0.516
## 3 virginica                  6.59              0.636

Part 3. Data visualization & statistical analysis

It is often necessary to show data distribution (e.g., to make sure they are normally distributed). We show 3 common plots using the ggplot2 visualization library.

In ggplot, we first maps variables to x, y axis using the aes() function. Then we add the geom_xxx() functions render into geometric shapes.

Plot histogram to show distribution of a numerical variable

The highest peaks in a histogram represent the most frequently occurring value (not the highest value).

Key Question: Are data points normally distributed? If not, apply log transformation. Statistical test: QQ plots

iris %>% 
  ggplot(aes(x=Sepal.Length)) + # use the aes() function to map variables to axes
  geom_histogram(bins=10) + # choose the graphic types
  facet_wrap(~Species)  # one species per panel

iris %>% 
  ggplot(aes(sample = Sepal.Length)) +
  stat_qq() + 
  stat_qq_line() +
  facet_wrap(~Species)

iris %>% 
  ggplot(aes(x=Sepal.Length)) + # use the aes() function to map variables to axes
  geom_histogram(bins=10) + # choose the graphic types; 
  facet_wrap(~Species) + # one species per panel 
  scale_y_log10() # apply log10 normalization

## Warning in scale_y_log10(): log-10 transformation introduced infinite values.

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).

iris %>% 
  ggplot(aes(sample = log2(Sepal.Length))) + # log2 transformation
  stat_qq() + 
  stat_qq_line() +
  facet_wrap(~Species)

Use boxplot to show distribution of a numerical variable, separated by a categorical variable

A boxplot shows the median and quartiles of a sample.

Key Question: Are the means different between the groups?

Statistical analysis: t-test (for two groups); ANOVA (for multiple groups)

iris %>% 
  ggplot(aes(x = Species, y = Sepal.Length)) + # categorical variable on x-axis, numerical on y-axis
  geom_boxplot()  # choose boxplot

iris %>% 
  ggplot(aes(x = Species, y = Sepal.Length)) + # categorical variable on x-axis, numerical on y-axis
  geom_boxplot() + # choose boxplot
  geom_jitter(shape = 1) # add data points to show sample sizes

t-test for two groups, with the null hypothesis of equal means (\(\mu_1=\mu_2\)) between the two groups:

two.species <- iris %>%  
  filter(Species != 'virginica')  

two.species %>% 
  t.test(data = ., Sepal.Length ~ Species)

## 
##  Welch Two Sample t-test
## 
## data:  Sepal.Length by Species
## t = -10.521, df = 86.538, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group setosa and group versicolor is not equal to 0
## 95 percent confidence interval:
##  -1.1057074 -0.7542926
## sample estimates:
##     mean in group setosa mean in group versicolor 
##                    5.006                    5.936

two.species %>% 
  t.test(data = ., Petal.Length ~ Species)

## 
##  Welch Two Sample t-test
## 
## data:  Petal.Length by Species
## t = -39.493, df = 62.14, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group setosa and group versicolor is not equal to 0
## 95 percent confidence interval:
##  -2.939618 -2.656382
## sample estimates:
##     mean in group setosa mean in group versicolor 
##                    1.462                    4.260

ANOVA (Analysis of Variance) for multiple groups, with the null hypothesis of equal variability between and within groups (\(V_{within}=V_{between}\)):

# use the lm() function (linear model)
sepal.length.anova <- lm(data = iris, Sepal.Length ~ Species)
summary(sepal.length.anova)

## 
## Call:
## lm(formula = Sepal.Length ~ Species, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6880 -0.3285 -0.0060  0.3120  1.3120 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
## Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
## Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared:  0.6187, Adjusted R-squared:  0.6135 
## F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

Use violin plot

Violin plots show data density (by width). Also, the widest regions are the most frequent data points:

iris %>% 
  ggplot(aes(x = Species, y = Sepal.Length)) +
  geom_violin() + # violin plot show side-way histograms
  geom_jitter(shape = 1) # choose symbol

Use scatterplot to show relation between two numerical variables

Key question: are the two variables correlated?

Statistical test: linear regression, with the null hypothesis of correction coefficient \(\rho=0\)

iris %>% 
  ggplot(aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point()

# Map species (a categorical variable) to colors
iris %>% 
  ggplot(aes(x=Sepal.Length, y= Sepal.Width, color = Species)) + 
  geom_point()

# add regression lines
iris %>% 
  ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() + 
  geom_smooth(method = "lm") # add a linear regression line

## `geom_smooth()` using formula = 'y ~ x'

Calculate linear regression. Note the p-value and R-squared (variance explained by the regression line).

sepal.width.lm <- lm(data = iris, Sepal.Width ~ Sepal.Length)
summary(sepal.width.lm)

## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1095 -0.2454 -0.0167  0.2763  1.3338 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.41895    0.25356   13.48   <2e-16 ***
## Sepal.Length -0.06188    0.04297   -1.44    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4343 on 148 degrees of freedom
## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

x.sepal <- iris %>% filter(Species == 'setosa') 
sepal.lm <- lm(data = x.sepal, Sepal.Width ~ Sepal.Length)
summary(sepal.lm)

## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = x.sepal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72394 -0.18273 -0.00306  0.15738  0.51709 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.5694     0.5217  -1.091    0.281    
## Sepal.Length   0.7985     0.1040   7.681 6.71e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2565 on 48 degrees of freedom
## Multiple R-squared:  0.5514, Adjusted R-squared:  0.542 
## F-statistic: 58.99 on 1 and 48 DF,  p-value: 6.71e-10

Advanced technique: Comprehensive visualization with a long table

A tidy table is ideal for co-visualization of multiiple variables (high-dimension) at once:

# transform to a long (tidy) table
iris.long <- iris %>% 
  pivot_longer(1:4, names_to = "var", values_to = "value")

head(iris.long)

## # A tibble: 6 × 3
##   Species var          value
##   <fct>   <chr>        <dbl>
## 1 setosa  Sepal.Length   5.1
## 2 setosa  Sepal.Width    3.5
## 3 setosa  Petal.Length   1.4
## 4 setosa  Petal.Width    0.2
## 5 setosa  Sepal.Length   4.9
## 6 setosa  Sepal.Width    3

glimpse(iris.long)

## Rows: 600
## Columns: 3
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa…
## $ var     <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", …
## $ value   <dbl> 5.1, 3.5, 1.4, 0.2, 4.9, 3.0, 1.4, 0.2, 4.7, 3.2, 1.3, 0.2, 4.…

We can now visualize all four measurements together.

Question: Which measurements vary the most and the least among the species?

# panels by species
iris.long %>% 
  ggplot(aes(x = var, y = value)) +
  geom_boxplot() +
  geom_jitter(shape = 1) +
  facet_wrap(~Species) +
  coord_flip() +
  theme_bw()

# panels by measurements
iris.long %>% 
  ggplot(aes(x = Species, y = value)) +
  geom_boxplot() +
  geom_jitter(shape = 1) +
  facet_wrap(~var) +
  coord_flip() +
  theme_bw()

Part 4. Advanced topic: cluster analysis of multi-dimensional data

The goal is to find groups in samples (in rows) based on the four traits (in columns). The hypothesis is that the samples of the same species should be more similar to each other than to samples from a difference species.

In machine learning, this is a classification problem. The model trained on the 150 samples could be used to identify (predict) species identity of a new sample based on the four measurements.

Method 1. hierarchical clustering with a heatmap

Key questions: What do rows and columns represent? what do the cluster diagram represent?

Statistical analysis: distances represent Euclidean distances \(d=\sqrt{\sum_{k=1}^n{(x_{k,1}-x_{k,2})^2}}\), where \(x_k\) is a measurement (e.g., sepal length). Or correlation distances \(d=1-cor(s_1, s_2)\) between two samples (in rows) or two species (in columns).

pheatmap(iris[,1:4])

x.iris <- iris %>% select(-5) # remove the species column (retain only numerical variables)
rownames(x.iris) <- paste("sample", 1:150)

ann_row <- data.frame(species = factor(iris$Species)) # a data frame containing species for each sample
rownames(ann_row) <- paste("sample", 1:150)

pheatmap(x.iris) # basic heatmap: rows are samples, cols are variables, colors are values, scale to normalize among variables (mean =0); row-side dendrogram groups samples with similar values

# annotated heatmap: scale by each measurement: z-score, standard deviates (mean = 0, in the units of sd; -3 ~ 3 for a normally distributed variable)
pheatmap(x.iris, 
         scale = "column", 
         show_rownames = F, 
         annotation_row = ann_row) # add rowside color to show species

pheatmap(x.iris, 
         scale = "column", 
         show_rownames = F, 
         annotation_row = ann_row, 
         cellwidth = 10)

Method 2. principal component analysis (PCA)

PCA projects this multi-dimensional data set (e.g., four measurements in the iris data set) into a 2D space. Alternative dimensional reduction techniques include MDS (multi-dimensional scaling), UMAP (non-linear combination), t-sne (for binary measurements)

A principal component is a linear combination of original variables

pca.iris <- prcomp(x.iris, scale. = T) # run PCA with scaling to z-scores
plot(pca.iris) # show variance explained by each principal component

biplot(pca.iris) # show sample groups

# Make a better plot
pca.coord <- as.data.frame(pca.iris[[5]]) # get coordinates from the PCA output for PC1 & PC2

df.pca <- tibble(pca.coord[1:2]) %>% 
  mutate(species = iris$Species) # extrac PC1 and PC2, add a species column

# plot 1st & 2nd principal components
df.pca %>% 
  ggplot(aes(x = PC1, y = PC2, color = species)) + 
  geom_point(size = 3, alpha = 0.5) + 
  theme_bw() + 
  theme(legend.position = "bottom") + 
  stat_ellipse() # shows clear separation of setosa samples from the other two species, and partial overlap between the vesicolor and virginica samples

Part 5. Case study: Differential gene expressions in Lyme diease bacteria

We will replicate results from a study of transcriptome changes during the growth of Lyme disease pathogen (Borrelia burgdorferi).

The key research question is to identify genes up- and down-regulated during early- mid- and late-phases of growth.

The original paper is available at rom Arnold et al (2016), PLoS One.

We will use the Supplemental data set S2 to replicate some of the results.

Load libraries and pull date from the journal website

Questions: How many growth phases? How many genes? How many replicons? How many replicates per growth phase?

# load libraries
library(openxlsx2)
library(broom)

# run the following two line once
# install.packages("BiocManager")
# BiocManager::install("EnhancedVolcano")

library(EnhancedVolcano)

## Loading required package: ggrepel

# read data from web
bb <- read_xlsx("https://doi.org/10.1371/journal.pone.0164165.s003", start_row = 2, skip_empty_cols = T, skip_empty_rows = T)

# rename column names
colnames(bb) <- c("replicon", "gene", 
                  "early1", "early2", "early3",
                  "mid1", "mid2", "mid3",
                  "late1", "late2", "late3"
                  )

# Add locus and annotation columns
bb <- bb %>% 
  mutate(locus = str_remove(gene, "BB_") %>% str_remove("_.+")) %>% 
  mutate(anno = str_remove(gene, "BB_.*\\d+"))

# show data
glimpse(bb)

## Rows: 1,386
## Columns: 13
## $ replicon <chr> "Chromosome", "Chromosome", "Chromosome", "Chromosome", "Chro…
## $ gene     <chr> "BB_0001_-_hypothetical_protein", "BB_0002_-_hypothetical_pro…
## $ early1   <dbl> 32.3891, 320.0120, 101.6160, 395.3540, 1689.9300, 644.4490, 1…
## $ early2   <dbl> 42.8668, 360.1180, 125.9710, 356.3700, 1630.8100, 551.4670, 1…
## $ early3   <dbl> 20.5760, 323.6450, 106.9470, 409.2720, 1651.6900, 645.2620, 1…
## $ mid1     <dbl> 12.1266, 356.2670, 122.3380, 298.4960, 2033.7500, 670.1690, 1…
## $ mid2     <dbl> 35.1927, 392.0060, 111.8710, 272.2370, 1763.4300, 679.7530, 1…
## $ mid3     <dbl> 40.0169, 336.3710, 127.3200, 313.9650, 1270.3700, 745.0010, 2…
## $ late1    <dbl> 65.8883, 450.0690, 104.5620, 486.9660, 2687.3200, 1635.8200, …
## $ late2    <dbl> 11.5959, 351.4560, 46.3174, 562.1400, 3081.1900, 1886.6500, 4…
## $ late3    <dbl> 22.7004, 390.3370, 80.5042, 333.1020, 1538.0000, 781.5110, 15…
## $ locus    <chr> "0001", "0002", "0004", "0005", "0006", "0007", "0008", "0009…
## $ anno     <chr> "_-_hypothetical_protein", "_-_hypothetical_protein", "_-_pho…

Transform data to be normally distributed

Note that log2+1 transformation is a technique to

make data normally distributed
avoid zero counts, and
set any differences to be fold changes: \(fc=log_2(\frac{x_1}{x_2})\).

# make a long table:
bb.long <- bb %>% 
  pivot_longer(3:11, names_to = "sample", values_to = "rna.counts")

# visualize distribution, check data normality
bb.long %>% ggplot(aes(sample = rna.counts)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~sample)

# apply log2+1 transformation
bb.long <- bb.long %>% 
  mutate(log2.counts = log2(rna.counts+1))

# check normality
bb.long %>% ggplot(aes(sample = log2.counts)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~sample)

Identify genes differentially expressed between early and late growth phases

We will run t-tests for each gene. This allows us to rank genes by p-values, so we could select the most significant genes that vary between the two phases.

# add a group factor
bb.long <- bb.long %>% 
  mutate(growth.phase = str_remove(sample, "\\d$"))

# remove mid-phase
bb.early.late <- bb.long %>% 
  filter(growth.phase != 'mid')

# t-test by gene, estimate is the fold change!
t.out <- bb.early.late %>% 
  group_by(locus) %>% 
  do(tidy(t.test(data = ., log2.counts ~ growth.phase)))

# show head
head(t.out)

## # A tibble: 6 × 11
## # Groups:   locus [6]
##   locus estimate estimate1 estimate2 statistic p.value parameter conf.low
##   <chr>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
## 1 0001     0.221      4.98      4.76     0.289  0.793       2.70   -2.37 
## 2 0002    -0.242      8.39      8.63    -2.08   0.129       3.02   -0.611
## 3 0004     0.596      6.81      6.21     1.69   0.218       2.29   -0.753
## 4 0005    -0.220      8.60      8.82    -0.946  0.433       2.28   -1.11 
## 5 0006    -0.495     10.7      11.2     -1.61   0.248       2.01   -1.81 
## 6 0007    -1.13       9.26     10.4     -2.82   0.0980      2.15   -2.75 
## # ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>

We will visualize the t-test results by a volcano plot:

Explain:

fold change & p-value
What does each point represent?
Which genes are down-regulated in late phase?
Which genes are up-regulated in late phase?

EnhancedVolcano(toptable = t.out, lab = t.out$locus, x = "estimate", y = "p.value",
                ylim = c(0, 4),
                pCutoff = 0.05,
                title = "Differentially expressed genes during Borrelia burgorferi growth",
                subtitle = "Early vs Late growth",
                caption = "Data source: Arnold et al (2016). PLoS One. DOI: 10.1371/journal.pone.0164165")

We verify the most significant genes by a boxplot:

# choose only significant genes:
sig.genes <- t.out %>% 
  filter(p.value < 0.01) %>% 
  pull(locus)

# filter data frame
bb.sig <- bb.long %>% 
  filter(locus %in% sig.genes)

# make a boxplot by genes
bb.sig %>% 
  ggplot(aes(x = growth.phase, y = log2.counts)) +
  geom_boxplot() +
  geom_jitter(color = "red") +
  facet_wrap(~locus, scales = "free") +
  theme_bw()

We conclude with a heatmap showing the topmost significant differentially expressed genes:

# choose only significant genes:
sig.genes <- t.out %>% 
  filter(p.value < 0.05) %>% 
  pull(locus)

bb.sig <- bb %>% 
  filter(locus %in% sig.genes)

# no scaling:
pheatmap(log2(bb.sig[,3:11]+1), labels_row = bb.sig$locus)

# scale by sample:
pheatmap(log2(bb.sig[,3:11]+1), scale = "row", labels_row = bb.sig$locus)

# scale by gene:
pheatmap(log2(bb.sig[,3:11]+1), scale = "row", labels_row = bb.sig$locus)

# scale by gene (most informative)
pheatmap(log2(bb.sig[,3:11]+1), scale = "row", labels_row = bb.sig$locus, cellheight = 10)

Questions:

Do replicates group by phases?
Identify a gene that is up-regulated in late phase relative to the early phase
Identify a gene that is down-regulated in late phase relative to the early phase

BIOL714. An R/RStudio Tutorial

Weigang Qiu

Spring 2024