PART 1. Iris data

We will investigate if the species could be classified based on four flower measurements, and, if so, what are the error rates of mis-classification. Like any other statistical analysis, the errors include false positives and false negatives.

In machine-learning lingo, this is called a classification problem or unsupervised learning. The idea is to create a model based on a set of training data (with known species identity). The model is then used to predict species identity of an sample (test data).

Preparations

We will first load libraries and data, transform the data, and visualize data distributions.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(pheatmap)
data(iris)
head(iris)

glimpse(iris)

## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

# transform to a long (tidy) table
iris.long <- iris %>% 
  pivot_longer(1:4, names_to = "var", values_to = "value")

head(iris.long)

glimpse(iris.long)

## Rows: 600
## Columns: 3
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa…
## $ var     <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", …
## $ value   <dbl> 5.1, 3.5, 1.4, 0.2, 4.9, 3.0, 1.4, 0.2, 4.7, 3.2, 1.3, 0.2, 4.…

Visualize data distributions by species:

iris.long %>% 
  ggplot(aes(x = var, y = value)) +
  geom_boxplot() +
  geom_jitter(shape = 1) +
  facet_wrap(~Species) +
  coord_flip() +
  theme_bw()

Boxplots of measurements (on x-axis) of four flower traits (on y-axis for three iris species (in panels)

Visualize data distributions by measurements:

iris.long %>% 
  ggplot(aes(x = Species, y = value)) +
  geom_boxplot() +
  geom_jitter(shape = 1) +
  facet_wrap(~var) +
  coord_flip() +
  theme_bw()

Boxplots of measurements (on x-axis) of 3 iris species (on y-axis for four flower traits (in panels)

Hierarchical clustering with heatmap

The goal is to find groups in samples (in rows) based on the four traits (in columns). The hypothesis is that the samples of the same species should be more similar to each other than to samples from a difference species.

In machine learning, this is a classification problem. The model trained on the 150 samples could be used to identify (predict) species identity of a new sample based on the four measurements.

# remove the species column (retain only numerical variables)
x.iris <- iris %>% select(-5) 
pheatmap(x.iris, show_rownames = F)

Heatmap showing the measurements (in color) of the flower traits (in columns) of 150 iris samples (in rows). The cluster diagrams group the species by similarity (as shown by the column-side cluster diagram), as well as the samples by similarity (as shown by the rwo-side cluster diagram)

We will improve the plot by (1) scaling the data by column so that measurements are normalized (to \(z\)-score) and comparable, (2) adding a row-side colorbar to annotate species

The \(z\)-scores have a mean of \(\mu=0\) and in the units of standard deviation.

# a data frame containing species for each sample
ann_row <- data.frame(species = factor(iris$Species))
# sync column names
rownames(ann_row) <- paste("sample", 1:150)
rownames(x.iris) <- paste("sample", 1:150)
pheatmap(x.iris, scale = "column", show_rownames = F, 
         cellwidth = 20,  
         annotation_row = ann_row)

Improved heatmap with scaled measurements and annotated species, indicating that vesicolor and viginica samples are scattered in multiple clusters.

Dimension reduction with PCA

Principal component analysis (PCA) projects this 4-dimensional data set (four measurements) into a 2D space. Alternative dimensional reduction methods include MDS (multi-dimensional scaling), UMAP (non-linear combination), and t-sne (for binary measurements)

A principal component (PC) is a linear combination of original variables

# run PCA with scaling to z-scores
pca.iris <- prcomp(x.iris, scale. = T) 
# show variance explained by each principal component
plot(pca.iris)

Variance explained by each PC

biplot(pca.iris)

Components for PC1 and PC2

Make a better plot by adding eclipses

# get coordinates from the PCA output for PC1 & PC2
pca.coord <- as.data.frame(pca.iris[[5]]) 

# extrac PC1 and PC2, add a species column
df.pca <- tibble(pca.coord[1:2]) %>% 
  mutate(species = iris$Species)

# plot 1st & 2nd principal components
df.pca %>% 
  ggplot(aes(x = PC1, y = PC2, color = species)) + 
  geom_point(size = 3, alpha = 0.5) + # increase point size & make transparent
  theme_bw() + 
  theme(legend.position = "bottom") + 
  stat_ellipse() #

Separation of setosa samples from the other two species, and partial overlap between the vesicolor and virginica samples

BioMed Genomics: Cluster Analysis

Dr Weigang Qiu & Dr Saymon Akther

Spring 2024

PART 1. Iris data

Preparations

Hierarchical clustering with heatmap

Dimension reduction with PCA