We will investigate if the species could be classified based on four flower measurements, and, if so, what are the error rates of mis-classification. Like any other statistical analysis, the errors include false positives and false negatives.
In machine-learning lingo, this is called a classification problem or unsupervised learning. The idea is to create a model based on a set of training data (with known species identity). The model is then used to predict species identity of an sample (test data).
We will first load libraries and data, transform the data, and visualize data distributions.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pheatmap)
data(iris)
head(iris)
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
# transform to a long (tidy) table
iris.long <- iris %>%
pivot_longer(1:4, names_to = "var", values_to = "value")
head(iris.long)
glimpse(iris.long)
## Rows: 600
## Columns: 3
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa…
## $ var <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", …
## $ value <dbl> 5.1, 3.5, 1.4, 0.2, 4.9, 3.0, 1.4, 0.2, 4.7, 3.2, 1.3, 0.2, 4.…
Visualize data distributions by species:
iris.long %>%
ggplot(aes(x = var, y = value)) +
geom_boxplot() +
geom_jitter(shape = 1) +
facet_wrap(~Species) +
coord_flip() +
theme_bw()
Visualize data distributions by measurements:
iris.long %>%
ggplot(aes(x = Species, y = value)) +
geom_boxplot() +
geom_jitter(shape = 1) +
facet_wrap(~var) +
coord_flip() +
theme_bw()
The goal is to find groups in samples (in rows) based on the four traits (in columns). The hypothesis is that the samples of the same species should be more similar to each other than to samples from a difference species.
In machine learning, this is a classification problem. The model trained on the 150 samples could be used to identify (predict) species identity of a new sample based on the four measurements.
# remove the species column (retain only numerical variables)
x.iris <- iris %>% select(-5)
pheatmap(x.iris, show_rownames = F)
We will improve the plot by (1) scaling the data by column so that measurements are normalized (to \(z\)-score) and comparable, (2) adding a row-side colorbar to annotate species
The \(z\)-scores have a mean of \(\mu=0\) and in the units of standard deviation.
# a data frame containing species for each sample
ann_row <- data.frame(species = factor(iris$Species))
# sync column names
rownames(ann_row) <- paste("sample", 1:150)
rownames(x.iris) <- paste("sample", 1:150)
pheatmap(x.iris, scale = "column", show_rownames = F,
cellwidth = 20,
annotation_row = ann_row)
Principal component analysis (PCA) projects this 4-dimensional data set (four measurements) into a 2D space. Alternative dimensional reduction methods include MDS (multi-dimensional scaling), UMAP (non-linear combination), and t-sne (for binary measurements)
A principal component (PC) is a linear combination of original variables
# run PCA with scaling to z-scores
pca.iris <- prcomp(x.iris, scale. = T)
# show variance explained by each principal component
plot(pca.iris)
biplot(pca.iris)
Make a better plot by adding eclipses
# get coordinates from the PCA output for PC1 & PC2
pca.coord <- as.data.frame(pca.iris[[5]])
# extrac PC1 and PC2, add a species column
df.pca <- tibble(pca.coord[1:2]) %>%
mutate(species = iris$Species)
# plot 1st & 2nd principal components
df.pca %>%
ggplot(aes(x = PC1, y = PC2, color = species)) +
geom_point(size = 3, alpha = 0.5) + # increase point size & make transparent
theme_bw() +
theme(legend.position = "bottom") +
stat_ellipse() #