The t-test is used to compare the means of a numerical measurement between two groups of samples. The main assumptions of the t-test include:
We will use the ozone
dataset, with the biology
hypothesis that gardening reduces pollution, as reflected in a lower
ozone level of ozone down-wind of the garden locations.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
getwd()
## [1] "C:/Users/weiga/Dropbox/Courses/bio-47120-2024"
setwd("C:/Users/weiga/Dropbox/Courses/bio-47120-2024/")
ozone <- read_csv("../datasets-master/ozone.csv")
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Garden.location, Garden.ID
## dbl (1): Ozone
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# show head
head(ozone)
## # A tibble: 6 × 3
## Ozone Garden.location Garden.ID
## <dbl> <chr> <chr>
## 1 61.7 West G1
## 2 64 West G2
## 3 72.4 West G3
## 4 56.8 West G4
## 5 52.4 West G5
## 6 44.8 West G6
# show summary
glimpse(ozone)
## Rows: 20
## Columns: 3
## $ Ozone <dbl> 61.7, 64.0, 72.4, 56.8, 52.4, 44.8, 70.4, 67.6, 68.8, …
## $ Garden.location <chr> "West", "West", "West", "West", "West", "West", "West"…
## $ Garden.ID <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9", …
The natural plot to view data set for the t-test is a boxplot, which shows distribution of a numerical data (on y-axis) by a categorical data (on x-axis).
ozone %>%
ggplot(aes(x = Garden.location, y = Ozone)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter() +
theme_bw()
Boxplot showing ozone levels (y-axis) in gardens (dots) located in two directions (x-axis) of the source of the pollution. Are the data points normally distributed?
ozone %>%
ggplot(aes(x = Garden.location, y = Ozone)) +
geom_violin() +
geom_jitter() +
theme_bw()
Violin plot, where the widths corresponding to sample density
It looks that the ozone levels are lower in the gardens located to the west than those located to the west of the pollution source. But is the difference significant? A t-test is warranted.
ozone %>%
t.test(data = ., Ozone ~ Garden.location)
##
## Welch Two Sample t-test
##
## data: Ozone by Garden.location
## t = 4.2363, df = 17.656, p-value = 0.0005159
## alternative hypothesis: true difference in means between group East and group West is not equal to 0
## 95 percent confidence interval:
## 8.094171 24.065829
## sample estimates:
## mean in group East mean in group West
## 77.34 61.26
The result shows \(p<0.05\), a significant departure from the null hypothesis, which can be rejected. Therefore, we could conclude that the ozone levels are significantly different in two garden locations. We conclude with a final annotated graph.
ozone %>%
ggplot(aes(x = Garden.location, y = Ozone)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(shape = 21, color = "red") +
xlab("Garden location") +
ylab("Ozone level") +
labs(title = "Ozone level analysis", subtitle = "Significant difference between two locations (with p=5.2e-4 by a t-test)") +
theme_bw()
Conclusion: gardens reduce ozone pollution
The contingency analysis is used to test the association between two categorical variables, e.g., between genotype and phenotype. It is commonly used in genome-wide association studies (GWAS).
For example, we tabulated the genotypes and phenotypes of the PTC taster lab from BIOL203 (mol genetics) lab. Students are genotypes (by PCR followed by restriction enzyme analysis) at a SNP site (single-nucleotide polymorphism) in an odor-receptor gene TAS2R38. This SNP has two alternative alleles C or T, leading to 3 possible genotypes (C/C, C/T, or T/T). Are you a homozygote or a heterozygote at this SNP?
Students also tested themselves with PTC paper strips to be either a “taster” or “non-taster”. This is a phenotype test.
library(knitr)
# construct a data matrix
ptc.lab <- matrix(c(53,20,39, 17, 9, 52), nrow = 2, byrow = T)
colnames(ptc.lab) <- c("C/C", "C/T", "T/T")
rownames(ptc.lab) <- c("Taster", "Non.Taster")
# display table
kable(ptc.lab)
C/C | C/T | T/T | |
---|---|---|---|
Taster | 53 | 20 | 39 |
Non.Taster | 17 | 9 | 52 |
The question is, is the genotype associated with the phenotype? If so, which genotype is over-represented in which phenotype? Does the allele has a dominant or recessive effect?
Define the following terms:
Contingency table could be viewed with a grouped barplot:
barplot(ptc.lab, beside = T)
legend("top", legend = c("taster", "non-taster"), fill = c("black", "grey"))
Grouped barplot
Questions:
Is there evidence for genotype-phenotype association? Is so, which genotype is over-represented in the tasters?
Is there evidence for allele-phenotype association? Is so, which allele is over-represented in the tasters?
Is the Hunter biology student population diverse? What is the Hardy-Weinberg expected frequency of C/T heterozygotes?
An alternative visualization of a contingency table is the mosaic plot:
library(vcd)
## Loading required package: grid
mosaic(ptc.lab, labeling_args = list(set_varnames = c(A = "Phenotype", B = "Genotype")), main = "Observed counts")
Mosaicplot shows the relative propoportions of genotypes (in columns) as well as of the phenotypes (in rows)
The null hypothesis of a \(X^2\) test is an absence of genotype-phenotype association (i.e., the phenotype and genotype are independent from each other).
We obtain the expected counts under the null hypothesis and visualize it with a mosaicplot, to contrast with the observed counts.
ptc.chi <- chisq.test(ptc.lab)
ptc.exp <- ptc.chi$expected
mosaic(ptc.exp, labeling_args = list(set_varnames = c(A = "Phenotype", B = "Genotype")), main = "Expected counts")
# an alternative form of the same test:
assocstats(ptc.lab)
## X^2 df P(> X^2)
## Likelihood Ratio 19.457 2 5.9574e-05
## Pearson 19.070 2 7.2266e-05
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.302
## Cramer's V : 0.317
# Still another alternative test of independence:
fisher.test(ptc.lab)
##
## Fisher's Exact Test for Count Data
##
## data: ptc.lab
## p-value = 6.413e-05
## alternative hypothesis: two.sided
library(vcd)
mosaic(ptc.lab, shade = T, labeling_args = list(set_varnames = c(A = "Phenotype", B = "Genotype")))
Significant genotype-phenotype association of the PTC lab, showing an excess (over-representation) of T/T genotype and a deficit (under-representation) of C/C genotypes in the non-tasters