Part 1. t-test (for two groups)

The t-test is used to compare the means of a numerical measurement between two groups of samples. The main assumptions of the t-test include:

We will use the ozone dataset, with the biology hypothesis that gardening reduces pollution, as reflected in a lower ozone level of ozone down-wind of the garden locations.

Load library, set working direction, and read data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
getwd()
## [1] "C:/Users/weiga/Dropbox/Courses/bio-47120-2024"
setwd("C:/Users/weiga/Dropbox/Courses/bio-47120-2024/")
ozone <- read_csv("../datasets-master/ozone.csv")
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Garden.location, Garden.ID
## dbl (1): Ozone
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# show head
head(ozone)
## # A tibble: 6 × 3
##   Ozone Garden.location Garden.ID
##   <dbl> <chr>           <chr>    
## 1  61.7 West            G1       
## 2  64   West            G2       
## 3  72.4 West            G3       
## 4  56.8 West            G4       
## 5  52.4 West            G5       
## 6  44.8 West            G6
# show summary
glimpse(ozone)
## Rows: 20
## Columns: 3
## $ Ozone           <dbl> 61.7, 64.0, 72.4, 56.8, 52.4, 44.8, 70.4, 67.6, 68.8, …
## $ Garden.location <chr> "West", "West", "West", "West", "West", "West", "West"…
## $ Garden.ID       <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9", …

Prelimiary data visualization

The natural plot to view data set for the t-test is a boxplot, which shows distribution of a numerical data (on y-axis) by a categorical data (on x-axis).

ozone %>% 
  ggplot(aes(x = Garden.location, y = Ozone)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter() + 
  theme_bw()
Boxplot showing ozone levels (y-axis) in gardens (dots) located in two directions (x-axis) of the source of the pollution. Are the data points normally distributed?

Boxplot showing ozone levels (y-axis) in gardens (dots) located in two directions (x-axis) of the source of the pollution. Are the data points normally distributed?

ozone %>% 
  ggplot(aes(x = Garden.location, y = Ozone)) +
  geom_violin() +
  geom_jitter() + 
  theme_bw()
Violin plot, where the widths corresponding to sample density

Violin plot, where the widths corresponding to sample density

It looks that the ozone levels are lower in the gardens located to the west than those located to the west of the pollution source. But is the difference significant? A t-test is warranted.

Run t-test

ozone %>% 
  t.test(data = ., Ozone ~ Garden.location)
## 
##  Welch Two Sample t-test
## 
## data:  Ozone by Garden.location
## t = 4.2363, df = 17.656, p-value = 0.0005159
## alternative hypothesis: true difference in means between group East and group West is not equal to 0
## 95 percent confidence interval:
##   8.094171 24.065829
## sample estimates:
## mean in group East mean in group West 
##              77.34              61.26

The result shows \(p<0.05\), a significant departure from the null hypothesis, which can be rejected. Therefore, we could conclude that the ozone levels are significantly different in two garden locations. We conclude with a final annotated graph.

Conclude with a final figure

ozone %>% 
  ggplot(aes(x = Garden.location, y = Ozone)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(shape = 21, color = "red") +
  xlab("Garden location") +
  ylab("Ozone level") +
  labs(title = "Ozone level analysis", subtitle = "Significant difference between two locations (with p=5.2e-4 by a t-test)") +
  theme_bw()
Conclusion: gardens reduce ozone pollution

Conclusion: gardens reduce ozone pollution

Part 2. Association test

The contingency analysis is used to test the association between two categorical variables, e.g., between genotype and phenotype. It is commonly used in genome-wide association studies (GWAS).

For example, we tabulated the genotypes and phenotypes of the PTC taster lab from BIOL203 (mol genetics) lab. Students are genotypes (by PCR followed by restriction enzyme analysis) at a SNP site (single-nucleotide polymorphism) in an odor-receptor gene TAS2R38. This SNP has two alternative alleles C or T, leading to 3 possible genotypes (C/C, C/T, or T/T). Are you a homozygote or a heterozygote at this SNP?

Students also tested themselves with PTC paper strips to be either a “taster” or “non-taster”. This is a phenotype test.

library(knitr)
# construct a data matrix
ptc.lab <- matrix(c(53,20,39, 17, 9, 52), nrow = 2, byrow = T)
colnames(ptc.lab) <- c("C/C", "C/T", "T/T")
rownames(ptc.lab) <- c("Taster", "Non.Taster")

# display table
kable(ptc.lab)
C/C C/T T/T
Taster 53 20 39
Non.Taster 17 9 52

The question is, is the genotype associated with the phenotype? If so, which genotype is over-represented in which phenotype? Does the allele has a dominant or recessive effect?

Define the following terms:

Preliminary data visualization

Contingency table could be viewed with a grouped barplot:

barplot(ptc.lab, beside = T)
legend("top", legend = c("taster", "non-taster"), fill = c("black", "grey"))
Grouped barplot

Grouped barplot

Questions:

  • Is there evidence for genotype-phenotype association? Is so, which genotype is over-represented in the tasters?

  • Is there evidence for allele-phenotype association? Is so, which allele is over-represented in the tasters?

  • Is the Hunter biology student population diverse? What is the Hardy-Weinberg expected frequency of C/T heterozygotes?

An alternative visualization of a contingency table is the mosaic plot:

library(vcd)
## Loading required package: grid
mosaic(ptc.lab, labeling_args = list(set_varnames = c(A = "Phenotype", B = "Genotype")), main = "Observed counts")
Mosaicplot shows the relative propoportions of genotypes (in columns) as well as of the phenotypes (in rows)

Mosaicplot shows the relative propoportions of genotypes (in columns) as well as of the phenotypes (in rows)

Association test

The null hypothesis of a \(X^2\) test is an absence of genotype-phenotype association (i.e., the phenotype and genotype are independent from each other).

We obtain the expected counts under the null hypothesis and visualize it with a mosaicplot, to contrast with the observed counts.

ptc.chi <- chisq.test(ptc.lab)
ptc.exp <- ptc.chi$expected
mosaic(ptc.exp, labeling_args = list(set_varnames = c(A = "Phenotype", B = "Genotype")), main = "Expected counts")

# an alternative form of the same test:
assocstats(ptc.lab)
##                     X^2 df   P(> X^2)
## Likelihood Ratio 19.457  2 5.9574e-05
## Pearson          19.070  2 7.2266e-05
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.302 
## Cramer's V        : 0.317
# Still another alternative test of independence:
fisher.test(ptc.lab)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ptc.lab
## p-value = 6.413e-05
## alternative hypothesis: two.sided

Conclude with a final mosaic plot

library(vcd)
mosaic(ptc.lab, shade = T, labeling_args = list(set_varnames = c(A = "Phenotype", B = "Genotype")))
Significant genotype-phenotype association of the PTC lab, showing an excess (over-representation) of T/T genotype and a deficit (under-representation) of C/C genotypes in the non-tasters

Significant genotype-phenotype association of the PTC lab, showing an excess (over-representation) of T/T genotype and a deficit (under-representation) of C/C genotypes in the non-tasters