R Tutorials

Part 1. t-test (for two groups)
Part 2. Association test

Part 1. t-test (for two groups)

The t-test is used to compare the means of a numerical measurement between two groups of samples. The main assumptions of the t-test include:

The samples are randomly selected and independent from each other (e.g., not from the same individual)
The measurements are normally distributed (otherwise, transformation is necessary, e.g., log+1 transformation of RNA counts of a transcriptome or OTU counts of a microbiome dataset)
Null hypothesis (\(H_0\)): equal means between the two groups: \(\mu_1=\mu_2\)

We will use the ozone dataset, with the biology hypothesis that gardening reduces pollution, as reflected in a lower ozone level of ozone down-wind of the garden locations.

Load library, set working direction, and read data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

getwd()

## [1] "C:/Users/weiga/Dropbox/Courses/bio-47120-2024"

setwd("C:/Users/weiga/Dropbox/Courses/bio-47120-2024/")
ozone <- read_csv("../datasets-master/ozone.csv")

## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Garden.location, Garden.ID
## dbl (1): Ozone
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# show head
head(ozone)

## # A tibble: 6 × 3
##   Ozone Garden.location Garden.ID
##   <dbl> <chr>           <chr>    
## 1  61.7 West            G1       
## 2  64   West            G2       
## 3  72.4 West            G3       
## 4  56.8 West            G4       
## 5  52.4 West            G5       
## 6  44.8 West            G6

# show summary
glimpse(ozone)

## Rows: 20
## Columns: 3
## $ Ozone           <dbl> 61.7, 64.0, 72.4, 56.8, 52.4, 44.8, 70.4, 67.6, 68.8, …
## $ Garden.location <chr> "West", "West", "West", "West", "West", "West", "West"…
## $ Garden.ID       <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9", …

Prelimiary data visualization

The natural plot to view data set for the t-test is a boxplot, which shows distribution of a numerical data (on y-axis) by a categorical data (on x-axis).

ozone %>% 
  ggplot(aes(x = Garden.location, y = Ozone)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter() + 
  theme_bw()

Boxplot showing ozone levels (y-axis) in gardens (dots) located in two directions (x-axis) of the source of the pollution. Are the data points normally distributed?

ozone %>% 
  ggplot(aes(x = Garden.location, y = Ozone)) +
  geom_violin() +
  geom_jitter() + 
  theme_bw()

Violin plot, where the widths corresponding to sample density

It looks that the ozone levels are lower in the gardens located to the west than those located to the west of the pollution source. But is the difference significant? A t-test is warranted.

Run t-test

ozone %>% 
  t.test(data = ., Ozone ~ Garden.location)

## 
##  Welch Two Sample t-test
## 
## data:  Ozone by Garden.location
## t = 4.2363, df = 17.656, p-value = 0.0005159
## alternative hypothesis: true difference in means between group East and group West is not equal to 0
## 95 percent confidence interval:
##   8.094171 24.065829
## sample estimates:
## mean in group East mean in group West 
##              77.34              61.26

The result shows \(p<0.05\), a significant departure from the null hypothesis, which can be rejected. Therefore, we could conclude that the ozone levels are significantly different in two garden locations. We conclude with a final annotated graph.

Conclude with a final figure

ozone %>% 
  ggplot(aes(x = Garden.location, y = Ozone)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(shape = 21, color = "red") +
  xlab("Garden location") +
  ylab("Ozone level") +
  labs(title = "Ozone level analysis", subtitle = "Significant difference between two locations (with p=5.2e-4 by a t-test)") +
  theme_bw()

Conclusion: gardens reduce ozone pollution

Part 2. Association test

The contingency analysis is used to test the association between two categorical variables, e.g., between genotype and phenotype. It is commonly used in genome-wide association studies (GWAS).

For example, we tabulated the genotypes and phenotypes of the PTC taster lab from BIOL203 (mol genetics) lab. Students are genotypes (by PCR followed by restriction enzyme analysis) at a SNP site (single-nucleotide polymorphism) in an odor-receptor gene TAS2R38. This SNP has two alternative alleles C or T, leading to 3 possible genotypes (C/C, C/T, or T/T). Are you a homozygote or a heterozygote at this SNP?

Students also tested themselves with PTC paper strips to be either a “taster” or “non-taster”. This is a phenotype test.

library(knitr)
# construct a data matrix
ptc.lab <- matrix(c(53,20,39, 17, 9, 52), nrow = 2, byrow = T)
colnames(ptc.lab) <- c("C/C", "C/T", "T/T")
rownames(ptc.lab) <- c("Taster", "Non.Taster")

# display table
kable(ptc.lab)

	C/C	C/T	T/T
Taster	53	20	39
Non.Taster	17	9	52

The question is, is the genotype associated with the phenotype? If so, which genotype is over-represented in which phenotype? Does the allele has a dominant or recessive effect?

Define the following terms:

Allele
Genotype
Homozygotes
Heterozygotes
Genetic diversity: \(h=1-\sum(f_i^2)\), which measures the frequency of heterozygotes in a population

Preliminary data visualization

Contingency table could be viewed with a grouped barplot:

barplot(ptc.lab, beside = T)
legend("top", legend = c("taster", "non-taster"), fill = c("black", "grey"))

Grouped barplot

Questions:

Is there evidence for genotype-phenotype association? Is so, which genotype is over-represented in the tasters?
Is there evidence for allele-phenotype association? Is so, which allele is over-represented in the tasters?
Is the Hunter biology student population diverse? What is the Hardy-Weinberg expected frequency of C/T heterozygotes?

An alternative visualization of a contingency table is the mosaic plot:

library(vcd)

## Loading required package: grid

mosaic(ptc.lab, labeling_args = list(set_varnames = c(A = "Phenotype", B = "Genotype")), main = "Observed counts")

Mosaicplot shows the relative propoportions of genotypes (in columns) as well as of the phenotypes (in rows)

Association test

The null hypothesis of a \(X^2\) test is an absence of genotype-phenotype association (i.e., the phenotype and genotype are independent from each other).

We obtain the expected counts under the null hypothesis and visualize it with a mosaicplot, to contrast with the observed counts.

ptc.chi <- chisq.test(ptc.lab)
ptc.exp <- ptc.chi$expected
mosaic(ptc.exp, labeling_args = list(set_varnames = c(A = "Phenotype", B = "Genotype")), main = "Expected counts")

# an alternative form of the same test:
assocstats(ptc.lab)

##                     X^2 df   P(> X^2)
## Likelihood Ratio 19.457  2 5.9574e-05
## Pearson          19.070  2 7.2266e-05
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.302 
## Cramer's V        : 0.317

# Still another alternative test of independence:
fisher.test(ptc.lab)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  ptc.lab
## p-value = 6.413e-05
## alternative hypothesis: two.sided

Conclude with a final mosaic plot

library(vcd)
mosaic(ptc.lab, shade = T, labeling_args = list(set_varnames = c(A = "Phenotype", B = "Genotype")))

Significant genotype-phenotype association of the PTC lab, showing an excess (over-representation) of T/T genotype and a deficit (under-representation) of C/C genotypes in the non-tasters