This is an R Markdown report to visualize computational results of African Swine Fever Virus (ASFV) genomics. Goals of the project include:
Implement a novel bioinformatics workflow for reference-based whole-genome alignment. All viral genomes (n=316) were aligned to the high-quality BA71 genome (NC_044942, with 180,365 bps).
Reconstruct the whole-genome phylogeny & gene phylogenies (n=161 ORFs)
Identify recombinant viral strains & estimate recombination rate relative to the mutation rate
Quantify site-by-site nucleotide and amino-acid evolutionary rates for each gene, to identify genes and gene sites under positive natural selection. Such genes and sites are the most likely antigenic loci involved in host-virus interactions and, as such, candidate targets for diagnosis, therapeutics, and vaccines.
Compile a list of n=316 ASFV genomes (by Dr Charles Adeola, KIZ), including the accessions and meta data (e.g., collection dates and locations, genotypes, host species)
Align the whole genomes to the BA71 reference genome. Identify and extract 2-allele SNP sites (n=26,372)
Infer whole-genome tree (n=317 OTUs) with IQTree2. Consolidate OTUs by tree-walk, resulting in n=62 representative genomes
Slice the whole-genome alignment into n=161 gene trees and remove sequences containing internal stop codons (likely due to non-senses mutations or sequencing errors)
For more details on the project, clone the project repository at https://edu.gitee.com/huntercollege/repos/huntercollege/asfv-genomics/sources.
First, we set working directory, load libraries, and read data. We make a preliminary genome-tree plot using the “ape” package.
Make sure that you use your own working directory name, not mine as shown below.
Install the required libraries
Install ggtree from bioconductor: https://bioconductor.org/install/; https://bioconductor.org/packages/release/bioc/html/ggtree.html
setwd("~/Dropbox/Courses/KIZ-Fall-2024/afsv-genomics")
library(ape)
library(readxl)
library(ggtree)
## ggtree v3.14.0 Learn more at https://yulab-smu.top/contribution-tree-data/
##
## Please cite:
##
## Guangchuang Yu. Data Integration, Manipulation and Visualization of
## Phylogenetic Trees (1st edition). Chapman and Hall/CRC. 2022,
## doi:10.1201/9781003279242, ISBN: 9781032233574
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:ape':
##
## rotate
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks ggtree::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::where() masks ape::where()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
t <- read.tree("data/asfv-317-mid.dnd")
plot.phylo(t, show.tip.label = F, type = "fan", no.margin = T)
add.scale.bar()
meta <- read_xlsx("docs/metadata-from-Charles.xlsx", skip = 1, trim_ws = T)
meta <- meta |> mutate(Genotype_p72 = str_extract(Genotype_p72, "\\S+"))
meta <- meta |> mutate(Region = case_when(
Region == 'Africa' ~ '0_African',
Region == 'Asia' ~ '1_Asia',
Region == 'Europe' ~ '2_Europe',
Region == 'South America' ~ '3_S.America')
)
head(meta)
## # A tibble: 6 × 11
## Accession Strain Location Country Region PubmedURL PubSubDate Length
## <chr> <chr> <chr> <chr> <chr> <chr> <dttm> <dbl>
## 1 NC_044942 BA71 Spain Spain 2_Eur… 26618713 NA 180365
## 2 MW723489 Nu1995… Italy: … Italy 2_Eur… 34696424 2021-10-04 00:00:00 181753
## 3 MW800838 Or_1984 Italy: … Italy 2_Eur… 34696424 2021-10-04 00:00:00 181924
## 4 MW736609 6396 WB Italy: … Italy 2_Eur… 34696424 2021-10-04 00:00:00 181756
## 5 MW723492 4996 WB Italy: … Italy 2_Eur… 34696424 2021-10-04 00:00:00 181751
## 6 MW723482 Nu1986 Italy: … Italy 2_Eur… 34696424 2021-10-04 00:00:00 181788
## # ℹ 3 more variables: Host <chr>, Genotype_p72 <chr>, AddInfo <chr>
We removed the closely related OTUs (mostly genotypes I and II isolates) with tree-walk. A total of 62 genomes remain after tree trimming.
# BASH (not R) command:
biotree --walk "NC_044942" > walk-fromBA71.txt
walk <- read_tsv("data/walk-fromBA71.txt", col_names = c("otu", "dist", "order"))
## Rows: 316 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): otu
## dbl (2): dist, order
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
walk |>
ggplot(aes(x = order, y = dist)) +
geom_line() +
xlab("tree walk order") +
ylab("distance from BA71") +
theme_bw()
Discuss:
There are 3 flat steps in the tree-walk line plot. What do these OTUs represent?
tree <- treeio::read.tree("data/asfv-62.dnd")
# join tree with meta data
tips <- tibble(id=tree$tip.label)
tips <- tips %>% left_join(meta, c("id" = "Accession"))
# color the regions (four colors) and genotypes (rainbow colors)
col <- c("green", "red", "blue", "orange", rainbow(length(unique(meta$Genotype_p72))))
# plot genome tree
ggtree(tree) %<+% tips +
geom_treescale(width = 0.01) +
geom_tippoint(aes(color = Genotype_p72)) +
geom_tiplab(align = TRUE, size = 2, aes(color = Region)) +
scale_color_manual(values = col) +
theme(legend.title = element_blank())
Tasks and Questions:
Plot individual gene trees (rooted on mid-point, plot the tree scales as well)
Compare with the genome tree
Do you see any inconsistencies in tree topology? What do the tree inconsistencies indicate?
Plot a gene tree:
tree2 <- treeio::read.tree("data/test-gene-tree.dnd")
# join tree with meta data
tips <- tibble(id=tree2$tip.label)
tips <- tips %>% left_join(meta, c("id" = "Accession"))
# color the regions (four colors) and genotypes (rainbow colors)
col <- c("green", "red", "blue", "orange", rainbow(length(unique(meta$Genotype_p72))))
# plot genome tree
ggtree(tree2) %<+% tips +
geom_treescale(width = 0.1) +
geom_tippoint(aes(color = Genotype_p72)) +
geom_tiplab(align = TRUE, size = 2, aes(color = Region)) +
scale_color_manual(values = col) +
theme(legend.title = element_blank())
If tree topology between the genome and gene trees has major differences, we expect to see major differences in branch length. We first obtain the tree distances from OTUs to the reference genome with the following command:
biotree --dist-all test-gene-tree.dnd | grep "NC_" | sed "s/NC_044942//" | tr -s "\t" | sed "s/^\t//" > dist-gene-tree.tsv
biotree --dist-all asfv-62.dnd | grep "NC_" | sed "s/NC_044942//" | tr -s "\t" | sed "s/^\t//" > dist-genome-tree.tsv
d1 <- read_tsv("data/dist-gene-tree.tsv", col_names = c("otu", "dist"))
## Rows: 61 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): otu
## dbl (1): dist
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d2 <- read_tsv("data/dist-genome-tree.tsv", col_names = c("otu", "dist"))
## Rows: 61 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): otu
## dbl (1): dist
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d <- d1 |> left_join(d2, "otu")
colnames(d) <- c("otu", "dist.gene", "dist.genome")
d |>
ggplot(aes(x = dist.genome, y = dist.gene, label = otu)) +
geom_point(shape = 1) +
geom_text_repel() +
geom_smooth(method = "lm") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: label.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: ggrepel: 45 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Discussion:
Which isolate shows the most inconsistency with the genome-based tree?
Obtained by running:
# BASH (not R) command:
iqtree2 -s test.cds --seqtype CODON --rate
nt.rate <- read_tsv("data/test-rates-nt.tsv", skip = 8)
## Rows: 170 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (4): Site, Rate, Cat, C_Rate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nt.rate <- nt.rate |>
mutate(codon.site = case_when(
Site %% 3 == 1 ~ "site-1",
Site %% 3 == 2 ~ "site-2",
Site %% 3 == 0 ~ "site-3"
))
nt.rate |>
ggplot(aes(x = Site, y = Rate)) +
geom_point(shape = 1) +
geom_line() +
facet_wrap(~codon.site, nrow = 3) +
theme_bw()
Obtained by running:
# BASH (not R) command:
iqtree2 -s test-aa.fas --rate
# Or, more robustly (thanks to Alex)
iqtree2 -s test-aa.fa --mlrate
aa.rate <- read_tsv("data/test-rates-aa.tsv", skip = 8)
## Rows: 170 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (4): Site, Rate, Cat, C_Rate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aa.seq <- scan("data/test-aa.txt", what = "c")
aas <- str_split_1(aa.seq, pattern = "")
aa.rate <- aa.rate |>
mutate(aa = aas)
aa.rate |>
ggplot(aes(x = Site, y = Rate, label = aa)) +
geom_segment(data = aa.rate, color = "gray", aes(xend = Site, y = 0, yend = Rate)) +
geom_text() +
theme_bw()
First, we plot log likelihood of the three models. The data was extracted from PAML output file “test.mlc” with the following commands:
# BASH (not R) command:
grep "^lnL" test.mlc | tr -s " " | cut -f2,3 -d" " | sed "s/np://" | tr -d "[):]" | tr ' ' '\t' > test-lnL.tsv
# Or most robustly (to deal with np < 100):
grep "^lnL" paml-runs/F8208_gp152.mlc | tr -s " " | cut -f3,4 -d":" | tr -d "[):]" | sed -E "s/^\s+//" | cut -f1,2 -d" "
like <- read_tsv("data/test-lnL.tsv",col_names = c("num.par", "log.likelihood"))
## Rows: 3 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (2): num.par, log.likelihood
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
like <- like |>
mutate(model = c("0-one-ratio", "1-nearly-neutral", "2-positive-selection"))
like |>
ggplot(aes(x = num.par, y = log.likelihood, label = model)) +
geom_line() +
geom_point() +
geom_text(nudge_y = 5) +
theme_minimal()
Questions:
Is the Model 1 a significant improvement over the one-ratio (same Ka/Ks for all sites, Model 0) model?
Does the Model 2 significantly improve over the nearly-neutral model (Model 1)?
Obtain the p-values using the “pchisq()” function
Second, we visualize the proportions of sites under negative, neutral, and positive selections, respectively. The output file is extracted with the following BASH commands:
# BASH (not R) command:
grep -w "[pw]:" test.mlc | tail -2 | tr -s " " | tr -d ":" | tr " " "\t" > test-prop.tsv
prop <- read.table("data/test-prop.tsv", row.names = 1, col.names = c("para", "neg.sel", "neutral", "pos.sel"))
pie(as.matrix(prop)[1,], col = c("blue", "green", "red"))
We validate the positive-selected sites by a co-visualization of gene tree and alignment and meta data with “Phyloview” at http://u19.borreliabase.org/. The input files are:
A gene tree (Newick format): e.g., “asfv-62.dnd”
An alignment (FASTA format): e.g., “test-aa.fas”
A meta-data table (in CSV format): e.g., “test-meta.csv2”
Tasks & Questions:
Extract the Ka & Ks tree lengths from Model 0 (one-ratio model) from each “foo.mlc” file. Make a scatterplot (\(K_S\) on the x-axis and \(K_A\) on the y-axis) for all genes. Which genes are under the strongest negative and positive selection, respectively? (Hint: add an \(K_A=K_S\) line using <code>geom_abline(slop = 1, intercept = 0, linetype = 2)</code>) (Note: if you can’t find the “tree length for dN” and “tree length for dS” near the end of the Model 0 outputs, you may be using an older version of PAML. The latest version is 4.10.7, June 2023).
What is the general pattern of the sites under positive natural selection?