shinyWGD
Source: https://github.com/li081766/shinyWGD/blob/main/vignettes/intro_to_shinywgd.Rmd
This vignette provides an introduction to WGD inference using the
shinyWGD
server or package.
shinyWGD
can prepare the input and
command lines for wgd
, ksrates
, i-ADHoRe
,
OrthoFinder
,
and Whale
and then study the whole genome duplication events (WGDs) of the
interesting species.
This page provides a brief introduction to this server.
if (!require("devtools")) {
install.packages("devtools")
}
devtools::install_github("li081766/shinyWGD", dependencies=TRUE, build_vignettes=TRUE)
library(shinyWGD)
runshinyWGD()
help(package="shinyWGD")
wgd
,
ksrates
, i-ADHoRe
and
OrthoFinder
Please select the proper number of species you want to study in Select Number of Species to Analyze: first.
If only one species is set, the core program will be set towgd
, otherwise,ksrates
will be chosen.
shinyWGD
provides two options to upload
the data in the Data Preparation
Page.
Users can prepare a Tab-Separated file in which each row includes the data of a species.
This option is suitable to study multiple species.
The detailed format of this file is below:
Tab-Separated file
latin name | cds file namea | gff file nameb (optional) |
---|---|---|
Elaeis guineensis | elaeis.fasta | elaeis.gff |
Oryza sativa | aryza.fasta | oryza.gff3 |
Asparagus officinalis | asparagus.fasta.gz |
cds
is the coding region
of a gene or the portion of a gene’s DNA or RNA that codes for protein.
This file can be fasta
or gzipped-fasta
.
gff
is the format
consists of one line per feature, each containing nine columns of data,
plus optional track definition lines, see more click
here. gff
, gff3
,
gzipped-gff
, or gzipped-gff3
. focal species
is the parameter in
ksrates
, which will be set in the configure file
of ksrates
.Do not contain the alternative isoforms of each gene.
Do not contain theHeader Row
when preparing the table.
To ensure the successful transformation of of annotation file, please review the type (the third column) and attributes (the ninth column) of the
Download the exampleGFF
file and make sure that when the type is equal tomRNA
, the attributes column starts withID=
.GFF
file ofArabidopsis thaliana
from PLAZA.
Now users need to upload the corresponding files in the
Upload CDS / Annotation Files:
field.
⚠️ Ensure that the uploaded files match the definitions in the
Tab-Separated file
.
After successfully uploading the
tab-separated
file and the
CDS / Annotation
files, the number of
species in the file will be calculated.
If only one species needed to be studied,
shinyWGD
will generate the command line
for wgd
by clicking the
Create wgd Codes in the right
Setting
panel.
Users can check the codes by clicking the Go to wgd Scripts button.
Then, users can click the
Run wgd button to operate wgd
program
in PSB computing cluster
in
Script Executing
panel. After the results
are generated, an email will be sent to the provided email
address.
⚠️ The Script Executing
panel only
works on the server version of shinyWGD
of
PSB
.
Then, users can click the
Download Analysis Data button to download the prepared files and
script for wgd
program.
If multiple species needed to be study,
shinyWGD
will generate the command line
for ksrates
.
Two more parameters should be added.
focal species
set the focal species for ksrates
.newick tree
upload a tree file in newick
format. The tree should contain all the species.Then, users can click the
Create ksrates Codes button to generate the
configure file
and command line for
ksrates
.
After that, users can click the
Go to ksrates Scripts button to review the
configure file
and command line for
ksrates
.
Then, users can click the rest buttons to generate the scripts
for i-ADHoRe
and OrthoFinder
.
Then, users can submit the computing to the
PSB computing cluster
. This also only
works on the server version of shinyWGD
of
PSB
.
Finally, users can click the
Download Analysis Data button to download the prepared files and
script for ksrates
, i-ADHoRe
, and
OrthoFinder
program.
This option provide users a more straightforward way to upload data.
After selecting the number of studied species, the corresponding uploading panels will be created.
Three items (latin name
, cds fasta
, and
annotation gff
) of each species should be set or uploaded
by users.
Once more than one species needed to be analyzed, the
focal species
and newick tree
should be
defined by users.
Then users can click the
Create wgd Codes or
Create ksrates Codes button to create the codes
for wgd
or ksrates
according to the species
number.
After reviewing and downloading the necessary files, users can
execute wgd
, ksrates
, i-ADHoRe
or
OrthoFinder
in their local environment or submit the
computing to the
PSB computing cluster
.
shinyWGD
will exclude genes with internal stop codons and genes whose lengths are not divisible by three. You can find details on the gene filtering process in the Sequence_processing.log file located in theAnalysis_Date
folder.
wgd
, krates
,
i-ADHoRe
or OrthoFinder
locallyAfter decompressing the downloaded compressed file
(Analysis_Date.tgz
), users will find the
bash
and configure
files
(run_wgd.sh
/ run_ksrates.sh
,
run_paralog_ks_rest_species.sh
,
run_diamond_iadhore.sh
,
run_orthofinder.sh
).
In the first several lines of each
Our cluster also uses Environment Modules to manage software. In the script files, please comment out therun_*.sh
file, we include the#SBATCH
directive to specify the memory and thread usage. This is essential because our cluster relies on SLURM for workload management. Users can modify these lines to align with their specific cluster configurations.module load
commands for packages if you are not using Environment Modules.
Users can execute them in their local environment.
wgd
to your local environment, please
read the wgd
document.ksrate
to your local environment, please
read the ksrates
document.i-ADHoRe
to your local environment,
please read the i-ADHoRe
document.OrthoFinder
to your local environment,
please read the OrthoFinder
document.When usingi-ADHoRe
to do synteny analysis, the alignment will be produced firstly. The alignment process will use thediamond
program. So please installdiamond
program first.
To compute the Ks value of each anchored gene pair and the Ks unit tree of studied species, theMAFFT
,trimAl
, andPAML
will be used for aligning the proteins, obtaining the nucleotide alignment and calculating the Ks value, respectively. So please installMAFFT
,trimAl
, andPAML
first.
In this page, users can use the output of OrthoFinder
to
prepare the input for Whale
, a sophisticated package for
the gene tree - species tree reconciliation analysis.
Users need to upload two inputs to operate Whale.jl
. One
is the species time tree in Newick
format. Another is the
folder which includes all ALE
files. If users use
OrthoFinder
in shinyWGD
,
users can find the folder in Orthofinder_wd
, named
OrthoFinderOutput_for_Whale.tar.gz. Users can also follow the
instruction
of Whale.jl
to prepare the ALE
files by using
the gene families built by other software.
To run therun_Whale_preparing.sh
to prepare theALE
inputs forWhale
, theMrBayes
andALE
will be operated to transform each gene family into its correspondingALE
file. How to install these two software, please refer to their websites.
Once the tree file uploaded, users can check the topology of the tree in the right panel. Then users can add the hypothetical WGD event to a certain branch by clicking the branch.
Users should use wgd as the prefix for
the name of hypothetical WGD event to satisfy the requirements of
Whale.jl
.
In the above tree figure, four hypothetical WGD events are tested, named as wgd1 and wgd2, respectively.
Users can review the added hypothetical WGD event in the Hypothetical WGDs to test: panel in the left side.
Users can define the Base Model in the Base Model for Whale panel and set the Chain rounds in the Set the Chain for Whale panel.
After setting all the parameters, users can create Whale
scripts by clicking the
Create Whale Codes button. Then, users can review
the scripts by clicking the
Go to Whale Scripts button
If users don’t ensure the evolutionary relationships among the
studied species, shinyWGD
provides an
alternative way to generate the tree through interacting with TimeTree.org database.
TimeTree
database does not include all the species. If the species are not included in theTimeTree
database, you should ensure the relationship in other ways.
Please click the
Tree Extraction block in the
Scripts
to extract a tree from TimeTree.org database.
A file containing species names in each row needs to be uploaded in
the Upload Species List File:
field.
Species names file
Vitis vinifera |
Elaeis guineensis |
Oryza sativa |
Asparagus officinalis |
After successfully obtaining the tree, users can review the tree in the right panel.
In the example, eight species were used to extract the tree from TimeTree.org. (A) The uploading
panel allows users to upload the species information and start the
extraction. (B) The tree display panel enables users to choose a proper
strategy to build the tree and download the tree in two formats
(Newick
and Nexus
format). When users choose
As timetree.org
, the module will create
the tree as TimeTree.org do. In
TimeTree.org, they use the
adjusted time to build the species divergence time tree. However, the
evolutionary close species with the same adjusted divergence time led to
more than two branches sharing the same ancestral node in the tree. When
users choose the median time strategy, the conflict will be corrected.
(C) The species information panel. The species with identical time
records will be assigned to the same group. In this example,
Apostasia fujianica, Apostasia shenzhenica, and
Neuwiedia singapureana share the same time record in TimeTree.org. Platanthera
guangdongensis and Platanthera zijinensis also have identical
records.
After successfully running the corresponding shell files, the
necessary outputs will be saved in the
Analysis_Date
folder. Then users can
choose the Ks_Data_for_Visualization.tar.gz
file in the
folder and upload it to the
KsDist page.
Users should choose the compressed file which is originated from the
folder created by shinyWGD
program to
avoid the failure of displaying the correct file names.
Once selecting one species from the drop-down menu of Paralogous Ks ▼, users can click the Confirm analysis to create the output panel in the right side of the page.
After clicking the Start , the paralogous Ks age distribution of the selected species will be generated.
shinyWGD
employs two methods to predict
the potential peaks of the distribution.
First, shinyWGD
fits normal mixture
models to the Ks data using maximum likelihood
with EMMIX. shinyWGD
will test one to ten
components with 500 random starts. Fitted models are selected by the
Bayesian Information Criterion (BIC) within
EMMIX
. Finally, the fitted models can be
drawn above the bar plot of each species. Users can choose one or
several peaks to draw the curve.
Second, shinyWGD
uses
SiZer
to identify significant peaks based
on the significant zero crossings of derivatives from kernel density
estimation. Then, a subplot to show the output of
SiZer
will be added under the bar
plot.
Please refer to the papers below to see details about
EMMIX
andSiZer
.
- McLachlan G, Peel D. (1999) The EMMIX algorithm for the fitting of normal and t-components. Journal of Statistical Software. 4:1–14.
- Barker, M.S., Kane, N.C., Matvienko, M., Kozik, A., Michelmore, R.W., Knapp, S.J. and Rieseberg, L.H., (2008) Multiple paleopolyploidizations during the evolution of the Compositae reveal parallel patterns of duplicate gene retention after millions of years. Molecular biology and evolution, 25(11), pp.2445-2455.
- Chaudhuri, P. and Marron, J.S., (1999) SiZer for exploration of structures in curves. Journal of the American Statistical Association, 94(447), pp.807-823.
On the right side of the plot, a table with GMM peaks information is displayed. Users can choose which peaks to be displayed by clicking the check box of each row.
The example output is below.
Users need to select the species from the Select Species to compare of the Orthologous Ks ▼ to compute the density of Ks age distribution among the selected species.
To obtain the potential peak of the plot,
shinyWGD
will fit the peak by using the
Epanechnikov kernel function. Then 95% confidence interval (CI) is
estimated by 1000 iterations.
On the right side of the density plot, an ultrametric tree is displayed with circles corresponding to the density plot. Users can remove comparisons of less interest by clicking the circle in the tree. A pop-up menu will appear, allowing users to delete the selected comparison.
The example output is below.
In the above figure, the dash line is the peak of a combination of the reference species and another species. The colored rectangle is the 95% CI. Users can retrieve the detail information about the density, the peaks, and the 95% CIs by hovering the mouse above the lines or rectangles.
When comparing the one-to-one orthologous Ks distributions between studied species and the reference species, the different Ks peaks for the same speciation event may exist, suggesting different synonymous substitution rates among the studied species, resulting in an over- or under-estimate for the divergence between the studied species and the reference species.
To quantify the difference in substitution rates,
shinyWGD
employs the relative rate tests.
First, the Newick tree, which contains the phylogenetic
relationship, will be used to ensure enough species. Then, users need to
define a reference species and an outgroup species to
determine a tree structure among the species which will be used in the
relative rate test. Then, the Ks distance between two
species is estimated by the mode of their orthologous
Ks distribution. Using the Ks
distance between the reference species and the studied species, the
Ks distance between the outgroup species and the
reference species, the Ks distance between the
outgroup species and the studied species, the distances to the studied
species and the reference species after their divergence are computed.
Finally, orthologous Ks between the studied species
and the reference species is corrected by the double of the
Ks distance to the reference species, assuming that
studied species has the same substitution rate as the reference
species.
The example output is below.
In the left figure, Kernel-density estimates (KDEs) of Ks distribution for one-to-one orthologues between A. officinalis and other three species, namely A. shenzhenica, D. catenatum and P. equestris. As the modes (peaks) of the KDE all represent the distances between A. officinalis and the compared species, the differences observed among the Ks values of the modes indicate substitution rate variations among the studied species. The segments with arrows above the KDE denote, from top to bottom, the Ks distances computed for P. equestris and D. catenatum using A. shenzhenica as the reference and A. officinalis as the outgroup. The dotted lines align with the observed modes in the KDE of the orthologous Ks distributions. The black dots on the segments represent the divergence between A. shenzhenica with D. catenatum and between A. shenzhenica with P. equestris; hence, the lengths of segments pointed to the right show the Ks accumulated in A. shenzhenica and D. catenatum / P. equestris, whereas the grey segments pointed to the left show the Ks distance between A. officinalis and the divergence of A. shenzhenica. The grey rectangles show the 95% confidence intervals of the Ks inferred from 200 bootstraps.
In the right figure, shinyWGD
uses
arrows with directions to indicate overestimation (to the right) or
underestimation (to the left) of divergence events. The arrow points to
the Ks values after correcting for different
substitution rates. The bar and line plots are the paralogous
Ks distributions of A. shenzhenica.
shinyWGD
uses the single-copy gene
families built by OrthoFinder to calculate the
Ks unit tree. Users can check the tree in the
Orthofinder_wd/ds_tree_wd folder, named
singleCopyGene.ds_tree.newick. Once uploading the tree file in
the Ks Tree File panel, the
Ks unit tree will displayed in the right side of the
page and a ultrametric tree will automatically added in the right right
side of the Ks unit tree. Users can click the branch
in the ultrametric tree to add the WGD events into the branch by
observing the left Ks unit tree.
Users can click the text of the species name to change the color of the text.
Users can upload a file which contains the Ks peaks of each studied species to place these peaks into the Ks unit tree. This file can be created by operating Paralogous Ks ▼ module.
Or, users can prepare a file blow.
Ks Peaks file separated by Tab
Species | Peak in | Peak | 95% Confidence Interval |
---|---|---|---|
Apostasia shenzhenica | Anchor pairs | 0.97 | 0.55-2.19 |
Dendrobium catenatum | Anchor pairs | 0.86 | 0.38-4.18 |
Phalaenopsis equestris | Anchor pairs | 0.96 | 0.58-1.92 |
The example output is below.
Users can choose the
Collinear_Data_for_Visualization.tar.gz
file in the
Analysis_Date
folder and upload it to the
Collinear Analysis page. Then, users can study the
intra-, inter-, and multiple- species
synteny, as well as the clustering analysis between two
species.
After selecting the studied species, users can click the Confirm analysis to create the setting and output panel in the right side of the page.
Users can set the threshold for the number of anchor pairs in each multiplicon to filter out the small multiplicons which contain less anchor pairs.
The dot plot and parallel link plot will be generated.
Users can also search a certain gene to check the location of the gene in the multiplicon in the Multiplicon-level Synteny panel and then plot the searched multiplicon.
In the multiple species alignment panel, users can define the order of species in the plot. Users can change the color of species by clicking the text of the species name. When the mouse hovers a link, the related links with this link among species can be highlighted using different colors.
shinyWGD
employs the Hierarchical
clustering method for constructing putative ancestral regions (PARs).
The segments are identified from the outputs of i-ADHoRe
,
and the homolog concentration scores are calculated
using -log(p)
, where p
is the probability of
the observed number of homolog pairs as modeled by a Poisson
distribution.
To read more about homolog concentration scores, please refer to Putnam NH, et al. (2008) The amphioxus genome and the evolution of the chordate karyotype. Nature 453:1064–1071.
For each segment, the array of scores against all segments in the
other genome forms a unique profile. The segments are then clustered
based on the similarity of these profiles (determined by Pearson
correlation coefficient r) using the average linkage method.
The default cutoff of r is 0.3. Users can change this cutoff through the
sliding block. The faint yellow rectangle will highlight the
significantly enriched regions (P < 1E-10
). Then, users
can zoom in each PAR through the Putative
Ancestral Regions block.
In the example, we used 0.5 as the cutoff for r and 10 to filter out the segment with fewer anchor points. In total, we identified 61 PARs.
A single PAR zooms in is below.
In this page, users can use Whale
to operate the gene
tree - species tree reconciliation analysis.
After Whale
is done, users can check the output in this
page.
The example output is below.
We can judge whether the hypothetical WGD
event is true or not based on the Bayes factor (K) to
compare the likelihood of q = 0(H0) to the likelihood of q
> 0(H1) using the Savage–Dickey density ratio.
Please Note: the log10 Bayes factor is calculated. Thus, a Bayes factor smaller than -2 could be considered as evidence in favor of the q ≠ 0 model compared to the q = 0 mode.
Besides, users should also check the ESS values. Generally, ESS values exceeding at least 100 is better for a good model, although short chains may be good for exploring and testing different models. If the ESS values below 100, please increase the Chain Rounds in the left setting panel.
Then, users can click the button to update the species tree to display the output of Whale.
The example output is below.
In the above plot, the WGD with the solid green bar is supported with retention rates significantly different from zero, while the hollow WGD bars are the ones with retention rates not different from zero.
Then users can download the plot by clicking the button.
In this page, users can visualize the time tree of
MCMCTree
, or a ultrametric tree.
Users can upload a nexus tree withe the divergence time information. The format of the tree is the output file of MCMCTree, named FigTree.tre. Be careful with the time unit, please use the 100 million yeas as the time scale. Users can click the text of the species name to change the color or add a symbol to the species. Users can click the branch to add a WGD into the branch.
Users can upload a file which contains the WGD events of each studied species to place these WGDs into the tree.
WGD Events file separated by Tab
Species | WGD eventsa | color |
---|---|---|
Elaeis guineensis | 0.75 | #ababab |
Oryza_sativa | 0.6308-0.6989,1.10-1.17 | #ababab |
Phalaenopsis_equestris | 0.4565-0.7838 | #ababab |
Zostera_marina | 0.64-0.72 | #ababab |
Spirodela polyrhiza | 0.9079,0.8964 | #ababab |
Asparagus officinalis | 0.3245-0.7013 | #ababab |
Ananas comosus | 1.10-1.17,1.17-1.33 | #ababab |
Vitis vinifera | 1.17-1.33 | #ababab |
a WGD events can be a single time
point or a time interval. Please use “,” to separate
multiple WGD events within a species.
The example output is below.
Users can also upload a ultrametric tree and then add the WGD events by clicking the branches manually.
The example output is below.
wgd
ksrates
i-ADHoRe
Whale
OrthoFinder
diamond
MAFFT
trimAl
PAML
MrBayes
ALE
{shiny}
{shinyjs}
{shinyFiles}
{shinyBS}
{shinyWidgets}
{shinyalert}
{bslib}
{bsplus}
{htmltools}
{stringi}
{tidyverse}
{vroom}
{english}
{data.table}
{argparse}
{dplyr}
{tools}
{seqinr}
{DT}
Zwaenepoel, A., and Van de Peer, Y., (2019) wgd - simple command line tools for the analysis of ancient whole genome duplications. Bioinformatics, bty915.
Sensalari, C., Maere, S., and Lohaus R., (2021) ksrates: positioning whole-genome duplications relative to speciation events in KS distributions. Bioinformatics, btab602.
Proost, S., Fostier, J., De Witte, D., Dhoedt, B., Demeester, P., Van de Peer, Y. and Vandepoele, K., (2012) i-ADHoRe 3.0—fast and sensitive detection of genomic homology in extremely large data sets. Nucleic acids research, 40(2), pp.e11-e11.
Zwaenepoel, A., and Van de Peer, Y., (2019) Inference of ancient whole-genome duplications and the evolution of gene duplication and loss rates. Molecular biology and evolution, 36(7), 1384-1404.
Emms, D.M., and Kelly, S., (2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome biology, 16(1), pp.1-14.
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] fontawesome_0.5.2 shinyWGD_1.0.5
#>
#> loaded via a namespace (and not attached):
#> [1] nlme_3.1-157 fs_1.5.2 usethis_2.1.6
#> [4] lubridate_1.9.2 devtools_2.4.5 bit64_4.0.5
#> [7] httr_1.4.5 rprojroot_2.0.3 tools_4.2.1
#> [10] profvis_0.3.7 bslib_0.4.2 utf8_1.2.2
#> [13] R6_2.5.1 DT_0.27 KernSmooth_2.23-20
#> [16] ade4_1.7-22 urlchecker_1.0.1 withr_2.5.0
#> [19] tidyselect_1.2.0 prettyunits_1.1.1 processx_3.8.0
#> [22] bit_4.0.4 compiler_4.2.1 english_1.2-6
#> [25] cli_3.6.0 argparse_2.2.2 xml2_1.3.3
#> [28] shinyjs_2.1.0 desc_1.4.2 sass_0.4.1
#> [31] mvtnorm_1.1-3 callr_3.7.3 commonmark_1.8.0
#> [34] bsplus_0.1.4 stringr_1.5.0 digest_0.6.29
#> [37] shinyBS_0.61.1 rmarkdown_2.21.1 pkgconfig_2.0.3
#> [40] htmltools_0.5.4 sessioninfo_1.2.2 fastmap_1.1.1
#> [43] htmlwidgets_1.6.1 rlang_1.0.6 rstudioapi_0.14
#> [46] shiny_1.7.5.1 jquerylib_0.1.4 generics_0.1.3
#> [49] jsonlite_1.8.4 mclust_6.0.0 vroom_1.6.1
#> [52] shinyalert_3.0.0 dplyr_1.1.0 magrittr_2.0.3
#> [55] Matrix_1.5-3 Rcpp_1.0.10 fansi_1.0.3
#> [58] ape_5.7-1 lifecycle_1.0.3 yaml_2.3.5
#> [61] stringi_1.7.6 brio_1.1.3 MASS_7.3-57
#> [64] pkgbuild_1.4.0 grid_4.2.1 shinyFiles_0.9.3
#> [67] parallel_4.2.1 promises_1.2.0.1 crayon_1.5.1
#> [70] miniUI_0.1.1.1 lattice_0.20-45 knitr_1.42
#> [73] ps_1.7.2 pillar_1.8.1 igraph_1.4.1
#> [76] seqinr_4.2-30 markdown_1.5 pkgload_1.3.2
#> [79] glue_1.6.2 evaluate_0.20 data.table_1.14.8
#> [82] remotes_2.4.2 vctrs_0.5.2 tzdb_0.3.0
#> [85] httpuv_1.6.5 testthat_3.1.6 purrr_1.0.1
#> [88] tidyr_1.3.0 ks_1.14.0 cachem_1.0.6
#> [91] xfun_0.37 mime_0.12 xtable_1.8-4
#> [94] tidyverse_2.0.0 pracma_2.4.2 roxygen2_7.2.3
#> [97] later_1.3.0 tibble_3.1.8 memoise_2.0.1
#> [100] shinyWidgets_0.7.6 timechange_0.2.0 ellipsis_0.3.2