- Unikseq: unique region identification in genome sequences using a k-mer approach, to empower environmental DNA assay designs and comparative genomics studies. Rene Warren, Michael J Allison, M. Louie Lopez, Neha Acharya-Patel, Lauren Coombe, Cecilia L. Yang, Caren C Helbing and Inanc Birol
- SequenceFlow - interactive web application for visualizing Partial Order Alignments. Krzysztof Zdąbłasz, Anna Lisiecka and Norbert Dojer
- Adversarial and variational autoencoders improve metagenomic binning. Pau Piera Lindez, Joachim Johansen, Arnor Ingi Sigurdsson, Jakob Nybo Nissen and Simon Rasmussen
- Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts. Askar Gafurov, Tomas Vinar, Paul Medvedev and Broňa Brejová
- REvolutionH-tl: Reconstruction of Evolutionary Histories tool. José Antonio Ramírez-Rafael, Alitzel López-Sánchez, Katia Aviña-Padilla, Andrea Arlette España-Tinajero and Maribel Hernández-Rosales
- Prior Density Learning in Variational Bayesian Phylogenetic Parameters Inference. Amine Remita, Golrokh Kiani and Abdoulaye Baniré Diallo
Unikseq: unique region identification in genome sequences using a k-mer approach, to empower environmental DNA assay designs and comparative genomics studies
Rene Warren, Michael J Allison, M. Louie Lopez, Neha Acharya-Patel, Lauren Coombe, Cecilia L. Yang, Caren C Helbing and Inanc Birol
The genomics revolution of the past two decades has resulted in an exponential growth of genome sequences hosted in public repositories, providing a wealth of genetic information for thousands of Earth’s species. This is opening up new research fields, such as species monitoring and conservation using environmental DNA (eDNA), in ways not thought possible only a few years ago. Conversely, the sudden access to massive amounts of DNA sequences is creating computational bottlenecks for comprehensive comparative genomics studies.
We present unikseq, a utility that uses substrings of length k (k-mers) to identify unique regions in a reference sequence relative to tolerated (ingroup) and not-tolerated (outgroup) sequence sets, quickly and with low memory. Applications of unikseq include qPCR assay design, where the identification of unique regions likely to yield specific assays helps to narrow the primer-probe search space considerably. For instance, with unikseq, thousands of complete animal mitochondrial genomes (MtG) can be simultaneously compared in seconds (e.g. ~2m and 9 GB RAM running unikseq on 4,000 MtG). This whole-genome analysis substantially increases the sequence “real estate” compared to that of earlier eDNA assay designs based on one/few mitochondrial/marker genes such as the metazoan barcoding sequence resource, which includes portions of the mitochondrial cytochrome c oxidase I-encoding gene (Mt-COI).
We show how unikseq enabled the design of eDNA assays for species monitoring in challenging environments, including those with many closely related sympatric species, like that of S. maliger (rockfish), by identifying unique regions within a reference mitogenome.
With unikseq-Bloom, we further demonstrate the scalability of unikseq in identifying unique regions in white and interior (vs. Sitka and Engelmann) spruce using succinct Bloom filter data structures built from 3 x 20 Gbp genomes. Additionally, we present a scalable application to paleogenomics, where unique and conserved regions within whole Neanderthal and Denisovan vs. 5 modern human genomes are identified by unikseq-Bloom in ~2h using <42 GB RAM, illustrating how the utility can also be used to identify incomplete sequences in large genome assembly projects.
Unikseq makes full use of public genomic resources for high-quality eDNA assay design, without the need to manually parse through entire multiple sequence alignments or compromise between eDNA assay sensitivity and selectivity. We anticipate the broad application of unikseq for eDNA assay design and genome analysis across the tree of life. Unikseq and unikseq-Bloom are available on GitHub (https://github.com/bcgsc/unikseq).
SequenceFlow - interactive web application for visualizing Partial Order Alignments
Krzysztof Zdąbłasz, Anna Lisiecka and Norbert Dojer
Partial Order Alignment (POA) is an acyclic directed graph whose vertices are residues of the aligned sequences, and the edges connect successive residues in these sequences. In addition, the set of vertices is divided into clusters, which are equivalent to the columns in the standard alignment model, with identical symbols from aligned sequences in each cluster combined into a single vertex. As a result, the aligned sequences are represented by paths in the graph, and the common vertices of these paths are their identical residues. While the POA model proved useful in several applications (e.g. sequencing reads assembly and pangenome structure exploration), we lack efficient visualization tools that could highlight its advantages.
Adversarial and variational autoencoders improve metagenomic binning
Pau Piera Lindez, Joachim Johansen, Arnor Ingi Sigurdsson, Jakob Nybo Nissen and Simon Rasmussen
Assembly of reads from metagenomic samples is a hard problem, often resulting in highly fragmented genome assemblies. Metagenomic binning allows us to reconstruct genomes by re-grouping the sequences by their organism of origin, thus representing a crucial processing step when exploring the biological diversity of metagenomic samples. Here we present Adversarial Autoencoders for Metagenomics Binning (AAMB), an ensemble deep learning approach that integrates sequence co-abundances and tetranucleotide frequencies into a common denoised space that enables precise clustering of sequences into microbial genomes. When benchmarked, AAMB presented similar or better results compared with the state-of-the-art reference-free binner VAMB, reconstructing ~7% more near-complete (NC) genomes across simulated and real data. In addition, genomes reconstructed using AAMB had higher completeness and greater taxonomic diversity compared with VAMB. Finally, we implemented a pipeline integrating VAMB and AAMB that enabled improved binning, recovering 20% and 29% more simulated and real NC genomes, respectively, compared to VAMB with moderate additional runtime. AAMB is freely available at https://github.com/RasmussenLab/VAMB.
Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts
Askar Gafurov, Tomas Vinar, Paul Medvedev and Broňa Brejová
An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes, conserved elements, and epigenetic modifications. A common task is to compare two annotations in order to determine if one is enriched or depleted in the regions covered by another. We study the problem of assigning statistical significance to such a comparison based on a null model representing two random unrelated annotations.Previous approaches to this problem remain too slow or inaccurate in the face of growing amount of available annotations for comparison.
We propose a new null model based on a Markov chain which differentiates among several genomic contexts. These contexts can capture various confounding factors, such as sequencing gaps in the reference assembly, which would otherwise lead to biased results. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistics and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. The use of genomic contexts to correct for GC-bias also resulted in the reversing of some previous findings.
Our work facilitates an important problem of assessment of statistical significance of overlaps between comparative genomics features (such as conserved regions or copy number variants) with other genomic features.
Availability: The software is freely available at https://github.com/fmfi-compbio/mcdp2 under the MIT licence.
REvolutionH-tl: Reconstruction of Evolutionary Histories tool
José Antonio Ramírez-Rafael, Alitzel López-Sánchez, Katia Aviña-Padilla, Andrea Arlette España-Tinajero and Maribel Hernández-Rosales
Reconciliation of gene trees with species trees and orthology relationships are the main components of evolutionary histories, which are fundamental to understanding the presence, distribution, and properties of genes across species. Herein we present REvolutionH-tl (gitlab.com/jarr.tecn/revolutionh-tl), a bioinformatics tool for evolution reconstruction that improves reconciliation and orthology prediction methods while accommodating duplications and gene losses along species trees. Our procedure is based on the construction of best match graphs, a theoretical characterization of non-necessarily reciprocal best hits, which underlie orthology relations among genes. We directly use such graphs for the reconstruction of event-labeled gene trees. Moreover, by reconciling the inferred gene trees against a species tree we bound the ancestry of duplication and gene losses. In addition, we spot gene trees whose topology contradicts the species tree, which help us to propose alternative hypotheses such as the existence of a horizontal gene transfer, the loss of a set of genes after duplication giving rise to pseudo-orthologs, among others.
We have compared the performance of REvolutionH-tl against Orthofinder and Proteinortho on both synthetic and real data, comparing three different aspects of evolutionary histories: i) gene family inference, ii) orthology prediction, and iii) gene tree topology. REvolutionH-tl has the best performance in these tests while decreasing computational requirements. This is particularly favorable considering that the output is composed of high-confidence reconciled trees and orthology predictions. REvolutionH-tl opens the possibility for the inference of whole-genome evolution for large sets of species, representing a bust for studies around pangenomics and comparative biology.
Prior Density Learning in Variational Bayesian Phylogenetic Parameters Inference
Amine Remita, Golrokh Kiani and Abdoulaye Baniré Diallo
The Bayesian phylogenetic community is exploring faster and more scalable alternatives to the Markov chain Monte Carlo (MCMC) approach to approximate the high dimensional Bayesian posterior. The search for other substitutes is motivated by the falling computational costs, increasing challenges in large-scale data analysis, advances in inference algorithms and implementation of efficient computational frameworks. Some alternatives are adaptive MCMC, Hamiltonian Monte Carlo, sequential Monte Carlo and variational inference (VI). Until recently, few studies were interested in applying classical variational approaches in probabilistic phylogenetic models. However, VI started to gain some attraction from the phylogenetic community taking advantage of advances that made it more scalable, generic and accurate, such as stochastic and black box VI algorithms, latent-variable reparametrization, and probabilistic programming. These advancements allowed designing of powerful and fast variational-based algorithms to infer complex phylogenetic models and analyze large-scale phylodynamic data.
Bayesian methods incorporate the practitioner's prior knowledge about the likelihood parameters through the prior distributions. Defining an appropriate and realistic prior is difficult, especially in small data regimes, similar sequences or parameters with complex correlations. Notably, the variational phylogenetic methods assign fixed prior distributions with default hyperparameters to the likelihood parameters, a similar practice in MCMC methods. However, such a choice could bias the posterior approximation and induce high posterior probabilities in cases where the data are weak, or the actual parameter values do not fall within the range specified by the priors.
Here, we show that variational phylogenetic inference can also suffer from misspecified priors on branch lengths and less severely on sequence evolutionary parameters. Further, we propose an approach and an implementation framework (nnTreeVB) to relax the rigidity of the prior densities by learning their parameters using a gradient-based method and a neural network-based parameterization. We applied this approach to estimate branch lengths and evolutionary parameters under several Markov chain substitution models. The results of performed simulations show that the approach is powerful in estimating branch lengths and evolutionary model parameters. They also show that a flexible prior model provides better results than a predefined prior model. Finally, the results highlight that using neural networks could improve the initialization of the optimization of the prior density parameters.
Reference: Remita A.M., Kiani G. and Diallo A.B. (2023) Prior Density Learning in Variational Bayesian Phylogenetic Parameters Inference.\ https://arxiv.org/abs/2302.02522.