Abstract: Metagenomic sequencing allows us to study the structure, diversity and ecology in microbial communities without the necessity of obtaining pure cultures. In many metagenomics studies, the reads obtained from metagenomics sequencing are first assembled into longer contigs and these contigs are then binned into clusters of contigs where contigs in a cluster are expected to come from the same species. As different species may share common sequences in their genomes, one assembled contig may belong to multiple species. However, existing tools for binning contigs only support non-overlapped binning, i.e., each contig is assigned to at most one bin (species). In this paper, we introduce GraphBin2 which refines the binning results obtained from existing tools and, more importantly, is able to assign contigs to multiple bins. GraphBin2 uses the connectivity and coverage information from assembly graphs to adjust existing binning results on contigs and to infer contigs shared by multiple species. Experimental results on both simulated and real datasets demonstrate that GraphBin2 not only improves binning results of existing tools but also supports to assign contigs to multiple bins. GraphBin2 incorporates the coverage information into the assembly graph to refine the binning results obtained from existing binning tools. GraphBin2 also enables the detection of contigs that may belong to multiple species. We show that GraphBin2 outperforms its predecessor GraphBin on both simulated and real datasets.
Availability: The source code is freely available at: https://github.com/Vini2/GraphBin2.
Abstract: A bipartite network is a graph structure where nodes are from two distinct domains and only inter-domain interactions exist as edges. A large number of network embedding methods exist to learn vectorial node representations from general graphs with both homogeneous and heterogeneous node and edge types, including some that can specifically model the distinct properties of bipartite networks. However, these methods are inadequate to model multiplex bipartite networks (e.g., in e-commerce), that have multiple types of interactions (e.g., click, inquiry, and buy) and node attributes. Most real-world multiplex bipartite networks are also sparse and have imbalanced node distributions that are challenging to model. In this paper, we develop an unsupervised Dual HyperGraph Convolutional Network (DualHGCN) model that scalably transforms the multiplex bipartite network into two sets of homogeneous hypergraphs and uses spectral hypergraph convolutional operators, along with intra- and inter-message passing strategies to promote information exchange within and across domains, to learn effective node embeddings. We benchmark DualHGCN using four real-world datasets on link prediction and node classification tasks. Our extensive experiments demonstrate that DualHGCN significantly outperforms state-of-the-art methods, and is robust to varying sparsity levels and imbalanced node distribution.
Availability: The source code is freely available at: https://github.com/xuehansheng/DualHGCN.
Abstract: Plasmids are extra-chromosomal genetic materials with important markers that affect the function and behaviour of the microorganisms supporting its environmental adaptations. Hence the identification and recovery of such plasmidic sequences from assemblies is a crucial task in metagenomics analysis. In the past, machine learning approaches have been developed to separate chromosomes and plasmids. However, there is always a compromise between precision and recall in the existing classification approaches. The similarity of compositions between chromosomes and their plasmids makes it difficult to separate plasmids and chromosomes with high accuracy. However, high confidence classifications are accurate with a significant compromise of recall, and vice versa. Hence, the requirement exists to have more sophisticated approaches to separate plasmids and chromosomes accurately while retaining an acceptable trade-off between precision and recall. We present GraphPlas, a novel approach for plasmid recovery using coverage, composition and assembly graph topology. We evaluated GraphPlas on simulated and real short read assemblies with varying compositions of plasmids and chromosomes. Our experiments show that GraphPlas is able to significantly improve accuracy in detecting plasmidic and chromosomal contigs on top of popular state-of-the-art plasmid detection tools.
Availability: The source code is freely available at: https://github.com/anuradhawick/GraphPlas.
Abstract: Recent advances in RNA-seq technology have made identification of expressed genes affordable, and thus boosting repaid development of transcriptomic studies. Transcriptome assembly, reconstructing all expressed transcripts from RNA-seq reads, is an essential step to understand gene, protein, and cell functions. Transcriptome assembly remains a challenging problem due to complications in splicing variants, expression level, uneven coverage and sequencing errors. Here, we formulate the transcriptome assembly problem as path extraction on splicing graph (or assembly graph), and propose a novel algorithm MultiTrans for path extraction using mixed integer linear programming. MultiTrans is able to take into consideration coverage constraints on vertices and edges, the number of paths and the paired-end information simultaneously. We benchmarked MultiTrans against two state-of-the-art transcriptome assemblers, TransLiG and rnaSPAdes. Experimental results show that MultiTrans generates more accurate transcripts compared to TransLiG (using the same splicing graph) and rnaSPAdes (using the same assembly graph).
Availability: The source code is freely available at: https://github.com/jzbio/MultiTrans.
Abstract: SNP calling is a fundamental problem of genetic analysis and has many applications, such as gene-disease diagnosis, drug design, and ancestry inference. Prior approaches either require high-quality reference genome, or suffer from low recall/precision or high runtime. We develop a referencefree algorithm Kmer2SNP to call SNP directly from raw reads, an approach that models SNP calling into a maximum weight matching problem. We benchmark Kmer2SNP against referencefree methods including hybrid (assembly-based) and assemblyfree methods on both simulated and real datasets. Experimental results show that Kmer2SNP achieves better SNP calling quality while being an order of magnitude faster than the state-of-theart methods. Kmer2SNP shows the potential of calling SNPs only using k-mers from raw reads without assembly.
Availability: The source code is freely available at: https://github.com/yanboANU/Kmer2SNP.
Abstract: Metagenomic sequencing allows us to study structure, diversity and ecology in microbial communities without the necessity of obtaining pure cultures. In many metagenomics studies, the reads obtained from metagenomics sequencing are first assembled into longer contigs and these contigs are then binned into clusters of contigs where contigs in a cluster are expected to come from the same species. As different species may share common sequences in their genomes, one assembled contig may belong to multiple species. However, existing tools for contig binning only support non-overlapped binning, i.e., each contig is assigned to at most one bin (species). In this paper, we introduce GraphBin2 which refines the binning results obtained from existing tools and, more importantly, is able to assign contigs to multiple bins. GraphBin2 uses the connectivity and coverage information from assembly graphs to adjust existing binning results on contigs and to infer contigs shared by multiple species. Experimental results on both simulated and real datasets demonstrate that GraphBin2 not only improves binning results of existing tools but also supports assigning contigs to multiple bins.
Availability: The source code is freely available at: https://github.com/Vini2/GraphBin2.
Abstract: Network embedding aims to learn low-dimensional representations of nodes while capturing structure information of networks. It has achieved great success on many tasks of network analysis such as link prediction and node classification. Most of existing network embedding algorithms focus on how to learn static homogeneous networks effectively. However, networks in the real world are more complex, e.g., networks may consist of several types of nodes and edges (called heterogeneous information) and may vary over time in terms of dynamic nodes and edges (called evolutionary patterns). Limited work has been done for network embedding of dynamic heterogeneous networks as it is challenging to learn both evolutionary and heterogeneous information simultaneously. In this paper, we propose a novel dynamic heterogeneous network embedding method, termed as DyHATR, which uses hierarchical attention to learn heterogeneous information and incorporates recurrent neural networks with temporal attention to capture evolutionary patterns. We benchmark our method on four real-world datasets for the task of link prediction. Experimental results show that DyHATR significantly outperforms several state-of-the-art baselines.
Availability: The source code is freely available at: https://github.com/skx300/DyHATR.
Motivation: Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyze metagenomic data, binning is considered a crucial step to characterise the different species of microorganisms present. The use of short-read data in most binning tools poses several limitations, such as insufficient species-specific signal, and the emergence of long-read sequencing technologies offers us opportunities to surmount them. However, most current metagenomic binning tools have been developed for short reads. The few tools that can process long reads either do not scale with increasing input size or require a database with reference genomes that are often unknown. In this paper, we presentMetaBCC-LR, a scalable reference-free binning method which clusters long reads directly based on their k-mer coverage histograms and oligonucleotide composition.
Results: We evaluate MetaBCC-LR on multiple simulated and real metagenomic long-read datasets with varying coverages and error rates. Our experiments demonstrate that MetaBCC-LR substantially outperforms state-of-the-art reference-free binning tools, achieving∼13% improvement in F1-score and∼30% improvement in ARI compared to the best previous tools. Moreover, we show that using MetaBCC-LR before long read assembly helps to enhance the assembly quality while significantly reducing the assembly cost in terms of time and memory usage. The efficiency and accuracy of MetaBCC-LR pave the way for more effective long-read based metagenomics analyses to support a wide range of applications.
Availability: The source code is freely available at: https://github.com/anuradhawick/MetaBCC-LR.
Motivation: The field of metagenomics has provided valuable insights into the structure, diversity and ecology within microbial communities. One key step in metagenomics analysis is to assemble reads into longer contigs which are then binned into groups of contigs that belong to different species present in the metagenomic sample. Binning of contigs plays an important role in metagenomics and most available binning algorithms bin contigs using genomic features such as oligonucleotide/k-mer composition and contig coverage. As metagenomic contigs are derived from the assembly process, they are output from the underlying assembly graph which contains valuable connectivity information between contigs that can be used for binning.
Results: We propose GraphBin, a new binning method that makes use of the assembly graph and applies a label propagation algorithm to refine the binning result of existing tools. We show that GraphBin can make use of the assembly graphs constructed from both the de Bruijn graph and the overlap-layout-consensus approach. Moreover, we demonstrate improved experimental results from GraphBin in terms of identifying mis-binned contigs and binning of contigs discarded by existing binning tools. To the best of our knowledge, this is the first time that the information from the assembly graph has been used in a tool for the binning of metagenomic contigs.
Availability: The source code of GraphBin is available at https://github.com/Vini2/GraphBin.
Abstract: The development of DNA sequencing technologies makes it possible to obtain reads originated from both copies of a chromosome (two parental chromosomes, or haplotypes) of a single individual. Reconstruction of both haplotypes (i.e. haplotype phasing) plays a crucial role in genetic analysis and provides relationship information between genetic variation and disease susceptibility. With the emerging third-generation sequencing technologies, most existing approaches for haplotype phasing suffer from performance issues to handle long and error-prone reads. We develop a divide-and-conquer algorithm, DCHap, to phase haplotypes using third-generation reads. We benchmark DCHap against three state-of-the-art phasing tools on both PacBio SMRT data and ONT Nanopore data. The experimental results show that DCHap generates more accurate or comparable results (measured by the switch errors) while being scalable for higher coverage and longer reads. DCHap is a fast and accurate algorithm for haplotype phasing using third-generation sequencing data. As the third-generation sequencing platforms continue improving on their throughput and read lengths, accurate and scalable tools like DCHap are important to improve haplotype phasing from the advances of sequencing technologies.
Availability: The source code is freely available at https://github.com/yanboANU/Haplotype-phasing.