Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Recent advances in applying deep learning in genomics include DNA-language and single-cell foundation models. However, these models take only one data type as input. We introduce dynamic token adaptation and demonstrate how it combines these models to predict gene regulation at the single-cell level in different genetic contexts. Although the method is generalisable, we focus on an illustrative example by training an adapter from DNA-sequence embeddings to a single-cell foundation model's token embedding space. As a qualitative evaluation, we assess the impact of DNA sequence changes on the model's learned gene regulatory networks by mutating the transcriptional start site of the transcription factor GATA4 in silico, observing predicted expression changes in its target genes in fetal cardiomyocytes.
Background: Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data. Results: Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods. Conclusions: In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.
Predicting phenotype from genotype is a central challenge in genetics. Traditional approaches in quantitative genetics typically analyze this problem using methods based on linear regression. These methods generally assume that the genetic architecture of complex traits can be parameterized in terms of an additive model, where the effects of loci are independent, plus (in some cases) pairwise epistatic interactions between loci. However, these models struggle to analyze more complex patterns of epistasis or subtle gene-environment interactions. Recent advances in machine learning, particularly attention-based models, offer a promising alternative. Initially developed for natural language processing, attention-based models excel at capturing context-dependent interactions and have shown exceptional performance in predicting protein structure and function. Here, we apply attention-based models to quantitative genetics. We analyze the performance of this attention-based approach in predicting phenotype from genotype using simulated data across a range of models with increasing epistatic complexity, and using experimental data from a recent quantitative trait locus mapping study in budding yeast. We find that our model demonstrates superior out-of-sample predictions in epistatic regimes compared to standard methods. We also explore a more general multi-environment attention-based model to jointly analyze genotype-phenotype maps across multiple environments and show that such architectures can be used for "transfer learning" - predicting phenotypes in novel environments with limited training data.
Thalassemia, a blood disorder and one of the most prevalent hereditary genetic disorders worldwide, is often caused by copy number variations (CNVs) in the hemoglobin genes. This disorder has incredible diversity, with a large number of distinct profiles corresponding to alterations of different regions in the genes. Correctly classifying an individual's profile is critical as it impacts treatment, prognosis, and genetic counseling. However, genetic classification is challenging due to the large number of profiles worldwide, and often requires a large number of sequential tests. Targeted next generation sequencing (NGS), which characterizes segments of an individual's genome, has the potential to dramatically reduce the cost of testing and increase accuracy. In this work, we introduce a probabilistic state space model for profiling thalassemia from targeted NGS data, which naturally characterize the spatial ordering of the genes along the chromosome. We then use decision theory to choose the best profile among the different options. Due to our use of Bayesian methodology, we are also able to detect low-quality samples to be excluded from consideration, an important component of clinical screening. We evaluate our model on a dataset of 57 individuals, including both controls and cases with a variety of thalassemia profiles. Our model has a sensitivity of 0.99 and specificity of 0.93 for thalassemia detection, and accuracy of 91.5\% for characterizing subtypes. Furthermore, the specificity and accuracy rise to $0.96$ and 93.9\% when low-quality samples are excluded using our automated quality control method. This approach outperforms alternative methods, particularly in specificity, and is broadly applicable to other disorders.
Deep learning techniques have driven significant progress in various analytical tasks within 3D genomics in computational biology. However, a holistic understanding of 3D genomics knowledge remains underexplored. Here, we propose MIX-HIC, the first multimodal foundation model of 3D genome that integrates both 3D genome structure and epigenomic tracks, which obtains unified and comprehensive semantics. For accurate heterogeneous semantic fusion, we design the cross-modal interaction and mapping blocks for robust unified representation, yielding the accurate aggregation of 3D genome knowledge. Besides, we introduce the first large-scale dataset comprising over 1 million pairwise samples of Hi-C contact maps and epigenomic tracks for high-quality pre-training, enabling the exploration of functional implications in 3D genomics. Extensive experiments show that MIX-HIC can significantly surpass existing state-of-the-art methods in diverse downstream tasks. This work provides a valuable resource for advancing 3D genomics research.
Accurate diagnosis of Mendelian diseases is crucial for precision therapy and assistance in preimplantation genetic diagnosis. However, existing methods often fall short of clinical standards or depend on extensive datasets to build pretrained machine learning models. To address this, we introduce an innovative LLM-Driven multi-agent debate system (MD2GPS) with natural language explanations of the diagnostic results. It utilizes a language model to transform results from data-driven and knowledge-driven agents into natural language, then fostering a debate between these two specialized agents. This system has been tested on 1,185 samples across four independent datasets, enhancing the TOP1 accuracy from 42.9% to 66% on average. Additionally, in a challenging cohort of 72 cases, MD2GPS identified potential pathogenic genes in 12 patients, reducing the diagnostic time by 90%. The methods within each module of this multi-agent debate system are also replaceable, facilitating its adaptation for diagnosing and researching other complex diseases.
Advancements in biomedical research have driven continuous innovations in sensing and diagnostic technologies. Among these, nanopore based single molecule sensing and sequencing is rapidly emerging as a powerful and versatile sensing methodology. Advancements in nanopore based approaches require concomitant improvements in the electronic readout methods employed, from the point of low noise, bandwidth and form factor. This article focuses on current sensing circuits designed and employed for ultra low noise nanopore signal readout, addressing the fundamental limitations of traditional off chip transimpedance amplifiers (TIAs), which suffer from high input parasitic capacitance, bandwidth constraints, and increased noise at high frequencies. This review explores the latest design schemes and circuit structures classified into on-chip and off-chip TIA designs, highlighting their design implementation, performance, respective challenges and explores the interplay between noise performance, capacitance, and bandwidth across diverse transimpedance amplifier (TIA) configurations. Emphasis is placed on characterizing noise response under varying parasitic capacitance and operational frequencies, a systematic evaluation not extensively addressed in prior literature while also considering the allowable input current compliance range limitations. The review also compares the widely used Axopatch 200B system to the designs reported in literature. The findings offer valuable insights into optimizing TIA designs for enhanced signal integrity in high speed and high sensitivity applications focusing on noise reduction, impedance matching, DC blocking, and offset cancellation techniques.
As genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 GB of data per minute imposes substantial communication and storage overheads. Second, the sequencing pipeline is bottlenecked at the basecalling step, consuming >40% of genome analysis time. A range of proposals have attempted to address these challenges, with limited success. We propose to address these challenges with a Compute-in-Memory Basecalling Accelerator (CiMBA), the first embedded ($\sim25$mm$^2$) accelerator capable of real-time, on-device basecalling, coupled with AnaLog (AL)-Dorado, a new family of analog focused basecalling DNNs. Our resulting hardware/software co-design greatly reduces data communication overhead, is capable of a throughput of 4.77 million bases per second, 24x that required for real-time operation, and achieves 17x/27x power/area efficiency over the best prior basecalling embedded accelerator while maintaining a high accuracy comparable to state-of-the-art software basecallers.
The ability to quickly and accurately identify microbial species in a sample, known as metagenomic profiling, is critical across various fields, from healthcare to environmental science. This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences, a process known as basecalling, via a multi-objective deep neural network for simultaneous basecalling and multi-class genome classification. We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers, and a pre-configured ranking strategy allowing top-K species accuracy, giving users flexibility to choose between higher accuracy or higher speed at identifying the species. We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers, attaining an average of 92.5%/98.9% accuracy at identifying the top-1/3 species among a total of 17 genomes in the Wick bacterial dataset. The work presented here has implications for future studies in metagenomic profiling by accelerating the bottleneck step of matching the DNA sequence to the correct genome.
Motivation: Viruses represent the most abundant biological entities on the planet and play vital roles in diverse ecosystems. Cataloging viruses across various environments is essential for understanding their properties and functions. Metagenomic sequencing has emerged as the most comprehensive method for virus discovery, enabling the sequencing of all genetic materials, including viruses, from host or environmental samples. However, distinguishing viral sequences from the vast background of cellular organism-derived reads in metagenomic data remains a significant challenge. While several learning-based tools, such as VirSorter2 and geNomad, have shown promise in identifying viral contigs, they often experience varying degrees of false positive rates due to noise in sequencing and assembly, shared genes between viruses and their hosts, and the formation of proviruses within host genomes. This highlights the urgent need for an accurate and efficient method to evaluate the quality of viral contigs. Results: To address these challenges, we introduce ViralQC, a tool designed to assess the quality of reported viral contigs or bins. ViralQC identifies contamination regions within putative viral sequences using foundation models trained on viral and cellular genomes and estimates viral completeness through protein organization alignment. We evaluate ViralQC on multiple datasets and compare its performance against CheckV, the state-of-the-art in virus quality assessment. Notably, ViralQC correctly identifies 38% more contamination than CheckV, while maintaining a median absolute error of only 3%. In addition, ViralQC delivers more accurate results for medium- to high-quality (>50% completeness) contigs, demonstrating its superior performance in completeness estimation.
Explainability is necessary for many tasks in biomedical research. Recent explainability methods have focused on attention, gradient, and Shapley value. These do not handle data with strong associated prior knowledge and fail to constrain explainability results based on known relationships between predictive features. We propose GraphPINE, a graph neural network (GNN) architecture leveraging domain-specific prior knowledge to initialize node importance optimized during training for drug response prediction. Typically, a manual post-prediction step examines literature (i.e., prior knowledge) to understand returned predictive features. While node importance can be obtained for gradient and attention after prediction, node importance from these methods lacks complementary prior knowledge; GraphPINE seeks to overcome this limitation. GraphPINE differs from other GNN gating methods by utilizing an LSTM-like sequential format. We introduce an importance propagation layer that unifies 1) updates for feature matrix and node importance and 2) uses GNN-based graph propagation of feature values. This initialization and updating mechanism allows for informed feature learning and improved graph representation. We apply GraphPINE to cancer drug response prediction using drug screening and gene data collected for over 5,000 gene nodes included in a gene-gene graph with a drug-target interaction (DTI) graph for initial importance. The gene-gene graph and DTIs were obtained from curated sources and weighted by article count discussing relationships between drugs and genes. GraphPINE achieves a PR-AUC of 0.894 and ROC-AUC of 0.796 across 952 drugs. Code is available at https://anonymous.4open.science/r/GraphPINE-40DE.
Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic analysis.
OLAF (Open Life Science Analysis Framework) is an open-source platform that enables researchers to perform bioinformatics analyses using natural language. By combining large language models (LLMs) with a modular agent-pipe-router architecture, OLAF generates and executes bioinformatics code on real scientific data, including formats like .h5ad. The system includes an Angular front end and a Python/Firebase backend, allowing users to run analyses such as single-cell RNA-seq workflows, gene annotation, and data visualization through a simple web interface. Unlike general-purpose AI tools, OLAF integrates code execution, data handling, and scientific libraries in a reproducible, user-friendly environment. It is designed to lower the barrier to computational biology for non-programmers and support transparent, AI-powered life science research.
Essential life processes take place across multiple space and time scales in living organisms but understanding their mechanistic interactions remains an ongoing challenge. Advanced multiscale modeling techniques are providing new opportunities and insights into these complex processes. In cells, meters of chromatin are folded into a nucleus with a diameter on the order of microns. The three-dimensional chromatin structure coupled with biochemical processes that turn genes on or off, specify a given cell type through a complicated set of interactions collectively referred to as epigenetics. Important epigenetic processes include the differential accessibility of genomic loci to transcription factors and chemical modifications to DNA and DNA-binding molecules such as histones. The dynamics of these epigenetic processes span timescales from milliseconds to years. How do chemical modifications consisting of a handful of atoms cooperate to modulate genome folding at the scale of the nucleus and impact organism outcomes? In this review, we highlight the inherently multiscale nature of chromatin organization, with a focus on computational modeling to bridge the gaps in our understanding of biochemical processes across scales. We review relevant chromatin biology, including major types of epigenetic modifications as well as the higher order chromatin structures to present a multiscale view of chromatin. We also review relevant computational methods to simulate chromatin structure, function, and dynamics, as well as experimental techniques that inform and validate said models. Finally, we argue that multiscale modeling provides a path forward towards understanding emergent behavior in this inherently multiscale system.
Viral mutations pose significant threats to public health by increasing infectivity, strengthening vaccine resistance, and altering disease severity. To track these evolving patterns, agencies like the CDC annually evaluate thousands of virus strains, underscoring the urgent need to understand viral mutagenesis and evolution in depth. In this study, we integrate genomic analysis, clustering, and three leading dimensionality reduction approaches, namely, principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP)-to investigate the effects of COVID-19 on influenza virus propagation. By applying these methods to extensive pre- and post-pandemic influenza datasets, we reveal how selective pressures during the pandemic have influenced the diversity of influenza genetics. Our findings indicate that combining robust dimension reduction with clustering yields critical insights into the complex dynamics of viral mutation, informing both future research directions and strategies for public health intervention.
In primates, loci associated with adaptive trait variation often fall in non-coding regions. Understanding the mechanisms linking these regulatory variants to fitness-relevant phenotypes remains challenging, but can be addressed using functional genomic data. However, such data are rarely generated at scale in non-human primates. When they are, only select tissues, cell types, developmental stages, and cellular environments are typically considered, despite appreciation that adaptive variants often exhibit context-dependent effects. In this review, we 1) discuss why context-dependent regulatory loci might be especially evolutionarily relevant in primates, 2) explore challenges and emerging solutions for mapping such context-dependent variation, and 3) discuss the scientific questions these data could address. We argue that filling this gap will provide critical insights into evolutionary processes, human disease, and regulatory adaptation.
Background: Esophageal cancer poses a significant global health challenge, with the incidence of esophageal adenocarcinoma (EAC), a predominant subtype, increasing notably in Western countries. Cathepsins, a family of lysosomal proteolytic enzymes, have been implicated in the progression of various tumors. However, the causal relationship between the cathepsin family and EAC remains unresolved. Methods: To evaluate these potential causal associations, integrative analyses were conducted, integrating Mendelian randomization (MR), transcriptome-wide association study (TWAS), single-cell RNA sequencing (scRNA-seq), and single-cell expression quantitative trait locus (sc-eQTL) analyses. Results: Univariable and multivariable MR analyses demonstrated that elevated levels of cathepsin B (CTSB) were associated with a reduced risk of EAC. The TWAS analysis identified a negative association between CTSB expression in esophageal tissue and EAC, consistent with experimental validation using immunohistochemistry. The scRNA-seq data analysis indicated that CTSB expression was predominantly localized in macrophages infiltrating EAC. Colocalization analysis incorporating sc-eQTL data specific to macrophages confirmed a shared causal variant between CTSB and macrophages. Additionally, MR analysis of CTSB and macrophage scavenger receptor (MSR) types I and II established their interrelationship, suggesting that CTSB may influence the proinflammatory phenotype of macrophages, ultimately affecting EAC risk. Conclusions: This integrative analysis, utilizing MR, TWAS, scRNA-seq, and sc-eQTL data, identified a significant causal association between CTSB and EAC, potentially mediated through macrophage MSR regulation. These findings suggest that targeting cathepsin B could represent a novel strategy for the diagnosis and treatment of EAC.
Epigenetics encompasses mechanisms that can alter the expression of genes without changing the underlying genetic sequence. The epigenetic regulation of gene expression is initiated and sustained by several mechanisms such as DNA methylation, histone modifications, chromatin conformation, and non-coding RNA. The changes in gene regulation and expression can manifest in the form of various diseases and disorders such as cancer and congenital deformities. Over the last few decades, high throughput experimental approaches have been used to identify and understand epigenetic changes, but these laboratory experimental approaches and biochemical processes are time-consuming and expensive. To overcome these challenges, machine learning and artificial intelligence (AI) approaches have been extensively used for mapping epigenetic modifications to their phenotypic manifestations. In this paper we provide a narrative review of published research on AI models trained on epigenomic data to address a variety of problems such as prediction of disease markers, gene expression, enhancer promoter interaction, and chromatin states. The purpose of this review is twofold as it is addressed to both AI experts and epigeneticists. For AI researchers, we provided a taxonomy of epigenetics research problems that can benefit from an AI-based approach. For epigeneticists, given each of the above problems we provide a list of candidate AI solutions in the literature. We have also identified several gaps in the literature, research challenges, and recommendations to address these challenges.