Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
DNA helicases undergo conformational changes; however, their structural dynamics are poorly understood. Here, we study single molecules of superfamily 1A DNA helicase Rep, which undergo conformational transitions during bacterial DNA replication, repair and recombination. We use time-correlated single-photon counting (TCSPC), fluorescence correlation spectroscopy (FCS), rapid single-molecule F\"orster resonance energy transfer (smFRET), Anti-Brownian ELectrokinetic (ABEL) trapping and molecular dynamics simulations (MDS) to provide unparalleled temporal and spatial resolution of Rep's domain movements. We detect four states revealing two hitherto hidden intermediates (S2, S3), between the open (S1) and closed (S4) structures, whose stability is salt dependent. Rep's open-to-closed switch involves multiple changes to all four subdomains 1A, 1B, 2A and 2B along the S1 to S2 to S3 to S4 transitional pathway comprising an initial truncated swing of 2B which then rolls across the 1B surface, following by combined rotations of 1B, 2A and 2B. High forward and reverse rates for S1 to S2 suggest that 1B may act to frustrate 2B movement to prevent premature Rep closure in the absence of DNA. These observations support a more general binding model for accessory DNA helicases that utilises conformational plasticity to explore a multiplicity of structures whose landscape can be tuned by salt prior to locking-in upon DNA binding.
The remarkable progress of artificial intelligence (AI) has revealed the enormous energy demands of modern digital architectures, raising deep concerns about sustainability. In stark contrast, the human brain operates efficiently on only ~20 watts, and individual cells process gigabit-scale genetic information using energy on the order of trillionths of a watt. Under the same energy budget, a general-purpose digital processor can perform only a few simple operations per second. This striking disparity suggests that biological systems follow algorithms fundamentally distinct from conventional computation. The framework of information thermodynamics-especially Maxwell's demon and the Szilard engine-offers a theoretical clue, setting the lower bound of energy required for information processing. However, digital processors exceed this limit by about six orders of magnitude. Recent single-molecule studies have revealed that biological molecular motors convert Brownian motion into mechanical work, realizing a "demon-like" operational principle. These findings suggest that living systems have already implemented an ultra-efficient information-energy conversion mechanism that transcends digital computation. Here, we experimentally establish a quantitative correspondence between positional information (bits) and mechanical work, demonstrating that molecular machines selectively exploit rare but functional fluctuations arising from Brownian motion to achieve ATP-level energy efficiency. This integration of information, energy, and timescale indicates that life realizes a Maxwell's demon-like mechanism for energy-efficient information processing.
How proteins fold remains a central unsolved problem in biology. While the idea of a folding code embedded in the amino acid sequence was introduced more than 6 decades ago, this code remains undefined. While we now have powerful predictive tools to predict the final native structure of proteins, we still lack a predictive framework for how sequences dictate folding pathways. Two main conceptual models dominate as explanations of folding mechanism: the funnel model, in which folding proceeds through many alternative routes on a rugged, hyperdimensional energy landscape; and the foldon model, which proposes a hierarchical sequence of discrete intermediates. Recent advances on two fronts are now enabling folding studies in unprecedented ways. Powerful experimental approaches; in particular, single-molecule force spectroscopy and hydrogen (deuterium exchange assays) allow time-resolved tracking of the folding process at high resolution. At the same time, computational breakthroughs culminating in algorithms such as AlphaFold have revolutionized static structure prediction, opening opportunities to extend machine learning toward dynamics. Together, these developments mark a turning point: for the first time, we are positioned to resolve how proteins fold, why they misfold, and how this knowledge can be harnessed for biology and medicine.
Aggregated Markov models provide a flexible framework for stochastic dynamics that develops on multiple timescales. For example, Markov models for ion channels often consist of multiple open and closed state to account for "slow" and "fast" openings and closings of the channel. The approach is a popular tool in the construction of mechanistic models of ion channels - instead of viewing model states as generators of sojourn times of a certain characteristic length, each individual model state is interpreted as a representation of a distinct biophysical state. We will review the properties of aggregated Markov models and discuss the implications for mechanistic modelling. First, we show how the aggregated Markov models with a given number of states can be calculated using P\'olya enumeration However, models with $n_O$ open and $n_C$ closed states that exceed the maximum number $2 n_O n_C$ of parameters are non-identifiable. We will present two derivations for this classical result and investigate non-identifiability further via a detailed analysis of the non-identifiable fully connected three-state model. Finally, we will discuss the implications of non-identifiability for mechanistic modelling of ion channels. We will argue that instead of designing models based on assumed transitions between distinct biophysical states which are modulated by ligand binding, it is preferable to build models based on additional sources of data that give more direct insight into the dynamics of conformational changes.
Designing enzyme backbones with substrate-specific functionality is a critical challenge in computational protein engineering. Current generative models excel in protein design but face limitations in binding data, substrate-specific control, and flexibility for de novo enzyme backbone generation. To address this, we introduce EnzyBind, a dataset with 11,100 experimentally validated enzyme-substrate pairs specifically curated from PDBbind. Building on this, we propose EnzyControl, a method that enables functional and substrate-specific control in enzyme backbone generation. Our approach generates enzyme backbones conditioned on MSA-annotated catalytic sites and their corresponding substrates, which are automatically extracted from curated enzyme-substrate data. At the core of EnzyControl is EnzyAdapter, a lightweight, modular component integrated into a pretrained motif-scaffolding model, allowing it to become substrate-aware. A two-stage training paradigm further refines the model's ability to generate accurate and functional enzyme structures. Experiments show that our EnzyControl achieves the best performance across structural and functional metrics on EnzyBind and EnzyBench benchmarks, with particularly notable improvements of 13\% in designability and 13\% in catalytic efficiency compared to the baseline models. The code is released at https://github.com/Vecteur-libre/EnzyControl.
Hydrogen peroxide oxidises cysteine residues to control protein function, yet bulk rate constants predict hours for changes that occur in cells in seconds. Here, this work shows that local electromagnetic fields (EMFs), ubiquitous in proteins, membranes and nanodomains, can lawfully modulate the Eyring barrier and orientate reactants, accelerating cysteine oxidation without changing the underlying chemistry. Embedding a field term into the Eyring expression, demonstrated that plausible local EMFs with realistic dipole changes accelerate rate constants by orders of magnitude. This local acceleration reconciles the discrepancy between predicted vs. observed rates of H2O2-mediated cysteine oxidation. The framework generates falsifiable predictions, such as vibrational Stark readouts in thiolate peroxide complexes should fall within predicted ranges, and reframes rate-constants as mutable, field conditioned parameters. Cysteine redox sensing is fast not because the chemistry is exotic, but because the physics is local.
We investigate a relatively underexplored class of hybrid neurosymbolic models integrating symbolic learning with neural reasoning to construct data generators meeting formal correctness criteria. In \textit{Symbolic Neural Generators} (SNGs), symbolic learners examine logical specifications of feasible data from a small set of instances -- sometimes just one. Each specification in turn constrains the conditional information supplied to a neural-based generator, which rejects any instance violating the symbolic specification. Like other neurosymbolic approaches, SNG exploits the complementary strengths of symbolic and neural methods. The outcome of an SNG is a triple $(H, X, W)$, where $H$ is a symbolic description of feasible instances constructed from data, $X$ a set of generated new instances that satisfy the description, and $W$ an associated weight. We introduce a semantics for such systems, based on the construction of appropriate \textit{base} and \textit{fibre} partially-ordered sets combined into an overall partial order, and outline a probabilistic extension relevant to practical applications. In this extension, SNGs result from searching over a weighted partial ordering. We implement an SNG combining a restricted form of Inductive Logic Programming (ILP) with a large language model (LLM) and evaluate it on early-stage drug design. Our main interest is the description and the set of potential inhibitor molecules generated by the SNG. On benchmark problems -- where drug targets are well understood -- SNG performance is statistically comparable to state-of-the-art methods. On exploratory problems with poorly understood targets, generated molecules exhibit binding affinities on par with leading clinical candidates. Experts further find the symbolic specifications useful as preliminary filters, with several generated molecules identified as viable for synthesis and wet-lab testing.
Biomolecular interactions underpin almost all biological processes, and their rational design is central to programming new biological functions. Generative AI models have emerged as powerful tools for molecular design, yet most remain specialized for individual molecular types and lack fine-grained control over interaction details. Here we present ODesign, an all-atom generative world model for all-to-all biomolecular interaction design. ODesign allows scientists to specify epitopes on arbitrary targets and generate diverse classes of binding partners with fine-grained control. Across entity-, token-, and atom-level benchmarks in the protein modality, ODesign demonstrates superior controllability and performance to modality-specific baselines. Extending beyond proteins, it generalizes to nucleic acid and small-molecule design, enabling interaction types such as protein-binding RNA/DNA and RNA/DNA-binding ligands that were previously inaccessible. By unifying multimodal biomolecular interactions within a single generative framework, ODesign moves toward a general-purpose molecular world model capable of programmable design. ODesign is available at https://odesign.lglab.ac.cn ,
Autophagy and migrasome formation constitute critical cellular mechanisms for maintaining cellular homeostasis, however, their potential compensatory interplay remains poorly understood. In this study, we identify VPS39, a core component of the HOPS complex, as a molecular switch coordinating these processes. Genetic ablation of VPS39 not only impairs autophagic flux but also triggers cell migration through RhoA/Rac1 GTPases upregulation, consequently facilitating migrasome formation. Using super-resolution microscopy, we further demonstrate that migrasomes serve as an alternative disposal route for damaged mitochondria during VPS39-induced autophagy impairment, revealing a novel stress adaptation mechanism. Our work establishes a previously unrecognized autophagy-migrasome axis and provides direct visual evidence of organelle quality control via migrasomal extrusion. These findings position VPS39-regulated pathway switching as a potential therapeutic strategy for neurodegenerative diseases characterized by autophagy dysfunction.
Designing RNA sequences that reliably adopt specified three-dimensional structures while maintaining thermodynamic stability remains challenging for synthetic biology and therapeutics. Current inverse folding approaches optimize for sequence recovery or single structural metrics, failing to simultaneously ensure global geometry, local accuracy, and ensemble stability-three interdependent requirements for functional RNA design. This gap becomes critical when designed sequences encounter dynamic biological environments. We introduce RiboPO, a Ribonucleic acid Preference Optimization framework that addresses this multi-objective challenge through reinforcement learning from physical feedback (RLPF). RiboPO fine-tunes gRNAde by constructing preference pairs from composite physical criteria that couple global 3D fidelity and thermodynamic stability. Preferences are formed using structural gates, PLDDT geometry assessments, and thermostability proxies with variability-aware margins, and the policy is updated with Direct Preference Optimization (DPO). On RNA inverse folding benchmarks, RiboPO demonstrates a superior balance of structural accuracy and stability. Compared to the best non-overlap baselines, our multi-round model improves Minimum Free Energy (MFE) by 12.3% and increases secondary-structure self-consistency (EternaFold scMCC) by 20%, while maintaining competitive 3D quality and high sequence diversity. In sampling efficiency, RiboPO achieves 11% higher pass@64 than the gRNAde base under the conjunction of multiple requirements. A multi-round variant with preference-pair reconstruction delivers additional gains on unseen RNA structures. These results establish RLPF as an effective paradigm for structure-accurate and ensemble-robust RNA design, providing a foundation for extending to complex biological objectives.
The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method targeting latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models' applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.
Understanding the flexibility of protein-nucleic acid complexes, often characterized by atomic B-factors, is essential for elucidating their structure, dynamics, and functions, such as reactivity and allosteric pathways. Traditional models such as Gaussian Network Models (GNM) and Elastic Network Models (ENM) often fall short in capturing multiscale interactions, especially in large or complex biomolecular systems. In this work, we apply the Persistent Sheaf Laplacian (PSL) framework for the B-factor prediction of protein-nucleic acid complexes. The PSL model integrates multiscale analysis, algebraic topology, combinatoric Laplacians, and sheaf theory for data representation. It reveals topological invariants in its harmonic spectra and captures the homotopic shape evolution of data with its non-harmonic spectra. Its localization enables accurate B-factor predictions. We benchmark our method on three diverse datasets, including protein-RNA and nucleic-acid-only structures, and demonstrate that PSL consistently outperforms existing models such as GNM and multiscale FRI (mFRI), achieving up to a 21% improvement in Pearson correlation coefficient for B-factor prediction. These results highlight the robustness and adaptability of PSL in modeling complex biomolecular interactions and suggest its potential utility in broader applications such as mutation impact analysis and drug design.
The extracellular matrix of biofilms presents a dense and intricate architecture. Numerous biophysical properties of the matrix surrounding microbial cells contribute to the heterogeneity of biofilms and their functions at the microscale. Previous mathematical models assume the matrix to be homogeneous, often overlooking the need for a detailed mechanistic understanding of the extracellular space. In this theoretical study, we introduce a novel cell-capsule approach to investigate geometric patterns in biofilm morphology and predict their role in oxygen transport. The thickness of the capsule and the arrangement of cell-capsule patterns can influence matrix heterogeneity, providing a clear picture of biofilm structure. By incorporating the bacterial capsule as a distinct, low-diffusivity phase, our novel cell-capsule model reveals that this architecture acts as a significant 'resistance-in-series' barrier. We found that a thick capsule/dense matrix arrangement can reduce local oxygen transfer by approximately 70%, a substantial drop that may give drive further research into oxygen limitations during early stage biofilm development.
Machine olfaction is rapidly emerging as a transformative capability, with applications spanning non-invasive medical diagnostics, industrial monitoring, agriculture, and security and defense. Recent advances in stabilizing mammalian olfactory receptors and integrating them into biophotonic and bioelectronic systems have enabled detection at near single-molecule resolution thus placing machines on par with trained detection dogs. As this technology converges with multimodal AI and distributed sensor networks imbued with embedded AI, it introduces a new, biochemical layer to a sensing ecosystem currently dominated by machine vision and audition. This review and industry roadmap surveys the scientific foundations, technological frontiers, and strategic applications of machine olfaction making the case that we are currently witnessing the rise of a new industry that brings with it a global chemosensory infrastructure. We cover exemplary industrial, military and consumer applications and address some of the ethical and legal concerns arising. We find that machine olfaction is poised to bring forth a planet-wide molecular awareness tech layer with the potential of spawning vast emerging markets in health, security, and environmental sensing via scent.
The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks. GitHub: https://github.com/yzf-code/KnowMol Huggingface: https://hf.co/datasets/yzf1102/KnowMol-100K
The rapid evolution of molecular dynamics (MD) methods, including machine-learned dynamics, has outpaced the development of standardized tools for method validation. Objective comparison between simulation approaches is often hindered by inconsistent evaluation metrics, insufficient sampling of rare conformational states, and the absence of reproducible benchmarks. To address these challenges, we introduce a modular benchmarking framework that systematically evaluates protein MD methods using enhanced sampling analysis. Our approach uses weighted ensemble (WE) sampling via The Weighted Ensemble Simulation Toolkit with Parallelization and Analysis (WESTPA), based on progress coordinates derived from Time-lagged Independent Component Analysis (TICA), enabling fast and efficient exploration of protein conformational space. The framework includes a flexible, lightweight propagator interface that supports arbitrary simulation engines, allowing both classical force fields and machine learning-based models. Additionally, the framework offers a comprehensive evaluation suite capable of computing more than 19 different metrics and visualizations across a variety of domains. We further contribute a dataset of nine diverse proteins, ranging from 10 to 224 residues, that span a variety of folding complexities and topologies. Each protein has been extensively simulated at 300K for one million MD steps per starting point (4 ns). To demonstrate the utility of our framework, we perform validation tests using classic MD simulations with implicit solvent and compare protein conformational sampling using a fully trained versus under-trained CGSchNet model. By standardizing evaluation protocols and enabling direct, reproducible comparisons across MD approaches, our open-source platform lays the groundwork for consistent, rigorous benchmarking across the molecular simulation community.
Biological machine learning is often bottlenecked by a lack of scaled data. One promising route to relieving data bottlenecks is through high throughput screens, which can experimentally test the activity of $10^6-10^{12}$ protein sequences in parallel. In this article, we introduce algorithms to optimize high throughput screens for data creation and model training. We focus on the large scale regime, where dataset sizes are limited by the cost of measurement and sequencing. We show that when active sequences are rare, we maximize information gain if we only collect positive examples of active sequences, i.e. $x$ with $y>0$. We can correct for the missing negative examples using a generative model of the library, producing a consistent and efficient estimate of the true $p(y | x)$. We demonstrate this approach in simulation and on a large scale screen of antibodies. Overall, co-design of experiments and inference lets us accelerate learning dramatically.
Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring labeled training data. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a one-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($\geq90\%$), named reaction classes ($\geq40\%$), and final reactants ($\geq74\%$). Beyond solving complex chemical tasks, our work also provides a method to generate theoretically grounded synthetic datasets by mapping chemical knowledge onto the molecular structure and thereby addressing data scarcity.