Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Most research on query optimization has centered on binary join algorithms like hash join and sort-merge join. However, recent years have seen growing interest in theoretically optimal algorithms, notably Yannakakis' algorithm. These algorithms rely on join trees, which differ from the operator trees for binary joins and require new optimization techniques. We propose three approaches to constructing join trees for acyclic queries. First, we give an algorithm to enumerate all join trees of an alpha-acyclic query by edits with amortized constant delay, which forms the basis of a cost-based optimizer for acyclic joins. Second, we show that the Maximum Cardinality Search algorithm by Tarjan and Yannakakis constructs a unique shallowest join tree, rooted at any relation, for a Berge-acyclic query; this tree enables parallel execution of large join queries. Finally, we prove that any connected left-deep linear plan for a gamma-acyclic query can be converted into a join tree by a simple algorithm, allowing reuse of optimization infrastructure developed for binary joins.
Modern scientific discovery increasingly relies on workflows that process data across the Edge, Cloud, and High Performance Computing (HPC) continuum. Comprehensive and in-depth analyses of these data are critical for hypothesis validation, anomaly detection, reproducibility, and impactful findings. Although workflow provenance techniques support such analyses, at large scale, the provenance data become complex and difficult to analyze. Existing systems depend on custom scripts, structured queries, or static dashboards, limiting data interaction. In this work, we introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis. Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries. Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful LLM agent responses beyond recorded provenance.
The increasing volume and complexity of X-ray absorption spectroscopy (XAS) data generated at synchrotron facilities worldwide require robust infrastructure for data management, sharing, and analysis. This paper introduces the XAS Database (XASDB), a comprehensive web-based platform developed and hosted by the Canadian Light Source (CLS). The database houses more than 1000 reference spectra spanning 40 elements and 324 chemical compounds. The platform employs a Node.js/MongoDB architecture designed to handle diverse data formats from multiple beamlines and synchrotron facilities. A key innovation is the XASproc JavaScript library, which enables browser-based XAS data processing including normalization, background sub- traction, extended X-ray absorption fine structure (EXAFS) extraction, and preliminary analysis traditionally limited to desktop applications. The integrated XASVue spectral viewer provides installation-free data visualization and analysis with broad accessibility across devices and operating systems. By offering standardized data output, comprehensive metadata, and integrated analytical ca- pabilities, XASDB facilitates collaborative research and promotes FAIR (Findable, Accessible, In- teroperable, and Reusable) data principles. The platform serves as a valuable resource for linear combination fitting (LCF) analysis, machine learning applications, and educational purposes. This initiative demonstrates the potential for web-centric approaches in XAS data analysis, accelerating advances in materials science, environmental research, chemistry, and biology.
In recent years, the Shapley value has emerged as a general game-theoretic measure for assessing the contribution of a tuple to the result of a database query. We study the complexity of calculating the Shapley value of a tuple for an aggregate conjunctive query, which applies an aggregation function to the result of a conjunctive query (CQ) based on a value function that assigns a number to each query answer. Prior work by Livshits, Bertossi, Kimelfeld, and Sebag (2020) established that this task is #P-hard for every nontrivial aggregation function when the query is non-hierarchical with respect to its existential variables, assuming the absence of self-joins. They further showed that this condition precisely characterizes the class of intractable CQs when the aggregate function is sum or count. In addition, they posed as open problems the complexity of other common aggregate functions such as min, max, count-distinct, average, and quantile (including median). Towards the resolution of these problems, we identify for each aggregate function a class of hierarchical CQs where the Shapley value is tractable with every value function, as long as it is local (i.e., determined by the tuples of one relation). We further show that each such class is maximal: for every CQ outside of this class, there is a local (easy-to-compute) value function that makes the Shapley value #P-hard. Interestingly, our results reveal that each aggregate function corresponds to a different generalization of the class of hierarchical CQs from Boolean to non-Boolean queries. In particular, max, min, and count-distinct match the class of CQs that are all-hierarchical (i.e., hierarchical with respect to all variables), and average and quantile match the narrower class of q-hierarchical CQs introduced by Berkholz, Keppeler, and Schweikardt (2017) in the context of the fine-grained complexity of query answering.
The NIAID Data Ecosystem Discovery Portal (https://data.niaid.nih.gov) provides a unified search interface for over 4 million datasets relevant to infectious and immune-mediated disease (IID) research. Integrating metadata from domain-specific and generalist repositories, the Portal enables researchers to identify and access datasets using user-friendly filters or advanced queries, without requiring technical expertise. The Portal supports discovery of a wide range of resources, including epidemiological, clinical, and multi-omic datasets, and is designed to accommodate exploratory browsing and precise searches. The Portal provides filters, prebuilt queries, and dataset collections to simplify the discovery process for users. The Portal additionally provides documentation and an API for programmatic access to harmonized metadata. By easing access barriers to important biomedical datasets, the NIAID Data Ecosystem Discovery Portal serves as an entry point for researchers working to understand, diagnose, or treat IID. Valuable datasets are often overlooked because they are difficult to locate. The NIAID Data Ecosystem Discovery Portal fills this gap by providing a centralized, searchable interface that empowers users with varying levels of technical expertise to find and reuse data. By standardizing key metadata fields and harmonizing heterogeneous formats, the Portal improves data findability, accessibility, and reusability. This resource supports hypothesis generation, comparative analysis, and secondary use of public data by the IID research community, including those funded by NIAID. The Portal supports data sharing by standardizing metadata and linking to source repositories, and maximizes the impact of public investment in research data by supporting scientific advancement via secondary use.
Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.
Vector databases have rapidly grown in popularity, enabling efficient similarity search over data such as text, images, and video. They now play a central role in modern AI workflows, aiding large language models by grounding model outputs in external literature through retrieval-augmented generation. Despite their importance, little is known about the performance characteristics of vector databases in high-performance computing (HPC) systems that drive large-scale science. This work presents an empirical study of distributed vector database performance on the Polaris supercomputer in the Argonne Leadership Computing Facility. We construct a realistic biological-text workload from BV-BRC and generate embeddings from the peS2o corpus using Qwen3-Embedding-4B. We select Qdrant to evaluate insertion, index construction, and query latency with up to 32 workers. Informed by practical lessons from our experience, this work takes a first step toward characterizing vector database performance on HPC platforms to guide future research and optimization.
In this technical report, we present a formalisation of the MongoDB aggregation framework. Our aim is to identify a fragment that could serve as the starting point for an industry-wide standard for querying JSON document databases. We provide a syntax and formal semantics for a set of selected operators, We show how this fragment relates to known relational query languages. We explain how our semantics differs from the current implementation of MongoDB, and justify our choices. We provide a set of algebraic transformations that can be used for query optimisation.
Approximate Nearest Neighbor Search (ANNS) plays a critical role in applications such as search engines, recommender systems, and RAG for LLMs. Vector quantization (VQ), a crucial technique for ANNS, is commonly used to reduce space overhead and accelerate distance computations. However, despite significant research advances, state-of-the-art VQ methods still face challenges in balancing encoding efficiency and quantization accuracy. To address these limitations, we propose a novel VQ method called SAQ. To improve accuracy, SAQ employs a new dimension segmentation technique to strategically partition PCA-projected vectors into segments along their dimensions. By prioritizing leading dimension segments with larger magnitudes, SAQ allocates more bits to high-impact segments, optimizing the use of the available space quota. An efficient dynamic programming algorithm is developed to optimize dimension segmentation and bit allocation, ensuring minimal quantization error. To speed up vector encoding, SAQ devises a code adjustment technique to first quantize each dimension independently and then progressively refine quantized vectors using a coordinate-descent-like approach to avoid exhaustive enumeration. Extensive experiments demonstrate SAQ's superiority over classical methods (e.g., PQ, PCA) and recent state-of-the-art approaches (e.g., LVQ, Extended RabitQ). SAQ achieves up to 80% reduction in quantization error and accelerates encoding speed by over 80x compared to Extended RabitQ.
When query evaluation produces too many tuples, a new approach in query answering is to retrieve a diverse subset of them. The standard approach for measuring the diversity of a set of tuples is to use a distance function between tuples, which measures the dissimilarity between them, to then aggregate the pairwise distances of the set into a score (e.g., by using sum or min aggregation). However, as we will point out in this work, the resulting diversity measures may display some unintuitive behavior. Moreover, even in very simple settings, finding a maximally diverse subset of the answers of fixed size is, in general, intractable and little is known about approximations apart from some hand-picked distance-aggregator pairs. In this work, we introduce a novel approach for computing the diversity of tuples based on volume instead of distance. We present a framework for defining volume-based diversity functions and provide several examples of these measures applied to relational data. Although query answering of conjunctive queries (CQ) under this setting is intractable in general, we show that one can always compute a (1-1/e)-approximation for any volume-based diversity function. Furthermore, in terms of combined complexity, we connect the evaluation of CQs under volume-based diversity functions with the ranked enumeration of solutions, finding general conditions under which a (1-1/e)-approximation can be computed in polynomial time.
While extensive research on query evaluation has achieved consistent improvements in the time complexity of algorithms, the space complexity of query evaluation has been largely ignored. This is a particular challenge in settings with strict pre-defined space constraints. In this paper, we examine the combined space-time complexity of conjunctive queries (CQs) and, more generally, of sum-product queries (SPQs). We propose several classes of space-efficient algorithms for evaluating SPQs, and we show that the optimal time complexity is almost always achievable with asymptotically lower space complexity than traditional approaches.
We present ORQ, a system that enables collaborative analysis of large private datasets using cryptographically secure multi-party computation (MPC). ORQ protects data against semi-honest or malicious parties and can efficiently evaluate relational queries with multi-way joins and aggregations that have been considered notoriously expensive under MPC. To do so, ORQ eliminates the quadratic cost of secure joins by leveraging the fact that, in practice, the structure of many real queries allows us to join records and apply the aggregations "on the fly" while keeping the result size bounded. On the system side, ORQ contributes generic oblivious operators, a data-parallel vectorized query engine, a communication layer that amortizes MPC network costs, and a dataflow API for expressing relational analytics -- all built from the ground up. We evaluate ORQ in LAN and WAN deployments on a diverse set of workloads, including complex queries with multiple joins and custom aggregations. When compared to state-of-the-art solutions, ORQ significantly reduces MPC execution times and can process one order of magnitude larger datasets. For our most challenging workload, the full TPC-H benchmark, we report results entirely under MPC with Scale Factor 10 -- a scale that had previously been achieved only with information leakage or the use of trusted third parties.
High read and write performance is important for generic key/value stores, which are fundamental to modern applications and databases. Yet, achieving high performance for both reads and writes is challenging due to traditionally limited memory and the pick-any-two-out-of-three tradeoff between memory use, read performance, and write performance. Existing state-of-the-art approaches limit memory usage and chose a primary dimension (reads or writes) for which to optimize their on-disk structures. They recover performance in the remaining dimension by other mechanisms. This approach limits databases' maximum performance in the remaining dimension and their dynamic (online) tunability to respond to changing workloads. We explore a different approach that dynamically trades memory for read or write performance as needed. We present TurtleKV, which includes a novel unbiased data structure for on-disk storage. It includes a knob that dynamically increases memory reserved for increasing read or write performance. When evaluated on YCSB, TurtleKV achieves up to 8x the write throughput of industry-leader RocksDB and up to 5x the read throughput while incurring similar space amplification. Compared to the state-of-the-art system SplinterDB, TurtleKV runs up to 40% better on point queries, up to 6x better on range scans and achieves similar write performance, while incurring 50% less space amplification.
We consider conjunctive queries with arithmetic comparisons (CQAC) and investigate the computational complexity of the problem: Given two CQAC queries, $Q$ and $Q'$, is $Q'$ contained in $Q$? We know that, for CQAC queries, the problem of testing containment is $\Pi_2 ^p$ -complete. However, there are broad classes of queries with semi-interval arithmetic comparisons in the containing query that render the problem solvable in NP. In all cases examined the contained query is allowed to be any CQAC. Interestingly, we also prove that there are simple cases where the problem remains $\Pi_2 ^p$ -complete. We also investigate the complexity of computing certain answers in the framework of answering CQAC queries with semi-interval comparisons using any CQAC views. We prove that maximally contained rewritings in the language of union of CQACs always compute exactly all certain answers. We find cases where we can compute certain answers in polynomial time using maximally contained rewritings.
Given a conjunctive query and a database instance, we aim to develop an index that can efficiently answer spatial queries on the results of a conjunctive query. We are interested in some commonly used spatial queries, such as range emptiness, range count, and nearest neighbor queries. These queries have essential applications in data analytics, such as filtering relational data based on attribute ranges and temporal graph analysis for counting graph structures like stars, paths, and cliques. Furthermore, this line of research can accelerate relational algorithms that incorporate spatial queries in their workflow, such as relational clustering. Known approaches either have to spend $\tilde{O}(N)$ query time or use space as large as the number of query results, which are inefficient or unrealistic to employ in practice. Hence, we aim to construct an index that answers spatial conjunctive queries in both time- and space-efficient ways. In this paper, we establish lower bounds on the tradeoff between answering time and space usage. For $k$-star (resp. $k$-path) queries, we show that any index for range emptiness, range counting or nearest neighbor queries with $T$ answering time requires $\Omega\left(N+\frac{N^k}{T^k}\right)$ (resp. $\Omega\left(N+\frac{N^2}{T^{2/(k-1)}}\right)$) space. Then, we construct optimal indexes for answering range emptiness and range counting problems over $k$-star and $k$-path queries. Extending this result, we build an index for hierarchical queries. By resorting to the generalized hypertree decomposition, we can extend our index to arbitrary conjunctive queries for supporting spatial conjunctive queries. Finally, we show how our new indexes can be used to improve the running time of known algorithms in the relational setting.
DBOS (DataBase Operating System) is a novel capability that integrates web services, operating system functions, and database features to significantly reduce web-deployment effort while increasing resilience. Integration of high performance network sensing enables DBOS web services to collaboratively create a shared awareness of their network environments to enhance their collective resilience and security. Network sensing is added to DBOS using GraphBLAS hypersparse traffic matrices via two approaches: (1) Python-GraphBLAS and (2) OneSparse PostgreSQL. These capabilities are demonstrated using the workflow and analytics from the IEEE/MIT/Amazon Anonymized Network Sensing Graph Challenge. The system was parallelized using pPython and benchmarked using 64 compute nodes on the MIT SuperCloud. The web request rate sustained by a single DBOS instance was ${>}10^5$, well above the required maximum, indicating that network sensing can be added to DBOS with negligible overhead. For collaborative awareness, many DBOS instances were connected to a single DBOS aggregator. The Python-GraphBLAS and OneSparse PostgreSQL implementations scaled linearly up to 64 and 32 nodes respectively. These results suggest that DBOS collaborative network awareness can be achieved with a negligible increase in computing resources.
Setchain has been proposed to increase blockchain scalability by relaxing the strict total order requirement among transactions. Setchain organizes elements into a sequence of sets, referred to as epochs, so that elements within each epoch are unordered. In this paper, we propose and evaluate three distinct Setchain algorithms, that leverage an underlying block-based ledger. Vanilla is a basic implementation that serves as a reference point. Compresschain aggregates elements into batches, and compresses these batches before appending them as epochs in the ledger. Hashchain converts batches into fixed-length hashes which are appended as epochs in the ledger. This requires Hashchain to use a distributed service to obtain the batch contents from its hash. To allow light clients to safely interact with only one server, the proposed algorithms maintain, as part of the Setchain, proofs for the epochs. An epoch-proof is the hash of the epoch, cryptographically signed by a server. A client can verify the correctness of an epoch with $f+1$ epoch-proofs (where $f$ is the maximum number of Byzantine servers assumed). All three Setchain algorithms are implemented on top of the CometBFT blockchain application platform. We conducted performance evaluations across various configurations, using clusters of four, seven, and ten servers. Our results show that the Setchain algorithms reach orders of magnitude higher throughput than the underlying blockchain, and achieve finality with latency below 4 seconds.
Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.