Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Large Language Models (LLMs) can enhance analytics systems with powerful data summarization, cleaning, and semantic transformation capabilities. However, deploying LLMs at scale -- processing millions to billions of rows -- remains prohibitively expensive in computation and memory. We present IOLM-DB, a novel system that makes LLM-enhanced database queries practical through query-specific model optimization. Instead of using general-purpose LLMs, IOLM-DB generates lightweight, specialized models tailored to each query's specific needs using representative data samples. IOLM-DB reduces model footprints by up to 76% and increases throughput by up to 3.31$\times$ while maintaining accuracy through aggressive compression techniques, including quantization, sparsification, and structural pruning. We further show how our approach enables higher parallelism on existing hardware and seamlessly supports caching and batching strategies to reduce overheads. Our prototype demonstrates that leveraging LLM queries inside analytics systems is feasible at scale, opening new possibilities for future OLAP applications.
The detection of sequential patterns in data is a basic functionality of modern data processing systems for complex event processing (CEP), OLAP, and retrieval-augmented generation (RAG). In practice, pattern matching is challenging, since common applications rely on a large set of patterns that shall be evaluated with tight latency bounds. At the same time, matching needs to maintain state, i.e., intermediate results, that grows exponentially in the input size. Hence, systems turn to best-effort processing, striving for maximal recall under a latency bound. Existing techniques, however, consider each pattern in isolation, neglecting the optimization potential induced by state sharing in pattern matching. In this paper, we present SHARP, a library that employs state reduction to achieve efficient best-effort pattern matching. To this end, SHARP incorporates state sharing between patterns through a new abstraction, coined pattern-sharing degree (PSD). At runtime, this abstraction facilitates the categorization and indexing of partial pattern matches. Based thereon, once a latency bound is exceeded, SHARP realizes best-effort processing by selecting a subset of partial matches for further processing in constant time. In experiments with real-world data, SHARP achieves a recall of 97%, 96% and 73% for pattern matching in CEP, OLAP, and RAG applications, under a bound of 50% of the average processing latency.
How to generate a large, realistic set of tables along with joinability relationships, to stress-test dataset discovery methods? Dataset discovery methods aim to automatically identify related data assets in a data lake. The development and evaluation of such solutions for customers from a wide range of business domains, relies on diverse, high quality and domain-specific tabular benchmarks. Large language models (LLMs) are trained on a wide variety of text data, which can provide a strong foundation of general and domain-specific knowledge. In this paper, we ask the question -- \textit{can we leverage LLMs to generate a tabular benchmark adequate for evaluating the dataset discovery solutions?} In particular, we focus on the task of finding joinable tables which is the cornerstone of virtually every dataset discovery method. Current corpora for evaluating dataset discovery methods are mainly based on subsets of open data, and they suffer from three important issues: $i)$ they focus on very common and generic data types (e.g., address, id, name, etc.); $ii)$ they do not contain human-annotated column pairs; instead, practitioners synthesize ground truth using table splits (e.g., horizontal for table union search and vertical ones for joinability) and $iii)$ they do not focus on semantic column relationships.
Increasingly massive volumes of multi-modal data are being accumulated in many {real world} settings, including in health care and e-commerce. This development calls for effective general-purpose data management solutions for multi-modal data. Such a solution must facilitate user-friendly and accurate retrieval of any multi-modal data according to diverse application requirements. Further, such a solution must be capable of efficient and scalable retrieval. To address this need, we present OneDB, a distributed multi-metric data similarity retrieval system. This system exploits the fact that data of diverse modalities, such as text, images, and video, can be represented as metric data. The system thus affords each data modality its own metric space with its own distance function and then uses a multi-metric model to unify multi-modal data. The system features several innovations: (i) an extended Spart SQL query interface; (ii) lightweight means of learning appropriate weights of different modalities when retrieving multi-modal data to enable accurate retrieval; (iii) smart search-space pruning strategies that improve efficiency; (iv) two-layered indexing of data to ensure load-balancing during distributed processing; and (v) end-to-end system parameter autotuning. Experiments on three real-life datasets and two synthetic datasets offer evidence that the system is capable of state-of-the-art performance: (i) efficient and effective weight learning; (ii) retrieval accuracy improvements of 12.63\%--30.75\% over the state-of-the-art vector similarity search system at comparable efficiency; (iii) accelerated search by 2.5--5.75x over state-of-the-art single- or multi-metric solutions; (iv) demonstrated high scalability; and (v) parameter tuning that enables performance improvements of 15+%.
Cache systems fundamentally limit modern computing performance due to their inability to precisely capture data relationships. While achieving 85-92% hit rates, traditional systems rely on statistical heuristics that cannot guarantee relationship discovery, leading to suboptimal prefetching and resource waste. We present PFCS (Prime Factorization Cache System), which leverages the mathematical uniqueness of prime factorization to achieve deterministic relationship discovery with zero false positives. PFCS assigns unique primes to data elements and represents relationships as composite numbers, enabling the recovery of perfect relationships through factorization. A comprehensive evaluation across database, ML, and HPC workloads demonstrates an average performance improvement of x 6.2, 98.9% hit rates, and a 38% power reduction compared to state-of-the-art systems. The mathematical foundation provides formal guarantees impossible with approximation-based approaches, establishing a new paradigm for cache system design
Property graphs are widely used in domains such as healthcare, finance, and social networks, but they often contain errors due to inconsistencies, missing data, or schema violations. Traditional rule-based and heuristic-driven graph repair methods are limited in their adaptability as they need to be tailored for each dataset. On the other hand, interactive human-in-the-loop approaches may become infeasible when dealing with large graphs, as the cost--both in terms of time and effort--of involving users becomes too high. Recent advancements in Large Language Models (LLMs) present new opportunities for automated graph repair by leveraging contextual reasoning and their access to real-world knowledge. We evaluate the effectiveness of six open-source LLMs in repairing property graphs. We assess repair quality, computational cost, and model-specific performance. Our experiments show that LLMs have the potential to detect and correct errors, with varying degrees of accuracy and efficiency. We discuss the strengths, limitations, and challenges of LLM-driven graph repair and outline future research directions for improving scalability and interpretability.
Query optimization is essential for efficient SQL query execution in DBMS, and remains attractive over time due to the growth of data volumes and advances in hardware. Existing traditional optimizers struggle with the cumbersome hand-tuning required for complex workloads, and the learning-based methods face limitations in ensuring generalization. With the great success of Large Language Model (LLM) across diverse downstream tasks, this paper explores how LLMs can be incorporated to enhance the generalization of learned optimizers. Though promising, such an incorporation still presents challenges, mainly including high model inference latency, and the substantial fine-tuning cost and suboptimal performance due to inherent discrepancy between the token sequences in LLM and structured SQL execution plans with rich numerical features. In this paper, we focus on recurring queries in offline optimization to alleviate the issue of high inference latency, and propose \textbf{LLM4Hint} that leverages moderate-sized backbone LLMs to recommend query optimization hints. LLM4Hint achieves the goals through: (i) integrating a lightweight model to produce a soft prompt, which captures the data distribution in DBMS and the SQL predicates to provide sufficient optimization features while simultaneously reducing the context length fed to the LLM, (ii) devising a query rewriting strategy using a larger commercial LLM, so as to simplify SQL semantics for the backbone LLM and reduce fine-tuning costs, and (iii) introducing an explicit matching prompt to facilitate alignment between the LLM and the lightweight model, which can accelerate convergence of the combined model. Experiments show that LLM4Hint, by leveraging the LLM's stronger capability to understand the query statement, can outperform the state-of-the-art learned optimizers in terms of both effectiveness and generalization.
Electronic payment platforms are estimated to process billions oftransactions daily, with the cumulative value of these transactionspotentially reaching into the trillions. Even a minor error within thishigh-volume environment could precipitate substantial financiallosses. To mitigate this risk, manually constructed verification rules,developed by domain experts, are typically employed to identifyand scrutinize transactions in production environments. However,due to the absence of a systematic approach to ensure the robust-ness of these verification rules against vulnerabilities, they remainsusceptible to exploitation.To mitigate this risk, manually constructed verification rules, de-veloped by domain experts, are typically employed to identify andscrutinize transactions in production environments. However, dueto the absence of a systematic approach to ensure the robustness ofthese verification rules against vulnerabilities, they remain suscep-tible to exploitation. To ensure data security, database maintainersusually compose complex verification rules to check whether aquery/update request is valid. However, the rules written by ex-perts are usually imperfect, and malicious requests may bypassthese rules. As a result, the demand for identifying the defects ofthe rules systematically emerges.
The lack of standardized tabular formats for tenancy schedules across real estate firms creates significant inefficiencies in data integration. Existing automated integration methods, such as Full Disjunction (FD)-based models like ALITE, prioritize completeness but result in schema bloat, sparse attributes and limited business usability. We propose a novel hybrid, template-based schema matcher that aligns multi-layout tenancy schedules to a predefined target schema. The matcher combines schema (Jaccard, Levenshtein) and instance-based metrics (data types, distributions) with globally optimal assignments determined via the Hungarian Algorithm. Evaluation against a manually labeled ground truth demonstrates substantial improvements, with grid search optimization yielding a peak F1-score of 0.881 and an overall null percentage of 45.7%. On a separate ground truth of 20 semantically similar column sets, ALITE achieves an F1-score of 0.712 and 75.6% nulls. These results suggest that combining structured business knowledge with hybrid matching can yield more usable and business-aligned schema mappings. The approach assumes cleanly extracted tabular input, future work could explore extending the matcher to support complex, composite tables.
PathDB is a Java-based graph database designed for in-memory data loading and querying. By utilizing Regular Path Queries (RPQ) and a closed path algebra, PathDB processes paths through its three main components: the parser, the logical plan, and the physical plan. This modular design allows for targeted optimizations and modifications without impacting overall functionality. Benchmark experiments illustrate PathDB's execution times and flexibility in handling dynamic and complex path queries, compared to baseline methods like Depth-First Search (DFS) and Breadth-First Search (BFS) guided by an automaton, highlighting PathDB optimizations that contribute to its performance. PathDB was also evaluated against leading commercial graph systems, including Neo4j, Memgraph, and K\`uzu. Benchmark experiments demonstrated PathDB competitive execution times and its ability to support a wide range of path query types.
Group recommendation over social media streams has attracted significant attention due to its wide applications in domains such as e-commerce, entertainment, and online news broadcasting. By leveraging social connections and group behaviours, group recommendation (GR) aims to provide more accurate and engaging content to a set of users rather than individuals. Recently, influence-aware GR has emerged as a promising direction, as it considers the impact of social influence on group decision-making. In earlier work, we proposed Influence-aware Group Recommendation (IGR) to solve this task. However, this task remains challenging due to three key factors: the large and ever-growing scale of social graphs, the inherently dynamic nature of influence propagation within user groups, and the high computational overhead of real-time group-item matching. To tackle these issues, we propose an Enhanced Influence-aware Group Recommendation (EIGR) framework. First, we introduce a Graph Extraction-based Sampling (GES) strategy to minimise redundancy across multiple temporal social graphs and effectively capture the evolving dynamics of both groups and items. Second, we design a novel DYnamic Independent Cascade (DYIC) model to predict how influence propagates over time across social items and user groups. Finally, we develop a two-level hash-based User Group Index (UG-Index) to efficiently organise user groups and enable real-time recommendation generation. Extensive experiments on real-world datasets demonstrate that our proposed framework, EIGR, consistently outperforms state-of-the-art baselines in both effectiveness and efficiency.
Traditional Data+AI systems utilize data-driven techniques to optimize performance, but they rely heavily on human experts to orchestrate system pipelines, enabling them to adapt to changes in data, queries, tasks, and environments. For instance, while there are numerous data science tools available, developing a pipeline planning system to coordinate these tools remains challenging. This difficulty arises because existing Data+AI systems have limited capabilities in semantic understanding, reasoning, and planning. Fortunately, we have witnessed the success of large language models (LLMs) in enhancing semantic understanding, reasoning, and planning abilities. It is crucial to incorporate LLM techniques to revolutionize data systems for orchestrating Data+AI applications effectively. To achieve this, we propose the concept of a 'Data Agent' - a comprehensive architecture designed to orchestrate Data+AI ecosystems, which focuses on tackling data-related tasks by integrating knowledge comprehension, reasoning, and planning capabilities. We delve into the challenges involved in designing data agents, such as understanding data/queries/environments/tools, orchestrating pipelines/workflows, optimizing and executing pipelines, and fostering pipeline self-reflection. Furthermore, we present examples of data agent systems, including a data science agent, data analytics agents (such as unstructured data analytics agent, semantic structured data analytics agent, data lake analytics agent, and multi-modal data analytics agent), and a database administrator (DBA) agent. We also outline several open challenges associated with designing data agent systems.
This paper aims to comprehensively grasp the research status and development trends of soil microplastics (MPs). It collects studies from the Web of Science Core Collection covering the period from 2013 to 2024. Employing CiteSpace and VOSviewer, the paper conducts in - depth analyses of literature regarding the environmental impacts of microplastics. These analyses involve keyword co - occurrence, clustering, burst term identification, as well as co - occurrence analysis of authors and institutions. Microplastics can accumulate in soil, transfer through food chains, and ultimately affect human health, making the research on them essential for effective pollution control. Focusing on the international research on the impacts of microplastics on soil and ecosystems, the study reveals a steadily increasing trend in the number of publications each year, reaching a peak of 956 articles in 2024. A small number of highly productive authors contribute significantly to the overall research output. The keyword clustering analysis results in ten major clusters, including topics such as plastic pollution and microbial communities. The research on soil microplastics has evolved through three distinct stages: the preliminary exploration phase from 2013 to 2016, the expansion phase from 2017 to 2020, and the integration phase from 2021 to 2024. For future research, multi - level assessments of the impacts of microplastics on soil ecosystems and organisms should be emphasized, in order to fully uncover the associated hazards and develop practical solutions.
In Complex Event Processing, handling out-of-order, late, and duplicate events is critical for real-time analytics, especially on resource-constrained devices that process heterogeneous data from multiple sources. We present LimeCEP, a hybrid CEP approach that combines lazy evaluation, buffering, and speculative processing to efficiently handle data inconsistencies while supporting multi-pattern detection under relaxed semantics. LimeCEP integrates Kafka for efficient message ordering, retention, and duplicate elimination, and offers configurable strategies to trade off between accuracy, latency, and resource consumption. Compared to state-of-the-art systems like SASE and FlinkCEP, LimeCEP achieves up to six orders of magnitude lower latency, with up to 10 times lower memory usage and 6 times lower CPU utilization, while maintaining near-perfect precision and recall under high-disorder input streams, making it well-suited for non-cloud deployments.
The Shapley value is widely used for data valuation in data markets. However, explaining the Shapley value of an owner in a data coalition is an unexplored and challenging task. To tackle this, we formulate the problem of finding the counterfactual explanation of Shapley value in data coalitions. Essentially, given two data owners $A$ and $B$ such that $A$ has a higher Shapley value than $B$, a counterfactual explanation is a smallest subset of data entries in $A$ such that transferring the subset from $A$ to $B$ makes the Shapley value of $A$ less than that of $B$. We show that counterfactual explanations always exist, but finding an exact counterfactual explanation is NP-hard. Using Monte Carlo estimation to approximate counterfactual explanations directly according to the definition is still very costly, since we have to estimate the Shapley values of owners $A$ and $B$ after each possible subset shift. We develop a series of heuristic techniques to speed up computation by estimating differential Shapley values, computing the power of singular data entries, and shifting subsets greedily, culminating in the SV-Exp algorithm. Our experimental results on real datasets clearly demonstrate the efficiency of our method and the effectiveness of counterfactuals in interpreting the Shapley value of an owner.
Recent progress in large language models (LLMs) has enabled the development of autonomous web agents capable of navigating and interacting with real websites. However, evaluating such agents remains challenging due to the instability and inconsistency of existing benchmarks, which often rely on dynamic content or oversimplified simulations. In this work, we introduce WebArXiv, a static and time-invariant benchmark comprising 275 web-based tasks grounded in the arXiv platform. WebArXiv ensures reproducible and reliable evaluation by anchoring tasks in fixed web snapshots with deterministic ground truths and standardized action trajectories. Through behavioral analysis, we identify a common failure mode, Rigid History Reflection, where agents over-rely on fixed interaction histories. To address this, we propose a lightweight dynamic reflection mechanism that allows agents to selectively retrieve relevant past steps during decision-making. We evaluate ten state-of-the-art web agents on WebArXiv. Results demonstrate clear performance differences across agents and validate the effectiveness of our proposed reflection strategy.
Retrieval-Augmented Generation (RAG) has proven effective on server infrastructures, but its application on mobile devices is still underexplored due to limited memory and power resources. Existing vector search and RAG solutions largely assume abundant computation resources, making them impractical for on-device scenarios. In this paper, we propose MobileRAG, a fully on-device pipeline that overcomes these limitations by combining a mobile-friendly vector search algorithm, \textit{EcoVector}, with a lightweight \textit{Selective Content Reduction} (SCR) method. By partitioning and partially loading index data, EcoVector drastically reduces both memory footprint and CPU usage, while the SCR method filters out irrelevant text to diminish Language Model (LM) input size without degrading accuracy. Extensive experiments demonstrated that MobileRAG significantly outperforms conventional vector search and RAG methods in terms of latency, memory usage, and power consumption, while maintaining accuracy and enabling offline operation to safeguard privacy in resource-constrained environments.
Dynamic graph storage systems are essential for real-time applications such as social networks and recommendation, where graph data continuously evolves. However, they face significant challenges in efficiently handling concurrent read and write operations. We find that existing methods suffer from write queries interfering with read efficiency, substantial time and space overhead due to per-edge versioning, and an inability to balance performance, such as slow searches under concurrent workloads. To address these issues, we propose RapidStore, a holistic approach for efficient in-memory dynamic graph storage designed for read-intensive workloads. Our key idea is to exploit the characteristics of graph queries through a decoupled system design that separates the management of read and write queries and decouples version data from graph data. Particularly, we design an efficient dynamic graph store to cooperate with the graph concurrency control mechanism. Experimental results demonstrate that RapidStore enables fast and scalable concurrent graph queries, effectively balancing the performance of inserts, searches, and scans, and significantly improving efficiency in dynamic graph storage systems.