Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Approximate Nearest Neighbor Search (ANNS) plays a critical role in applications such as search engines, recommender systems, and RAG for LLMs. Vector quantization (VQ), a crucial technique for ANNS, is commonly used to reduce space overhead and accelerate distance computations. However, despite significant research advances, state-of-the-art VQ methods still face challenges in balancing encoding efficiency and quantization accuracy. To address these limitations, we propose a novel VQ method called SAQ. To improve accuracy, SAQ employs a new dimension segmentation technique to strategically partition PCA-projected vectors into segments along their dimensions. By prioritizing leading dimension segments with larger magnitudes, SAQ allocates more bits to high-impact segments, optimizing the use of the available space quota. An efficient dynamic programming algorithm is developed to optimize dimension segmentation and bit allocation, ensuring minimal quantization error. To speed up vector encoding, SAQ devises a code adjustment technique to first quantize each dimension independently and then progressively refine quantized vectors using a coordinate-descent-like approach to avoid exhaustive enumeration. Extensive experiments demonstrate SAQ's superiority over classical methods (e.g., PQ, PCA) and recent state-of-the-art approaches (e.g., LVQ, Extended RabitQ). SAQ achieves up to 80% reduction in quantization error and accelerates encoding speed by over 80x compared to Extended RabitQ.
Feature selection has emerged as a crucial technique in refining recommender systems. Recent advancements leveraging Automated Machine Learning (AutoML) has drawn significant attention, particularly in two main categories: early feature selection and late feature selection, differentiated by whether the selection occurs before or after the embedding layer. The early feature selection selects a fixed subset of features and retrains the model, while the late feature selection, known as adaptive feature selection, dynamically adjusts feature choices for each data instance, recognizing the variability in feature significance. Although adaptive feature selection has shown remarkable improvements in performance, its main drawback lies in its post-embedding layer feature selection. This process often becomes cumbersome and inefficient in large-scale recommender systems with billions of ID-type features, leading to a highly sparse and parameter-heavy embedding layer. To overcome this, we introduce Adaptive Early Feature Selection (AEFS), a very simple method that not only adaptively selects informative features for each instance, but also significantly reduces the activated parameters of the embedding layer. AEFS employs a dual-model architecture, encompassing an auxiliary model dedicated to feature selection and a main model responsible for prediction. To ensure effective alignment between these two models, we incorporate two collaborative training loss constraints. Our extensive experiments on three benchmark datasets validate the efficiency and effectiveness of our approach. Notably, AEFS matches the performance of current state-of-theart Adaptive Late Feature Selection methods while achieving a significant reduction of 37. 5% in the activated parameters of the embedding layer. AEFS is open-source at https://github. com/fly-dragon211/AEFS .
This report presents the results of the 14th Video Browser Showdown, held at the 2025 International Conference on Multimedia Modeling on the 8th of January 2025 in Nara, Japan.
Online AI platforms for creating music from text prompts (AI music), such as Suno and Udio, are now being used by hundreds of thousands of users. Some AI music is appearing in advertising, and even charting, in multiple countries. How are these platforms being used? What subjects are inspiring their users? This article answers these questions for Suno and Udio using a large collection of songs generated by users of these platforms from May to October 2024. Using a combination of state-of-the-art text embedding models, dimensionality reduction and clustering methods, we analyze the prompts, tags and lyrics, and automatically annotate and display the processed data in interactive plots. Our results reveal prominent themes in lyrics, language preference, prompting strategies, as well as peculiar attempts at steering models through the use of metatags. To promote the musicological study of the developing cultural practice of AI-generated music we share our code and resources.
Fine-tuning large language models (LLMs) for recommendation in a generative manner has delivered promising results, but encounters significant inference overhead due to autoregressive decoding in the language space. This work explores bypassing language-space decoding by directly matching candidate items with the LLM's internal thought representations in the latent space, eliminating the time-consuming autoregressive process to reduce computational costs. Towards this, we introduce Light Latent-space Decoding (L2D), an effective and efficient latent-space decoding method. L2D represents user-preferred items by using the hidden states of test sequences reflecting the LLM's internal thought, and obtains candidate item representations from the hidden states of training sequences labeled with the corresponding candidate items. It then matches the two types of representations to decode items, achieving latent-space decoding. In this way, it enables efficient decoding without altering the LLM's generative tuning paradigm, thereby preserving performance. Extensive empirical results demonstrate that L2D is more than 10x faster than language-space decoding while maintaining or enhancing performance.
Electronic Dance Music (EDM) classification typically relies on industry-defined taxonomies with numerous subgenres, yet the acoustic basis for these distinctions remains unclear. Current approaches use supervised learning with prescribed genre labels, assuming their validity without systematic evaluation. In this paper, we propose an unsupervised approach to discover the natural acoustic structure of EDM independent of commercial labels. Our method combines novel tempogram-based features capturing EDM's layered rhythmic patterns with multi-criteria feature selection. To validate that our findings reflect genuine acoustic structure rather than methodological artifacts, we compare our results against state-of-the-art pre-trained audio embeddings (MERT and CLAP). Both our feature space and embedding representations converge to 19-23 natural acoustic families compared to the prescribed 35, providing consistent evidence of significant overspecification in current EDM taxonomy by approximately one-third.
Large language models (LLMs) are increasingly deployed in information systems, including being used as second-stage rerankers in information retrieval pipelines, yet their susceptibility to recency bias has received little attention. We investigate whether LLMs implicitly favour newer documents by prepending artificial publication dates to passages in the TREC Deep Learning passage retrieval collections in 2021 (DL21) and 2022 (DL22). Across seven models, GPT-3.5-turbo, GPT-4o, GPT-4, LLaMA-3 8B/70B, and Qwen-2.5 7B/72B, "fresh" passages are consistently promoted, shifting the Top-10's mean publication year forward by up to 4.78 years and moving individual items by as many as 95 ranks in our listwise reranking experiments. Although larger models attenuate the effect, none eliminate it. We also observe that the preference of LLMs between two passages with an identical relevance level can be reversed by up to 25% on average after date injection in our pairwise preference experiments. These findings provide quantitative evidence of a pervasive recency bias in LLMs and highlight the importance of effective bias-mitigation strategies.
Decentralized data-feed systems enable blockchain-based smart contracts to access off-chain information by aggregating values from multiple oracles. To improve accuracy, these systems typically use an aggregation function, such as majority voting, to consolidate the inputs they receive from oracles and make a decision. Depending on the final decision and the values reported by the oracles, the participating oracles are compensated through shared rewards. However, such incentive mechanisms are vulnerable to mirroring attacks, where a single user controls multiple oracles to bias the decision of the aggregation function and maximize rewards. This paper analyzes the impact of mirroring attacks on the reliability and dependability of majority voting-based data-feed systems. We demonstrate how existing incentive mechanisms can unintentionally encourage rational users to implement such attacks. To address this, we propose a new incentive mechanism that discourages Sybil behavior. We prove that the proposed mechanism leads to a Nash Equilibrium in which each user operates only one oracle. Finally, we discuss the practical implementation of the proposed incentive mechanism and provide numerical examples to demonstrate its effectiveness.
We introduce random adversarial training (RAT), a novel framework successfully applied to biomedical information extraction (BioIE) tasks. Building on PubMedBERT as the foundational architecture, our study first validates the effectiveness of conventional adversarial training in enhancing pre-trained language models' performance on BioIE tasks. While adversarial training yields significant improvements across various performance metrics, it also introduces considerable computational overhead. To address this limitation, we propose RAT as an efficiency solution for biomedical information extraction. This framework strategically integrates random sampling mechanisms with adversarial training principles, achieving dual objectives: enhanced model generalization and robustness while significantly reducing computational costs. Through comprehensive evaluations, RAT demonstrates superior performance compared to baseline models in BioIE tasks. The results highlight RAT's potential as a transformative framework for biomedical natural language processing, offering a balanced solution to the model performance and computational efficiency.
Personalized news recommendation systems inadvertently create information cocoons--homogeneous information bubbles that reinforce user biases and amplify societal polarization. To address the lack of comprehensive assessment frameworks in prior research, we propose a multidimensional analysis that evaluates cocoons through dual perspectives: (1) Individual homogenization via topic diversity (including the number of topic categories and category information entropy) and click repetition; (2) Group polarization via network density and community openness. Through multi-round experiments on real-world datasets, we benchmark seven algorithms and reveal critical insights. Furthermore, we design five lightweight mitigation strategies. This work establishes the first unified metric framework for information cocoons and delivers deployable solutions for ethical recommendation systems.
Knowledge Graphs (KGs) enhance recommender systems but face challenges from inherent noise, sparsity, and Euclidean geometry's inadequacy for complex relational structures, critically impairing representation learning, especially for long-tail entities. Existing methods also often lack adaptive multi-source signal fusion tailored to item popularity. This paper introduces SPARK, a novel multi-stage framework systematically tackling these issues. SPARK first employs Tucker low-rank decomposition to denoise KGs and generate robust entity representations. Subsequently, an SVD-initialized hybrid geometric GNN concurrently learns representations in Euclidean and Hyperbolic spaces; the latter is strategically leveraged for its aptitude in modeling hierarchical structures, effectively capturing semantic features of sparse, long-tail items. A core contribution is an item popularity-aware adaptive fusion strategy that dynamically weights signals from collaborative filtering, refined KG embeddings, and diverse geometric spaces for precise modeling of both mainstream and long-tail items. Finally, contrastive learning aligns these multi-source representations. Extensive experiments demonstrate SPARK's significant superiority over state-of-the-art methods, particularly in improving long-tail item recommendation, offering a robust, principled approach to knowledge-enhanced recommendation. Implementation code is available at https://github.com/Applied-Machine-Learning-Lab/SPARK.
Recommender systems (RecSys) have been widely applied to various applications, including E-commerce, finance, healthcare, social media and have become increasingly influential in shaping user behavior and decision-making, highlighting their growing impact in various domains. However, recent studies have shown that RecSys are vulnerable to membership inference attacks (MIAs), which aim to infer whether user interaction record was used to train a target model or not. MIAs on RecSys models can directly lead to a privacy breach. For example, via identifying the fact that a purchase record that has been used to train a RecSys associated with a specific user, an attacker can infer that user's special quirks. In recent years, MIAs have been shown to be effective on other ML tasks, e.g., classification models and natural language processing. However, traditional MIAs are ill-suited for RecSys due to the unseen posterior probability. Although MIAs on RecSys form a newly emerging and rapidly growing research area, there has been no systematic survey on this topic yet. In this article, we conduct the first comprehensive survey on RecSys MIAs. This survey offers a comprehensive review of the latest advancements in RecSys MIAs, exploring the design principles, challenges, attack and defense associated with this emerging field. We provide a unified taxonomy that categorizes different RecSys MIAs based on their characterizations and discuss their pros and cons. Based on the limitations and gaps identified in this survey, we point out several promising future research directions to inspire the researchers who wish to follow this area. This survey not only serves as a reference for the research community but also provides a clear description for researchers outside this research domain.
Grounded Multimodal Named Entity Recognition (GMNER) extends traditional NER by jointly detecting textual mentions and grounding them to visual regions. While existing supervised methods achieve strong performance, they rely on costly multimodal annotations and often underperform in low-resource domains. Multimodal Large Language Models (MLLMs) show strong generalization but suffer from Domain Knowledge Conflict, producing redundant or incorrect mentions for domain-specific entities. To address these challenges, we propose ReFineG, a three-stage collaborative framework that integrates small supervised models with frozen MLLMs for low-resource GMNER. In the Training Stage, a domain-aware NER data synthesis strategy transfers LLM knowledge to small models with supervised training while avoiding domain knowledge conflicts. In the Refinement Stage, an uncertainty-based mechanism retains confident predictions from supervised models and delegates uncertain ones to the MLLM. In the Grounding Stage, a multimodal context selection algorithm enhances visual grounding through analogical reasoning. In the CCKS2025 GMNER Shared Task, ReFineG ranked second with an F1 score of 0.6461 on the online leaderboard, demonstrating its effectiveness with limited annotations.
Scientific progress increasingly depends on synthesizing knowledge across vast literature, yet most experimental data remains trapped in semi-structured formats that resist systematic extraction and analysis. Here, we present MatSKRAFT, a computational framework that automatically extracts and integrates materials science knowledge from tabular data at unprecedented scale. Our approach transforms tables into graph-based representations processed by constraint-driven GNNs that encode scientific principles directly into model architecture. MatSKRAFT significantly outperforms state-of-the-art large language models, achieving F1 scores of 88.68 for property extraction and 71.35 for composition extraction, while processing data $19$-$496\times$ faster than them (compared to the slowest and the fastest models, respectively) with modest hardware requirements. Applied to nearly 69,000 tables from more than 47,000 research publications, we construct a comprehensive database containing over 535,000 entries, including 104,000 compositions that expand coverage beyond major existing databases, pending manual validation. This systematic approach reveals previously overlooked materials with distinct property combinations and enables data-driven discovery of composition-property relationships forming the cornerstone of materials and scientific discovery.
We present RecoWorld, a blueprint for building simulated environments tailored to agentic recommender systems. Such environments give agents a proper training space where they can learn from errors without impacting real users. RecoWorld distinguishes itself with a dual-view architecture: a simulated user and an agentic recommender engage in multi-turn interactions aimed at maximizing user retention. The user simulator reviews recommended items, updates its mindset, and when sensing potential user disengagement, generates reflective instructions. The agentic recommender adapts its recommendations by incorporating these user instructions and reasoning traces, creating a dynamic feedback loop that actively engages users. This process leverages the exceptional reasoning capabilities of modern LLMs. We explore diverse content representations within the simulator, including text-based, multimodal, and semantic ID modeling, and discuss how multi-turn RL enables the recommender to refine its strategies through iterative interactions. RecoWorld also supports multi-agent simulations, allowing creators to simulate the responses of targeted user populations. It marks an important first step toward recommender systems where users and agents collaboratively shape personalized information streams. We envision new interaction paradigms where "user instructs, recommender responds," jointly optimizing user retention and engagement.
While optimizing recommendation systems for user engagement is a well-established practice, effectively diversifying recommendations without negatively impacting core business metrics remains a significant industry challenge. In line with our initiative to broaden our audience's cultural practices, this study investigates using personalized Determinantal Point Processes (DPPs) to sample diverse and relevant recommendations. We rely on a well-known quality-diversity decomposition of the similarity kernel to give more weight to user preferences. In this paper, we present our implementations of the personalized DPP sampling, evaluate the trade-offs between relevance and diversity through both offline and online metrics, and give insights for practitioners on their use in a production environment. For the sake of reproducibility, we release the full code for our platform and experiments on GitHub.
Recommender systems often benefit from complex feature embeddings and deep learning algorithms, which deliver sophisticated recommendations that enhance user experience, engagement, and revenue. However, these methods frequently reduce the interpretability and transparency of the system. In this research, we develop a systematic application, adaptation, and evaluation of deletion diagnostics in the recommender setting. The method compares the performance of a model to that of a similar model trained without a specific user or item, allowing us to quantify how that observation influences the recommender, either positively or negatively. To demonstrate its model-agnostic nature, the proposal is applied to both Neural Collaborative Filtering (NCF), a widely used deep learning-based recommender, and Singular Value Decomposition (SVD), a classical collaborative filtering technique. Experiments on the MovieLens and Amazon Reviews datasets provide insights into model behavior and highlight the generality of the approach across different recommendation paradigms.
We regularly encounter information on novel, emerging topics for which the body of knowledge is still evolving, which can be linked, for instance, to current events. A primary way to learn more about such topics is through web search. However, information on emerging topics is sparse and evolves dynamically as knowledge grows, making it uncertain and variable in quality and trustworthiness and prone to deliberate or accidental manipulation, misinformation, and bias. In this paper, we outline a research vision towards search systems and interfaces that support effective knowledge acquisition, awareness of the dynamic nature of topics, and responsible opinion formation among people searching the web for information on emerging topics. To realize this vision, we propose three overarching research questions, aimed at understanding the status quo, determining requirements of systems aligned with our vision, and building these systems. For each of the three questions, we highlight relevant literature, including pointers on how they could be addressed. Lastly, we discuss the challenges that will potentially arise in pursuing the proposed vision.