CyberSec Research

Twitter/X GitHub

Loading...

Made by 0x1622

CyberSec Research

Browse, search and filter the latest cybersecurity research papers from arXiv

Filters

Cryptography and Security1245

Computers and Society654

Networking and Internet Architecture876

Distributed Computing432

Software Engineering789

Artificial Intelligence1532

Machine Learning921

Hardware Security342

Software Security578

Network Security456

AI Security324

ML Security428

Cloud Security219

IoT Security187

Malware Analysis296

Cryptography413

Privacy329

Authentication247

Vulnerability Analysis385

Publication Year

Results (20303)

Inferring trust in recommendation systems from brain, behavioural, and physiological data

Oct 31, 2025

Vincent K. M. Cheung, Pei-Cheng Shih, Ma...

As people nowadays increasingly rely on artificial intelligence (AI) to curate information and make decisions, assigning the appropriate amount of trust in automated intelligent systems has become ever more important. However, current measurements of trust in automation still largely rely on self-reports that are subjective and disruptive to the user. Here, we take music recommendation as a model to investigate the neural and cognitive processes underlying trust in automation. We observed that system accuracy was directly related to users' trust and modulated the influence of recommendation cues on music preference. Modelling users' reward encoding process with a reinforcement learning model further revealed that system accuracy, expected reward, and prediction error were related to oscillatory neural activity recorded via EEG and changes in pupil diameter. Our results provide a neurally grounded account of calibrating trust in automation and highlight the promises of a multimodal approach towards developing trustable AI systems.

+2

Read Article PDF

Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm

Oct 31, 2025

Anselm Lohmann, Tomohiro Nakatani, Rinta...

Guided Source Separation (GSS) is a popular front-end for distant automatic speech recognition (ASR) systems using spatially distributed microphones. When considering spatially distributed microphones, the choice of reference microphone may have a large influence on the quality of the output signal and the downstream ASR performance. In GSS-based speech enhancement, reference microphone selection is typically performed using the signal-to-noise ratio (SNR), which is optimal for noise reduction but may neglect differences in early-to-late-reverberant ratio (ELR) across microphones. In this paper, we propose two reference microphone selection methods for GSS-based speech enhancement that are based on the normalized $\ell_p$-norm, either using only the normalized $\ell_p$-norm or combining the normalized $\ell_p$-norm and the SNR to account for both differences in SNR and ELR across microphones. Experimental evaluation using a CHiME-8 distant ASR system shows that the proposed $\ell_p$-norm-based methods outperform the baseline method, reducing the macro-average word error rate.

+1

Read Article PDF

Beamforming in the Reproducing Kernel Domain Based on Spatial Differentiation

Oct 31, 2025

Takahiro Iwami, Naohisa Inoue, Akira Omo...

This paper proposes a novel beamforming framework in the reproducing kernel domain, derived from a unified interpretation of directional response as spatial differentiation of the sound field. By representing directional response using polynomial differential operators, the proposed method enables the formulation of arbitrary beam patterns including non-axisymmetric. The derivation of the reproducing kernel associated with the interior fields is mathematically supported by Hobson's theorem, which allows concise analytical expressions. Furthermore, the proposed framework generalizes conventional spherical harmonic domain beamformers by reinterpreting them as spatial differential operators, thereby clarifying their theoretical structure and extensibility. Three numerical simulations conducted in two-dimensional space confirm the validity of the method.

+2

Read Article PDF

Expressive Range Characterization of Open Text-to-Audio Models

Oct 31, 2025

Jonathan Morse, Azadeh Naderi, Swen Gaud...

Text-to-audio models are a type of generative model that produces audio output in response to a given textual prompt. Although level generators and the properties of the functional content that they create (e.g., playability) dominate most discourse in procedurally generated content (PCG), games that emotionally resonate with players tend to weave together a range of creative and multimodal content (e.g., music, sounds, visuals, narrative tone), and multimodal models have begun seeing at least experimental use for this purpose. However, it remains unclear what exactly such models generate, and with what degree of variability and fidelity: audio is an extremely broad class of output for a generative system to target. Within the PCG community, expressive range analysis (ERA) has been used as a quantitative way to characterize generators' output space, especially for level generators. This paper adapts ERA to text-to-audio models, making the analysis tractable by looking at the expressive range of outputs for specific, fixed prompts. Experiments are conducted by prompting the models with several standardized prompts derived from the Environmental Sound Classification (ESC-50) dataset. The resulting audio is analyzed along key acoustic dimensions (e.g., pitch, loudness, and timbre). More broadly, this paper offers a framework for ERA-based exploratory evaluation of generative audio models.

+1

+2

Read Article PDF

Modeling strategies for speech enhancement in the latent space of a neural audio codec

Oct 30, 2025

Sofiene Kammoun, Xavier Alameda-Pineda, ...

Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and encoder fine-tuning yields the strongest enhancement metrics overall, though at the cost of degraded codec reconstruction. The code and audio samples are available online.

Software Security

+2

+1

Read Article PDF

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

Oct 30, 2025

Hitomi Jin Ling Tee, Chaoren Wang, Zijie...

The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.

Network Security

+1

+2

Read Article PDF

Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition

Oct 29, 2025

Amine Razig, Youssef Soulaymani, Loubna ...

Automated monitoring of marine mammals in the St. Lawrence Estuary faces extreme challenges: calls span low-frequency moans to ultrasonic clicks, often overlap, and are embedded in variable anthropogenic and environmental noise. We introduce a multi-step, attention-guided framework that first segments spectrograms to generate soft masks of biologically relevant energy and then fuses these masks with the raw inputs for multi-band, denoised classification. Image and mask embeddings are integrated via mid-level fusion, enabling the model to focus on salient spectrogram regions while preserving global context. Using real-world recordings from the Saguenay St. Lawrence Marine Park Research Station in Canada, we demonstrate that segmentation-driven attention and mid-level fusion improve signal discrimination, reduce false positive detections, and produce reliable representations for operational marine mammal monitoring across diverse environmental conditions and signal-to-noise ratios. Beyond in-distribution evaluation, we further assess the generalization of Mask-Guided Classification (MGC) under distributional shifts by testing on spectrograms generated with alternative acoustic transformations. While high-capacity baseline models lose accuracy in this Out-of-distribution (OOD) setting, MGC maintains stable performance, with even simple fusion mechanisms (gated, concat) achieving comparable results across distributions. This robustness highlights the capacity of MGC to learn transferable representations rather than overfitting to a specific transformation, thereby reinforcing its suitability for large-scale, real-world biodiversity monitoring. We show that in all experimental settings, the MGC framework consistently outperforms baseline architectures, yielding substantial gains in accuracy on both in-distribution and OOD data.

+1

+4

Read Article PDF

SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

Oct 29, 2025

Xiaoyu Yang, Yifan Yang, Zengrui Jin, Zi...

Self-Supervised Learning (SSL) excels at learning generic representations of acoustic signals, yet prevailing methods remain domain-specific, tailored to either speech or general audio, hindering the development of a unified representation model with a comprehensive capability over both domains. To address this, we present SPEAR (SPEech and Audio Representations), the first SSL framework to successfully learn unified speech and audio representations from a mixture of speech and audio data. SPEAR proposes a unified pre-training objective based on masked prediction of fine-grained discrete tokens for both speech and general audio. These tokens are derived from continuous speech and audio representations using a Multi-codebook Vector Quantisation (MVQ) method, retaining rich acoustic detail essential for modelling both speech and complex audio events. SPEAR is applied to pre-train both single-domain and unified speech-and-audio SSL models. Our speech-domain model establishes a new state-of-the-art on the SUPERB benchmark, a speech processing benchmark for SSL models, matching or surpassing the highly competitive WavLM Large on 12 out of 15 tasks with the same pre-training corpora and a similar model size. Crucially, our unified model learns complementary features and demonstrates comprehensive capabilities across two major benchmarks, SUPERB and HEAR, for evaluating audio representations. By further scaling up the model size and pre-training data, we present a unified model with 600M parameters that excels in both domains, establishing it as one of the most powerful and versatile open-source SSL models for auditory understanding. The inference code and pre-trained models will be made publicly available.

Software Security

+1

Read Article PDF

Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models

Oct 29, 2025

Harm Lameris, Shree Harsha Bokkahalli Sa...

Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion recognition, evaluating whether model behaviours are consistent across different phonation inputs. We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice. Our work provides the first examination of SFM sensitivity to these particular non-lexical aspects of speech perception.

Network Security

+1

+2

Read Article PDF

PitchFlower: A flow-based neural audio codec with pitch controllability

Oct 29, 2025

Diego Torres, Axel Roebel, Nicolas Obin

We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFiGAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.

Software Security

+1

+1

Read Article PDF

Controlling Contrastive Self-Supervised Learning with Knowledge-Driven Multiple Hypothesis: Application to Beat Tracking

Oct 29, 2025

Antonin Gagnere, Slim Essid, Geoffroy Pe...

Ambiguities in data and problem constraints can lead to diverse, equally plausible outcomes for a machine learning task. In beat and downbeat tracking, for instance, different listeners may adopt various rhythmic interpretations, none of which would necessarily be incorrect. To address this, we propose a contrastive self-supervised pre-training approach that leverages multiple hypotheses about possible positive samples in the data. Our model is trained to learn representations compatible with different such hypotheses, which are selected with a knowledge-based scoring function to retain the most plausible ones. When fine-tuned on labeled data, our model outperforms existing methods on standard benchmarks, showcasing the advantages of integrating domain knowledge with multi-hypothesis selection in music representation learning in particular.

Network Security

+2

+1

Read Article PDF

Separating peripheral and higher-level effects on speech intelligibility using a hearing loss simulator and an objective intelligibility measure

Oct 29, 2025

Toshio Irino, Ayako Yamamoto, Fuki Miyaz...

This paper presents a new method for separating the effects of peripheral hearing loss (HL) and higher-level processes on speech intelligibility (SI). In a previous study, we conducted an SI experiment with 14 older adult (OA) listeners, using speech-in-noise sounds that were either processed with an ideal ratio mask (IRM) enhancement technique or left unprocessed. The current study involved an SI experiment with 15 young, normal-hearing (YNH) listeners. This experiment used simulated HL sounds processed with the WHIS simulator that reflected the hearing level of a specific OA from the previous study. The results showed that the target OA's SI scores were higher than the average YNH scores. This implies that the target OA's higher-level processes may be more effective than those of the average YNH. To understand the characteristics of other OAs, we used the GESI objective intelligibility measure to predict SI. First, we confirmed that GESI could fairly accurately predict the SI scores for both the YNH and OA listeners. Next, we predicted the SI scores of the 14 OA listeners using the parameters estimated in the YNH experiment. The results showed that some OAs had higher SI scores than the average YNH, while one OA had lower scores. These differences in SI scores may reflect variations in the efficiency of higher-level processes.These results imply that WHIS and GESI could facilitate contrastive experiments between YNH and OA listeners, regardless of hearing level. This would allow us to study the effects of higher-level processes in OA listeners individually.

Network Security

+1

Read Article PDF

Retaining Mixture Representations for Domain Generalized Anomalous Sound Detection

Oct 29, 2025

Phurich Saengthong, Tomoya Nishida, Kota...

Anomalous sound detection (ASD) in the wild requires robustness to distribution shifts such as unseen low-SNR input mixtures of machine and noise types. State-of-the-art systems extract embeddings from an adapted audio encoder and detect anomalies via nearest-neighbor search, but fine tuning on noisy machine sounds often acts like a denoising objective, suppressing noise and reducing generalization under mismatched mixtures or inconsistent labeling. Training-free systems with frozen self-supervised learning (SSL) encoders avoid this issue and show strong first-shot generalization, yet their performance drops when mixture embeddings deviate from clean-source embeddings. We propose to improve SSL backbones with a retain-not-denoise strategy that better preserves information from mixed sound sources. The approach combines a multi-label audio tagging loss with a mixture alignment loss that aligns student mixture embeddings to convex teacher embeddings of clean and noise inputs. Controlled experiments on stationary, non-stationary, and mismatched noise subsets demonstrate improved robustness under distribution shifts, narrowing the gap toward oracle mixture representations.

Software Security

+1

+1

Read Article PDF

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Oct 29, 2025

Jiarong Du, Zhan Jin, Peijun Yang, Juan ...

Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.

Network Security

+3

Read Article PDF

Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-Supervised Training of Sound Events With Partial Labels

Oct 29, 2025

Keisuke Imoto

Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While reducing annotation costs, weakly-supervised and partial label learning often suffer from decreased detection performance due to lacking the precise event set and their temporal annotations. To better balance between annotation cost and detection performance, we also explore a semi-supervised framework that leverages both strong and partial labels. Moreover, to refine partial labels and achieve better model training, we propose a label refinement method based on self-distillation for the proposed approach with partial labels.

Network Security

+1

+1

Read Article PDF

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

Oct 29, 2025

Pedro Corrêa, João Lima, Victor Moreno, ...

Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.

Software Security

+1

Read Article PDF

EasyEyes: Online hearing research using speakers calibrated by phones

Oct 29, 2025

Ivan Vican, Hugo De Moraes, Chongjun Lia...

Hearing research requires a calibrated sound source, traditionally as lab equipment. Online research is quicker and more inclusive, but most participants lack calibration equipment and their sound sources are uncalibrated and diverse. This article explains how the open-source EasyEyes.app calibrates loudspeakers online. A library of smartphone-microphone profiles allows EasyEyes to use the participant's phone to calibrate their computer's loudspeaker in three minutes. Participants select their phone model, which is verified by screen size. Calibration employs the Novak et al. nonsynchronous maximum-length-sequence (MLS) algorithm. The computer's loudspeaker is corrected by convolving its input with the inverse of its impulse response. Researchers can contribute to the open-access library by calibrating phones with a measurement microphone. In the library, each profile is linked back to the profile used to produce it, back to the manufacturer profile of a measurement microphone. Correction accuracy is such that playing the flat-spectrum MLS through the corrected loudspeaker produces a nearly flat spectrum, with standard deviation less than 3 dB. A survey shows that a library of 94 phone models from major brands will support most participants in the USA (87%) and UK (80%). This method facilitates efficient and inclusive online hearing research.

Network Security

+1

Read Article PDF

POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

Oct 28, 2025

Chin-Jou Li, Kalvin Chang, Shikhar Bhara...

Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.

Software Security

Network Security

+2

+1

Read Article PDF