Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose \textbf{UniTok-Audio}, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok-Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual-stream audio codec involving acoustic and semantic branch is developed for high-fidelity waveform reconstruction. Experimental results demonstrate that UniTok-Audio achieves competitive performance in comparation with state-of-the-art task-specific or multi-task systems across five time-aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language-queried audio source separation. To foster future research, we will open-source our codebase. The demo page of our work can be found here: https://alibaba.github.io/unified-audio.
Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and encoder fine-tuning yields the strongest enhancement metrics overall, though at the cost of degraded codec reconstruction. The code and audio samples are available online.
The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.
Recent advances in Audio-Language Models (ALMs) have significantly improved multimodal understanding capabilities. However, the introduction of the audio modality also brings new and unique vulnerability vectors. Previous studies have proposed jailbreak attacks that specifically target ALMs, revealing that defenses directly transferred from traditional audio adversarial attacks or text-based Large Language Model (LLM) jailbreaks are largely ineffective against these ALM-specific threats. To address this issue, we propose ALMGuard, the first defense framework tailored to ALMs. Based on the assumption that safety-aligned shortcuts naturally exist in ALMs, we design a method to identify universal Shortcut Activation Perturbations (SAPs) that serve as triggers that activate the safety shortcuts to safeguard ALMs at inference time. To better sift out effective triggers while preserving the model's utility on benign tasks, we further propose Mel-Gradient Sparse Mask (M-GSM), which restricts perturbations to Mel-frequency bins that are sensitive to jailbreaks but insensitive to speech understanding. Both theoretical analyses and empirical results demonstrate the robustness of our method against both seen and unseen attacks. Overall, \MethodName reduces the average success rate of advanced ALM-specific jailbreak attacks to 4.6% across four models, while maintaining comparable utility on benign benchmarks, establishing it as the new state of the art. Our code and data are available at https://github.com/WeifeiJin/ALMGuard.
State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x. We release our code and checkpoints under MIT license at https://github.com/smulelabs/windowed-roformer.
We present Binaspect, an open-source Python library for binaural audio analysis, visualization, and feature generation. Binaspect generates interpretable "azimuth maps" by calculating modified interaural time and level difference spectrograms, and clustering those time-frequency (TF) bins into stable time-azimuth histogram representations. This allows multiple active sources to appear as distinct azimuthal clusters, while degradations manifest as broadened, diffused, or shifted distributions. Crucially, Binaspect operates blindly on audio, requiring no prior knowledge of head models. These visualizations enable researchers and engineers to observe how binaural cues are degraded by codec and renderer design choices, among other downstream processes. We demonstrate the tool on bitrate ladders, ambisonic rendering, and VBAP source positioning, where degradations are clearly revealed. In addition to their diagnostic value, the proposed representations can be exported as structured features suitable for training machine learning models in quality prediction, spatial audio classification, and other binaural tasks. Binaspect is released under an open-source license with full reproducibility scripts at https://github.com/QxLabIreland/Binaspect.
Ambiguities in data and problem constraints can lead to diverse, equally plausible outcomes for a machine learning task. In beat and downbeat tracking, for instance, different listeners may adopt various rhythmic interpretations, none of which would necessarily be incorrect. To address this, we propose a contrastive self-supervised pre-training approach that leverages multiple hypotheses about possible positive samples in the data. Our model is trained to learn representations compatible with different such hypotheses, which are selected with a knowledge-based scoring function to retain the most plausible ones. When fine-tuned on labeled data, our model outperforms existing methods on standard benchmarks, showcasing the advantages of integrating domain knowledge with multi-hypothesis selection in music representation learning in particular.
This paper presents a new method for separating the effects of peripheral hearing loss (HL) and higher-level processes on speech intelligibility (SI). In a previous study, we conducted an SI experiment with 14 older adult (OA) listeners, using speech-in-noise sounds that were either processed with an ideal ratio mask (IRM) enhancement technique or left unprocessed. The current study involved an SI experiment with 15 young, normal-hearing (YNH) listeners. This experiment used simulated HL sounds processed with the WHIS simulator that reflected the hearing level of a specific OA from the previous study. The results showed that the target OA's SI scores were higher than the average YNH scores. This implies that the target OA's higher-level processes may be more effective than those of the average YNH. To understand the characteristics of other OAs, we used the GESI objective intelligibility measure to predict SI. First, we confirmed that GESI could fairly accurately predict the SI scores for both the YNH and OA listeners. Next, we predicted the SI scores of the 14 OA listeners using the parameters estimated in the YNH experiment. The results showed that some OAs had higher SI scores than the average YNH, while one OA had lower scores. These differences in SI scores may reflect variations in the efficiency of higher-level processes.These results imply that WHIS and GESI could facilitate contrastive experiments between YNH and OA listeners, regardless of hearing level. This would allow us to study the effects of higher-level processes in OA listeners individually.
This paper explores the integration of AI technologies into the artistic workflow through the creation of Studies for, a generative sound installation developed in collaboration with sound artist Evala (https://www.ntticc.or.jp/en/archive/works/studies-for/). The installation employs SpecMaskGIT, a lightweight yet high-quality sound generation AI model, to generate and playback eight-channel sound in real-time, creating an immersive auditory experience over the course of a three-month exhibition. The work is grounded in the concept of a "new form of archive," which aims to preserve the artistic style of an artist while expanding beyond artists' past artworks by continued generation of new sound elements. This speculative approach to archival preservation is facilitated by training the AI model on a dataset consisting of over 200 hours of Evala's past sound artworks. By addressing key requirements in the co-creation of art using AI, this study highlights the value of the following aspects: (1) the necessity of integrating artist feedback, (2) datasets derived from an artist's past works, and (3) ensuring the inclusion of unexpected, novel outputs. In Studies for, the model was designed to reflect the artist's artistic identity while generating new, previously unheard sounds, making it a fitting realization of the concept of "a new form of archive." We propose a Human-AI co-creation framework for effectively incorporating sound generation AI models into the sound art creation process and suggest new possibilities for creating and archiving sound art that extend an artist's work beyond their physical existence. Demo page: https://sony.github.io/studies-for/
Accurate direction-of-arrival (DOA) estimation for sound sources is challenging due to the continuous changes in acoustic characteristics across time and frequency. In such scenarios, accurate localization relies on the ability to aggregate relevant features and model temporal dependencies effectively. In time series modeling, achieving a balance between model performance and computational efficiency remains a significant challenge. To address this, we propose FA-Stateformer, a state space and self-attention collaborative network with feature aggregation. The proposed network first employs a feature aggregation module to enhance informative features across both temporal and spectral dimensions. This is followed by a lightweight Conformer architecture inspired by the squeeze-and-excitation mechanism, where the feedforward layers are compressed to reduce redundancy and parameter overhead. Additionally, a temporal shift mechanism is incorporated to expand the receptive field of convolutional layers while maintaining a compact kernel size. To further enhance sequence modeling capabilities, a bidirectional Mamba module is introduced, enabling efficient state-space-based representation of temporal dependencies in both forward and backward directions. The remaining self-attention layers are combined with the Mamba blocks, forming a collaborative modeling framework that achieves a balance between representation capacity and computational efficiency. Extensive experiments demonstrate that FA-Stateformer achieves superior performance and efficiency compared to conventional architectures.
Anomalous sound detection (ASD) in the wild requires robustness to distribution shifts such as unseen low-SNR input mixtures of machine and noise types. State-of-the-art systems extract embeddings from an adapted audio encoder and detect anomalies via nearest-neighbor search, but fine tuning on noisy machine sounds often acts like a denoising objective, suppressing noise and reducing generalization under mismatched mixtures or inconsistent labeling. Training-free systems with frozen self-supervised learning (SSL) encoders avoid this issue and show strong first-shot generalization, yet their performance drops when mixture embeddings deviate from clean-source embeddings. We propose to improve SSL backbones with a retain-not-denoise strategy that better preserves information from mixed sound sources. The approach combines a multi-label audio tagging loss with a mixture alignment loss that aligns student mixture embeddings to convex teacher embeddings of clean and noise inputs. Controlled experiments on stationary, non-stationary, and mismatched noise subsets demonstrate improved robustness under distribution shifts, narrowing the gap toward oracle mixture representations.
Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While reducing annotation costs, weakly-supervised and partial label learning often suffer from decreased detection performance due to lacking the precise event set and their temporal annotations. To better balance between annotation cost and detection performance, we also explore a semi-supervised framework that leverages both strong and partial labels. Moreover, to refine partial labels and achieve better model training, we propose a label refinement method based on self-distillation for the proposed approach with partial labels.
Recent synthetic speech detection models typically adapt a pre-trained SSL model via finetuning, which is computationally demanding. Parameter-Efficient Fine-Tuning (PEFT) offers an alternative. However, existing methods lack the specific inductive biases required to model the multi-scale temporal artifacts characteristic of spoofed audio. This paper introduces the Multi-Scale Convolutional Adapter (MultiConvAdapter), a parameter-efficient architecture designed to address this limitation. MultiConvAdapter integrates parallel convolutional modules within the SSL encoder, facilitating the simultaneous learning of discriminative features across multiple temporal resolutions, capturing both short-term artifacts and long-term distortions. With only $3.17$M trainable parameters ($1\%$ of the SSL backbone), MultiConvAdapter substantially reduces the computational burden of adaptation. Evaluations on five public datasets, demonstrate that MultiConvAdapter achieves superior performance compared to full fine-tuning and established PEFT methods.
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
Extracting features from the speech is the most critical process in speech signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used features in the majority of the speaker and speech recognition applications, as the filtering in this feature is similar to the filtering taking place in the human ear. But the main drawback of this feature is that it provides only the frequency information of the signal but does not provide the information about at what time which frequency is present. The wavelet transform, with its flexible time-frequency window, provides time and frequency information of the signal and is an appropriate tool for the analysis of non-stationary signals like speech. On the other hand, because of its uniform frequency scaling, a typical wavelet transform may be less effective in analysing speech signals, have poorer frequency resolution in low frequencies, and be less in line with human auditory perception. Hence, it is necessary to develop a feature that incorporates the merits of both MFCC and wavelet transform. A great deal of studies are trying to combine both these features. The present Wavelet Transform based Mel-scaled feature extraction methods require more computation when a wavelet transform is applied on top of Mel-scale filtering, since it adds extra processing steps. Here we are proposing a method to extract Mel scale features in time domain combining the concept of wavelet transform, thus reducing the computational burden of time-frequency conversion and the complexity of wavelet extraction. Combining our proposed Time domain Mel frequency Wavelet Coefficient(TMFWC) technique with the reservoir computing methodology has significantly improved the efficiency of audio signal processing.
Fixed beamforming is widely used in practice since it does not depend on the estimation of noise statistics and provides relatively stable performance. However, a single beamformer cannot adapt to varying acoustic conditions, which limits its interference suppression capability. To address this, adaptive convex combination (ACC) algorithms have been introduced, where the outputs of multiple fixed beamformers are linearly combined to improve robustness. Nevertheless, ACC often fails in highly non-stationary scenarios, such as rapidly moving interference, since its adaptive updates cannot reliably track rapid changes. To overcome this limitation, we propose a frame-online neural fusion framework for multiple distortionless differential beamformers, which estimates the combination weights through a neural network. Compared with conventional ACC, the proposed method adapts more effectively to dynamic acoustic environments, achieving stronger interference suppression while maintaining the distortionless constraint.
Though playing an essential role in smart home systems, smart speakers are vulnerable to voice spoofing attacks. Passive liveness detection, which utilizes only the collected audio rather than the deployed sensors to distinguish between live-human and replayed voices, has drawn increasing attention. However, it faces the challenge of performance degradation under the different environmental factors as well as the strict requirement of the fixed user gestures. In this study, we propose a novel liveness feature, array fingerprint, which utilizes the microphone array inherently adopted by the smart speaker to determine the identity of collected audios. Our theoretical analysis demonstrates that by leveraging the circular layout of microphones, compared with existing schemes, array fingerprint achieves a more robust performance under the environmental change and user's movement. Then, to leverage such a fingerprint, we propose ARRAYID, a lightweight passive detection scheme, and elaborate a series of features working together with array fingerprint. Our evaluation on the dataset containing 32,780 audio samples and 14 spoofing devices shows that ARRAYID achieves an accuracy of 99.84%, which is superior to existing passive liveness detection schemes.
Codec-based text-to-speech (TTS) models have recently gained traction for their efficiency and strong performance in voice cloning. However, codec-based TTS faces limitations due to the challenges of pretraining robust speech codecs and the quality degradation introduced by quantization errors. Emerging evidence suggests that continuous-valued generative models can alleviate these issues and serve as a promising alternative. Yet, effectively modelling diverse speech patterns and developing reliable sampling strategies for continuous-valued autoregressive (AR) TTS remains underexplored. In this work, we propose BELLE, Bayesian evidential learning with language modelling for TTS, a novel continuous-valued AR framework that directly predicts mel-spectrograms from textual input. BELLE treats each mel-spectrogram frame as a Gaussian distribution sampled from a learned hyper distribution, enabling principled uncertainty estimation, particularly in scenarios with parallel data (i.e., one text-audio prompt paired with multiple speech samples). To obtain such data, diverse speech samples are synthesized using multiple pre-trained TTS models given the same text-audio prompts, which are distilled into BELLE via Bayesian evidential learning. Experimental results indicate that BELLE demonstrates highly competitive performance compared with the current best open-source TTS models, even though BELLE is trained on a large amount of synthetic data and uses only approximately one-tenth of their training data. Audio samples generated by BELLE are available at https://belletts.github.io/Belle/. The code, checkpoints, and synthetic data will be released after the paper is accepted.