EEND-SAA: Enrollment-Less Main Speaker Voice Activity Detection Using Self-Attention Attractors
Abstract
Voice activity detection (VAD) is essential in speech-based systems, but traditional methods detect only speech presence without identifying speakers. Target-speaker VAD (TS-VAD) extends this by detecting the speech of a known speaker using a short enrollment utterance, but this assumption fails in open-domain scenarios such as meetings or customer service calls, where the main speaker is unknown. We propose EEND-SAA, an enrollment-less, streaming-compatible framework for main-speaker VAD, which identifies the primary speaker without prior knowledge. Unlike TS-VAD, our method determines the main speaker as the one who talks more steadily and clearly, based on speech continuity and volume. We build our model on EEND using two self-attention attractors in a Transformer and apply causal masking for real-time use. Experiments on multi-speaker LibriSpeech mixtures show that EEND-SAA reduces main-speaker DER from 6.63% to 3.61% and improves F1 from 0.9667 to 0.9818 over the SA-EEND baseline, achieving state-of-the-art performance under conditions involving speaker overlap and noise.