Frequency Dynamic Convolutions for Sound Event Detection
Abstract
Recent research in deep learning-based Sound Event Detection (SED) has primarily focused on Convolutional Recurrent Neural Networks (CRNNs) and Transformer models. However, conventional 2D convolution-based models assume shift invariance along both the temporal and frequency axes, leadin to inconsistencies when dealing with frequency-dependent characteristics of acoustic signals. To address this issue, this study proposes Frequency Dynamic Convolution (FDY conv), which dynamically adjusts convolutional kernels based on the frequency composition of the input signal to enhance SED performance. FDY conv constructs an optimal frequency response by adaptively weighting multiple basis kernels based on frequency-specific attention weights. Experimental results show that applying FDY conv to CRNNs improves performance on the DESED dataset by 7.56% compared to the baseline CRNN. However, FDY conv has limitations in that it combines basis kernels of the same shape across all frequencies, restricting its ability to capture diverse frequency-specific characteristics. Additionally, the $3\times3$ basis kernel size is insufficient to capture a broader frequency range. To overcome these limitations, this study introduces an extended family of FDY conv models. Dilated FDY conv (DFD conv) applies convolutional kernels with various dilation rates to expand the receptive field along the frequency axis and enhance frequency-specific feature representation. Experimental results show that DFD conv improves performance by 9.27% over the baseline. Partial FDY conv (PFD conv) addresses the high computational cost of FDY conv, which results from performing all convolution operations with dynamic kernels. Since FDY conv may introduce unnecessary adaptivity for quasi-stationary sound events, PFD conv integrates standard 2D convolutions with frequency-adaptive kernels to reduce computational complexity while maintaining performance. Experimental results demonstrate that PFD conv improves performance by 7.80% over the baseline while reducing the number of parameters by 54.4% compared to FDY conv. Multi-Dilated FDY conv (MDFD conv) extends DFD conv by addressing its structural limitation of applying the same dilation across all frequencies. By utilizing multiple convolutional kernels with different dilation rates, MDFD conv effectively captures diverse frequency-dependent patterns. Experimental results indicate that MDFD conv achieves the highest performance, improving the baseline CRNN performance by 10.98%. Furthermore, standard FDY conv employs Temporal Average Pooling, which assigns equal weight to all frames along the time axis, limiting its ability to effectively capture transient events. To overcome this, this study proposes TAP-FDY conv (TFD conv), which integrates Temporal Attention Pooling (TA) that focuses on salient features, Velocity Attention Pooling (VA) that emphasizes transient characteristics, and Average Pooling (AP) that captures stationary properties. TAP-FDY conv achieves the same performance as MDFD conv but reduces the number of parameters by approximately 30.01% (12.703M vs. 18.157M), achieving equivalent accuracy with lower computational complexity. Class-wise performance analysis reveals that FDY conv improves detection of non-stationary events, DFD conv is particularly effective for events with broad spectral features, and PFD conv enhances the detection of quasi-stationary events. Additionally, TFD conv (TFD-CRNN) demonstrates strong performance in detecting transient events. In the case studies, PFD conv effectively captures stable signal patterns in tank powertrain fault recognition, DFD conv recognizes wide harmonic spectral patterns on speed-varying motor fault recognition, while TFD conv outperforms other models in detecting transient signals in offshore arc detection. These results suggest that frequency-adaptive convolutions and their extended variants provide a robust alternative to conventional 2D convolutions in deep learning-based audio processing.