Let's Simply Count: Quantifying Distributional Similarity Between Activities in Event Data
Abstract
To obtain insights from event data, advanced process mining methods assess the similarity of activities to incorporate their semantic relations into the analysis. Here, distributional similarity that captures similarity from activity co-occurrences is commonly employed. However, existing work for distributional similarity in process mining adopt neural network-based approaches as developed for natural language processing, e.g., word2vec and autoencoders. While these approaches have been shown to be effective, their downsides are high computational costs and limited interpretability of the learned representations. In this work, we argue for simplicity in the modeling of distributional similarity of activities. We introduce count-based embeddings that avoid a complex training process and offer a direct interpretable representation. To underpin our call for simple embeddings, we contribute a comprehensive benchmarking framework, which includes means to assess the intrinsic quality of embeddings, their performance in downstream applications, and their computational efficiency. In experiments that compare against the state of the art, we demonstrate that count-based embeddings provide a highly effective and efficient basis for distributional similarity between activities in event data.