The Curious Case of High-Dimensional Indexing as a File Structure: A Case Study of eCP-FS
Abstract
Modern analytical pipelines routinely deploy multiple deep learning and retrieval models that rely on approximate nearest-neighbor (ANN) indexes to support efficient similarity-based search. While many state-of-the-art ANN-indexes are memory-based (e.g., HNSW and IVF), using multiple ANN indexes creates a competition for limited GPU/CPU memory resources, which in turn necessitates disk-based index structures (e.g., DiskANN or eCP). In typical index implementations, the main component is a complex data structure that is serialized to disk and is read either fully at startup time, for memory-based indexes, or incrementally at query time, for disk-based indexes. To visualize the index structure, or analyze its quality, complex coding is needed that is either embedded in the index implementation or replicates the code that reads the data structure. In this paper, we consider an alternative approach that maps the data structure to a file structure, using a file library, making the index easily readable for any programming language and even human-readable. The disadvantage is that the serialized index is verbose, leading to overhead of searching through the index. The question addressed in this paper is how severe this performance penalty is. To that end, this paper presents eCP-FS, a file-based implementation of eCP, a well-known disk-based ANN index. A comparison with state-of-the-art indexes shows that while eCP-FS is slower, the implementation is nevertheless somewhat competitive even when memory is not constrained. In a memory-constrained scenario, eCP-FS offers a minimal memory footprint, making it ideal for resource-constrained or multi-index environments.