A DNA Methylation Classification Model Predicts Organ and Disease Site
Abstract
Cell-free DNA (cfDNA) analysis is a powerful, minimally invasive tool for monitoring disease progression, treatment response, and early detection. A major challenge, however, is accurately determining the tissue of origin, especially in complex or heterogeneous disease contexts. To address this, we developed a machine learning framework that leverages tissue-specific DNA methylation signatures to classify both tissue and disease origin from cfDNA data. Our model integrates methylation datasets across diverse epigenomic platforms, including Whole Genome Bisulfite Sequencing (WGBS), Illumina Infinium Bead Arrays, and Enzymatic Methyl-seq (EM-seq). To account for platform variability and data sparsity, we applied imputation strategies and harmonized CpG features to enable cross-platform learning. Dimensionality reduction revealed clear tissue-specific clustering of methylation profiles. A random forest classifier trained on these features achieved consistent classification performance (accuracy 0.75-0.8 across test sets and platforms). Notably, our model distinguished clinically relevant tissues such as inflamed synovium and peripheral blood mononuclear cells (PBMCs) in arthritis patients and deconvoluted synthetic cfDNA mixtures mimicking real-world liquid biopsy samples. The predicted tissue proportions closely matched the true values, demonstrating the model's potential for both classification and quantitative inference. These results support the feasibility of using cross-platform methylation data and machine learning for scalable, generalizable cfDNA diagnostics and lay the groundwork for future integration of disease-specific epigenetic features to guide clinical decision-making in precision medicine.