Identifying progression subphenotypes of Alzheimer’s disease from large-scale electronic health records with machine learning

Key Information
Year
2025
summary/abstract

Objective

Identification of clinically meaningful subphenotypes of disease progression can enhance the understanding of disease heterogeneity and underlying pathophysiology. In this study, we propose a machine learning framework to identify subphenotypes of Alzheimer’s disease progression based on longitudinal real-world patient records.

Methods

The framework, dynaPhenoM, extracts coherent clinical topics across patient visits and employs a time-aware latent class analysis to characterize subphenotypes. We validated dynaPhenoM using three patient databases with a total of 3952 AD patients across the United States, demonstrating its effectiveness in revealing mild cognitive impairment (MCI) progression to AD.

Results

Our study identified five subphenotypes associated with distinct organ systems for disease progression from MCI to AD, including common subtypes across cohorts—respiratory, musculoskeletal, cardiovascular, and endocrine/metabolic—as well as a cohort-specific digestive subtype.

Conclusion

Our study unravels the complexity and heterogeneity of the progression from MCI to AD. These findings highlight disease progression heterogeneity and can inform both diagnostic and therapeutic strategies, thereby advancing precision medicine for Alzheimer’s disease.

Introduction

Alzheimer's disease (AD) is the most prevalent neurodegenerative disorder worldwide [1], with its prevalence expected to double in the next 20 years [1], [2]. The disease mechanism of AD is highly complex, and effective treatments are yet to be found [3]. The complex and heterogeneous nature of human diseases, such as AD, often results in patients exhibiting diverse clinical manifestations [4]. Identifying clinically meaningful subphenotypes, or subgroups of patients with coherent clinical characteristics, is crucial for enhancing our understanding of underlying disease mechanisms and informing precision medicine [5], [6]. With the increasing adoption of health information systems such as electronic health records (EHR), comprehensive patient information, including demographics, diagnoses, medications, and lab tests, have become readily available [7].
To effectively explore clinical information in patient records and identify comprehensive disease subphenotypes, several challenges must be addressed: (i) information heterogeneity: patient data encompass various types of information; (ii) irregular timing of visits: the time intervals between successive patient visits are typically irregular; (iii) missing values: substantial information may be missing in patient records, which does not necessarily indicate the absence of the disease; (iv) high-dimensionality and sparsity: clinical events are represented as systematic codes with large vocabularies [12], with each patient visit containing only a few codes [13]; and (v) interpretability: analysis results must be interpretable and easily understood by clinicians.
While existing studies have developed data-driven approaches to identify and predict disease subphenotypes [8], [9], [10], [11], they often focus on selected clinical events without considering their temporal evolution. For instance, traditional machine learning models such as random forests and XGBoost have been used to classify Alzheimer’s subtypes [10] but did not consider disease trajectories over time to model AD progression. In addition, methods such as LSTM-based models and Hidden Markov Models, have been developed to model temporal dynamics [11] using patients’ longitudinal records but usually suffer from limitations in interpretability and scalability. To address the issues, in this study, we developed a machine learning framework, termed dynaPhenoM, which capture temporal patterns of AD progression using longitudinal patient records to identify subphenotypes while ensuring their clinical interpretability. The dynaPhenoM comprises two main modules: the dynamic multimodal topic model (DMTM) for deriving interpretable compressed representations of multimodal clinical events, and the time-aware latent class analysis (TLCA) for subphenotype identification that accommodates irregular visit times. DMTM builds on latent topic modeling (LTM)[14], commonly used in text-mining tasks, treating clinical events as words and each visit as a document. Unlike existing methods that focus on single visits, DMTM learns representations from longitudinal patient information [12], [15], [16], [17]. Although latent class analysis (LCA)[17] is a widely used subphenotyping method in clinical studies [18], [19], [20], traditional LCA does not account for irregular time intervals between visits [21], motivating the invention of TLCA. We further developed and applied machine learning models to predict subphenotype assignment based on clinical features prior to MCI onset.
Authors
Manqi Zhou a, Alice S. Tang b c, Hao Zhang d, Zhenxing Xu d, Alison M.C. Ke a, Chang Su d, Yu Huang e, William G. Mantyh f, Michael S. Jaffee g, Katherine P. Rankin b h, Steven T. DeKosky g , Jiayu Zhou i, Yi Guo j, Jiang Bian e, Marina Sirota b k, Fei Wang d