Donna Tjandra's Personal Website

Feel free to reach out at dotjandr@umich.edu, I am also on GitHub at detjandra

Home Research Extracurricular

Overview and Motivation

My first project in graduate school was to train a model to predict the onset of Alzheimer’s disease (AD) using electronic health record (EHR) data. This work was done in collaboration with two clinicians specializing in AD and related diseases, Raymond Migrino and Bruno Giordani, and my advisor Jenna Wiens, and it resulted in the publication “Cohort discovery and risk stratification for Alzheimer’s disease: an electronic health record‐based approach” (AD-TRCI, 2020). AD is one of the leading causes of death in individuals 65 years and older, and there is currently no way to prevent, stop, or slow its progression. Accurately predicting which individuals are at high risk of developing AD years before symptom onset using routinely collected variables from everyday clinical encounters could be crucial to advancing AD research.

Our study consisted of two parts. In the first part, we developed an automated cohort discovery tool that labeled when patients in an EHR are diagnosed with AD. Of several cohort discovery tools we tested, we found that labeling patients based on ICD (international classification of diseases) billing codes had the best performance, measured by the F1 score. In the second part, we trained a machine learning model to predict whether patients between the ages of 68 and 72 would experience AD onset within 10 years, where AD onset was labeled based on our best performing cohort discovery tool. Here, we only considered patients who experienced AD onset within 10 years or did not experience AD onset after 10 years of followup. Patients who were lost to followup (e.g., followup of five years without AD onset) were excluded. Overall, we were able to predict AD with modest performance, acheiving an area under the receiver operating characteristic curve (AUROC) of ~0.7, and found that important features identified by the model aligned with the literature. For more details, please see the paper.

From a practical standpoint, our study showed the potential of EHR data in AD risk prediction. Patients identified as being at a higher risk to develop AD in a routine clinical visit can then be recruited to studies like ADNI (Alzheimer’s disease neuroimaging initiative) or early intervention trials for more thorough evaluation (e.g., measuring cerebrospinal fluid composition). This is a more cost-effective alternative to conducting a thorough evaluation on all patients above a certain age threshold, as these tests are invasive and expensive. In addition, although we cannot conclude causal relations between the features identified as important by the model and AD onset, these features can stimulate hypothesis generation for further investigation.

On the other hand, our study also highlighted some of the challenges of working with EHR data. For example, we were unable to assign a label to all patients in our population of interest, and it is likely that some of our labels were incorrect. Although we focused on predicting AD onset, these challenges are more broadly applicable in settings where 1) some individuals have incomplete follow-up (i.e., are censored) or 2) obtaining accurate labels require labor-intensive chart review. Individuals can have incomplete follow-up when they move away or change healthcare providers, among other reasons, and thus do not always have enough followup data to assign a complete label. For example, a patient could move away after 3 years but not be diagnosed with AD by then, resulting in an incomplete label since we do not know if or when onset occurred after moving. Additionally, many conditions, like AD onset, are difficult to label without manual chart review. As a result, researchers often rely on automated cohort discovery tools to obtain outcome labels, since performing chart review on thousands of patient records is infeasible from the perspective of clinician labor. Though convenient, such tools are less accurate than clinician chart review, introducing label noise (i.e., incorrect labels). These limitations have motivated some of my subsequent researach, some of which are described below.

Highlights

1. Hierarchical Survival Analysis

In past work, censored labels are most commonly considered in the context of survival analysis, where we aim to predict time-to-events (i.e., when an event of interest occurs). Survival analysis uses both censored and uncensored individuals to learn to predict the survival curve. A survival curve plots an individual’s probability of survival (i.e., not experiencing the event) at various time points in the future, relative to when the prediction was made. In “A Hierarchical Approach to Multi-Event Survival Analysis” (AAAI, 2021), we built on previous work in survival analysis to improve prediction accuracy by leveraging the relatedness between time points. Previous work modeled survival prediction as a multiclass classification problem, with each time point as its own class. Doing so assumes that the time points are independent of each other, when in most cases, they are not (e.g., time point 1 is more similar to time point 2 than time point 10).

We approached this challenge by breaking the original task (i.e., predicting the probability of event occurrence at each of several time points into the future over some horizon) into a series of coarser-grained subtasks that eventually build up to the original task. Each subtask represents a more simple version of the main task and learns to predict the probability of event occurrence at a coarser time granularity (e.g., months versus days). By first considering a more simple task, we can build up to the main task by using the predictions from previous tasks to guide predictions at subsequent tasks, thus taking advantage of the relatedness between finer grained time points. We empirically show that our approach is more effective than existing baselines on a variety of tasks. For more details, please see the paper. From a practical standpoint, obtaining accurate survival curves for events such as onset of AD and mortality can aid clinicians in deciding the best treatment plan for a patient.

This project was done in collaboration with my advisor Jenna Wiens and a (then) undergraduate student I mentored, Yifei He.

2. Learning with Instance-Dependent Label Noise

Past work addressing label noise mainly focuses on classification. Earlier works assume that the noise is independent of the individual/instance features. However, this is not always true, and when the noise does depend on instance features, this can lead to different noise rates within subsets of the data. This in turn, can lead to biased model performance in favor of subsets with less noise. Although instance-dependent label noise has recently received more attention, the effects of these approaches on model bias is relatively understudied. In “ Leveraging an Alignment Set in Tackling Instance-Dependent Label Noise” (CHIL 2023), we propose an approach to address instance-dependent label noise while also explicitly accounting for differences in noise rates among subsets of the data.

Our approach addressed this issue by learning the underlying pattern of noise using a small subset of data for which we know the ground truth and noisy labels. Such a setting can occur in practice where, for example, a clinician chart reviews a small subset of the data to validate the performance of an automated cohort discovery tool. We used the underlying label noise pattern to re-weight the objective function during subsequent training, where we consider the entire dataset and upweight subsets with a higher estimated noise rate. Empirically, we show that, as we vary the overall noise rate and the difference in noise rates among subsets of the data, our approach is robust compared to state-of-the-art baselines. For more details, please see the paper. From a practical standpoint, by accounting for different noise rates in subsets of the data, our approach can improve the predictions for subsets of the population where it is more difficult to obtain accurate labels.

This work was done in collaboration with my advisor Jenna Wiens.

Current Research

I am currently studying label noise in the context of survival analysis. Past work studying label noise in survival analysis considers cases when the labeled time-to-event in the dataset does not match the ground truth time-to-event that occurred in reality in the context of a single labeler. Label noise in survival analysis with multiple labelers remains understudied but is still relevant in settings such as healthcare since there are often multiple automated labeling tools for labeling a specific condition, each with their own strengths and weaknesses.