Published on in Vol 9, No 1 (2017):

Identification of Sufferers of Rare Diseases Using  Medical Claims Data

Identification of Sufferers of Rare Diseases Using Medical Claims Data

Identification of Sufferers of Rare Diseases Using Medical Claims Data

Authors of this article:

Jieshi Chen1 ;   Artur Dubrawski1
The full text of this article is available as a PDF download by clicking here.

ISDS Annual Conference Proceedings 2017. This is an Open Access article distributed under the terms of the Creative Commons Attribution-Noncommercial 3.0 Unported License (, permitting all non-commercial use, distribution,and reproduction in any medium, provided the original work is properly cited.38(page number not for citation purposes)ISDS 2016 Conference AbstractsIdentification of Sufferers of Rare Diseases UsingMedical Claims DataJieshi Chen* and Artur DubrawskiAuton Lab, Carnegie Mellon University, Pittsburgh, PA, USAObjectiveTo identify sufferers of a rare and hard to diagnose diseases bydetecting sequential patterns in historical medical claims.IntroductionPatients who suffer from rare diseases can be hard to diagnose forprolonged periods of time. In the process, they are often subjectedto tentative treatments for ailments they do not have, risking anescalation of their actual condition and side effects from therapiesthey do not need. An early and accurate detection of these caseswould enable follow-ups for precise diagnoses, mitigating the costsof unnecessary care and improving patients’ outcomes.MethodsA sequential rule learning algorithm1was applied to a medical claimdataset of about 1,700 patients, who are pre-selected to have medicalhistories indicative of Gaucher Disease (GD) but only 25 of thesepatients were confirmed positives. About 168,000 medical claimsand 142,000 pharmaceutical claims were featurized into sequencesof asynchronous events and regularly sampled time series as inputsfor the model, such that an occurrence of a certain diagnosis code ina medical claim was counted as one event along the timeline of thepatient’s medical history. Similar method was applied to other keyattributes of claims data including procedure codes, National DrugCodes, Diagnosis Related Groupers, etc. These types of events as wellas their temporal statistics, e.g. moving frequencies, peaks, changepoints, etc., formed the input feature space for the algorithm whichwas trained to adjudicate each test case and estimate their likelihoodof having GD. A random forest algorithm was also applied to the samefeature set to comparatively evaluate the utility of sequential aspectsof data. The models were evaluated with 10-fold cross-validation.ResultsFigure 1 shows the Receiver Operating Characteristic (ROC)curves of the temporal rule model with Area Under the Curve scoreexceeding 81% and significantly outperforming the random forestand default models. Considering the practical costs to performfollow-up genetic tests, we prefer a model achieving high positiverecall at low risk of false detection. Our model correctly identifiesmore than 25% of known positive cases well within 0.1% of the falsepositive rate, while the performance of a more popular alternativeis indistinguishable from random. This demonstrates the utility ofsequential structure of medical claims in identifying patients whosuffer from rare diseases.Our algorithm infers from data highly interpretable rules it usesin case adjudication. Figure 2 illustrates one of them. The rootnode of the case adjudication tree (Event.7969) reflects the ICD-9diagnosis code of “Other nonspecific abnormal findings”. Amongthe 14 patients that have this particular ICD-9 code present in theirclaim history, 36% are confirmed GD sufferers. Compared to defaultprevalence in our pre-selected data set of 1.47%, this rule lifts theestimated likelihood of GD 25 times. The rule further developsinto two children nodes. The left child node adds the condition ofhaving any outpatient claim observed within 43 claims recordednearby the occurrence of the root node event. It isolates 5 patientsall of whom are GD-positive. The right child shows that 3 patientswithout Event.7969 in their claim history but prescribed NDC62756-0137-02 (Gabapentin by Sun Pharmaceutical Industries Ltd.)are all GD-positive. This is just one example of a simple and easyto implement business rule that is capable of identifying previouslyundiagnosed sufferers of rare diseases.ConclusionsOur model successfully utilizes sequential relationships amongevents recorded in medical claims data and reveals interpretablepatterns that can identify sufferers of rare diseases with highconfidence. The algorithm scales well to large volumes of medicalclaims data and it remains sensitive in despite of a very low prevalenceof target cases in data.ROC diagrams of models trained to identify GD patients shown with decimallogarithmic scale of the false positive rate axis.