Abstract
ObjectiveThe research objective was to develop and validate an automatedsystem to extract and classify patient alcohol use based on unstructured(i.e., free) text in primary care electronic medical records (EMRs).IntroductionEMRs are a potentially valuable source of information about apatient’s history of health risk behaviors, such as excessive alcoholconsumption or smoking. This information is often found in theunstructured (i.e., free) text of physician notes. It may be difficultto classify and analyze health risk behaviors because there are nostandardized formats for this type of information1. As well, thecompleteness of the data may vary across clinics and physicians.The application of automated classification tools for this type ofinformation could be useful for describing patterns within thepopulation and developing disease risk prediction models.Natural Language Processing (NLP) tools are currently used toprocess EMR free text in an automated and systematic way. However,these tools have primarily been applied to classify information aboutthe presence or absence of disease diagnoses2. The application of NLPtools to health risk behaviors, particularly alcohol use informationfrom primary care EMRs, has thus far received limited attention.MethodsStudy data were from the Manitoba regional network of theCanadian Primary Care Sentinel Surveillance Network (CPCSSN)for the period from 1998 to 2016. CPCSSN is a national primary caresurveillance network for chronic diseases comprised of 11 regionalnetworks with publicly funded healthcare systems. Currently, a totalof 53 clinics and more than 260 physicians provide data to CPCSSNin Manitoba. We classified each record based on unstructured textfrom physician notes into the following mutually exclusive categories:current drinker, not a current drinker, and unknown1. A standardizedde-identification process was applied to each record prior to applyingan NLP tool to the data.Text classification used a support vector machine (SVM) appliedto both unigrams (i.e., single words) and mixed grams (i.e., unigrams,and pairs of words known as bigrams) from a bag-of-words model inwhich each record is quantified by the relative frequency of occurrenceof each word in the record3. The training dataset for the SVM wascomprised of 2000 records classified by two primary care physicians.These physicians were initially trained using an independent sampleof 200 EMR text strings containing specific reference to alcohol use.Cohen’s kappa statistic, a chance-adjusted measure, was used toestimate agreement. Internal validation of the SVM was conductedusing 10-fold cross-validation techniques. Model performance wasassessed using recall and precision statistics, as well as the F-measurestatistic, which is a function of their average. All analyses wereconducted using the R open-source software package.ResultsA total of 57,663 unique records were included in the study. Theestimate of the kappa statistic for the physician training phase was0.98, indicating excellent agreement. Subsequent classification of thetraining dataset by the physicians resulted in 1.7% of records assignedas not a current drinker, 16.8% as current drinker, and 81.5% asunknown. Average estimates of recall for the 10 validation folds usingunigrams were 0.62 for not current drinkers, 0.86 for current drinkers,and 0.98 for the unknown category. Average estimates of recall usingmixed grams were 0.48, 0.84, and 0.97, respectively. Estimates ofprecision were higher with mixed grams than unigrams for the notcurrently drinking category, but the opposite was true for the currentdrinker category. There was no difference in precision between thetwo methods for the unknown category. The F-measure statistic washigher for classification of current drinkers using unigrams (0.89)than mixed grams (0.83), although the differences for the unknowncategory were negligible (0.98 versus 0.97). Application of the SVMwith unigrams to the entire dataset resulted in 15.3% of recordsclassified as current drinkers, 2.0% classified as not current drinkers,and 82.7% as unknown.ConclusionsThis study developed an automated system to classify unstructuredtext about alcohol consumption into mutually-exclusive alcohol usecategories. However, we found that only a small percentage of primarycare records contained documentation about alcohol consumption,which limits the utility of the automated tool and the data source fordisease risk prediction or alcohol use prevalence estimation1. Whileour automated approach is useful for processing existing EMR data,systematic documentation of alcohol consumption will benefit fromstandardized entry fields and terms to produce clinically meaningfulinformation that will improve the understanding of health riskbehaviors in primary care populations.