Abstract
ObjectiveTo evaluate prediction of laboratory diagnosis of acute respiratoryinfection (ARI) from participatory data using machine learningmodels.IntroductionARIs have epidemic and pandemic potential. Prediction of presenceof ARIs from individual signs and symptoms in existing studieshave been based on clinically-sourced data1. Clinical data generallyrepresents the most severe cases, and those from locations with accessto healthcare institutions. Thus, the viral information that comes fromclinical sampling is insufficient to either capture disease incidence ingeneral populations or its predictability from symptoms. Participatorydata — information that individuals today can produce on their own— enabled by the ubiquity of digital tools, can help fill this gap byproviding self-reported data from the community. Internet-basedparticipatory efforts such as Flu Near You2have augmented existingARI surveillance through early and widespread detection of outbreaksand public health trends.MethodsThe GoViral platform3was established to obtain self-reportedsymptoms and diagnostic specimens from the community (Table 1summarizes participation detail). Participants from states with themost data, MA, NY, CT, NH, and CA were included. Age, gender,zip code, and vaccination status were requested from each participant.Participants submitted saliva and nasal swab specimens and reportedsymptoms from: fever, cough, sore throat, shortness of breath, chills,fatigue, body aches, headache, nausea, and diarrhea. Pathogenswere confirmed via RT-PCR on a GenMark respiratory panel assay(full virus list reported previously3).Observations with missing, invalid or equivocal lab tests wereremoved. Table 2 summarizes the binary features. Age categorieswere:≤20, > 20 and < 40, and≥40 to represent young, middle-aged, and old. Missing age and gender values were imputed based onoverall distributions.Three machine learning algorithms—Support Vector Machines(SVMs)4, Random Forests (RFs)5, and Logistic Regression (LR) wereconsidered. Both individual features and their combinations wereassessed. Outcome was the presence (1) or absence (0) of laboratorydiagnosis of ARI.ResultsTen-fold cross validation was repeated ten times. Evaluationsmetrics used were: positive predictive value (PPV), negativepredictive value (NPV), sensitivity, and specificity6. LR and SVMsyielded the best PPV of 0.64 (standard deviation:±0.08) with coughand fever as predictors. The best sensitivity of 0.59 (±0.14) was fromLR using cough, fever, and sore throat. RFs had the best NPV andspecificity of 0.62 (±0.15) and 0.83 (±0.10) respectively with theCDC ILI symptom profile of fever and (cough or sore throat). Addingdemographics and vaccination status did not improve performanceof the classifiers. Results are consistent with studies using clinically-sourced data: cough and fever together were found to be the bestpredictors of flu-like illness1. Because our data include mildlyinfectious and asymptomatic cases, the classifier sensitivity and PPVare low compared to results from clinical data.ConclusionsEvidence of fever and cough together are good predictors of ARIin the community, but clinical data may overestimate this due tosampling bias. Integration of participatory data can not only improvepopulation health by actively engaging the general public2but alsoimprove the scope of studies solely based on clinically-sourcedsurveillance data.Table 1. Details of included participants.Table 2. Coding of binary features