Abstract
ObjectiveThe objective of this analysis is to leverage recent advances innatural language processing (NLP) to develop new methods andsystem capabilities for processing social media (Twitter messages)for situational awareness (SA), syndromic surveillance (SS), andevent-based surveillance (EBS). Specifically, we evaluated the useof human-in-the-loop semantic analysis to assist public health (PH)SA stakeholders in SS and EBS using massive amounts of publiclyavailable social media data.IntroductionSocial media messages are often short, informal, and ungrammatical.They frequently involve text, images, audio, or video, which makesthe identification of useful information difficult. This complexityreduces the efficacy of standard information extraction techniques1.However, recent advances in NLP, especially methods tailoredto social media2, have shown promise in improving real-time PHsurveillance and emergency response3. Surveillance data derived fromsemantic analysis combined with traditional surveillance processeshas potential to improve event detection and characterization. TheCDC Office of Public Health Preparedness and Response (OPHPR),Division of Emergency Operations (DEO) and the Georgia TechResearch Institute have collaborated on the advancement of PH SAthrough development of new approaches in using semantic analysisfor social media.MethodsTo understand how computational methods may benefit SS andEBS, we studied an iterative refinement process, in which the datauser actively cultivated text-based topics (“semantic culling”) in asemi-automated SS process. This ‘human-in-the-loop’ process wascritical for creating accurate and efficient extraction functions in large,dynamic volumes of data. The general process involved identifyinga set of expert-supplied keywords, which were used to collect aninitial set of social media messages. For purposes of this analysisresearchers applied topic modeling to categorize related messages intoclusters. Topic modeling uses statistical techniques to semanticallycluster and automatically determine salient aggregations. A user thensemantically culled messages according to their PH relevance.In June 2016, researchers collected 7,489 worldwide English-language Twitter messages (tweets) and compared three samplingmethods: a baseline random sample (C1, n=2700), a keyword-basedsample (C2, n=2689), and one gathered after semantically cullingC2 topics of irrelevant messages (C3, n=2100). Researchers utilizeda software tool, Luminoso Compass4, to sample and perform topicmodeling using its real-time modeling and Twitter integrationfeatures. For C2 and C3, researchers sampled tweets that theLuminoso service matched to both clinical and layman definitions ofRash, Gastro-Intestinal syndromes5, and Zika-like symptoms. Laymanterms were derived from clinical definitions from plain languagemedical thesauri. ANOVA statistics were calculated using SPSSsoftware, version. Post-hoc pairwise comparisons were completedusing ANOVA Turkey’s honest significant difference (HSD) test.ResultsAn ANOVA was conducted, finding the following mean relevancevalues: 3% (+/- 0.01%), 24% (+/- 6.6%) and 27% (+/- 9.4%)respectively for C1, C2, and C3. Post-hoc pairwise comparison testsshowed the percentages of discovered messages related to the eventtweets using C2 and C3 methods were significantly higher than forthe C1 method (random sampling) (p<0.05). This indicates that thehuman-in-the-loop approach provides benefits in filtering socialmedia data for SS and ESB; notably, this increase is on the basis ofa single iteration of semantic culling; subsequent iterations could beexpected to increase the benefits.ConclusionsThis work demonstrates the benefits of incorporating non-traditional data sources into SS and EBS. It was shown that an NLP-based extraction method in combination with human-in-the-loopsemantic analysis may enhance the potential value of social media(Twitter) for SS and EBS. It also supports the claim that advancedanalytical tools for processing non-traditional SA, SS, and EBSsources, including social media, have the potential to enhance diseasedetection, risk assessment, and decision support, by reducing the timeit takes to identify public health events.