Abstract
ObjectiveWe present the support vector subset scan (SVSS), a new methodfor detecting localized and irregularly shaped patterns in spatial data.SVSS integrates the penalized fast subset scan3with a kernel supportvector machine classifier to accurately detect disease clusters that arecompact and irregular in shape.IntroductionNeill’s fast subset scan2detects significant spatial patterns ofdisease by efficiently maximizing a log-likelihood ratio statisticover subsets of locations, but may result in patterns that are notspatially compact. The penalized fast subset scan (PFSS)3providesa flexible framework for adding soft constraints to the fast subsetscan, rewarding or penalizing inclusion of individual points into acluster with additive point-specific penalty terms. We propose thesupport vector subset scan (SVSS), a novel method that iterativelyassigns penalties according to distance from the separating hyperplanelearned by a kernel support vector machine (SVM). SVSS efficientlydetects disease clusters that are geometrically compact and irregular.MethodsSpeakman3observes that for a fixed value of relative riskq, thelog-likelihood ratio for the exponential family of expectation-basedscan statistics can be written as an additive set function over all dataelements. This property enables addition of element-specific penaltyterms to the log-likelihood ratio, interpreted as the prior log-odds ofincluding a data point in the cluster. We propose an iterative methodfor setting the penalty terms which leads to spatially compact clusters,alternately running PFSS to obtain an optimal subset and traininga kernel SVM to maximize the margin between points within andoutside of the subset. On each iteration of PFSS, penalties are assignedbased on distance to the SVM decision boundary. We apply randomrestarts across the penalty space to approach a global optimum in thenon-convex SVSS objective function.ResultsWe demonstrate detection of disease clusters in mosquito poolstested for West Nile Virus (WNV), using data made publicly availableby the Chicago Department of Public Health through the City ofChicago Data Portal. In comparison to the circular scan1, whichdetects circular patterns with elevated WNV, SVSS has improvedpower to detect disease clusters that are elongated or irregularin shape. For example, the top WNV cluster detected by SVSSroughly conforms to sections of two major rivers in North Chicago,overlapping significant portions of the forest preserves adjacent tothese rivers. The unconstrained fast subset scan2has high detectionpower for subtle and irregular disease clusters, but finds patterns thatare spatially sparse and intermingled with non-anomalous points.SVSS rewards patterns with spatial coherence, detecting clustersthat are compact and separated from non-anomalous points whilemaintaining power to detect slight but significant increases in detectedrates of WNV.ConclusionsSVSS introduces soft spatial constraints to the fast subset scan2in the form of penalties to the log-likelihood ratio statistic, learnediteratively based on distance to a high-dimensional SVM decisionboundary. These constraints give SVSS greater power to detectspatially compact and irregular patterns of disease.Clusters of West Nile Virus detected by three scanning algorithms.