Mostly Issues Related to Conflict Mortality Estimation
8:40- Taxi Pick up from Marriott City Center (please have buffet breakfast at the hotel and charge to room)
9:00-9:45- Kristian Lum, Adjusting expectations for the F1 score on imbalanced data: an application to model assessment for record-linkage
10:00-10:45- Patrick Ball, Deduplicating databases of deaths in war: advances in adaptive blocking, pairwise classification, and clustering
10:45-11:00- Coffee/Tea in hallway
11:00-11:45- Anshumali Shrivastava, Probabilistic Fingerprints for Scalable and Secure Record Linkage (Deduplication)
1:00-1:45- Rebecca Steorts, Understanding the Syrian Conflict: Not Just Another Enumeration
2:00-2:45- James Johndrow, A new approach to heterogeneous population estimation
2:45-3:00- Coffee/Tea in Hallway
3:00-3:45- Robin Kirk, Mapping Human Rights: using geographical data to map the human rights past
4:00-5:00- Panel Discussion (all presenters)
6:30- Dinner at the Piedmont Restaurant (walking distance from hotel)(reservation under David Banks name)
8:40- Taxi Pick up from Marriott City Center (Please have buffet breakfast at hotel and charge to your room).
9:00-9:45- Jay Aronson, Developing a long-term, collaborative research agenda (and funding model) for statistics and human rights
10:00-10:45- Chris McNaboe
10:45-11:00- Coffee/Tea in hallway
11:00-11:4- Megan Price
1:00-1:45- Robin Mejia, Case studies on data issues in human rights investigation
2:00-2:45- Duncan Thomas, Costs and benefits of following the hard to follow: Evidence on following migrants and the displaced in longitudinal surveys
2:45-3:00- Coffee/Tea in hallway
3:00-3:45- David Banks, Cost-Benefit Analysis of War
4:00-4:45- Daniel Manrique-Vallier, Multiple-Recapture Estimation of Casualties in Armed Conflicts Using Dirichlet Process Mixtures.
4:45-5:30- Panel Discussion (all presenters)
Adjusting expectations for the F1 score on imbalanced data: an application to model assessment for record-linkage
Imbalanced data — binary data in which one value predominates — is routinely encountered in many modern applications. When training a classifier on imbalanced data, it is common to create balanced training sets containing roughly equal numbers of positives and negatives. Performance is often assessed by computing the F1 score on an imbalanced validation set representative of the data at large. Motivated by an application to record linkage, we demonstrate mathematically and through numerical examples that the F1 score will always degrade as validation data become more imbalanced, even if the classifier has identical predictive performance on the training and validation data. In the described scenario a reduced F1 on the testing set may be interpreted by practitioners as model over-fitting, when in fact, it is a mathematical property of the metric. We then propose an alternative approach to measuring performance of binary classifiers for record-linkage, and apply this approach to link several datasets containing the names of people killed in Syria.
Deduplicating databases of deaths in war: advances in adaptive blocking, pairwise classification, and clustering
Violent inter-state and civil wars are documented with lists of the casualties. There are often several lists, with duplicate entries in each list and among the lists, requiring record linkage to dedeuplicate them. This talk will explore how we do record linkage, including an adaptive blocking approach; pairwise classification with string, date, and integer features and several classifiers; and clustering. Assessment metrics will be proposed for each stage, with real-world results from deduplicating more than 350,000 records of Syrian people killed since 2011.
Understanding the Syrian Conflict: Not Just Another Enumeration
While the Syrian conflict is extremely well documented, providing a reliable enumeration is challenging. Victims are often reported in multiple data sets and are missing vital identifiers while other victims are not reported at all. The key insight is that many victims are reported in multiple data sets. Thus, in order to find a reliable enumeration, record linkage (otherwise known as de-duplication or entity resolution) is used to merge multiple noisy data sets to remove duplicated information.
Record linkage itself is a multi-step process. The crucial first step involves a data reduction component-blocking that divides the space of records into similar partitions. On any moderately sized data set it is essential to avoid all-to-all record comparisons, thus emphasizing the use of computationally scalable blocking algorithms. Once blocked partitions have been established, any record linkage method can be applied within these partitions to remove the duplicated information victims. After the duplicated victims are removed by record linkage, an enumeration and a standard error of this enumeration can be computed. We present three methods for providing such an enumeration, assessing our results, and providing insights for future directions.
Joint work with Abbas Zaidi, Anshumali Shrivastava, Megan Price, Rebecca C. Steorts
A new approach to heterogeneous population estimation
Abstract: Capture-recapture methods aim to estimate the size of a closed population on the basis of multiple incomplete enumerations of individuals. In many applications, the individual probability of being recorded is heterogeneous in the population. Previous studies have suggested that it is not possible to reliably estimate the total population size when capture heterogeneity exists. Here we approach population estimation in the presence of capture heterogeneity as a specialized nonparametric density estimation problem. We show mathematically that in this setting it is generally impossible to estimate the density on the entire real line in finite samples, and that estimators of the density will converge to the true density at a logarithmic rate or worse in the sample size. As an alternative, we propose estimating the population of individuals with capture probability exceeding some threshold. We provide methods for selecting an appropriate threshold, and show that this approach results in estimators with substantially lower risk than estimators of the total population size, with correspondingly smaller uncertainty. The alternative paradigm is demonstrated in extensive simulation studies and an application to snowshoe hare multiple recapture data.
Probabilistic Fingerprints for Scalable and Secure Record Linkage (Deduplication)
The dramatic growth in data volumes has made conventional statistical methodologies near-infeasible, due to their prohibitive computational requirements. We are soon going towards an era where, for most applications, super-liner (in the size of data) runtime is practically infeasible. In this talk, I will introduce probabilistic hashing (or fingerprinting) techniques as a solution. The proposed techniques trade a small amount of certainty, which is insignificant for most practical purposes, with huge, often exponential gains in the computational complexity. I will demonstrate a linear time algorithm for record linkage with results on a real deduplication task over Syrian Death Records. I will then show how to modify these fingerprints to restrict the information leakage leading to privacy preserving fingerprinting.
Multiple-Recapture Estimation of Casualties in Armed Conflicts Using Dirichlet Process Mixtures
Beginning with the pioneering work of Patrick Ball in Guatemala in 1999, Multiple-Recapture (MR) techniques have become a popular tool for estimating total numbers of casualties in armed conflicts from multiple incomplete lists. A major challenge in these applications is to correctly account for individual heterogeneity of capture, and dependence between lists. Classic log-linear modeling is often a simple and reasonable approach. However, difficult problems such as model selection and low tolerance to sparsity when dealing with large numbers of lists, limit their broader applicability and often require ad-hoc solutions. In this talk I present a full Bayesian method, based on Dirichlet process mixtures. This method offers a principled way of accounting for complex patterns of heterogeneity of capture, obviating the need for a separate model selection process, and is computationally efficient. Additionally it has a high tolerance for sparsity. I illustrate it using historical data from conflicts in Kosovo and Colombia.
Case studies on data issues in human rights investigation
In this talk, I’ll discuss the importance of careful assessment of data sources used in quantitative human rights investigations. The talk will focus on two case studies: investigations into child abductions that occurred in El Salvador’s civil war and deaths in Syria. El Salvador suffered a brutal civil war between 1979 and 1992. Military abductions of children were a feature of the conflict, but the extend of the practice has not been quantified. I assess data provided by the Salvadoran NGO La Asociacion Pro Busqueda por Ninas y Ninos Desparecidos (Pro Busqueda), which investigates these cases, and present a characterization of known cases. I then explore issues that arise when attempting to parse the data into multiple lists (cases opened by parents and cases opened by children) for use in establishing a capture-recapture estimate of the total number of abductions. In addition, if time permits, I will present initial results characterizing datasets provided by Syrian observer organizations to the Human Rights Data Analysis Group for use in estimating the number of observed deaths and also the number of total deaths in that conflict. Several groups have provided HRDAG multiple snapshots of their datasets — for example, providing data in May 2013 and again in June 2014. We see that not only do updates include new incidents that have occurred since May 2013, but they also include revisions to the data from the beginning of the war.
Cost-Benefit Analysis of War
Surprisingly little seems to have been written about the tradeoffs made when a country decides to go to war. This talk looks at the four discretionary wars fought by the United States between 1950 and 2000. The main goal is to find sensible methods, but some calculations are possible.