Examining the Role of Training Data for Supervised Methods of Automated Record Linkage

Jonas Helgertz, University of Minnesota/Lund University
Joseph P. Price, Brigham Young University
James Feigenbaum, Boston University

During the past decade, a vast amount of research based on linked digitized historical individual-level data has emerged, with conclusions that shape and change our understanding of the past. This revolution is in no small part attributable to important methodological advances, making record linkage easily accessible to researchers. Many scholars link records with supervised methods that require training data, examples of matched and non-matched cases used to train automated linking algorithms. This paper addresses the training data generating process, an aspect of record linkage that to date has largely been overlooked. In the current literature, sufficient quality training data is often taken for granted: that the optimized linking algorithm will perform well but also that various fit statistics (e.g. precision and recall) are informative. We investigate different procedures for creating training data, evaluating differences in quality and resources required for its generation, as well as analyzing the importance of training data quality for supervised methods of automated record linkage. We also provide detailed instructions for how to build training data that should be useful to scholars at all budgets.

No extended abstract or paper available

 Presented in Session 197. Dealing with Data