Feature Engineering for Entity Resolution with Arabic Names: Improving Estimates of Observed Casualties in the Syrian Civil War
Entity resolution or record linkage is the task of identifying records referring to the same entity across multiple data sources. In the absence of a unique identifier entities must be resolved on the basis of possibly noisy and incomplete quasi-identifiers, such as names, ages, and addresses or geographic locations. Our goal is to improve estimates of the total observed casualty count in the ongoing Syrian civil war. Estimating the total victim tools in a conflict is an important element to understand its extend and magnitude, drive intervention policies and also to aid in bringing justice to perpetrators and mass murderers. Our data comprise multiple lists of casualties, compiled by the Human Rights Data Analysis Group. To arrive at an estimate of the number of unique casualties we first need to detect duplicate entries within and across lists. By focusing on Arabic names and their structure, we develop new features for comparing records and demonstrate meaningful improvements over existing classifiers (which have already seen significant engineering), empirically supporting the importance of language-specific analysis. We expect that these features will be useful in other contexts where it is necessary to measure the similarity between Arabic names.
READ FULL TEXT