Abstract
Probabilistic record linkage is a well established topic in the literature. Fellegi–Sunter probabilistic record linkage and its enhanced versions are commonly used methods, which calculate match and non-match weights for each pair of records. Bayesian network classifiers–naive Bayes classifier and TAN have also been successfully used here. Recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN to record linkage and investigated the benefits of using naturally existing hierarchical feature level information and parsed fields of the datasets. In this work, we extend the naive Bayes classifier with such hierarchical feature level information. Finally we illustrate the benefits of our method over previously proposed methods on four datasets in terms of the linkage performance (\(F_1\) score). We also show the results can be further improved by evaluating the benefit provided by additionally parsing the fields of these datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Note in conventional PRL-FS method [8], two fields are either matched or unmatched. Thus, the k of \(m_{k,i}\) can be omitted in this case.
These datasets can be found at http://yzhou.github.io/.
Because the phone number is unique for each restaurant, it, on its own, can be used to identify duplicates without the need to resort to probabilistic record linkage techniques. Thus, this field is not used in our experiments.
In each dataset, we only introduce one hierarchical restriction between the name and address field.
References
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on Very Large Data Bases, VLDB Endowment, pp. 586–597 (2002)
de Campos, C.P., Cuccu, M., Corani, G., Zaffalon, M.: Extended tree augmented naive classifier. In: van der Gaag, L.C., Feelders, A.J. (eds.) Probabilistic Graphical Models, pp. 176–189. Springer, Berlin (2014)
de Campos, C.P., Corani, G., Scanagatta, M., Cuccu, M., Zaffalon, M.: Learning extended tree augmented naive structures. Int. J. Approx. Reason. 68, 153–163 (2016)
Christen, P., Belacic, D.: Automated probabilistic address standardisation and verification. In: Australasian Data Mining Conference (AusDM05), pp. 53–67(2005)
Churches, T., Christen, P., Lim, K., Zhu, J.X.: Preparation of name and address data for record linkage using hidden Markov models. BMC Med. Inf. Decis. Making 2(1), 1 (2002)
Dunn, H.L.: Record linkage*. Am. J. Public Health Nations Health 36(12), 1412–1416 (1946)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
Leitão, L., Calado, P., Weis, M.: Structure-based inference of XML similarity for fuzzy duplicate detection. In: Proceedings of the sixteenth ACM conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’07, pp. 293–302 (2007)
Leitão, L., Calado, P., Herschel, M.: Efficient and effective duplicate detection in hierarchical data. IEEE Trans. Knowl. Data Eng. 25(5), 1028–1041 (2013)
Li, X., Guttmann, A., Cipiere, S., Maigne, L., Demongeot, J., Boire, J.Y., Ouchchane, L.: Implementation of an extended Fellegi–Sunter probabilistic record linkage method using the Jaro–Winkler string comparator. In: 2014 IEEE-EMBS international conference on biomedical and health informatics (BHI), IEEE, pp. 375–379 (2014)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, AUAI Press, pp. 454–461 (2004)
Tromp, M., Ravelli, A.C., Bonsel, G.J., Hasman, A., Reitsma, J.B.: Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. J. Clin. Epidemiol. 64(5), 565–572 (2011)
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi–Sunter model of record linkage. In: Proceedings of the section on survey research, pp. 354–359 (1990)
Winkler, W.E.: The state of record linkage and current research problems. In: Statistical research division, US Census Bureau, Citeseer (1999)
Zhou, Y., Fenton, N., Neil, M.: Bayesian network approach to multinomial parameter learning using data and expert judgments. Int. J. Approx. Reason. 55(5), 1252–1268 (2014)
Zhou, Y., Fenton, N., Hospedales, T., Neil, M.: Probabilistic graphical models parameter learning with transferred prior and constraints. In: Proceedings of the 31st conference on uncertainty in artificial intelligence, AUAI Press, pp. 972–981 (2015a)
Zhou, Y., Howroyd, J., Danicic, S., Bishop, J.: Extending naive bayes classifier with hierarchy feature level information for record linkage. In: Suzuki, J., Ueno, M. (eds.) Advanced Methodologies for Bayesian Networks, Lecture Notes in Computer Science, vol. 9505, pp. 93–104. Springer, Berlin. doi:10.1007/978-3-319-28379-1_7 (2015b)
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors would like to thank the Tungsten Network for their financial support.
About this article
Cite this article
Zhou, Y., Wang, M., Haberland, V. et al. Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data. New Gener. Comput. 35, 87–104 (2017). https://doi.org/10.1007/s00354-016-0008-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00354-016-0008-5