skip to main content
10.1145/3314221.3322485acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Synthesis and machine learning for heterogeneous extraction

Published: 08 June 2019 Publication History

Abstract

We present a way to combine techniques from the program synthesis and machine learning communities to extract structured information from heterogeneous data. Such problems arise in several situations such as extracting attributes from web pages, machine-generated emails, or from data obtained from multiple sources. Our goal is to extract a set of structured attributes from such data.
We use machine learning models ("ML models") such as conditional random fields to get an initial labeling of potential attribute values. However, such models are typically not interpretable, and the noise produced by such models is hard to manage or debug. We use (noisy) labels produced by such ML models as inputs to program synthesis, and generate interpretable programs that cover the input space. We also employ type specifications (called "field constraints") to certify well-formedness of extracted values. Using synthesized programs and field constraints, we re-train the ML models with improved confidence on the labels. We then use these improved labels to re-synthesize a better set of programs. We iterate the process of re-synthesizing the programs and re-training the ML models, and find that such an iterative process improves the quality of the extraction process. This iterative approach, called HDEF, is novel, not only the in way it combines the ML models with program synthesis, but also in the way it adapts program synthesis to deal with noise and heterogeneity.
More broadly, our approach points to ways by which machine learning and programming language techniques can be combined to get the best of both worlds --- handling noise, transferring signals from one context to another using ML, producing interpretable programs using PL, and minimizing user intervention.

Supplementary Material

WEBM File (p301-iyer.webm)

References

[1]
2015. Microsoft Program Synthesis using Examples SDK. https://microsoft.github.io/prose/.
[2]
Rajeev Alur, Pavol Cerny, and Arjun Radhakrishna. 2015. Synthesis Through Unification. In Computer Aided Verification - 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part II. 163-179.
[3]
Rajeev Alur, Arjun Radhakrishna, and Abhishek Udupa. 2017. Scaling Enumerative Program Synthesis via Divide and Conquer. In Tools and Algorithms for the Construction and Analysis of Systems - 23rd International Conference, TACAS 2017, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2017, Uppsala, Sweden, April 22-29, 2017, Proceedings, Part I. 319-336.
[4]
Nilesh N. Dalvi, Ravi Kumar, and Mohamed A. Soliman. 2011. Automatic Wrappers for Large Scale Web Extraction. PVLDB 4, 4 (2011), 219-230.
[5]
Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. RobustFill: Neural Program Learning under Noisy I/O. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 990-998. http://proceedings.mlr.press/v70/devlin17a.html
[6]
Alexander Gammerman, Volodya Vovk, and Vladimir Vapnik. 1998. Learning by Transduction. In UAI '98: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, University of Wisconsin Business School, Madison, Wisconsin, USA, July 24-26, 1998, Gregory F. Cooper and Serafín Moral (Eds.). Morgan Kaufmann, 148-155. https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=243&proceeding_id=14
[7]
Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In Proceedings of the 38th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, Thomas Ball and Mooly Sagiv (Eds.). ACM, 317-330.
[8]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735-1780.
[9]
Chun-Nan Hsu and Ming-Tzung Dung. 1998. Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. Inf. Syst. 23, 8 (1998), 521-538.
[10]
Nicholas Kushmerick. 2000. Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118, 1-2 (2000), 15-68.
[11]
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001. 282-289.
[12]
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. CoRR abs/1603.01360 (2016). arXiv:1603.01360 http://arxiv.org/abs/1603.01360
[13]
Vu Le and Sumit Gulwani. 2014. FlashExtract: a framework for data extraction by examples. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '14, Edinburgh, United Kingdom - June 09 - 11, 2014, Michael F. P. O'Boyle and Keshav Pingali (Eds.). ACM, 542-553.
[14]
Chong Long, Xiubo Geng, Chang Xu, and Sathiya Keerthi. 2012. A simple approach to the design of site-level extractors using domaincentric principles. In 21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012, Xue-wen Chen, Guy Lebanon, Haixun Wang, and Mohammed J. Zaki (Eds.). ACM, 1517-1521.
[15]
Hiroki Nakayama. 2018. Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition. https://github.com/Hironsan/anago.
[16]
Saswat Padhi, Prateek Jain, Daniel Perelman, Oleksandr Polozov, Sumit Gulwani, and Todd D. Millstein. 2018. FlashProfile: a framework for synthesizing data profiles. PACMPL 2, OOPSLA (2018), 150:1-150:28.
[17]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825-2830.
[18]
Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: a framework for inductive program synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, Jonathan Aldrich and Patrick Eugster (Eds.). ACM, 107-126.
[19]
Veselin Raychev, Pavol Bielik, Martin T. Vechev, and Andreas Krause. 2016. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 761-774.
[20]
Mohammad Raza and Sumit Gulwani. 2018. Disjunctive Program Synthesis: A Robust Approach to Programming by Example. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 1403-1412. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17055
[21]
Andrew Reynolds, Morgan Deters, Viktor Kuncak, Cesare Tinelli, and Clark W. Barrett. 2015. Counterexample-Guided Quantifier Instantiation for Synthesis in SMT. In Computer Aided Verification - 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part II. 198-216.
[22]
Shambwaditya Saha, Pranav Garg, and P. Madhusudan. 2015. Alchemist: Learning Guarded Affine Functions. In Computer Aided Verification - 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part I. 440-446.
[23]
Sunita Sarawagi. 2008. Information Extraction. Found. Trends databases 1, 3 (March 2008), 261-377.
[24]
Sandeepkumar Satpal, Sahely Bhadra, Sundararajan Sellamanickam, Rajeev Rastogi, and Prithviraj Sen. 2011. Web information extraction using markov logic networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, Chid Apté, Joydeep Ghosh, and Padhraic Smyth (Eds.). ACM, 1406-1414.
[25]
Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. 2018. Programmatically Interpretable Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 (JMLR Workshop and Conference Proceedings), Jennifer G. Dy and Andreas Krause (Eds.), Vol. 80. JMLR.org, 5052-5061. http://proceedings.mlr.press/v80/verma18a.html
[26]
Weinan Zhang, Amr Ahmed, Jie Yang, Vanja Josifovski, and Alexander J. Smola. 2015. Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, Longbing Cao, Chengqi Zhang, Thorsten Joachims, Geoffrey I. Webb, Dragos D. Margineantu, and Graham Williams (Eds.). ACM, 2257-2266.

Cited By

View all
  • (2024)VRDSynth: Synthesizing Programs for Multilingual Visually Rich Document Information ExtractionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680314(704-716)Online publication date: 11-Sep-2024
  • (2022)BlueprintProceedings of the VLDB Endowment10.14778/3554821.355483615:12(3459-3471)Online publication date: 29-Sep-2022
  • (2022)Large-Scale Entity Extraction from Enterprise DataProceedings of the Second International Conference on AI-ML Systems10.1145/3564121.3564818(1-2)Online publication date: 12-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2019
1162 pages
ISBN:9781450367127
DOI:10.1145/3314221
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data extraction
  2. Heterogeneous data
  3. Machine Learning
  4. Program synthesis

Qualifiers

  • Research-article

Conference

PLDI '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)VRDSynth: Synthesizing Programs for Multilingual Visually Rich Document Information ExtractionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680314(704-716)Online publication date: 11-Sep-2024
  • (2022)BlueprintProceedings of the VLDB Endowment10.14778/3554821.355483615:12(3459-3471)Online publication date: 29-Sep-2022
  • (2022)Large-Scale Entity Extraction from Enterprise DataProceedings of the Second International Conference on AI-ML Systems10.1145/3564121.3564818(1-2)Online publication date: 12-Oct-2022
  • (2022)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3547352(4792-4793)Online publication date: 14-Aug-2022
  • (2022)Spine: Scaling up Programming-by-Negative-Example for String Filtering and TransformationProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517908(521-530)Online publication date: 10-Jun-2022
  • (2022)Program Synthesis—A SurveyComputational Intelligence in Machine Learning10.1007/978-981-16-8484-5_39(409-421)Online publication date: 3-Mar-2022
  • (2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
  • (2021)Web question answering with neurosymbolic program synthesisProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454047(328-343)Online publication date: 19-Jun-2021
  • (2021)An Integrated Approach of Deep Learning and Symbolic Analysis for Digital PDF Table Extraction2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9413069(4062-4069)Online publication date: 10-Jan-2021
  • (2021)Learning patterns in configurationProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE51524.2021.9678525(817-828)Online publication date: 15-Nov-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media