research-article

Synthesis and machine learning for heterogeneous extraction

Authors:

Manohar Jonnalagedda,

Suresh Parthasarathy,

Arjun Radhakrishna,

Sriram K. RajamaniAuthors Info & Claims

PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 301 - 315

https://doi.org/10.1145/3314221.3322485

Published: 08 June 2019 Publication History

Abstract

We present a way to combine techniques from the program synthesis and machine learning communities to extract structured information from heterogeneous data. Such problems arise in several situations such as extracting attributes from web pages, machine-generated emails, or from data obtained from multiple sources. Our goal is to extract a set of structured attributes from such data.

We use machine learning models ("ML models") such as conditional random fields to get an initial labeling of potential attribute values. However, such models are typically not interpretable, and the noise produced by such models is hard to manage or debug. We use (noisy) labels produced by such ML models as inputs to program synthesis, and generate interpretable programs that cover the input space. We also employ type specifications (called "field constraints") to certify well-formedness of extracted values. Using synthesized programs and field constraints, we re-train the ML models with improved confidence on the labels. We then use these improved labels to re-synthesize a better set of programs. We iterate the process of re-synthesizing the programs and re-training the ML models, and find that such an iterative process improves the quality of the extraction process. This iterative approach, called HDEF, is novel, not only the in way it combines the ML models with program synthesis, but also in the way it adapts program synthesis to deal with noise and heterogeneity.

More broadly, our approach points to ways by which machine learning and programming language techniques can be combined to get the best of both worlds --- handling noise, transferring signals from one context to another using ML, producing interpretable programs using PL, and minimizing user intervention.

Supplementary Material

WEBM File (p301-iyer.webm)

Download
95.33 MB

References

[1]

2015. Microsoft Program Synthesis using Examples SDK. https://microsoft.github.io/prose/.

[2]

Rajeev Alur, Pavol Cerny, and Arjun Radhakrishna. 2015. Synthesis Through Unification. In Computer Aided Verification - 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part II. 163-179.

[3]

Rajeev Alur, Arjun Radhakrishna, and Abhishek Udupa. 2017. Scaling Enumerative Program Synthesis via Divide and Conquer. In Tools and Algorithms for the Construction and Analysis of Systems - 23rd International Conference, TACAS 2017, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2017, Uppsala, Sweden, April 22-29, 2017, Proceedings, Part I. 319-336.

[4]

Nilesh N. Dalvi, Ravi Kumar, and Mohamed A. Soliman. 2011. Automatic Wrappers for Large Scale Web Extraction. PVLDB 4, 4 (2011), 219-230.

Digital Library

[5]

Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. RobustFill: Neural Program Learning under Noisy I/O. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 990-998. http://proceedings.mlr.press/v70/devlin17a.html

Digital Library

[6]

Alexander Gammerman, Volodya Vovk, and Vladimir Vapnik. 1998. Learning by Transduction. In UAI '98: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, University of Wisconsin Business School, Madison, Wisconsin, USA, July 24-26, 1998, Gregory F. Cooper and Serafín Moral (Eds.). Morgan Kaufmann, 148-155. https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=243&proceeding_id=14

Digital Library

[7]

Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In Proceedings of the 38th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, Thomas Ball and Mooly Sagiv (Eds.). ACM, 317-330.

Digital Library

[8]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735-1780.

Digital Library

[9]

Chun-Nan Hsu and Ming-Tzung Dung. 1998. Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. Inf. Syst. 23, 8 (1998), 521-538.

[10]

Nicholas Kushmerick. 2000. Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118, 1-2 (2000), 15-68.

Digital Library

[11]

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001. 282-289.

Digital Library

[12]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. CoRR abs/1603.01360 (2016). arXiv:1603.01360 http://arxiv.org/abs/1603.01360

[13]

Vu Le and Sumit Gulwani. 2014. FlashExtract: a framework for data extraction by examples. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '14, Edinburgh, United Kingdom - June 09 - 11, 2014, Michael F. P. O'Boyle and Keshav Pingali (Eds.). ACM, 542-553.

Digital Library

[14]

Chong Long, Xiubo Geng, Chang Xu, and Sathiya Keerthi. 2012. A simple approach to the design of site-level extractors using domaincentric principles. In 21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012, Xue-wen Chen, Guy Lebanon, Haixun Wang, and Mohammed J. Zaki (Eds.). ACM, 1517-1521.

Digital Library

[15]

Hiroki Nakayama. 2018. Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition. https://github.com/Hironsan/anago.

[16]

Saswat Padhi, Prateek Jain, Daniel Perelman, Oleksandr Polozov, Sumit Gulwani, and Todd D. Millstein. 2018. FlashProfile: a framework for synthesizing data profiles. PACMPL 2, OOPSLA (2018), 150:1-150:28.

Digital Library

[17]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825-2830.

Digital Library

[18]

Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: a framework for inductive program synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, Jonathan Aldrich and Patrick Eugster (Eds.). ACM, 107-126.

Digital Library

[19]

Veselin Raychev, Pavol Bielik, Martin T. Vechev, and Andreas Krause. 2016. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 761-774.

Digital Library

[20]

Mohammad Raza and Sumit Gulwani. 2018. Disjunctive Program Synthesis: A Robust Approach to Programming by Example. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 1403-1412. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17055

[21]

Andrew Reynolds, Morgan Deters, Viktor Kuncak, Cesare Tinelli, and Clark W. Barrett. 2015. Counterexample-Guided Quantifier Instantiation for Synthesis in SMT. In Computer Aided Verification - 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part II. 198-216.

[22]

Shambwaditya Saha, Pranav Garg, and P. Madhusudan. 2015. Alchemist: Learning Guarded Affine Functions. In Computer Aided Verification - 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part I. 440-446.

[23]

Sunita Sarawagi. 2008. Information Extraction. Found. Trends databases 1, 3 (March 2008), 261-377.

Digital Library

[24]

Sandeepkumar Satpal, Sahely Bhadra, Sundararajan Sellamanickam, Rajeev Rastogi, and Prithviraj Sen. 2011. Web information extraction using markov logic networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, Chid Apté, Joydeep Ghosh, and Padhraic Smyth (Eds.). ACM, 1406-1414.

Digital Library

[25]

Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. 2018. Programmatically Interpretable Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 (JMLR Workshop and Conference Proceedings), Jennifer G. Dy and Andreas Krause (Eds.), Vol. 80. JMLR.org, 5052-5061. http://proceedings.mlr.press/v80/verma18a.html

[26]

Weinan Zhang, Amr Ahmed, Jie Yang, Vanja Josifovski, and Alexander J. Smola. 2015. Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, Longbing Cao, Chengqi Zhang, Thorsten Joachims, Geoffrey I. Webb, Dragos D. Margineantu, and Graham Williams (Eds.). ACM, 2257-2266.

Digital Library

Cited By

Nguyen TDo-Viet TNguyen-Duy HLuu TLe HLe BThongtanunam PChristakis MPradel M(2024)VRDSynth: Synthesizing Programs for Multilingual Visually Rich Document Information ExtractionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680314(704-716)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680314
Mishchenko ADanco DJindal ABlue A(2022)BlueprintProceedings of the VLDB Endowment10.14778/3554821.355483615:12(3459-3471)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3554821.3554836
Gupta RKondapally R(2022)Large-Scale Entity Extraction from Enterprise DataProceedings of the Second International Conference on AI-ML Systems10.1145/3564121.3564818(1-2)Online publication date: 12-Oct-2022
https://dl.acm.org/doi/10.1145/3564121.3564818
Show More Cited By

Index Terms

Synthesis and machine learning for heterogeneous extraction
1. Computing methodologies
  1. Machine learning
2. Software and its engineering
  1. Software creation and management
    1. Software development techniques
      1. Automatic programming

Recommendations

Machine Learning: The State of the Art

The two fundamental problems in machine learning (ML) are statistical analysis and algorithm design. The former tells us the principles of the mathematical models that we establish from the observation data. The latter defines the conditions on which ...
Can reactive synthesis and syntax-guided synthesis be friends?
SPLASH Companion 2021: Companion Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity

While reactive synthesis and syntax-guided synthesis (SyGuS) have seen enormous progress in recent years, combining the two approaches has remained a challenge. In this work, we present the synthesis of reactive programs from Temporal Stream Logic ...
Synthesizing correct code for machine learning programs
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Success using machine learning (ML) in numerous fields has created a new class of users, who are not experts in the data science domain but want to use ML as a means to solve their inference problems. Various automatic machine learning (AutoML) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2019

1162 pages

ISBN:9781450367127

DOI:10.1145/3314221

General Chair:
Kathryn S. McKinley
Google, USA
,
Program Chair:
Kathleen Fisher
Tufts University, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '19

Sponsor:

SIGPLAN

PLDI '19: 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

June 22 - 26, 2019

AZ, Phoenix, USA

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
394
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)2

Reflects downloads up to 14 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nguyen TDo-Viet TNguyen-Duy HLuu TLe HLe BThongtanunam PChristakis MPradel M(2024)VRDSynth: Synthesizing Programs for Multilingual Visually Rich Document Information ExtractionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680314(704-716)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680314
Mishchenko ADanco DJindal ABlue A(2022)BlueprintProceedings of the VLDB Endowment10.14778/3554821.355483615:12(3459-3471)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3554821.3554836
Gupta RKondapally R(2022)Large-Scale Entity Extraction from Enterprise DataProceedings of the Second International Conference on AI-ML Systems10.1145/3564121.3564818(1-2)Online publication date: 12-Oct-2022
https://dl.acm.org/doi/10.1145/3564121.3564818
Gupta RKondapally RZhang ARangwala H(2022)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3547352(4792-4793)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3547352
Zuo CAssadi SDeng DIves ZBonifati AEl Abbadi A(2022)Spine: Scaling up Programming-by-Negative-Example for String Filtering and TransformationProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517908(521-530)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517908
Jacindha SAbishek GVasuki P(2022)Program Synthesis—A SurveyComputational Intelligence in Machine Learning10.1007/978-981-16-8484-5_39(409-421)Online publication date: 3-Mar-2022
https://doi.org/10.1007/978-981-16-8484-5_39
Gupta RKondapally RDemartini GZuccon GCulpepper JHuang ZTong H(2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482027
Chen QLamoreaux AWang XDurrett GBastani ODillig IFreund SYahav E(2021)Web question answering with neurosymbolic program synthesisProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454047(328-343)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454047
Zhang MPerelman DLe VGulwani S(2021)An Integrated Approach of Deep Learning and Symbolic Analysis for Digital PDF Table Extraction2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9413069(4062-4069)Online publication date: 10-Jan-2021
https://doi.org/10.1109/ICPR48806.2021.9413069
Bhagwan RMehta SRadhakrishna AGarg SGrundy J(2021)Learning patterns in configurationProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE51524.2021.9678525(817-828)Online publication date: 15-Nov-2021
https://dl.acm.org/doi/10.1109/ASE51524.2021.9678525
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents