skip to main content
10.1145/3570991.3571037acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
short-paper

BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells

Published: 04 January 2023 Publication History

Abstract

Tables are the most convenient way to represent structured information in a document. Understanding the table structure is critical to understanding its contents. Several deep learning-based approaches from the literature have shown promising results in understanding table structures, but they require large amounts of annotated data. However, the availability of annotated datasets to train these methods are expensive, laborious, and very limited. Moreover, human-annotated data suffers from inconsistencies in table and cell annotations. We propose BUDDI Table Factory (BTF) for synthetically generating annotated documents with a wide range of variations in table structures. We propose a heuristics-based method to generate a variety of table structures from which we generate synthetic documents using LaTeX. We propose a computer vision-based approach to localize table and cell regions and automatically generate annotations in PASCAL VOC challenge format. We empirically illustrate the advantage of adding synthetic BTF documents with limited original documents to the model training, which can significantly improve the TEDS and IoU performance of the table structure recognition tasks in public and real-world healthcare datasets.

References

[1]
Madhav Agarwal, Ajoy Mondal, and C. Jawahar. 2021. CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. In CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. 9491–9498. https://doi.org/10.1109/ICPR48806.2021.9411922
[2]
Azim Ahmadzadeh, Dustin J. Kempton, Yang Chen, and Rafal A. Angryk. 2021. Multiscale IoU: A Metric for Evaluation of Salient Object Detection with Fine Structures. https://doi.org/10.48550/ARXIV.2105.14572
[3]
Sanket Biswas, Pau Riba, Josep Lladós, and Umapada Pal. 2021. DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis. https://doi.org/10.48550/ARXIV.2107.02638
[4]
G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools(2000).
[5]
Quang Anh Bui, David Mollard, and Salvatore Tabbone. 2019. Automatic Synthetic Document Image Generation using Generative Adversarial Networks: Application in Mobile-Captured Document Analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR). 393–400. https://doi.org/10.1109/ICDAR.2019.00070
[6]
Abhishek Dutta and Andrew Zisserman. 2019. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM ’19). Association for Computing Machinery, New York, NY, USA, 2276–2279. https://doi.org/10.1145/3343031.3350535
[7]
David Etter, Stephen Rawls, Cameron Carpenter, and Gregory Sell. 2019. A Synthetic Recipe for OCR. In A Synthetic Recipe for OCR. 864–869. https://doi.org/10.1109/ICDAR.2019.00143
[8]
Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2015. The PASCAL Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision 111, 1 (1 Jan. 2015), 98–136. https://doi.org/10.1007/s11263-014-0733-5
[9]
Max C. Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. 2013. ICDAR 2013 Table Competition. 2013 12th International Conference on Document Analysis and Recognition (2013), 1449–1453.
[10]
Nicholas Journet, Muriel Visani, Boris Mansencal, Kieu Van-Cuong, and Antoine Billy. 2017. DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images. Journal of Imaging 3, 4 (2017). https://doi.org/10.3390/jimaging3040062
[11]
Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. 2020. TableBank: Table Benchmark for Image-based Table Detection and Recognition. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 1918–1925. https://aclanthology.org/2020.lrec-1.236
[12]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740–755.
[13]
C.H. Lun and S. Hou. 2022. Geological Document Layout Analysis via Synthetic Dataset Creation. European Association of Geoscientists & Engineers 2022, 1(2022), 1–5. https://doi.org/10.3997/2214-4609.202239022
[14]
Shubham Paliwal, Vishwanath D, Rohit Rahul, Monika Sharma, and Lovekesh Vig. 2019. TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images. In TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images. https://doi.org/10.1109/ICDAR.2019.00029
[15]
Nandhinee PR, Harinath Krishnamoorthy, Koushik Srivatsan, Anil Goyal, and Sudarsun Santhiappan. 2022. DEXTER: An end-to-end system to extract table contents from electronic medical health documents. https://doi.org/10.48550/ARXIV.2207.06823
[16]
Natraj Raman, Sameena Shah, and Manuela Veloso. 2022. Synthetic document generator for annotation-free layout recognition. Pattern Recognition 128 (aug 2022), 108660. https://doi.org/10.1016/j.patcog.2022.108660
[17]
C V Jawahar Sachin Raja, Ajoy Mondal. 2020. Table Structure Recognition using Top-Down and Bottom-Up Cues.
[18]
Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. 2017. DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 01. 1162–1167. https://doi.org/10.1109/ICDAR.2017.192
[19]
Asif Shahab, Faisal Shafait, Thomas Kieninger, and Andreas Dengel. 2010. An Open Approach towards the Benchmarking of Table Structure Recognition Systems. In An Open Approach towards the Benchmarking of Table Structure Recognition Systems (Boston, Massachusetts, USA) (DAS ’10). Association for Computing Machinery, New York, NY, USA, 113–120. https://doi.org/10.1145/1815330.1815345
[20]
Noah Siegel, Nicholas Lourie, Russell Power, and Waleed Ammar. 2018. Extracting Scientific Figures with Distantly Supervised Neural Networks. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (Fort Worth, Texas, USA) (JCDL ’18). Association for Computing Machinery, New York, NY, USA, 223–232. https://doi.org/10.1145/3197026.3197040
[21]
Lars Vögtlin, Manuel Drazyk, Vinaychandran Pondenkandath, Michele Alberti, and Rolf Ingold. 2021. Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs. https://doi.org/10.48550/ARXIV.2103.08236
[22]
Lin Wan, Ju Zhou, and Bailing Zhang. 2020. Data Synthesis for Document Layout Analysis. In Data Synthesis for Document Layout Analysis.
[23]
Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. 2020. Image-Based Table Recognition: Data, Model, and Evaluation. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI (Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 564–580. https://doi.org/10.1007/978-3-030-58589-1_34
[24]
Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. 2019. PubLayNet: Largest Dataset Ever for Document Layout Analysis. 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019), 1015–1022.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
January 2023
357 pages
ISBN:9781450397971
DOI:10.1145/3570991
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Automated Annotation Extraction
  2. Cell Detection
  3. Computer Vision
  4. Deep Learning
  5. Synthetic Document Generation
  6. Table Detection
  7. Table Structure Recognition

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

CODS-COMAD 2023

Acceptance Rates

Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 83
    Total Downloads
  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media