short-paper

BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells

Authors:

Bharath Sripathy,

Harinath Krishnamoorthy,

Sudarsun SanthiappanAuthors Info & Claims

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

Pages 218 - 222

https://doi.org/10.1145/3570991.3571037

Published: 04 January 2023 Publication History

Abstract

Tables are the most convenient way to represent structured information in a document. Understanding the table structure is critical to understanding its contents. Several deep learning-based approaches from the literature have shown promising results in understanding table structures, but they require large amounts of annotated data. However, the availability of annotated datasets to train these methods are expensive, laborious, and very limited. Moreover, human-annotated data suffers from inconsistencies in table and cell annotations. We propose BUDDI Table Factory (BTF) for synthetically generating annotated documents with a wide range of variations in table structures. We propose a heuristics-based method to generate a variety of table structures from which we generate synthetic documents using LaTeX. We propose a computer vision-based approach to localize table and cell regions and automatically generate annotations in PASCAL VOC challenge format. We empirically illustrate the advantage of adding synthetic BTF documents with limited original documents to the model training, which can significantly improve the TEDS and IoU performance of the table structure recognition tasks in public and real-world healthcare datasets.

References

[1]

Madhav Agarwal, Ajoy Mondal, and C. Jawahar. 2021. CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. In CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. 9491–9498. https://doi.org/10.1109/ICPR48806.2021.9411922

[2]

Azim Ahmadzadeh, Dustin J. Kempton, Yang Chen, and Rafal A. Angryk. 2021. Multiscale IoU: A Metric for Evaluation of Salient Object Detection with Fine Structures. https://doi.org/10.48550/ARXIV.2105.14572

[3]

Sanket Biswas, Pau Riba, Josep Lladós, and Umapada Pal. 2021. DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis. https://doi.org/10.48550/ARXIV.2107.02638

[4]

G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools(2000).

[5]

Quang Anh Bui, David Mollard, and Salvatore Tabbone. 2019. Automatic Synthetic Document Image Generation using Generative Adversarial Networks: Application in Mobile-Captured Document Analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR). 393–400. https://doi.org/10.1109/ICDAR.2019.00070

[6]

Abhishek Dutta and Andrew Zisserman. 2019. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM ’19). Association for Computing Machinery, New York, NY, USA, 2276–2279. https://doi.org/10.1145/3343031.3350535

Digital Library

[7]

David Etter, Stephen Rawls, Cameron Carpenter, and Gregory Sell. 2019. A Synthetic Recipe for OCR. In A Synthetic Recipe for OCR. 864–869. https://doi.org/10.1109/ICDAR.2019.00143

[8]

Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2015. The PASCAL Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision 111, 1 (1 Jan. 2015), 98–136. https://doi.org/10.1007/s11263-014-0733-5

Digital Library

[9]

Max C. Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. 2013. ICDAR 2013 Table Competition. 2013 12th International Conference on Document Analysis and Recognition (2013), 1449–1453.

[10]

Nicholas Journet, Muriel Visani, Boris Mansencal, Kieu Van-Cuong, and Antoine Billy. 2017. DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images. Journal of Imaging 3, 4 (2017). https://doi.org/10.3390/jimaging3040062

[11]

Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. 2020. TableBank: Table Benchmark for Image-based Table Detection and Recognition. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 1918–1925. https://aclanthology.org/2020.lrec-1.236

[12]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740–755.

[13]

C.H. Lun and S. Hou. 2022. Geological Document Layout Analysis via Synthetic Dataset Creation. European Association of Geoscientists & Engineers 2022, 1(2022), 1–5. https://doi.org/10.3997/2214-4609.202239022

[14]

Shubham Paliwal, Vishwanath D, Rohit Rahul, Monika Sharma, and Lovekesh Vig. 2019. TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images. In TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images. https://doi.org/10.1109/ICDAR.2019.00029

[15]

Nandhinee PR, Harinath Krishnamoorthy, Koushik Srivatsan, Anil Goyal, and Sudarsun Santhiappan. 2022. DEXTER: An end-to-end system to extract table contents from electronic medical health documents. https://doi.org/10.48550/ARXIV.2207.06823

[16]

Natraj Raman, Sameena Shah, and Manuela Veloso. 2022. Synthetic document generator for annotation-free layout recognition. Pattern Recognition 128 (aug 2022), 108660. https://doi.org/10.1016/j.patcog.2022.108660

Digital Library

[17]

C V Jawahar Sachin Raja, Ajoy Mondal. 2020. Table Structure Recognition using Top-Down and Bottom-Up Cues.

[18]

Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. 2017. DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 01. 1162–1167. https://doi.org/10.1109/ICDAR.2017.192

[19]

Asif Shahab, Faisal Shafait, Thomas Kieninger, and Andreas Dengel. 2010. An Open Approach towards the Benchmarking of Table Structure Recognition Systems. In An Open Approach towards the Benchmarking of Table Structure Recognition Systems (Boston, Massachusetts, USA) (DAS ’10). Association for Computing Machinery, New York, NY, USA, 113–120. https://doi.org/10.1145/1815330.1815345

Digital Library

[20]

Noah Siegel, Nicholas Lourie, Russell Power, and Waleed Ammar. 2018. Extracting Scientific Figures with Distantly Supervised Neural Networks. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (Fort Worth, Texas, USA) (JCDL ’18). Association for Computing Machinery, New York, NY, USA, 223–232. https://doi.org/10.1145/3197026.3197040

Digital Library

[21]

Lars Vögtlin, Manuel Drazyk, Vinaychandran Pondenkandath, Michele Alberti, and Rolf Ingold. 2021. Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs. https://doi.org/10.48550/ARXIV.2103.08236

[22]

Lin Wan, Ju Zhou, and Bailing Zhang. 2020. Data Synthesis for Document Layout Analysis. In Data Synthesis for Document Layout Analysis.

[23]

Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. 2020. Image-Based Table Recognition: Data, Model, and Evaluation. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI (Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 564–580. https://doi.org/10.1007/978-3-030-58589-1_34

Digital Library

[24]

Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. 2019. PubLayNet: Largest Dataset Ever for Document Layout Analysis. 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019), 1015–1022.

Index Terms

BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Annotation
      2. Document scripting languages
2. Computing methodologies

Recommendations

Configurable Table Structure Recognition in Untagged PDF documents
DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering

Today, PDF is one of the most popular document formats in the web. Many PDF documents are not images, but remain untagged. They have no tags for identifying the logical reading order, paragraphs, figures, and tables. One of the challenges with these ...
End-to-end table structure recognition and extraction in heterogeneous documents
Abstract
Automatically detecting and parsing tables into an indexable and searchable format is an important problem in document digitization. It relates to computer vision, machine learning, and optical character recognition. This paper ...
Highlights
- Recognizing tables using object detection in structured and unstructured documents.
Automatic extraction of table metadata from digital documents
JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries

Tables are used to present, list, summarize, and structure important data in documents. In scholarly articles, they are often used to present the relationships among data and high-light a collection of results obtained from experiments and scientific ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

January 2023

357 pages

ISBN:9781450397971

DOI:10.1145/3570991

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

CODS-COMAD 2023

CODS-COMAD 2023: 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

January 4 - 7, 2023

Mumbai, India

Acceptance Rates

Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
83
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents