FS-DAG (Few-Shot: Domain Adapting Graphs)

Information Extraction from documents is finding its application across various industry as a push for Digitization and Automation of processes. Various Tasks for Document Understanding are an active field of research in industry and academia. Our paper FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding proposes multiple novel techniques to solve the problems more effectively and effficiently.

To usher further research in the domain we open-source 2 categories of datasets with industry specific document types for benchmarking different models for similar tasks under few-shot learning environment.

Installation

$ git clone https://github.com/oracle-samples/fs-dag.git
$ cd fs-dag
$ tree -d

Documentation

For the Visually Rich Document Understanding (VRDU) task of Key Information Extraction(KIE) , many publicly available datasets are available, like SROIE ,CORD , and WildReceipt , which are typical document receipts. Other datasets like FUNSD contain various types of forms with only high-level key-value pairs, which are well-suited for academic research but do not capture the industry requirements of data extraction with fine-grained classes. The publicly available datasets are concentrated on receipts, invoices, and simple forms without adding much diversity to the industry domain/use cases. The publicly available datasets minimally capture those forms filled character-by-character within boxes/placeholders and have wide-scale relevance in the financial, medical, and logistics industries. Thus to further fuel the research for VRDU task, we open-source two categories of datasets as follows:

Dataset Desriptions

Description.
a. Category 1.
- Folder Structure.
- File Description.
b. Category 2.
- Folder Structure.
- File Description.

Dataset Categories

Category 1

It contains 5 documents types as described:

Ecommerce Invoice: Real invoices downloaded for items bought at a particular e-commerce site.
Adverse Health Reaction Form: Contains filled forms of VAERS. The forms are filled by human annotators using non-existential/faker data that cannot be used to identify any human thus removing PII.
Medical Invoice: Contains synthetically generated Medical Invoices for various treatment.
University Admission Form: Contains filled application forms for University of Michigan. The forms are filled by human annotators using non-existential/faker data that cannot be used to identify any human thus removing PII.
Visa Form: Contains Non-Immigrant Visa Application form issued by U.S. Department of State. The forms are filled by human annotators using non-existential/faker data that cannot be used to identify any human thus removing PII.

Folder Structure

.
|-- cat_1
|   |-- adverse_health_reaction_form
|   |-- ecommerce_invoice
|   |-- medical_invoice
|   |-- university_admission_form
|   `-- visa_form

Evert dataset folder has the following structure:

.<dataset>
|-- class_list.txt
|-- images
|-- train_05.txt
|-- test.txt
`-- ocr_test.txt

Files Description

class_list.txt: Contains the list of fine-grained key-values (classes) that needs to be extracted from the document
images: Folder contains all the document images released as apart of the dataset.
train_05.txt: Contains the train split used for training various models with GT annotations for the files.
test.txt: Contains the test split used for testing various models with GT annotations for the files.
ocr_test.txt: Contains the test split used for testing the robustness of various models with OCR errors introduced using a probability of 0.1 in the GT annotations.

Category 2

It contains 7 documents types which are forms with fields filled character-by-character. These type of forms are used across various industry verticals and have be programmatically filled. The document types are described as follows:

Medical Authorization: Contains an authorization form for medicine prescriber's use only.
Personal Bank Account: Contains forms that needs to be filled to open a new personal bank account by physically visiting the bank office.
Equity Mortgage: Contains forms used to pledge equity stocks as collateral/mortgage.
Corporate Bank Account: Contains forms that needs to be filled to open a new corporate bank account by physically visiting the bank office.
Online Banking Application: Contains forms that needs to be filled to open a new corporate bank account by using online banking applications.
Medical Tax Returns: Contains form by the Department of Revenue Administration for filing Medicaid Tax Returens
Medical Insurance Enrollement: Contains Group Enrollment Form for Medical Insurance.

Folder Structure

.
|-- cat_2
|    |-- corporate_bank_account
|    |-- equity_mortgage
|    |-- medical_authorization
|    |-- medical_insurance_enrollment
|    |-- medical_tax_returns
|    |-- online_banking_application
|    `-- personal_bank_account
|

Evert dataset folder has the following structure:

.<dataset>
|-- class_list.txt
|-- images
|-- train_05.txt
|-- test.txt
|-- ocr_test.txt
|-- annotations
|-- statistics_entity.xlsx
|-- statistics_word.xlsx
`-- visualise

Files Description

class_list.txt: Contains the list of fine-grained key-values (classes) that needs to be extracted from the document
images: Folder contains all the document images released as apart of the dataset.
train_05.txt: Contains the train split used for training various models with GT annotations for the files.
test.txt: Contains the test split used for testing various models with GT annotations for the files.
ocr_test.txt: Contains the test split used for testing the robustness of various models with OCR errors introduced using a probability of 0.1 in the GT annotations.
annotations: The folder contains annotations of all the 50 images at word level and entity level, which can help in creating new task or solve more problems. Files ending with _word.json contain word level annotations while files ending with _entity.json contain entity level annotations
statistics_entity.xlsx: Contains the distribution of key-value fields/class of the dataset at entity level.
statistics_word.xlsx: Contains the distribution of key-value fields/class of the dataset at word level.
visualise.txt: The folder visualise the bounding box for each word in files ending with _word.png in blue color. Images ending with _entity.png contain word level bounding box in red while entity level bounding box in blue.

Example Tasks

The dataset can be used for various Document Understanding tasks and models like:

Key Information Extraction (KIE): Extracting key-value pairs from documents
Entity Linking (EL): The entity level annotation in Category 2 dataset can be exploited to benchmark entity linking models.
Optical Character Recognition(OCR): OCR models can be used to check efficiency in complex real world documents across both categories of dataset.

Help

Reach out to:

Amit Agarwal (Senior Applied Scientist, OCI)
Srikant Panda (Senior Applied Scientist, OCI)
Kulbhushan Panda (Senior Director, OCI)

Contributing

This project welcomes contributions from the community. Before submitting a pull request, please review our contribution guide

Security

Please consult the security guide for our responsible security vulnerability disclosure process

License

Copyright (c) 2023 Oracle and/or its affiliates.
Released under the CC0 1.0 Universal as shown at license guide.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
cat_1		cat_1
cat_2		cat_2
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FS-DAG (Few-Shot: Domain Adapting Graphs)

Installation

Documentation

Dataset Desriptions

Dataset Categories

Category 1

Folder Structure

Files Description

Category 2

Folder Structure

Files Description

Example Tasks

Help

Contributing

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

License

oracle-samples/fs-dag

Folders and files

Latest commit

History

Repository files navigation

FS-DAG (Few-Shot: Domain Adapting Graphs)

Installation

Documentation

Dataset Desriptions

Dataset Categories

Category 1

Folder Structure

Files Description

Category 2

Folder Structure

Files Description

Example Tasks

Help

Contributing

Security

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Packages