Optical Character Recognition (OCR) and table extraction

A tool which makes use of OCR to transform PDFs to hocr files and classifies areas in the hocr files as tables and other textual representations by using table-extract.

Requirements

conda (Anaconda or Miniconda)

Install

1. Anaconda environment

Create a new environment, e.g., ocr-pdf and install all dependencies.

$ conda env create --name ocr-pdf --file environment.yaml

Activate the conda environment. Either use the anaconda navigator or use this command in your terminal:

$ conda activate ocr-pdf

or

$ source activate ocr-pdf

2. Set up input and output directories

The Snakefile uses three variables to specify:

INPUTDIR: a directory containing all PDFs to process. Default: pdfs
OUDIR: this directory defines where to store the output of the processed PDFs. Default: output
TMPDIR: a directory to store temporary output files (in this pipeline: to omit pngs from pdfs). Reads $TMPDIR from bash environment. If it is not set, TMPDIR defaults to OUTIR.

Output format

Example: Presuming you want to process the pdf foo_bar_1999.pdf contained in INPUTDIR and OUTDIR is set to output.

After running all rules, output/foo_bar_1999 contains the following folders and files:

orig.pdf: the input pdf
pdftotext.txt: a txt file containing the output from pdftotext
ocr_text.txt: a text file from the OCR output
hocr/: the directory containing the hocr files for each page of the pdf: page_1.hocr ... page_n.hocr (with being the last page of the pdf)
ocr-txt/: generated from OCR. Same as ```hocr`` but plain text files per page.
hocr-ts: this directory entails all pages page_1.hocr, page_2.hocr, ... from the pdf after executing table_extract. table_extract enriches the hocr by two attributes in the hocr (example here):

ts:table-score="1" ts:type="caption"

ts:table-score is a score from how table-alike this area is. Higher scores hint that this area is like a table.
ts:type is a classification for this area. This attribute can have the following values: text block, table, line, caption, decoration, other.

Execute jobs with Snakemake

The configuration is stored in Snakefile. Adjust -j <num_cores> in your snakemake calls to make use of multiple cores to run at the same time.

1. Single file processing

Example file pdfs/foo_bar_1999.pdf and OUTDIR='output'.

To run pdftotext on this pdf, we execute this on the terminal:

$ snakemake output/foo_bar_1999/pdftotext.txt

It is also possible to OCR the whole paper by running:

$ snakemake output/foo_bar_1999/ocr_text.txt

To create an archive from the input pngs, execute:

$ snakemake output/foo_bar_1999/pngs.tar.gz

Last but not least to run table-extract, run snakemake with the following arguments:

$ snakemake output/foo_bar_1999/table_extract.done

2. Batch processing of pdfs

Snakemake permits to define batch processes by applying other Snakemake rules onto many files, e.g. all pdfs in INPUTDIR. This workflow allows to generate outputs from all pdfs in INPUTDIR.

To only run pdftotext on all pdfs, call:

$ snakemake pdftotext_all

In order to generate all hocr files from the pdfs, run snakemake as follows:

$ snakemake all

or simply

$ snakemake

To extract and classify hocr files with table_extract on the whole corpus of pdfs, run this:

$ snakemake table_extract_all

All *_all rules implicitly archive the pdf's pngs as pngs.tar.gz in the corresponding folder.

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
cluster_scripts		cluster_scripts
scripts		scripts
table_extract		table_extract
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
README_table_extract.md		README_table_extract.md
Snakefile		Snakefile
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Optical Character Recognition (OCR) and table extraction

Requirements

Install

1. Anaconda environment

2. Set up input and output directories

Output format

Execute jobs with Snakemake

1. Single file processing

2. Batch processing of pdfs

About

Uh oh!

Releases

Packages

Languages

License

komax/TesserTable-PDFMiner

Folders and files

Latest commit

History

Repository files navigation

Optical Character Recognition (OCR) and table extraction

Requirements

Install

1. Anaconda environment

2. Set up input and output directories

Output format

Execute jobs with Snakemake

1. Single file processing

2. Batch processing of pdfs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages