A tool which makes use of OCR to transform PDFs to hocr files and classifies areas in the hocr files as tables and other textual representations by using table-extract.
- conda (Anaconda or Miniconda)
- Create a new environment, e.g.,
ocr-pdfand install all dependencies.
$ conda env create --name ocr-pdf --file environment.yaml- Activate the conda environment. Either use the anaconda navigator or use this command in your terminal:
$ conda activate ocr-pdfor
$ source activate ocr-pdfThe Snakefile uses three variables to specify:
INPUTDIR: a directory containing all PDFs to process. Default:pdfsOUDIR: this directory defines where to store the output of the processed PDFs. Default:outputTMPDIR: a directory to store temporary output files (in this pipeline: to omit pngs from pdfs). Reads$TMPDIRfrom bash environment. If it is not set,TMPDIRdefaults toOUTIR.
Example: Presuming you want to process the pdf foo_bar_1999.pdf contained in INPUTDIR and OUTDIR is set to output.
After running all rules, output/foo_bar_1999 contains the following folders and files:
orig.pdf: the input pdfpdftotext.txt: a txt file containing the output frompdftotextocr_text.txt: a text file from the OCR outputhocr/: the directory containing the hocr files for each page of the pdf:page_1.hocr...page_n.hocr(with being the last page of the pdf)ocr-txt/: generated from OCR. Same as ```hocr`` but plain text files per page.hocr-ts: this directory entails all pagespage_1.hocr,page_2.hocr, ... from the pdf after executingtable_extract.table_extractenriches the hocr by two attributes in the hocr (example here):
ts:table-score="1" ts:type="caption"
ts:table-scoreis a score from how table-alike this area is. Higher scores hint that this area is like a table.ts:typeis a classification for this area. This attribute can have the following values:text block, table, line, caption, decoration, other.
The configuration is stored in Snakefile. Adjust -j <num_cores> in your snakemake calls to make use of multiple cores to run at the same time.
Example file pdfs/foo_bar_1999.pdf and OUTDIR='output'.
- To run
pdftotexton this pdf, we execute this on the terminal:
$ snakemake output/foo_bar_1999/pdftotext.txt- It is also possible to OCR the whole paper by running:
$ snakemake output/foo_bar_1999/ocr_text.txt- To create an archive from the input pngs, execute:
$ snakemake output/foo_bar_1999/pngs.tar.gz- Last but not least to run
table-extract, run snakemake with the following arguments:
$ snakemake output/foo_bar_1999/table_extract.doneSnakemake permits to define batch processes by applying other Snakemake rules onto many files, e.g. all pdfs in INPUTDIR. This workflow allows to generate outputs from all pdfs in INPUTDIR.
- To only run
pdftotexton all pdfs, call:
$ snakemake pdftotext_all- In order to generate all hocr files from the pdfs, run snakemake as follows:
$ snakemake allor simply
$ snakemake- To extract and classify hocr files with
table_extracton the whole corpus of pdfs, run this:
$ snakemake table_extract_allAll *_all rules implicitly archive the pdf's pngs as pngs.tar.gz in the corresponding folder.