Scope: Processing of images, training / evaluation of CLIP model, input file/directory processing, class 🪧 (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub 1 🔗 support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion
- Versions 🏁
- Model description 📇
- How to install 🔧
- How to run prediction 🪄 modes
- Results 📊
- Data preparation 📦
- For developers 🪛
- Contacts 📧
- Acknowledgements 🙏
- Appendix 🤓
There are currently 4 version of the model available for download, both of them have the same set of categories,
but different data annotations. The latest approved v1.1
is considered to be default and can be found in the main
branch
of HF 😊 hub 1 🔗
Version | Base code | Pages | PDFs | Description |
---|---|---|---|---|
v1.1 |
ViT-B/16 |
15855 | 5730 | smallest (default) |
v1.2 |
ViT-B/32 |
15855 | 5730 | small with higher granularity |
v2.1 |
ViT-L/14 |
15855 | 5730 | large |
v2.2 |
ViT-L/14@336 |
15855 | 5730 | large with highest resolution |
Base model - size 👀
Version | Disk space |
---|---|
openai/clip-vit-base-patch16 |
992 Mb |
openai/clip-vit-base-patch32 |
1008 Mb |
openai/clip-vit-large-patch14 |
1.5 Gb |
openai/clip-vit-large-patch14-336 |
1.5 Gb |
🔲 Fine-tuned model repository: UFAL's clip-historical-page 1 🔗
🔳 Base model repository: OpenAI's clip-vit-base-patch16, clip-vit-base-patch32, clip-vit-large-patch14, clip-vit-large-patch14-336 2 3 4 5 🔗
The model was trained on the manually ✍️ annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form.
The images contain various combinations of texts ️📄, tables 📏, drawings 📈, and photos 🌄 - categories 🪧 described below were formed based on those archival documents. Page examples can be found in the category_samples 📁 directory.
The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine-typed (old style fonts) / handwritten ✏️ / just printed plain ️📄 text or structured in tabular 📏 format text, as well as to mark the presence of the printed 🌄 or drawn 📈 graphic materials yet to be extracted from the page images.
The dataset is provided under Public Domain license, and consists of 15855 PNG images of pages from the archival documents. The source image files and their annotation can be found in the LINDAT repository 6 🔗.
Training 💪 set of the model: 14267 images
90% of all - proportion in categories 🪧 tabulated below
Evaluation 🏆 set: 1583 images
10% of all - same proportion in categories 🪧 as below and demonstrated in model_EVAL.csv 📎
Manual ✍️ annotation was performed beforehand and took some time ⌛, the categories 🪧 were formed from different sources of the archival documents originated in the 1920-2020 years span.
Note
Disproportion of the categories 🪧 in both training data and provided evaluation category_samples 📁 is NOT intentional, but rather a result of the source data nature.
In total, several thousands of separate PDF files were selected and split into PNG pages, ~4k of scanned documents were one-page long which covered around a third of all data, and ~2k of them were much longer (dozens and hundreds of pages) covering the rest (more than 60% of all annotated data).
The specific content and language of the source data is irrelevant considering the model's vision resolution, however, all of the data samples were from archaeological reports which may somehow affect the drawing detection preferences due to the common form of objects being ceramic pieces, arrowheads, and rocks formerly drawn by hand and later illustrated with digital tools (examples can be found in category_samples/DRAW 📁)
Label️ | Description |
---|---|
DRAW |
📈 - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions |
DRAW_L |
📈📏 - drawings, etc but presented within a table-like layout or includes a legend formatted as a table |
LINE_HW |
✏️📏 - handwritten text organized in a tabular or form-like structure |
LINE_P |
📏 - printed text organized in a tabular or form-like structure |
LINE_T |
📏 - machine-typed text organized in a tabular or form-like structure |
PHOTO |
🌄 - photographs or photographic cutouts, potentially with text captions |
PHOTO_L |
🌄📏 - photos presented within a table-like layout or accompanied by tabular annotations |
TEXT |
📰 - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements |
TEXT_HW |
✏️📄 - only handwritten text in paragraph or block form (non-tabular) |
TEXT_P |
📄 - only printed text in paragraph or block form (non-tabular) |
TEXT_T |
📄 - only machine-typed text in paragraph or block form (non-tabular) |
The categories were chosen to sort the pages by the following criteria:
- presence of graphical elements (drawings 📈 OR photos 🌄)
- type of text 📄 (handwritten ✏️️ OR printed OR typed OR mixed 📰)
- presence of tabular layout / forms 📏
The reasons for such distinction are different processing pipelines for different types of pages, which would be applied after the classification as mentioned above.
Examples of pages sorted by category 🪧 can be found in the category_samples 📁 directory which is also available as a testing subset of the training data.
Step-by-step instructions on this program installation are provided here. The easiest way to obtain the model would be to use the HF 😊 hub repository 1 🔗 that can be easily accessed via this project.
Hardware requirements 👀
Minimal machine 🖥️ requirements for slow prediction run (and very slow training / evaluation):
- CPU with a decent (above average) operational memory size
Ideal machine 🖥️ requirements for fast prediction (and relatively fast training / evaluation):
- CPU of some kind and memory size
- GPU (for real CUDA 7 support - only one of Nvidia's cards)
Warning
Make sure you have Python version 3.10+ installed on your machine 💻 and check its hardware requirements for correct program running provided above. Then create a separate virtual environment for this project
How to 👀
Clone this project to your local machine 🖥️️ via:
cd /local/folder/for/this/project
git init
git clone https://github.com/ufal/atrium-page-classification.git
Then change to the Vit and EffNet models or CLIP models branch (clip
or vit
):
cd atrium-page-classification
git checkout vit
OR for updating the already cloned project with some changes, go to the folder containing (hidden) .git
subdirectory and run pulling which will merge upcoming files with your local changes:
cd /local/folder/for/this/project/atrium-page-classification
git add <changed_file>
git commit -m 'local changes'
And then for updating the project with the latest changes from the remote repository, run:
git pull -X theirs
Alternatively, if you are interested in a specific branch (clip
or vit
), you can update it via:
git fetch origin
git checkout vit
git pull --ff-only origin vit
Alternatively, if you do NOT care about local changes OR you want to get the latest project files,
just remove those files (all .py
, .txt
and README
files) and pull the latest version from the repository:
cd /local/folder/for/this/project/atrium-page-classification
And then for a total clean up and update, run:
rm *.py
rm *.txt
rm README*
git pull
Alternatively, for a specific branch (clip
or vit
):
git reset --hard HEAD
git clean -fd
git fetch origin
git checkout vit
git pull origin vit
Overall, a force update to the remote repository branch (clip
or vit
) looks like this:
git fetch origin
git checkout vit
git reset --hard origin/vit
Next step would be a creation of the virtual environment. Follow the Unix / Windows-specific instruction at the venv docs 8 👀🔗 if you don't know how to.
After creating the venv folder, activate the environment via:
source <your_venv_dir>/bin/activate
and then inside your virtual environment, you should install Python libraries (takes time ⌛)
Caution
Up to 1 GB of space for model files and checkpoints is needed, and up to 7 GB of space for the Python libraries (Pytorch and its dependencies, etc)
Installation of Python dependencies can be done via:
pip install -r requirements.txt
Note
The so-called CUDA 7 support for Python's PyTorch library is supposed to be automatically installed at this point - when the presence of the GPU on your machine 🖥️ is checked for the first time, later it's also checked every time before the model initialization (for training, evaluation or prediction run).
After the dependencies installation is finished successfully, in the same virtual environment, you can run the Python program.
To test that everything works okay and see the flag
descriptions call for --help
❓:
python3 run.py -h
You should see a (hopefully) helpful message about all available command line flags. Your next step would be to pull the model from the HF 😊 hub repository 1 🔗 via:
python3 run.py --hf
OR for specific model version (e.g. main
, vX.1
or vX.2
) use the --revision
flag:
python3 run.py --hf -rev v1.1
OR for specific base model version (e.g. ViT-B/16
, ViT-B/32
, ViT-L/14
or ViT-L/14@336px
) use the --base
flag (only when the
trained model version demands such base model as described above):
python3 run.py --hf -rev v2.2 -m `ViT-L/14@336`
Important
If you already have the model files in the model/movel_<revision>
directory next to this file, you do NOT have to use the --hf
flag to download the
model files from the HF 😊 repo 1 🔗 (only for the model version update).
You should see a message about loading the model from the hub and then saving it locally on your machine 🖥️.
Only after you have obtained the trained model files (takes less time ⌛ than installing dependencies), you can play with any commands provided below.
After the model is downloaded, you should see a similar file structure:
Initial project tree 🌳 files structure 👀
/local/folder/for/this/project/atrium-page-classification
├── models
└── <base_code>_rev_<revision>
├── config.json
├── model.safetensors
└── preprocessor_config.json
├── model_checkpoints
├── model_<categ_limit>_<base_code>_<lr>.pt
├── model_<categ_limit>_<base_code>_<lr>_cp.pt
└── ...
├── data_scripts
├── windows
├── move_single.bat
├── pdf2png.bat
└── sort.bat
└── unix
├── move_single.sh
├── pdf2png.sh
└── sort.sh
├── result
├── plots
├── conf_mat_Nn_Cc_<base>_date-time.png
└── ...
└── tables
├── result_date-time_<base>_Nn_Cc.csv
├── EVAL_table_Nn_Cc_<base>_date-time.csv
├── date-time_<base>_RAW.csv
└── ...
├── category_samples
├── DRAW
├── CTX193200994-24.png
└── ...
├── DRAW_L
└── ...
├── run.py
├── classifier.py
├── utils.py
├── requirements.txt
├── config.txt
├── README.md
└── ...
Some of the folders may be missing, like mentioned later model_output
which is automatically created
only after launching the model.
There are two main ways to run the program:
- Single PNG file classification 📄
- Directory with PNG files classification 📁
To begin with, open config.txt ⚙ and change folder path in the [INPUT]
section, then
optionally change top_N
and batch
in the [SETUP]
section.
Note
️ Top-3 is enough to cover most of the images, setting Top-5 will help with a small number of difficult to classify samples.
The batch
variable value depends on your machine 🖥️ memory size
Rough estimations of memory usage per batch size 👀
Batch size | CPU / GPU memory usage |
---|---|
4 | 2 Gb |
8 | 3 Gb |
16 | 5 Gb |
32 | 9 Gb |
64 | 17 Gb |
It is safe to use batch size below 12 for a regular office desktop computer, and lower it to 4 if it's an old device.
For training on a High Performance Computing cluster, you may use values above 20 for
the batch
variable in the [SETUP]
section.
Caution
Do NOT try to change base_model and other section contents unless you know what you are doing
Rough estimations of disk space needed for trained model in relation to the base model 👀
Version | Disk space |
---|---|
openai/clip-vit-base-patch16 |
992 Mb |
openai/clip-vit-base-patch32 |
1008 Mb |
openai/clip-vit-large-patch14 |
1.5 Gb |
openai/clip-vit-large-patch14-336 |
1.5 Gb |
Make sure the virtual environment with all the installed libraries is activated, you are in the project directory with Python files and only then proceed.
How to 👀
cd /local/folder/for/this/project/
source <your_venv_dir>/bin/activate
cd atrium-page-classification
Important
All the listed below commands for Python scripts running are adapted for Unix consoles, while
Windows users must use python
instead of python3
syntax
The following prediction should be run using the -f
or --file
flag with the path argument. Optionally,
you can use the -tn
or --topn
flag with the number of guesses you want to get, and also the -m
or
--model
flag with the path to the model folder argument.
How to 👀
Run the program from its starting point run.py 📎 with optional flags:
python3 run.py -tn 3 -f '/full/path/to/file.png' --model_path '/full/path/to/model/folder' -m '<base_code>'
for exactly TOP-3 guesses with a console output.
OR if you are sure about default variables set in the config.txt ⚙:
python3 run.py -f '/full/path/to/file.png'
to run a single PNG file classification - the output will be in the console.
Note
Console output and all result tables contain normalized scores for the highest N class 🪧 scores
The following prediction type does NOT require explicit directory path setting with the -d
or --directory
,
since its default value is set in the config.txt ⚙ file and awakens when the --dir
flag
is used. The same flags for the number of guesses and the model folder path as for the single-page
processing can be used. In addition, a directory-specific flag --raw
is available.
Caution
You must either explicitly set the -d
flag's argument or use the --dir
flag (calling for the preset in
[INPUT]
section default value of the input directory) to process PNG files on the directory
level, otherwise, nothing will happen
Worth mentioning that the directory 📁 level processing is performed in batches, therefore you should refer to the hardware's memory capacity requirements for different batch sizes tabulated above.
How to 👀
python3 run.py -tn 3 -d '/full/path/to/directory' --model_path '/full/path/to/model/folder' -m '<base_code>'
for exactly TOP-3 guesses in tabular format from all images found in the given directory.
OR if you are really sure about default variables set in the config.txt ⚙:
python3 run.py --dir
python3 run.py -rev v2.2 -m ViT-B/15 --dir
The classification results of PNG pages collected from the directory will be saved 💾 to related results 📁
folders defined in [OUTPUT]
section of config.txt ⚙ file.
Tip
To additionally get raw class 🪧 probabilities from the model along with the TOP-N results, use
--raw
flag when processing the directory (NOT available for single file processing)
Tip
To process all PNG files in the directory AND its subdirectories use the --inner
flag
when processing the directory, or switch its default value to True
in the [SETUP]
section
Naturally, processing of the large amount of PNG pages takes time ⌛ and progress of this process
is recorded in the console via messages like Processed <B×N> images
where B
is batch size set in the [SETUP]
section of the config.txt ⚙ file,
and N
is an iteration of the current dataloader processing loop.
Only after all images from the input directory are processed, the output table is
saved 💾 in the result/tables
folder.
There are accuracy performance measurements and plots of confusion matrices for the evaluation
dataset (10% of the provided in [TRAIN]
's folder data). Both graphic plots and tables with
results can be found in the result 📁 folder.
v1.1
Evaluation set's accuracy (Top-1): 100.00% 🏆
v1.2
Evaluation set's accuracy (Top-1): 100.00% 🏆
v2.1
Evaluation set's accuracy (Top-1): 99.94% 🏆
v2.2
Evaluation set's accuracy (Top-1): 99.87% 🏆
Confusion matrices provided above show the diagonal of matching gold and predicted categories 🪧 while their off-diagonal elements show inter-class errors. By those graphs you can judge what type of mistakes to expect from your model.
By running tests on the evaluation dataset after training you can generate the following output files:
- EVAL_table_Nn_Cc__date-time.csv - (by default) results of the evaluation dataset with TOP-N guesses
- conf_mat_Nn_Cc__date-time.png - (by default) confusion matrix plot for the evaluation dataset also with TOP-N guesses
- date-time__RAW.csv - (by flag
--raw
) raw probabilities for all classes of the processed directory - result_date-time__Nn_Cc.csv - (by default) results of the processed directory with TOP-N guesses
Note
Generated tables will be sorted by FILE and PAGE number columns in ascending order.
Additionally, results of prediction inference run on the directory level without checked results are included.
General result tables 👀
Demo files v1.1
:
-
Manually ✍️ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv 📎
-
Unchecked with TRUE values (small): model_TOP-1.csv📎
Demo files v1.2
:
-
Manually ✍️ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv 📎
-
Unchecked with TRUE values (small): model_TOP-1.csv📎
Demo files v2.1
:
-
Manually ✍️ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv 📎
-
Unchecked with TRUE values (small): model_TOP-1.csv📎
Demo files v2.2
:
-
Manually ✍️ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv 📎
-
Manually ✍️ checked evaluation dataset (TOP-5): model_TOP-5_EVAL.csv 📎
-
Unchecked with TRUE values (small): model_TOP-1.csv📎
With the following columns 📋:
- FILE - name of the file
- PAGE - number of the page
- CLASS-N - label of the category 🪧, guess TOP-N
- SCORE-N - score of the category 🪧, guess TOP-N
and optionally
- TRUE - actual label of the category 🪧
Raw result tables 👀
Demo files v1.1
:
- Unchecked with TRUE values (small) RAW: model_RAW.csv 📎
Demo files v1.2
:
- Unchecked with TRUE values (small) RAW: model_RAW.csv 📎
Demo files v2.1
:
-
Unchecked with TRUE values (small) RAW: model_RAW.csv 📎
-
Demo files
v2.2
: -
Unchecked with TRUE values (small) RAW: model_RAW.csv 📎
With the following columns 📋:
- FILE - name of the file
- PAGE - number of the page
- <CATEGORY_LABEL> - separate columns for each of the defined classes 🪧
The reason to use the --raw
flag is the possible convenience of results review,
since the rows will be basically sorted by categories, and most ambiguous ones will
have more small probabilities instead of zeros than the most obvious (for the model)
categories 🪧.
You can use this section as a guide for creating your own dataset of pages, which will be suitable for further model processing.
There are useful multiplatform scripts in the data_scripts 📁 folder for the whole process of data preparation.
Note
The .sh
scripts are adapted for Unix OS and .bat
scripts are adapted for Windows OS, yet
their functionality remains the same
On Windows you must also install the following software before converting PDF documents to PNG images:
- ImageMagick 9 🔗 - download and install the latest version
- Ghostscript 10 🔗 - download and install the latest version (32 or 64-bit) by AGPL
The source set of PDF documents must be converted to page-specific PNG images before processing. The following steps describe the procedure of converting PDF documents to PNG images suitable for training, evaluation, or prediction inference.
Firstly, copy the PDF-to-PNG converter script to the directory with PDF documents.
How to 👀
Windows:
move \local\folder\for\this\project\data_scripts\pdf2png.bat \full\path\to\your\folder\with\pdf\files
Unix:
cp /local/folder/for/this/project/data_scripts/pdf2png.sh /full/path/to/your/folder/with/pdf/files
Now check the content and comments in pdf2png.sh 📎 or pdf2png.bat 📎 script, and run it.
Important
You can optionally comment out the removal of processed PDF files from the script, yet it's NOT recommended in case you are going to launch the program several times from the same location.
How to 👀
Windows:
cd \full\path\to\your\folder\with\pdf\files
pdf2png.bat
Unix:
cd /full/path/to/your/folder/with/pdf/files
pdf2png.sh
After the program is done, you will have a directory full of document-specific subdirectories containing page-specific images with a similar structure:
Unix folder tree 🌳 structure 👀
/full/path/to/your/folder/with/pdf/files
├── PdfFile1Name
├── PdfFile1Name-001.png
├── PdfFile1Name-002.png
└── ...
├── PdfFile2Name
├── PdfFile2Name-01.png
├── PDFFile2Name-02.png
└── ...
├── PdfFile3Name
└── PdfFile3Name-1.png
├── PdfFile4Name
└── ...
Note
The page numbers are padded with zeros (on the left) to match the length of the last page number in each PDF file, this is done automatically by the pdftoppm command used on Unix. While ImageMagick's 9 🔗 convert command used on Windows does NOT pad the page numbers.
Windows folder tree 🌳 structure 👀
\full\path\to\your\folder\with\pdf\files
├── PdfFile1Name
├── PdfFile1Name-1.png
├── PdfFile1Name-2.png
└── ...
├── PdfFile2Name
├── PdfFile2Name-1.png
├── PDFFile2Name-2.png
└── ...
├── PdfFile3Name
└── PdfFile3Name-1.png
├── PdfFile4Name
└── ...
Optionally you can use the move_single.sh 📎 or move_single.bat 📎 script to move all PNG files from directories with a single PNG file inside to the common directory of one-pagers.
By default, the scripts assume that the onepagers
is the back-off directory for PDF document names without a
corresponding separate directory of PNG pages found in the PDF files directory (already converted to
subdirectories of pages).
How to 👀
Windows:
move \local\folder\for\this\project\atrium-page-classification\data_scripts\move_single.bat \full\path\to\your\folder\with\pdf\files
cd \full\path\to\your\folder\with\pdf\files
move_single.bat
Unix:
cp /local/folder/for/this//project/atrium-page-classification/data_scripts/move_single.sh /full/path/to/your/folder/with/pdf/files
cd /full/path/to/your/folder/with/pdf/files
move_single.sh
The reason for such movement is simply convenience in the following annotation process below. These changes are cared for in the next sort.sh 📎 and sort.bat 📎 scripts as well.
The generated PNG images of document pages are used to form the annotated gold data.
Note
It takes a lot of time ⌛ to collect at least several hundred examples per category.
Prepare a CSV table with exactly 3 columns:
- FILE - name of the PDF document which was the source of this page
- PAGE - number of the page (NOT padded with 0s)
- CLASS - label of the category 🪧
Tip
Prepare equal-in-size categories 🪧 if possible, so that the model will not be biased towards the over-represented labels 🪧
For Windows users, it's NOT recommended to use MS Excel for writing CSV tables, the free alternative may be Apache's OpenOffice 11 🔗. As for Unix users, the default LibreCalc should be enough to correctly write a comma-separated CSV table.
Table in .csv format example 👀
FILE,PAGE,CLASS
PdfFile1Name,1,Label1
PdfFile2Name,9,Label1
PdfFile1Name,11,Label3
...
Cluster the annotated data into separate folders using the sort.sh 📎 or sort.bat 📎 script to copy data from the source folder to the training folder where each category 🪧 has its own subdirectory. This division of PNG images will be used as gold data in training and evaluation.
Warning
It does NOT matter from which directory you launch the sorting script, but you must check the top of the script for (1) the path to the previously described CSV table with annotations, (2) the path to the previously described directory containing document-specific subdirectories of page-specific PNG pages, and (3) the path to the directory where you want to store the training data of label-specific directories with annotated page images.
How to 👀
Windows:
sort.bat
Unix:
sort.sh
After the program is done, you will have a directory full of label-specific subdirectories containing document-specific pages with a similar structure:
Unix folder tree 🌳 structure 👀
/full/path/to/your/folder/with/train/pages
├── Label1
├── PdfFileAName-00N.png
├── PdfFileBName-0M.png
└── ...
├── Label2
├── Label3
├── Label4
└── ...
Windows folder tree 🌳 structure 👀
\full\path\to\your\folder\with\train\pages
├── Label1
├── PdfFileAName-N.png
├── PdfFileBName-M.png
└── ...
├── Label2
├── Label3
├── Label4
└── ...
The sorting script can help you in moderating mislabeled samples before the training. Accurate data annotation directly affects the model performance.
Before running the training, make sure to check the config.txt ⚙️ file for the [TRAIN]
section variables, where you should
set a path to the data folder. Make sure label directory names do NOT contain special characters like spaces, tabs or paragraph splits.
Tip
In the config.txt ⚙️ file tweak the parameter of max_categ
for a maximum number of samples per category 🪧, in case you have over-represented labels significantly dominating in size.
Set max_categ
higher than the number of samples in the largest category 🪧 to use all data samples. Similarly,
max_categ_e
parameter sets the maximum number of samples per category 🪧 for the evaluation dataset, and should be
increased to very large numbers if you want to cover all samples from al categories 🪧.
From this point, you can start model training or evaluation process.
You can use this project code as a base for your own image classification tasks. The detailed guide on the key phases of the whole process (settings, training, evaluation) is provided here.
Project files description 📋👀
File Name | Description |
---|---|
classifier.py |
Model-specific classes and related functions including predefined values for training arguments |
utils.py |
Task-related algorithms |
run.py |
Starting point of the program with its main function - can be edited for flags and function argument extensions |
config.txt |
Changeable variables for the program - should be edited |
Most of the changeable variables are in the config.txt ⚙ file, specifically,
in the [TRAIN]
, [HF]
, and [SETUP]
sections.
In the dev sections of the configuration ⚙ file, you will find many boolean variables that can be changed from the default False
state to True
, yet it's recommended to awaken those variables solely through the specific
command line flags implemented for each of these boolean variables.
For more detailed training process adjustments refer to the related functions in classifier.py 📎 file, where you will find some predefined values not used in the run.py 📎 file.
Important
For both training and evaluation, you must make sure that the training pages directory is set right in the config.txt ⚙ and it contains category 🪧 subdirectories with images inside. Names of the category 🪧 subdirectories are sorted in the alphabetic order and become actual label names and replace the default categories 🪧 list
Device 🖥️ requirements for training / evaluation:
- CPU of some kind and memory size
- GPU (for real CUDA 7 support - better one of Nvidia's cards)
Worth mentioning that the efficient training is possible only with a CUDA-compatible GPU card.
Rough estimations of memory usage 👀
Batch size | CPU / GPU memory usage |
---|---|
4 | 2 Gb |
8 | 3 Gb |
16 | 5 Gb |
32 | 9 Gb |
64 | 17 Gb |
For test launches on the CPU-only device 🖥️ you should set batch size to lower than 4, and even in this case, above-average CPU memory capacity is a must-have to avoid a total system crush.
To train the model run:
python3 run.py --train
The training process has an automatic progress logging into console, and should take approximately 5-12h depending on your machine's 🖥️ CPU / GPU memory size and prepared dataset size.
Tip
Run the training with default hyperparameters if you have at least ~10,000 and less than 50,000 page samples of the very similar to the initial source data - meaning, no further changes are required for fine-tuning model for the same task on an expanded (or new) dataset of document pages, even number of categories 🪧 does NOT matter while it stays under 20
Training hyperparameters 👀
- eval_strategy "epoch"
- save_strategy "epoch"
- learning_rate 5e-5
- per_device_train_batch_size 8
- per_device_eval_batch_size 8
- num_train_epochs 3
- warmup_ratio 0.1
- logging_steps 10
- load_best_model_at_end True
- metric_for_best_model "accuracy"
Above are the default hyperparameters or TrainingArguments 12 used in the training process that can be partially
(only epoch
and log_step
) changed in the [TRAIN]
section, plus batch
in the [SETUP]
section,
of the config.txt ⚙ file. Importantly, avg
- average configuration of all texts can be used.
Important
CLIP models accept not only images but also text inputs, in our case its descriptions.tsv 📎 file
which summarizes the category 🪧 descriptions in the category_samples 📁 folder. Optionally you can run the modesl
with only a single table of category 🪧 descriptions (via categories_file
variable), or use --avg
flag to average all of the
category 🪧 descriptions in the description_folder
starting with the categories_prefix
value.
In case your descriptions table contains more than 1 text per category 🪧, the --avg
flag will be set to True
automatically.
descriptions_comparison_graph.png 📎 is a graph containing separate and averaged results of all category 🪧 descriptions. Using averaged text embeddings of all label description seems to be the most powerful way to classify our images.
You are free to play with the learning rate right in the training function arguments called in the run.py 📎 file, yet warmup ratio and other hyperparameters are accessible only through the classifier.py 📎 file.
Playing with training hyperparameters is recommended only if training 💪 loss (error rate) descends too slow to reach 0.001-0.001 values by the end of the 3rd (last by default) epoch.
In the case evaluation 🏆 loss starts to steadily going up after the previous descend, this means
you have reached the limit of worthy epochs, and next time you should set epochs
to the
number of epoch that has successfully ended before you noticed the evaluation loss growth.
During training image transformations 13 are applied sequentially with a 50% chance.
Note
No rotation, reshaping, or flipping was applied to the images, mainly color manipulations were used. The reason behind this are pages containing specific form types, general text orientation on the pages, and the default reshape of the model input to the square 224x224 (or 336x336) resolution images.
Image preprocessing steps 👀
- transforms.ColorJitter(brightness 0.5)
- transforms.ColorJitter(contrast 0.5)
- transforms.ColorJitter(saturation 0.5)
- transforms.ColorJitter(hue 0.5)
- transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
- transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
More about selecting the image transformation and the available ones you can read in the PyTorch torchvision docs 13.
After training is complete the model will be saved 💾 to its separate subdirectory in the model
directory, by default,
the naming of the model folder corresponds to the length of its training batch dataloader and the number of epochs -
for example model_<S/B>_E
where E
is the number of epochs, B
is the batch size, and S
is the size of your
training dataset (by defaults, 90% of the provided in [TRAIN]
's folder data).
Full project tree 🌳 files structure 👀
/local/folder/for/this/project/atrium-page-classification
├── models
├── <base_code>_rev_v<HFrevision1>
├── config.json
├── model.safetensors
└── preprocessor_config.json
├── <base_code>_rev_v<HFrevision2>
└── ...
├── hf_hub_checkpoints
├── models--openai--clip-vit-base-patch16
├── blobs
├── snapshots
└── refs
└── .locs
└── models--openai--clip-vit-large-patch14
├── model_checkpoints
├── model_<categ_limit>_<base_code>_<lr>.pt
├── model_<categ_limit>_<base_code>_<lr>_cp.pt
└── ...
├── data_scripts
├── windows
└── unix
├── result
├── plots
└── tables
├── category_samples
├── DRAW
├── DRAW_L
└── ...
├── run.py
├── classifier.py
├── utils.py
└── ...
Important
The movel_<revision>
folder naming is generated from the HF 😊 repo 1 🔗 revision
value and does NOT
affect the trained model naming, other training parameters do.
Since the length of the dataloader depends not only on the size of the dataset but also on the preset batch size,
and test subset ratio.
You can slightly change the test_size
and / or
the batch
variable value in the config.txt ⚙ file to train a differently named model on the same dataset.
Alternatively, adjust the model naming generation in the classifier.py's 📎 training function.
After the fine-tuned model is saved 💾, you can explicitly call for evaluation of the model to get a table of TOP-N classes for the randomly composed subset (10% in size by default) of the training page folder.
There is an option of setting test_size
to 0.8 and use all the sorted by category pages provided
in [TRAIN]
's folder for evaluation, but do NOT launch it on the whole training data you have actually used up
for the evaluated model training.
To do this in the unchanged configuration ⚙, automatically create a confusion matrix plot 📊 and additionally get raw class probabilities table run:
python3 run.py --eval
OR when you don't remember the specific [SETUP]
and [TRAIN]
variables' values for the trained model, you can use:
python3 run.py --eval -model_path 'model_<categ_limit>_<base>_<lr>.pt'
To prove that initial models without finetuning show awful results you can run --zero_shot
flag during the evalution.
python3 run.py --eval --zero_shot -m '<base_code>'
Finally, when your model is trained and you are happy with its performance tests, you can uncomment a code line
in the run.py 📎 file for HF 😊 hub model push. This functionality has already been implemented and can be
accessed through the --hf
flag using the values set in the [HF]
section for the token
and repo_name
variables.
In this case, you must rename the trained model folder in respect to the revision
value (dots in the naming are skipped, e.g.
revision v1.9.22
turns to model_v1922
model folder), and only then run repo push.
Caution
Set your own repo_name
to the empty one of yours on HF 😊 hub, then in the Settings of your HF 😊 account
find the Access Tokens section and generate a new token - copy and paste its value to the token
variable. Before committing
those config.txt ⚙ file changes via git replace the full token
value with its shortened version for security reasons.
For support write to: [email protected] responsible for this GitHub repository 14 🔗
Information about the authors of this project, including their names and ORCIDs, can be found in the CITATION.cff 📎 file.
- Developed by UFAL 15 👥
- Funded by ATRIUM 16 💰
- Shared by ATRIUM 16 & UFAL 15 🔗
- Model type: fine-tuned ViT with a 224x224 2 3 4 🔗 or 336x336 5 🔗 resolution size
©️ 2022 UFAL & ATRIUM
README emoji codes 👀
- 🖥 - your computer
- 🪧 - label/category/class
- 📄 - page/file
- 📁 - folder/directory
- 📊 - generated diagrams or plots
- 🌳 - tree of file structure
- ⌛ - time-consuming process
- ✍️ - manual action
- 🏆 - performance measurement
- 😊 - Hugging Face (HF)
- 📧 - contacts
- 👀 - click to see
- ⚙️ - configuration/settings
- 📎 - link to the internal file
- 🔗 - link to the external website
Content specific emoji codes 👀
- 📏 - table content
- 📈 - drawings/paintings/diagrams
- 🌄 - photos
- ✏️ - handwritten content
- 📄 - text content
- 📰 - mixed types of text content, maybe with graphics
Decorative emojis 👀
- 📇📜🔧▶🪄🪛️📦🔎📚🙏👥📬🤓 - decorative purpose only
Tip
Alternative version of this README file is available in README.html 📎 webpage