The understand of modular Multimodal Architecture for Document Classifification
一、Text Extraction
- the main way: We utilize the open source16 Tesseract OCR engine17 to extract text from all images in the RVL-CDIP dataset.
- We use the the combined legacy/LSTM engine (oem 3) and the standard page segmentation mode (psm 3) parameters for this extraction.
二、 Abstract Model Architecture
-
We defifine the abstract structure of the model as having three components, an image classififier, a text classififier,and a meta-classififier.
-
Structure diagram
三、 Image Model Architectures
- We utilize two standard CNN architectures for the image models, the fifirst is AlexNet (Fig. 2a) with added batch normalization, the second is VGG16 (Fig. 2b).
四、Text Model Architectures
- The raw text is fifirst preprocessed into one-hot vectors,that is to say that each document is represented by a binary vector whose components indicate the presence of the word corresponding to that index.
五、 Meta-classififier
- The meta-classififier in all experiments is an XGBoost model.
六、experiments and results
-
Implementation
- All network models are generated using Keras19 with Tensorflflow backend20. We also utilize a number of modules from scikit-learn21 to preprocess the text.
- We consistently surpass the current state of the art.
-
Augmentation
- Slight shear augmentations (θ ∈ [[10◦**, 10◦ ]) during training provide the best generalization performance
-
Optimization
- We utilize SGD with warm restarts23, however we adjust the learning rate over batches as opposed to epochs,essentially reducing to a discontinuous one-cycle learning rate cosine annealing24 for optimization.
-
Results
- As each classififier is trained independently from one another we can see the results of each experiment and the remarkable boost that comes from the combination of difffferent classififiers.
- As each classififier is trained independently from one another we can see the results of each experiment and the remarkable boost that comes from the combination of difffferent classififiers.
七、CONCLUSION
-
It is clear from the results that the inclusion of extracted text in the development of document classifification models improves the quality and accuracy of predictions.
-
The work here only takes advantage of a bag-of-words approach to the text classifification component, a further avenue for research could include extending the more recent embedding approaches to account for transcriptionerrors.
八、APPENDIX
- The open RVL-CDIP dataset suffffers from some data quality issues, namely duplicated images across sets(training, testing, and validation) and classes. i.e. the same image can occur across classes and sets.