The understand of modular Multimodal Architecture for Document Classifification

一、Text Extraction

the main way: We utilize the open source16 Tesseract OCR engine17 to extract text from all images in the RVL-CDIP dataset.
We use the the combined legacy/LSTM engine (oem 3) and the standard page segmentation mode (psm 3) parameters for this extraction.

二、 Abstract Model Architecture

We defifine the abstract structure of the model as having three components, an image classififier, a text classififier,and a meta-classififier.
Structure diagram

三、 Image Model Architectures

We utilize two standard CNN architectures for the image models, the fifirst is AlexNet (Fig. 2a) with added batch normalization, the second is VGG16 (Fig. 2b).

四、Text Model Architectures

The raw text is fifirst preprocessed into one-hot vectors,that is to say that each document is represented by a binary vector whose components indicate the presence of the word corresponding to that index.

五、 Meta-classififier

六、experiments and results

Implementation
- All network models are generated using Keras19 with Tensorflflow backend20. We also utilize a number of modules from scikit-learn21 to preprocess the text.
- We consistently surpass the current state of the art.
Augmentation
- Slight shear augmentations (θ ∈ [[10◦**, 10◦ ]) during training provide the best generalization performance
Optimization
- We utilize SGD with warm restarts23, however we adjust the learning rate over batches as opposed to epochs,essentially reducing to a discontinuous one-cycle learning rate cosine annealing24 for optimization.
Results
- As each classififier is trained independently from one another we can see the results of each experiment and the remarkable boost that comes from the combination of difffferent classififiers.

七、CONCLUSION

It is clear from the results that the inclusion of extracted text in the development of document classifification models improves the quality and accuracy of predictions.
The work here only takes advantage of a bag-of-words approach to the text classifification component, a further avenue for research could include extending the more recent embedding approaches to account for transcriptionerrors.

八、APPENDIX

The open RVL-CDIP dataset suffffers from some data quality issues, namely duplicated images across sets(training, testing, and validation) and classes. i.e. the same image can occur across classes and sets.

The understand of modular Multimodal Architecture for Document Classifification