OCR system for recognizing modern Japanese magazines

starred-ducanh841988-repo starred-repo

Go to file

DucAnh 7ff9246af5 Merge branch 'master' of https://github.com/ducanh841988/Kindai-OCR		2020-07-08 16:40:48 +09:00
basenet	initial commit	2020-07-08 11:31:26 +09:00
data	initial commit	2020-07-08 11:31:26 +09:00
images	add images	2020-07-08 11:34:55 +09:00
pretrain	initial commit	2020-07-08 11:31:26 +09:00
.gitignore	initial commit	2020-07-08 11:31:26 +09:00
README.md	Update README.md	2020-07-08 14:34:16 +09:00
coordinates.py	initial commit	2020-07-08 11:31:26 +09:00
craft.py	initial commit	2020-07-08 11:31:26 +09:00
craft_utils.py	initial commit	2020-07-08 11:31:26 +09:00
data_loader.py	initial commit	2020-07-08 11:31:26 +09:00
decoder.py	initial commit	2020-07-08 11:31:26 +09:00
encoder.py	initial commit	2020-07-08 11:31:26 +09:00
encoder_decoder.py	initial commit	2020-07-08 11:31:26 +09:00
evaluation.py	initial commit	2020-07-08 11:31:26 +09:00
file_utils.py	initial commit	2020-07-08 11:31:26 +09:00
gaussian.py	initial commit	2020-07-08 11:31:26 +09:00
imgproc.py	initial commit	2020-07-08 11:31:26 +09:00
mep.py	initial commit	2020-07-08 11:31:26 +09:00
mseloss.py	initial commit	2020-07-08 11:31:26 +09:00
requirements.txt	update code	2020-07-08 16:40:13 +09:00
test.py	initial commit	2020-07-08 11:31:26 +09:00
torchutil.py	initial commit	2020-07-08 11:31:26 +09:00
translate_line.py	initial commit	2020-07-08 11:31:26 +09:00
utils.py	update code	2020-07-08 16:40:13 +09:00
watershed.py	initial commit	2020-07-08 11:31:26 +09:00

README.md

Kindai-OCR

OCR system for recognizing modern Japanese magazines

About

This repo contains an OCR sytem for converting modern Japanese images to text. This is a result of N2I project for digitization of modern Japanese documents.

The system has 2 main modules: text line extraction and text line recognition. The overall architechture is shown in the below figures.

For text line extraction, we retrain the CRAFT (Character Region Awareness for Text Detection) on 1000 annotated images provided by Center for Research and Development of Higher Education, The University of Tokyo.

For text line recognition, we employ the attention-based encoder-decoder on our previous publication. We train the text line recognition on 1000 annotated images and 1600 unannotated images provided by Center for Research and Development of Higher Education and National Institute for Japanese Language and Linguistics, respectively.

Installing Kindai OCR

python==3.7.4 torch==1.4.0 torchvision==0.2.1 opencv-python==3.4.2.17 scikit-image==0.14.2 scipy==1.1.0 Polygon3

Running Kindai OCR

You should first download the pre_trained models and put them into ./pretrain/ folder. VGG model, CRAFT model, OCR model
Copy your images into ./data/test/ folder
run the following script to recognize images:
python test.py
The recognized text transcription is in ./data/result.xml and the result images are in ./data/result/
If you may have to check the path to Japanese font in test.py for correct visualization results.
fontPIL = '/usr/share/fonts/truetype/fonts-japanese-gothic.ttf' # japanese font
An example result from our OCR system