Kindai-OCR/README.md



# Kindai-OCR
OCR system for recognizing modern Japanese magazines

# Updates:
Kindai V2.0 employed Transformer OCR for text recognition. Transformer OCR was trained on NDL and CODH datasets.

## About

This repo contains an OCR system for converting modern Japanese images to text. The software has been developed by Dr. Anh Duc Le, while he was working for <a href="http://codh.rois.ac.jp/">ROIS-DS Center for Open Data in the Humanities</a>.

The system has 2 main modules: text line extraction and text line recognition. The overall architecture is shown in the below figures.
![alt text](https://github.com/ducanh841988/Kindai-OCR/blob/master/images/TextlineExtraction.jpg "text line extraction")

For text line extraction, we retrain the CRAFT (Character Region Awareness for Text Detection) on 1000 annotated images provided by Center for Research and Development of Higher Education, The University of Tokyo.
![alt text](https://github.com/ducanh841988/Kindai-OCR/blob/master/images/kindai_v1.jpg "text line recognition with attention model")
Text line recognition with attention model in Kindai V1.0

![alt text](https://github.com/ducanh841988/Kindai-OCR/blob/master/images/kindai_v2.png "text line recognition with ")
Text line recognition with Transformer in Kindai V2.0
Text line recognition,
For Kindai V1.0, we employ the attention-based encoder-decoder on our previous publication. We train the text line recognition on 1000 annotated images and 1600 unannotated images provided by Center for Research and Development of Higher Education, The University of Tokyo and National Institute for Japanese Language and Linguistics, respectively.    
For Kindai V2.0, we trained a transformer with more data from National Diet Library (NDL) and The Center for Open Data in The Humanities (CODH).
[NDL dataset](https://github.com/ndl-lab/pdmocrdataset-part2) contains 3,997 pages, 103,256 lines and [CODH dataset](http://codh.rois.ac.jp/modern-magazine/dataset/) contains 1985 pages and 59,465 lines.


## Installing Kindai OCR

Python==3.7.11         
torch==1.7.0     
torchvision==0.8.1     
opencv-python==3.4.2.17     
scikit-image==0.14.2     
scipy==1.1.0     
Polygon3     
pillow==4.3.0     
pytorch-lightning==1.3.5     
einops==0.3.0     
editdistance==0.5.3     


## Running Kindai OCR
- You should first download the pre_trained models and put them into ./pretrain/ folder.
[VGG model](https://drive.google.com/file/d/1_A1dEFKxyiz4Eu1HOCDbjt1OPoEh90qr/view?usp=sharing), [CRAFT model](https://drive.google.com/file/d/1-9xt_jjs4btMrz5wzrU1-kyp2c6etFab/view?usp=sharing), [OCR V1.0 model](https://drive.google.com/file/d/1mibg7D2D5rvPhhenLeXNilSLMBloiexl/view?usp=sharing), [OCR V2.0 model](https://drive.google.com/file/d/1cq4PwPS2mXXRjOApst2i7n4G3mBSVqpI/view?usp=drive_link)
- Copy your images into ./data/test/ folder   
- run the following script to recognize images:   
`python test_kindai_1.0.py`   
`python test_kindai_2.0.py`   
- The recognized text transcription is in ./data/result.xml and the result images are in ./data/result/   
- If you may have to check the path to Japanese font in test.py for correct visualization results.   
    `fontPIL = '/usr/share/fonts/truetype/fonts-japanese-gothic.ttf' # japanese font`   
- using --cuda = True for GPU device and Fasle for CPU device    
- using --canvas_size ot set image size for text line detection   
 - An example result from our OCR system
 <img src="https://github.com/ducanh841988/Kindai-OCR/blob/master/data/result/res_k188701_021_39.jpg" width="700">

 ## Citation
 If you find Kindai OCR useful in your research, please consider citing:   
 Anh Duc Le, Daichi Mochihashi, Katsuya Masuda, Hideki Mima, and Nam Tuan Ly. 2019. Recognition of Japanese historical text lines by an attention-based encoder-decoder and text line generation. In Proceedings of the 5th International Workshop on Historical Document Imaging and Processing (HIP ’19). Association for Computing Machinery, New York, NY, USA, 37–41. DOI:https://doi.org/10.1145/3352631.3352641   


 ## Acknowledgment

We thank The Center for Research and Development of Higher Education, The University of Tokyo, and National Institute for Japanese Language and Linguistics for providing the kindai datasets.     

## Contact
Dr. Anh Duc Le, email: leducanh841988@gmail.com or anh@ism.ac.jp
-												add transformer OCR

											
										
										
											2023-07-11 06:44:23 +00:00
-												Initial commit
											
										
										
											2020-07-08 01:12:27 +00:00
+								# Kindai-OCR
 								OCR system for recognizing modern Japanese magazines
-												Update README.md
											
										
										
											2020-07-08 01:54:30 +00:00
-												update readme

											
										
										
											2023-07-12 02:58:51 +00:00
+								# Updates:
 								Kindai V2.0 employed Transformer OCR for text recognition. Transformer OCR was trained on NDL and CODH datasets.
-												Update README.md
											
										
										
											2020-07-08 01:54:30 +00:00
+								## About
-												update readme

											
										
										
											2023-07-12 02:58:51 +00:00
+								This repo contains an OCR system for converting modern Japanese images to text. The software has been developed by Dr. Anh Duc Le, while he was working for <a href="http://codh.rois.ac.jp/">ROIS-DS Center for Open Data in the Humanities</a>.
-												Update README.md
											
										
										
											2020-07-08 01:54:30 +00:00
-												update readme

											
										
										
											2023-07-12 02:58:51 +00:00
+								The system has 2 main modules: text line extraction and text line recognition. The overall architecture is shown in the below figures.
-												Update README.md
											
										
										
											2020-07-08 02:41:37 +00:00
+								![alt text](https://github.com/ducanh841988/Kindai-OCR/blob/master/images/TextlineExtraction.jpg "text line extraction")
-												Update README.md
											
										
										
											2020-07-08 01:54:30 +00:00
-												Update README.md
											
										
										
											2020-07-08 02:03:19 +00:00
+								For text line extraction, we retrain the CRAFT (Character Region Awareness for Text Detection) on 1000 annotated images provided by Center for Research and Development of Higher Education, The University of Tokyo.
-												update readme

											
										
										
											2023-07-12 02:58:51 +00:00
+								![alt text](https://github.com/ducanh841988/Kindai-OCR/blob/master/images/kindai_v1.jpg "text line recognition with attention model")
 								Text line recognition with attention model in Kindai V1.0
-												Update README.md
											
										
										
											2020-07-08 02:41:37 +00:00
-												update readme

											
										
										
											2023-07-12 02:58:51 +00:00
+								![alt text](https://github.com/ducanh841988/Kindai-OCR/blob/master/images/kindai_v2.png "text line recognition with ")
 								Text line recognition with Transformer in Kindai V2.0
-												update model on gdrive

											
										
										
											2023-07-11 07:33:23 +00:00
+								Text line recognition,
 								For Kindai V1.0, we employ the attention-based encoder-decoder on our previous publication. We train the text line recognition on 1000 annotated images and 1600 unannotated images provided by Center for Research and Development of Higher Education, The University of Tokyo and National Institute for Japanese Language and Linguistics, respectively.
-												update readme

											
										
										
											2023-07-12 02:58:51 +00:00
+								For Kindai V2.0, we trained a transformer with more data from National Diet Library (NDL) and The Center for Open Data in The Humanities (CODH).
 								[NDL dataset](https://github.com/ndl-lab/pdmocrdataset-part2) contains 3,997 pages, 103,256 lines and [CODH dataset](http://codh.rois.ac.jp/modern-magazine/dataset/) contains 1985 pages and 59,465 lines.
-												Update README.md
											
										
										
											2020-07-08 01:54:30 +00:00
 								## Installing Kindai OCR
-												add transformer OCR

											
										
										
											2023-07-11 06:44:23 +00:00
-												update model on gdrive

											
										
										
											2023-07-11 07:33:23 +00:00
+								Python==3.7.11
 								torch==1.7.0
 								torchvision==0.8.1
 								opencv-python==3.4.2.17
 								scikit-image==0.14.2
 								scipy==1.1.0
 								Polygon3
 								pillow==4.3.0
 								pytorch-lightning==1.3.5
 								einops==0.3.0
 								editdistance==0.5.3
-												Update README.md
											
										
										
											2020-07-08 01:54:30 +00:00
 								## Running Kindai OCR
-												update model on gdrive

											
										
										
											2023-07-11 07:33:23 +00:00
+								- You should first download the pre_trained models and put them into ./pretrain/ folder.
-												update readme

											
										
										
											2023-07-12 02:58:51 +00:00
+								[VGG model](https://drive.google.com/file/d/1_A1dEFKxyiz4Eu1HOCDbjt1OPoEh90qr/view?usp=sharing), [CRAFT model](https://drive.google.com/file/d/1-9xt_jjs4btMrz5wzrU1-kyp2c6etFab/view?usp=sharing), [OCR V1.0 model](https://drive.google.com/file/d/1mibg7D2D5rvPhhenLeXNilSLMBloiexl/view?usp=sharing), [OCR V2.0 model](https://drive.google.com/file/d/1cq4PwPS2mXXRjOApst2i7n4G3mBSVqpI/view?usp=drive_link)
-												Update README.md
											
										
										
											2020-07-08 02:19:47 +00:00
+								- Copy your images into ./data/test/ folder
 								- run the following script to recognize images:
-												add transformer OCR

											
										
										
											2023-07-11 06:44:23 +00:00
+								`python test_kindai_1.0.py`
 								`python test_kindai_2.0.py`
-												Update README.md
											
										
										
											2020-07-08 02:19:47 +00:00
+								- The recognized text transcription is in ./data/result.xml and the result images are in ./data/result/
-												Update README.md
											
										
										
											2020-07-08 02:20:11 +00:00
+								- If you may have to check the path to Japanese font in test.py for correct visualization results.
-												Update README.md
											
										
										
											2020-07-08 02:19:47 +00:00
+								    `fontPIL = '/usr/share/fonts/truetype/fonts-japanese-gothic.ttf' # japanese font`
-												Update README.md
											
										
										
											2020-07-22 02:54:47 +00:00
+								- using --cuda = True for GPU device and Fasle for CPU device
 								- using --canvas_size ot set image size for text line detection
-												Update README.md
											
										
										
											2020-07-08 03:23:07 +00:00
+								 - An example result from our OCR system
-												Update README.md
											
										
										
											2020-07-08 03:25:25 +00:00
+								 <img src="https://github.com/ducanh841988/Kindai-OCR/blob/master/data/result/res_k188701_021_39.jpg" width="700">
-												update model on gdrive

											
										
										
											2023-07-11 07:33:23 +00:00
-												update readme

											
										
										
											2023-07-12 02:58:51 +00:00
+								 ## Citation
-												Update README.md
											
										
										
											2020-07-22 03:13:22 +00:00
+								 If you find Kindai OCR useful in your research, please consider citing:
 								 Anh Duc Le, Daichi Mochihashi, Katsuya Masuda, Hideki Mima, and Nam Tuan Ly. 2019. Recognition of Japanese historical text lines by an attention-based encoder-decoder and text line generation. In Proceedings of the 5th International Workshop on Historical Document Imaging and Processing (HIP ’19). Association for Computing Machinery, New York, NY, USA, 37–41. DOI:https://doi.org/10.1145/3352631.3352641
-												update model on gdrive

											
										
										
											2023-07-11 07:33:23 +00:00
-												Update README.md
											
										
										
											2020-08-05 06:42:03 +00:00
+								 ## Acknowledgment
-												Update README.md
											
										
										
											2020-07-22 03:13:22 +00:00
-												Update README.md
											
										
										
											2020-08-05 06:42:03 +00:00
+								We thank The Center for Research and Development of Higher Education, The University of Tokyo, and National Institute for Japanese Language and Linguistics for providing the kindai datasets.
-												Update README.md
											
										
										
											2020-08-04 00:09:15 +00:00
 								## Contact
 								Dr. Anh Duc Le, email: leducanh841988@gmail.com or anh@ism.ac.jp