This article examines the challenges faced during OCR research, including limited datasets, transcription difficulties, non-standard spacing, multi-column extraction issues, and the inability to recognize mathematical equations. It also outlines future directions, such as dataset expansion, post-processing for spacing alignment, improved handling of multi-column pages, and more accurate equation recognition.This article examines the challenges faced during OCR research, including limited datasets, transcription difficulties, non-standard spacing, multi-column extraction issues, and the inability to recognize mathematical equations. It also outlines future directions, such as dataset expansion, post-processing for spacing alignment, improved handling of multi-column pages, and more accurate equation recognition.

Key Challenges in OCR Research and Future Directions

9 min read

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

5.1 Challenges and Limitations

Following is the list of main challenges and limitations we faced during this research:

\ • The limited availability of resources posed significant challenges during our data collection

\ Figure 21: Result of Ocreval tool

\ process. Converting the collected data into a digital format proved an additional obstacle. Manual transcription of the documents was difficult due to unclear text, non-standard spacing, and unique vocabulary influenced by Arabic letters and terminologies. We attempted to create the dataset synthetically, crafting a small tool that assembled letters from a given collection of character images. Regrettably, the outcomes were unsatisfactory, and given our time constraints, we discontinued this approach.

\ • The non-standard spacing between the words and characters was challenging for transcribing the documents and needed to be more apparent for the model. The model interpreted the excessive gaps between characters or words as space characters. In contrast, in other cases where there should have been a space character, the minimal spacing went unnoticed by the model.

\ • Extracting text from multi-column pages was another limitation of the model.

\ • Recognizing mathematical equations was another limitation of our model.

\ Considering the challenges and limitations and based on the discussion on the results, we are interested in exploring several areas in the future as follows:

\ • Expanding the dataset is an aspect that requires further attention and effort.

\ • An observed issue pertained to the misalignment of spaces between words and characters. To address this, a post-processing phase is suggested for rectifying the misaligned space characters. • Ocr’ing the multi-column pages property is another area requiring more effort.

\ • Extracting mathematical equations accurately.

\

Online Resources

The dataset is partially publicly available for non-commercial use under the CC BY-NC-SA 4.0 license at https://github.com/KurdishBLARK/OCR4OldTextsInSorani.

Acknowledgments

We would like to extend our gratitude to the Zheen Center for Documentation and Research in Sulaymaniyah, Kurdistan Region, Iraq for their generous support in providing us with digital copies of certain historical publications.

References

Ahmadi, S., Hassani, H., and Jaff, D. Q. (2022). Leveraging multilingual news websites for building a Kurdish parallel corpus. Transactions on Asian and Low-Resource Language Information Processing, 21(5):1–11.

\ Antonacopoulos, A., Karatzas, D., Krawczyk, H., and Wiszniewski, B. (2004). The lifecycle of a digital historical document: structure and content. In Proceedings of the 2004 ACM Symposium on Document Engineering, pages 147–154.

\ Ataer, E. and Duygulu, P. (2007). Matching Ottoman words: an image retrieval approach to historical document indexing. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval, pages 341–347.

\ Aula, L. (2021). Improvement of optical character recognition on scanned historical documents using image processing.

\ Bukhari, S. S., Kadi, A., Jouneh, M. A., Mir, F. M., and Dengel, A. (2017). anyOCR: An opensource OCR system for historical archives. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 305–310. IEEE.

\ Bulert, K., Miyagawa, S., and B¨”uchler, M. (2017). Optical character recognition with a neural network model for printed coptic texts. In Digital Humanities 2017 Conference Abstracts, pages 657–9.

\ Do˘gru, M. (2016). Ottoman-Turkish Optical Character Recognition and Latin Transcription. Ph.D. thesis, Ankara Yıldırım Beyazıt ¨”Universitesi Fen Bilimleri Enstit¨”us¨”u.

\ Dolek, I. and Kurt, A. (2021). Ottoman OCR: Printed Naskh font. In 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pages 1–5. IEEE.

\ Feng, J., Peng, L., and Lebourgeois, F. (2015). Gaussian process style transfer mapping for historical chinese character recognition. In Document Recognition and Retrieval XXII, volume 9402, pages 104– 115. SPIE.

\ Gilbey, J. D. and Sch¨”onlieb, C.-B. (2021). An end-to-end optical character recognition approach for ultra-low-resolution printed text images. arXiv preprint arXiv:2105.04515.

\ Google. (2023a). How to train lstm/neural net tesseract. Accessed on 30-04-2023.

\ Google. (2023b). Improving the quality of the output. Accessed on 15-04-2023.

\ Hassani, H., Medjedovic, D., et al. (2016). Automatic Kurdish dialects identification. Computer Science & Information Technology, 6(2):61–78.

\ Hassanpour, A. (1992). Nationalism and language in Kurdistan, 1918-1985. San Francisco: Mellen Research University Press.

\ Idrees, S. and Hassani, H. (2021). Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR. Applied Sciences, 11(20):9752.

\ Idrees, S. (2020). Improving Document Processing in the Public Organizations of the Kurdistan Region of Iraq (KRI) Using an Optical Character Recognition (OCR) System. Master Thesis.

\ Kilic, N., Gorgel, P., Ucan, O. N., and Kala, A. (2008). Multifont ottoman character recognition using support vector machine. In 2008 3rd international symposium on communications, control and signal processing, pages 328–333. IEEE.

\ Koistinen, M., Kettunen, K., and Kervinen, J. (2017). How to improve optical character recognition of historical finnish newspapers using open source tesseract ocr engine. Proc. of LTC, pages 279–283.

\ K¨u¸c¨uk¸sahin, N. (2019). Design of an Offline Ottoman Character Recognition System for Translating Printed Documents to Modern Turkish. Ph.D. thesis, Izmir Institute of Technology (Turkey).

\ Li, B., Peng, L., and Ji, J. (2014). Historical chinese character recognition method based on style transfer mapping. In 2014 11th IAPR International Workshop on Document Analysis Systems, pages 96–100. IEEE.

\ Ly, N. T., Nguyen, C. T., and Nakagawa, M. (2020). An attention-based row-column encoder-decoder model for text recognition in japanese historical documents. Pattern Recognition Letters, 136:134–141.

\ Munivel, M. and Enigo, V. (2022). Optical character recognition for printed tamizhi documents using deep neural networks. DESIDOC Journal of Library & Information Technology, 42(4).

\ Nashwan, F. M., Rashwan, M. A., Al-Barhamtoshy, H. M., Abdou, S. M., and Moussa, A. M. (2017). A holistic technique for an arabic ocr system. Journal of Imaging, 4(1):6.

\ Nguyen, H. T., Ly, N. T., Nguyen, K. C., Nguyen, C. T., and Nakagawa, M. (2017). Attempts to recognize anomalously deformed kana in japanese historical documents. In Proceedings of the 4th International Workshop on Historical Document Imaging and Processing, pages 31–36.

\ Nunamaker, B., Bukhari, S. S., Borth, D., and Dengel, A. (2016). A tesseract-based ocr framework for historical documents lacking ground-truth text. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3269–3273. IEEE.

\ Ozturk, A., Gunes, S., and Ozbay, Y. (2000). Multifont ottoman character recognition. In ICECS 2000. 7th IEEE International Conference on Electronics, Circuits and Systems (Cat. No. 00EX445), volume 2, pages 945–949. IEEE.

\ Poncelas, A., Aboomar, M., Buts, J., Hadley, J., and Way, A. (2020). A tool for facilitating ocr postediting in historical documents. arXiv preprint arXiv:2004.11471.

\ Poulos, M., Kokkonas, Y., Papavlasopoulos, S., and Bokos, G. (2010). An ocr system for greek printed early books based on computational geometry algorithms.

\ Qania, M. (2012). Le Barey Ragayandinewe. Chwarchra.

\ Reul, C., Springmann, U., Wick, C., and Puppe, F. (2018). State of the art optical character recognition of 19th century fraktur scripts using open source engines. arXiv preprint arXiv:1810.03436.

\ Romanello, M., Najem-Meyer, S., and Robertson, B. (2021). Optical character recognition of 19th century classical commentaries: the current state of affairs. In The 6th International Workshop on Historical Document Imaging and Processing, pages 1–6.

\ Shafii, M. (2014). Optical character recognition of printed persian/arabic documents.

\ Sihang, W., Jiapeng, W., Weihong, M., and Lianwen, J. (2020). Precise detection of chinese characters in historical documents with deep reinforcement learning. Pattern Recognition, 107:107503.

\ Simistira, F., Ul-Hassan, A., Papavassiliou, V., Gatos, B., Katsouros, V., and Liwicki, M. (2015). Recognition of historical greek polytonic scripts using lstm networks. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 766–770. IEEE.

\ Skelbye, M. B. and Dann´ells, D. (2021). Ocr processing of swedish historical newspapers using deep hybrid cnn–lstm networks. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 190–198.

\ Springmann, U., Fink, F., and Schulz, K. U. (2016). Automatic quality evaluation and (semi-) automatic improvement of ocr models for historical printings. arXiv preprint arXiv:1606.05157.

\ Springmann, U., Reul, C., Dipper, S., and Baiter, J. (2018). Ground truth for training ocr engines on historical documents in german fraktur and early modern latin. arXiv preprint arXiv:1809.05501.

\ Stahlberg, F. and Vogel, S. (2016). Qatip–an optical character recognition system for arabic heritage collections in libraries. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pages 168–173. IEEE.

\ Sulaiman, A., Omar, K., and Nasrudin, M. F. (2019). Degraded historical document binarization: A review on issues, challenges, techniques, and future directions. Journal of Imaging, 5(4):48.

\ Vamvakas, G., Gatos, B., Stamatopoulos, N., and Perantonis, S. J. (2008). A complete optical character recognition methodology for historical documents. In 2008 The Eighth IAPR International Workshop on Document Analysis Systems, pages 525–532. IEEE.

\ Yang, H., Jin, L., and Sun, J. (2018). Recognition of chinese text in historical documents with pagelevel annotations. In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 199–204. IEEE.

\ Yousefi, M. R., Soheili, M. R., Breuel, T. M., Kabir, E., and Stricker, D. (2015). Binarization-free OCR for historical documents using lstm networks. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 1121–1125. IEEE.

\

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq (blnd.yaseen@ukh.edu.krd);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq (hosseinh@ukh.edu.krd).

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

\

Market Opportunity
Moonveil Logo
Moonveil Price(MORE)
$0.0007286
$0.0007286$0.0007286
-3.26%
USD
Moonveil (MORE) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts?

Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts?

The post Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts? appeared on BitcoinEthereumNews.com. In recent crypto news, Stephen Miran swore in as the latest Federal Reserve governor on September 16, 2025, slipping into the board’s last open spot right before the Federal Open Market Committee kicks off its two-day rate discussion. Traders are betting heavily on a 25-basis-point trim, which would bring the federal funds rate down to 4.00%-4.25%, based on CME FedWatch Tool figures from September 15, 2025. Miran, who’s been Trump’s top economic advisor and a supporter of his trade ideas, joins a seven-member board where just three governors come from Democratic picks, according to the Fed’s records updated that same day. Crypto News: Miran’s Background and Quick Path to Confirmation The Senate greenlit Miran on September 15, 2025, with a tight 48-47 vote, following his nomination on September 2, 2025, as per a recent crypto news update. His stint runs only until January 31, 2026, stepping in for Adriana D. Kugler, who stepped down in August 2025 for reasons not made public. Miran earned his economics Ph.D. from Harvard and worked at the Treasury back in Trump’s first go-around. Afterward, he moved to Hudson Bay Capital Management as an economist, then looped back to the White House in December 2024 to head the Council of Economic Advisers. There, he helped craft Trump’s “reciprocal tariffs” approach, aimed at fixing trade gaps with China and the EU. He wouldn’t quit his White House gig, which irked Senator Elizabeth Warren at the September 7, 2025, confirmation hearings. That limited time frame means Miran gets to cast a vote straight away at the FOMC session starting September 16, 2025. The full board now features Chair Jerome H. Powell (Trump pick, term ends 2026), Vice Chair Philip N. Jefferson (Biden, to 2036), and folks like Lisa D. Cook (Biden, to 2028) and Michael S. Barr…
Share
BitcoinEthereumNews2025/09/18 03:14
XRPL Validator Reveals Why He Just Vetoed New Amendment

XRPL Validator Reveals Why He Just Vetoed New Amendment

Vet has explained that he has decided to veto the Token Escrow amendment to prevent breaking things
Share
Coinstats2025/09/18 00:28
US Senate Democrats plan to restart discussions on a cryptocurrency market structure bill later today.

US Senate Democrats plan to restart discussions on a cryptocurrency market structure bill later today.

PANews reported on February 4th that, according to Crypto In America, US Senate Democrats plan to reconvene on the afternoon of February 4th to discuss legislation
Share
PANews2026/02/04 23:12