Update 2: OCR & CAT of Classical Arabic works

Posted on Fri 10 November 2023 in Language

بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ

In my previous articles here, here and here, I discussed various aspects related to translating and digitizing classical Arabic books.

This article is an update that presents a couple of new and evolving ideas I have had since writing those articles.

Tools update

As tooling has been central to my previous discussions, it is worth mentioning some new updates on that front.

Although I previously had an on-off view regarding the tool Weblate, I have found that it is adequate enough to use for translations. It offers Git-like features and also backs up to Git repositories.

In the OCR field, a tool called eScriptorium is currently the best at (my subjective opinion) scanning Arabic texts. It is an Open Source project built on top of another Open Source project called Kraken.

The models to be used for the eScriptorium are available here:

Printed Urdu Base Model Trained on the OpenITI Corpus Printed Persian Base Model Trained on the OpenITI Corpus Printed Ottoman Base Model Trained on the OpenITI Corpus Printed Arabic Base Model Trained on the OpenITI Corpus Printed Arabic-Script Base Model Trained on the OpenITI Corpus (here is a quote regarding the difference between this model and the Arabic model above: The former has been trained only on Arabic language prints, the latter is trained on multiple languages that all use the Arabic script (Arabic, Persian, Urdu, Ottoman).)

The power of large language models cannot be understated. Tools like GPT (especially GPT-4) provide very accurate translations, even of niche subjects like Usul ul-Fiqh or Uloom ul-Hadith. GPT-based translation tools already exist: ebook-GPT-translator

It should be noted that to get true accuracy, the machine-translated text has to be proofread by an expert.

Two other questions arise as well:

  1. how literal should one be in the translations?

  2. and are translations even necessary?

The philosophical side to translations

To answer the latter question, I will clearly say that translations are necessary. As an Arabic learner however, I believe that translations of introductory (by introductory I do not mean 'primers' - primers offer very little textual substance and many translations exist of most primers already) texts should be the primary focus. Beyond that, I believe that translations should be as close to literal as possible while retaining normal language. The reasons for this are:

  • It preserves the original Arabic meaning

  • It saves on dictionary lookup time

  • It turns every translated book into a dual-purpose book (for learning Arabic and for learning the underlying science of the book itself)

Partial translations

As students progress towards intermediate books, the use of partial translations may be more effective for translators (due to the volume of texts that can be translated). The key flaw in this idea is its subjective nature. It is difficult for a translator to determine which words are easy and which words are hard in any given book (and the respective learning levels of the audience).

If this option is chosen, translators can focus more on the meanings and morphology of certain words. Arabic dictionary definitions can also be included to get students to read word definitions according to the Arabs.

More effective than translating?

While there is some research regarding the amount of diacritics to use (full or partial), it is undeniable that adding diacritics to any religious text would enable better reading and clear up any ambiguities in the text.

In terms of Arabic natural language processing (NLP), diacritics is not much explored. Objectively, I can say that diacritics is challenging. Subjectively, I would say that due to bulk of Arabic NLP researchers being Arab, they may feel that diacritics is an unnecessary crutch in order to read Arabic.

The current state-of-the-art (SoTA) neural network model for adding diacritics to Arabic texts (trained on the Tashkeela corpus) is the CBHG model.

The GPT models (specifically GPT-4) show enormous promise regarding diacritics. When doing a rudimentary test using Shakkelha and GPT-4, I found that both models were equivalent in prediction and accuracy. The only error that I picked up in both models was the failure to predict passive verbs.

I will soon be publishing an article where I used old data from Maktaba Shamela (7000 books) to analyze the percentage of diacritics within each book. The books were already split into categories. I split this further by only extracting books that have greater than 50% diacritics within them.

The AI (LLM) field

A lot of activity is going on in the LLM (famously known as 'AI') research field. If Open Source LLM research continues to advance as it is, it might be possible to fully diacriticise and translate many classical Arabic texts with a moderate GPU at home. The only thing then left would be for annotators to check for errors.

Waiting on the promise of other research is not sensible though. Researchers (even amateurs like myself) should continue to explore alternative paths to achieve their goals and advance the fields of knowledge.

Manuscript Research/Verification

In my previous article here I discussed the issue of circular digital book verification. I was unaware that my concerns about digitization were expressed by other past scholars regarding publishers and the books being published. I discovered that this field is called: Tahqeeq

Mufti Husain Kadodia wrote an excellent article about this field in recent times.

Here are some keywords and their definitions (lifted verbatim from the article or my own summary):

The historicity of book publication is perhaps as important as the contents within the books. The pursuit of greed in humans is also not a new phenomenon. Greed existed in the past too. The astute reader would also pick up the cynical irony in book publishers of Islamic books acting shady (is nothing sacred?).

I am left in a state of mild bewilderment because, although a book may be mutawaatir in print now, this was not how books were written previously. Although the printing press is not a new technology, it did not reach the Muslim world until relatively recently (roughly in the past 300 years while being >600 years old). It was the job of scribes (usually trained scholars) to verbatim copy Islamic books from manuscript to manuscript.

End

While the last section may make the reader feel bleak, I believe the Mufti was only expressing the bad side of the industry. The Mufti does sells Islamic books (their Telegram channel no longer seems accessible). It may be possible that can provide a full picture of this field from its early history until now.

Communities to join:

I am yet to find an Islamic studies community that is moderate/tolerant, focused on research and neither sectarian nor polemical.


If you don't know how to use RSS and want email updates on my new content, consider Joining my Newsletter

The original content of this blog is a Waqf solely for the Pleasure of Allah. You are hereby granted full permission to copy, download, distribute, publish and share this content without modification under condition that full attribution is given to this author by creating a link either above or below the content that links back to the original source of the content. For any questions or ambiguity, you are requested to contact me via email for clarification.