Online-KHATT: An Open-Vocabulary Database for Arabic Online-Text Processing
Sabri A. Mahmoud1, *, Hamzah Luqman1, Baligh M. Al-Helali1, Galal BinMakhashen1, Mohammad Tanvir Parvez2
1 King Fahd University of Petroleum & Minerals, Dhahran31261, Saudi Arabia
2 Qassim University, Qassim 51477, Saudi Arabia
An Arabic online text database called Online-KHATT is presented, which addresses the lack of a free benchmarking database of natural Arabic online text. This database consists of natural Arabic online text written without any constraints using digital pen.
The main objective of this work is to build a comprehensive benchmarking database of online Arabic text. Part of this objective is the development of tools, techniques and procedures for online text collection, verification and transliteration. Additionally, we built a dataset for segmented online Arabic characters and ligatures with ground truth labeling and present classification results of online Arabic characters using DBN-based HMM.
The source text of Online-KHATT is the same source text of the unique paragraphs of the KHATT database, along with additional resources to increase the coverage of the database. A 3-level verification procedure aligns the online text with its ground truth. The verified ground-truth database contains meta-data that describes the online Arabic text at the line level using text, InkML and XML formats.
The database consists of 10,040 lines of Arabic text written by 623 writers using Android- and Windows-based devices. The text lines of Online-KHATT database are randomly distributed into training, testing, and verification sets that contain 70%, 15% and 15% of the text lines of the database, respectively. We have segmented part of the collected data into characters along with their ground truths. We have developed tools for the collection of data (for devices with electronic pen), verification and correction of ground truths, transliteration, and semi-automated segmentation of characters. In addition, we also present the experimental results of Arabic online character recognition using the Online-KHATT database.
Online-KHATT database can be used for Arabic online text recognition, writer identification and verification, pre-processing and segmentation, etc. In addition, researchers may use the segmented characters to test their segmentation algorithms for use in online text recognition or to train online text classifiers. This database will be made freely available for interested researchers at (http://onlinekhatt.ideas2serve.net/).
Keywords: Arabic online text database, Arabic online text recognition, Segmentation, Handwriting recognition, Online character recognition, HMM.
open-access license: This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International Public License (CC-BY 4.0), a copy of which is available at: (https://creativecommons.org/licenses/by/4.0/legalcode). This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
* Address correspondence to this author at the King Fahd University of Petroleum & Minerals, Sabri A. Mahmoud, Dhahran 31261, Saudi Arabia; Tel: 966554430980; E-mails: firstname.lastname@example.org , email@example.com