Arabic text classification using deep learning models

作者：

Highlights：

•

摘要

Text classification or categorization is the process of automatically tagging a textual document with most relevant labels or categories. When the number of labels is restricted to one, the task becomes single-label text categorization. However, the multi-label version is challenging. For Arabic language, both tasks (especially the latter one) become more challenging in the absence of large and free Arabic rich and rational datasets. Therefore, we introduce new rich and unbiased datasets for both the single-label (SANAD) as well as the multi-label (NADiA) Arabic text categorization tasks. Both corpora are made freely available to the research community on Arabic computational linguistics. Further, we present an extensive comparison of several deep learning (DL) models for Arabic text categorization in order to evaluate the effectiveness of such models on SANAD and NADiA. A unique characteristic of our proposed work, when compared to existing ones, is that it does not require a pre-processing phase and fully based on deep learning models. Besides, we studied the impact of utilizing word2vec embedding models to improve the performance of the classification tasks. Our experimental results showed solid performance of all models on SANAD corpus with a minimum accuracy of 91.18%, achieved by convolutional-GRU, and top performance of 96.94%, achieved by attention-GRU. As for NADiA, attention-GRU achieved the highest overall accuracy of 88.68% for a maximum subsets of 10 categories on “Masrawy” dataset.

论文关键词：Arabic text classification/categorization,Single-label text categorization,Multi-label text categorization,Word embedding,Deep learning,SANAD,NADiA

论文评审过程：Received 4 April 2019, Revised 5 September 2019, Accepted 6 September 2019, Available online 18 September 2019, Version of Record 18 September 2019.

论文官网地址：https://doi.org/10.1016/j.ipm.2019.102121