A Guide to Arabic Natural Language Processing
Arabic natural language processing (NLP) is an advanced application of AI and machine learning used to understand Arabic dialects. Our focus in this article is on applying NLP in Arabic to extract semantic insights from all sources of data, be it text, audio, or video. We talk about how, at Repustate, we use NLP for sentiment analysis. We also showcase a real-world example of how we successfully provided a highly customized Arabic sentiment analysis solution for a client in Saudi Arabia.
Challenges in Arabic natural language processing
Arabic is a very complex language, with many dialects. Arabic morphology has two fundamentals: derivational morphology, (word formation) and inflectional morphology (how words interact with the syntax). Morphological features have integrated dependencies on several linguistic factors like affixes, root-based structures, and vowels. Additionally, Arabic’s high ability for new word formations with rich semantic meanings - all pose quite a challenge for Arabic natural language processing.
To explain this complexity, let’s look at an example. In Arabic, derivational morphology is the principal factor behind how a word is transformed from its root. For instance, from the root word ”اقرأ“ (read), different phrases such as “هو يقرأ” (he is reading), or “هي تقرأ” (she is reading) can be formed. Semantically related words generated from this root can also be diverse, such as ” مكتبة” (library) or “جريدة” (newspaper). This richness in vocabulary and nuance is what makes Arabic a category IV language in parameters of complexity and difficulty to learn. This is also why a regular sentiment analysis model that uses translations cannot give accurate results for sentiment analysis of Arabic tweets, reviews, or news.
How is Arabic NLP used for sentiment analysis?
For accurate sentiment analysis of Arabic text, the sentiment mining tool needs to read the data directly in Arabic. For this, it needs to have an algorithm that has been trained on Arabic datasets (to hone the machine learning model) using Arabic NLP. If the tool resorts to translating text to English or any other language first, it will give wrong sentiment scores, translating to a company investing in business and operational strategies based on incorrect insights.
Repustate’s Arabic sentiment analysis solution reads and understands the Arabic language in its native format. It supports standard Arabic and can be trained on three Arabic dialects - Gulf Peninsular, Levantine, and Egyptian. Our platform has its own Arabic lemmatizer, part-of-speech tagger, and sentiment models. This is how it conducts Arabic NLP tasks to understand the intent behind texts and return a corresponding classification and sentiment score.
Being highly customizable and scalable, our sentiment analysis API can be trained with a machine learning dataset based on your particular industry, business, or region. This is what enables it to conduct sentiment analysis in Arabic tweets, social media videos, patient voice data, consumer reviews, or any other source that is connected to you.
Here is an example of NLP in Arabic for customer reviews that our Arabic Sentiment analysis model processed for a client in the Middle East.
Figure 1: Arabic texts analyzed by Repustate’s Arabic sentiment analysis model.
Arabic machine learning datasets for sentiment analysis in Arabic tweets
Arabic training datasets are a crucial part of machine learning and Arabic natural language processing. These datasets are used to train machine-learning algorithms and carry out NLP in Arabic such as text analytics, text classification, or product categorizations. Repustate’s Arabic sentiment analysis tool is based on a rich and varied corpus and training datasets from multiple disciplines and sources. This is what makes it so high in precision and accuracy.
Arabic training datasets are widely available in audio, text, and image formats. Some of them include Arabic Learner Corpus (ALC), Quranic Arabic Corpus, Arabic Handwritten Characters Dataset, French-Arabic Newspapers and more. Let’s get to know these sources better.
Arabic Text Datasets
Arabic 100K Reviews : This annotated dataset for Arabic NLP is painfully collected and created from 99,999 hotels, books, movies, and product reviews. It is categorised in three classes - Positive (4 - 5/5), Negative (1 - 2/5), and Mixed (3/5).
Arabic Sentiment Twitter Corpus : An excellent resource, this dataset is apt for deep learning in Arabic sentiment analysis. It contains more than 58,000 Arabic tweets, already annotated in positive and negative labels.
Jamalon Arabic Book Dataset : This is a corpus of more than 8000 books in Arabic and English that can be used to train models for Arabic natural language processing. Arabizi Text : This dataset has mixed English and Arabizi text that can be used to train a model for automatic detection of texts in mixed English and Arabizi languages.
Quranic Arabic Corpus : Here is an amazing collection that contains an annotated corpus comprising Arabic syntax, grammar and word structure for every word in the Quran.
Arabic Poetry Dataset : This is a training Arabic NLP dataset that contains more than 58,000 poems including metadata such as the poet, topic, and genre.
Corpus of Contemporary Arabic (CCA) : The CCA contains 1 Million annotated Arabic words and is apt for sentiment models meant for linguists, Arabic language teachers, and foreign learners of Arabic.
Arabic Learner Corpus (ALC) : Another great dataset for Arabic natural language processing, the Arabic Learner ALC has written and verbal materials produced by 900+ students in Saudi Arabia.
Arabic BERT corpus : This is a database collected from Arabic Wikipedia and 2000 Arabic books based on the Nvidia BERT methodology. It is perfect for training a multilingual or monolingual Arabic BERT language model.
French-Arabic Newspapers : Meant especially for a multilingual Arabic and French language Arabic NLP model, this dataset contains a collection of 10,000 Arabic words with French translations.
OntoNotes : This dataset is a collection of texts in English, Chinese, and Arabic from a variety of sources such as Arabic news, customer care, telephonic conversations, popular talk shows, etc.
Arabic Audio Datasets
Arabic Natural Audio Dataset : This source contains audio from online Arabic talk-shows. All the audio is automatically divided into 1-sec speech units.
RATS language identification : This dataset contains more than 5K hours of Arabic, Urdu, Farsi, Pashto, and Dari conversational audio.
Arabic Broadcast News Transcripts : Here you will find transcriptions of more than 35 hours of Arabic broadcast news speech that can be used to train models on Arabic natural language processing tasks.
Arabic Handwritten Character Datasets for Optical Character Recognition (OCR)
Arabic Handwritten Digits : This is a corpus handwritten by 700 writers and has 60,000 training images and 10,000 test images for Arabic NLP, with each writer having written every digit ten times.
Arabic Handwritten Characters Dataset : This dataset contains 16,800 characters handwritten by 60 people, aged between 19 to 40 years for NLP in Arabic. Only 10% of the writers were left handed. The images have a resolution of 300 dpi.
Yarmouk Arabic OCR : This Arabic dataset comprises 8,994 images from 4,587 articles written in the Arabic language on Wikipedia.
Examples of Non-Arabic Datasets:
You can find similar freely available datasets for English and non-English regions too. Amazon Reviews, Wikipedia Links Data, Blogger Corpus, and Google Books Ngrams are great for reviews and social sentiment mining. While, Kaggle and the UCI Machine Learning Repository can give you niche datasets like Netflix movies and shows for prediction datasets, the UCI Repository can give you user-contributed datasets from a wider variety. You can also get training datasets for computer vision applications like COIL100, Labelme, Stanford Dogs Dataset, ImageNet, VisualQA, and ofcourse, Google’s own Google’s Open Images.
What is the process of doing sentiment analysis in Arabic tweets?
First, a massive corpus of Arabic data is collected. The more varied this collection of text, the better. This corpus is cleaned, processed, and then used to train the sentiment analysis model. The resulting data from Arabic NLP is tested, and the model trained again until it gives the highest accuracy scores possible. These insights are then presented in a sentiment analysis dashboard for ease of understanding. As time progresses and the model processes more and more data, it becomes smarter and more intuitive, giving even better results.
Below we detail the steps of how sentiment analysis is conducted for Arabic text:
Create a part-of-speech tagger: Each Arabic word is classified at a grammatical level so it can identify conjunctions, subordinate clauses, prepositional, and noun phrases. This helps the model understand the text’s true meaning.
Lemmatization: This step is very important, as in lemmatization, the rules of conjugating nouns and verbs based on gender, tense, etc. are applied in the model. This helps the tool determine the root of a word. For example, “reading” and “reader”, are based on the root word “read”.
Build prior polarity: This determines the positive and negative context of the word and calculates the intensity of the polarity. Such as excellent (+1), good (+0.5), average(0), and poor(-0.5).
Determine grammatical constructs: In this step, nuanced grammatical constructs like negations and amplifiers are determined so the model can understand the sentiment score. It is essential that a sentiment analysis model reads Arabic natively, because in some languages negations and amplifiers come after a phrase, while in some they come afterwards.
Feed sentiment scores: When all these steps using Arabic NLP come together, the sentiment scores are fed to machine learning models. It can get as granular as you want based on document, topic or aspect-based sentiment analysis.
Through a combination of machine learning and natural language processing, the sentiment mining tool thus conducts sentiment analysis in Arabic tweets, voice of the customer data, employee emotions, news, etc.
Discover More: Sentiment Analysis in Arabic Tweets
Real world example of Arabic NLP for business
One of our clients in the Middle East is Jeddah-based specialized healthcare consultancy firm, Health-Links. They were looking for an Arabic sentiment analysis solution that could analyze their continually growing patient voice data, accumulating from 12 million surveys conducted annually. There was no Arabic natural language processing solution in the market that could fit their needs as granulary and as precisely as they needed. This had lead them to perform manual analysis of the data. However, this was giving them inaccurate results due to human error and bias. Not to mention, how time-consuming and costly the task was.
The client partners with the Saudi Ministry of Health, policy makers and stake-holders in the healthcare field in the Gulf region to make lasting improvements in the overall quality of patient care. It helps partners optimize their processes by implementing validated strategies and operation practices, and improve individual and organizational performance. Hence, it was of utmost necessity for the client to have the right Arabic sentiment analysis tool with the highest accuracy possible.
Repustate’s Arabic sentiment analysis model helped bring insights from the patient voice data to the forefront. Repustate data engineers worked with Health-Links to classify hospital patient feedback by sentiment as well as into multiple semantic categories. The model was specially customized to fit the requirements of the Middle Eastern healthcare industry, specifically the Saudi Complaints Taxonomy.
The model presently analyses all the Arabic text natively using Arabic natural language processing. It presents patterns and trends from historical data to Health-Links for a more holistic view. It helps the client enable healthcare organizations in Saudi Arabia and the Gulf region forge new paths and affect the change needed for better healthcare.