Enhancement of a multi-dialectal sentiment analysis system by the detection of the implied sarcastic features

摘要

Sentiment analysis is an NLP task that gained the interest of many researchers in various languages and recently in the Arabic language. We have encountered several challenges when dealing with this task, including sarcasm detection. In this article, we aim to exploit sarcastic characteristics to improve the accuracy of the sentiment analysis system. Sarcasm is difficult to detect because it is implicit and characterized by the presence of positive words in a negative context. We have then extracted a variety of features to define context incongruity and the opposition between the objective and subjective sentences. Offensive language and hate speech correspond to expressions that hurt others. The detection of offensive language is based on identifying offensive terms that are strongly negative and helpful to detect negative expressions. Thus, we have manually and automatically constructed sentimental, offensive and sarcastic lexicons and collected others. In the same way, many corpora either ironic (sarcastic, offensive) or sentimental (positive, negative) were collected. As sarcasm is a major challenge for the sentiment analysis system, we have built a balanced system that contains positive and negative (sarcastic, offensive) tweets. Since the analyzed corpus is multidialectal, we have used a cross dialect lexicon that retains meaning when passing from one dialect to another. Besides the Arabic dialect common characteristics, the classification was enhanced by the detection of the specificities of some dialects that use negation clitics as well as negation words to negate a term. The experiments prove that the enhancement of a sentiment analysis system by sarcastic features improved the results by 8% to reach 84.17% of accuracy using a classical machine learning approach and 80.36% using a Deep learning approach. The classical machine learning approach is improved afterward based on the expansion of the BOW lexicon and the reduction of the characteristic vector to reach an accuracy of 89.24%. This method is multilingual because the built model can be language independent. Indeed, it is enough to have the corresponding resources to apply the system to other languages.