RIST

Revue d'Information Scientifique et Technique

Transformers and Ensemble methods: A solution for Hate Speech Detection in Arabic language

This paper describes our participation in the shared task of hate speech detection, which is one of the subtasks of the CERIST NLP Challenge 2022. Our experiments evaluate the performance of six transformer models and their
combination using 2 ensemble approaches. The best results on the training set, in a five-fold cross validation scenario,were obtained by using the ensemble approach based on the majority vote. The evaluation of this approach on the test set resulted in an F1-score of 0.60 and Accuracy of 0.86.

Auteurs : Angel Felipe Magnossão de Paula , Imene Bensalem , Paolo Rosso , Wajdi Zaghouani

 

Téléchargement : PDF

Classifying Arabic covid-19 related tweets for fake news detection and sentiment analysis with BERT-based models

The present paper is about the participation of our team “techno” at CERIST Natural Language Processing Challenge. We used an available dataset for task1.c: Arabic sentiment analysis and fake news detection within covid-19. It comprises 4128 tweets for sentiment analysis task and 8661 tweets for fake news detection task. We used natural language processing tools with the combination of the most renowned pre-trained language models BERT (Bidirectional Encoder Representations from Transformers). The results shows the efficacy of pre-trained language models as we attained an accuracy of 0.93 for the sentiment analysis task and 0.90 for the fake news detection task.

Auteurs : Rabia Bounaama , Mohammed El Amine Abderrahim

Téléchargement : PDF

Arabic Hate speech and social networks offensive language detection

The containment measures caused by the coronavirus pandemic have stimulated the use of social networks as a means of exchanging information, communication, and combating social distancing. This paper presents our participation in the NLP Challenge2022 competition initiated by RESEARCH CENTRE FOR SCIENTIFIC AND TECHNICAL INFORMATION (CERIST). The competition focuses on the task of detecting Arabic hate speech and offensive language on social networks, specifically analyzing Twitter messages related to the COVID-19 pandemic and classifying users’ sentiments as either hateful or not. In the present work, we propose a model based on recurrent neural networks, more precisely the Bidirectional long-term memory (Bi-LSTM). We trained the model using a dataset constructed by the authors of this challenge. As a result, we achieves an accuracy of 96.35 %.

Auteurs : Hakim Bouchal , Ahror BELAID

Téléchargement : PDF

GigaBERT-based Approach for Hate Speech Detection in Arabic Twitter

Natural Language Processing has recently become one of the most trending research areas in Artificial Intelligence, especially in social media-related tasks. This paper describes our participation in the « Hate Speech Detection on Arabic Twitter” task at the CERIST NLP-Challenge 2022 competition. The proposed solution aims to classify the tweets collected in the Arabic ARACOVID19-MFH multi-label and multi-dialect dataset into « Hateful » and « Not Hateful » categories. Based on a pre-trained transformer model known as GigaBERT-v4, our solution outperformed the most common transformer models supporting the Arabic language. Experiments have proved that the GigaBERT-v4 model is more effective than the other models using the previously described dataset, obtaining a 99.46% accuracy and a 98.68% macro F1-score.

Auteurs : Bachir Said  , Mohammed E. Barmati

Téléchargement : PDF

XLM-T for Multilingual Sentiment Analysis in Twitter using oversampling technique

With the emergence of Pre-trained Language Models (PLMs) and the success of large scale, the field of Natural Language Processing (NLP) has achieved tremendous development such as Sentiment analysis (SA) that is one of the fast-growing research tasks in NLP. This paper describes the system that our team submitted to the CERIST NLP Challenge, for task 1.b. The purpose of this task is to identify the sentiment polarity of the datasets in English and Arabic languages comments collected from twitter. Our approach is based on a PL Model called XLM-T, and uses the Oversampling technique to solve the sentiment analysis problem of multilingualism in twitter. Experimental results confirm that this state-of-the-art model is robust achieving accuracy of 85%.

Auteurs :  Mohammed E. Barmati , Bachir Said

Téléchargement : PDF

Hate speech detection model based on BERT for the Arabic dialects

Hateful speech spread through social media has the potential to cause personal harm and suffering as well as social tension. Social media platforms, on the other hand, are unable to regulate all of the content that users post. As a result,
there is a demand for automatic detection of hate speech. This demand is increased when the posts are written in complex languages, such as Arabic. This present study is dedicated to contributing to hate speech and offensive language detection tasks for Arabic dialects. This paper is about my participation on CERIST Natural Language Processing Challenge 2022.
We propose an approach based on deep learning and a pre-trained BERT model. This approach is built by adding GRU and LSTM layers to BERT outputs. Additionally, to deal with the class imbalance issue in the dataset, two methods are proposed, the first is based on data augmentation by oversampling minority class using translation and back translation method and the second uses focal loss for training. The best results reached with focal loss training are 98.03% for accuracy and 98.02% for f1-score, and with data augmentation, 99.14% for both accuracy and f1-score.

Auteurs : Nourelhouda Chiker

Téléchargement : PDF

Modeling Fake News Detection Using Machine Learning Algorithms for Arabic covid-19 Tweets

Fake news detection has become a major issue in the digital age, with social media playing a major role in its spread. This paper outlines the dataset and methodology used to model Arabic fake news. This paper is about our participation on CERIST Natural Language Processing Challenge. We used the dataset provided for the Task1.c. Arabic sentiment analysis and fake news detection within covid-19. The model used for this task is a simple transformer fake news model based on the Arabic pre-trained language model CAMeL-BERT. This model was utilized in two variants: a fine-tuned model and a Bidirectional long short-term model. The experiment results of this modeling CAMeL-BERT provides the best result by achieving 0.959 F1, thus outperforming all other models variants in detecting fake news.

Auteurs : Mohammed Aldawsari , Omer Salih Dawood Omer  ,Yousra F.G.Elhakeem , Safa Eltayeb

Téléchargement : PDF

Arabic Sentiment Analysis within COVID-19

In this paper, we give a brief study that allow us to analyze some Arabic tweets posted in the Covid-19 period and classify them into “Positive, Negative and Neutral”. This paper is about our participation on CERIST Natural Language
Processing Challenge. We worked on a dataset that consist of 4800 tuples on which we applied three different approaches “Naive Bayes, Neuron network and Stochastic gradient descent (SGD)” where the last algorithm gave the best result with an accuracy of 91%.

Auteurs : Slimane Arbaoui, Alaa Eddine Belfedhal

Téléchargement : PDF

Exploration de l’innovation chinoise à travers l’information brevet: hégémonie ou manipulation de la connaissance?

Nous proposons dans cet article d’analyser la puissance innovatrice de la Chine. Nous nous demandons si ce pays, qui est devenu en quelques années le premier demandeur de brevets dans le monde, relève d’un réel réservoir d’invention effectif ou d’une stratégie de manipulation de la connaissance à l’échelle mondiale. En d’autres termes, est-ce que la Chine, qualifiée jadis d’usine du monde, est devenue un véritable moteur de la R&D mondiale ? L’objectif de cet article est de comprendre comment l’information brevet est exploitée par les chercheurs et de savoir quelle est la proportion des innovations à valeur ajoutée dans l’explosion du nombre de brevets chinois.

Auteurs : Nour-Eddine Aissaoui

Téléchargement : PDF

Introduction au BIG DATA : Concepts et Technologies

Depuis quelques années, le terme Big Data s’est généralisé et les plus grandes entreprises et fournisseurs de données dans le monde y sont déjà passés.
Ce phénomène qui a changé le monde, a vu le jour suite à l’explosion des données numériques et l’incapacité des systèmes traditionnels à gérer ces énormes quantités des données. En fait, Google, Yahoo et d’autres entreprises du web ont été les premiers confrontés aux problèmes de passage à l’échelle de leurs systèmes, ce qui a motivé le développement des premiers projets Big Data. Ainsi, pour répondre aux exigences des données de plus en plus massives, plusieurs projets ont été développés par la suite. Cet article est une introduction au Big Data et à ses technologies récentes.

Auteurs : Faiza Deghmani

Téléchargement : PDF