In this paper, we propose a topic segmentation approach for Arabic texts, through which we have studied the effect of the application of two different stemming techniques, root-based and light stemming. The approach we propose is global,distributional, non-linear. It is global since it considers a comparison of all text segments and not only neighboring segments. It is non-linear in the sense that it can rank segments situated in different positions in text in same groups (subtopics). The approach is based on the calculation of lexical cohesion between segments basing on a combination of repetitive lexical semantic criteria. For terms weighting, we have used OKAPI (BM25) measure after an operation of stemming using both root-based stemming and light stemming. The semantic repetitions of terms are calculated using
Arabic WordNet lexical database. A similarity matrix is created where rows and columns are the text segments and the elements of the matrix are COSINE scores between pairs of segments. Subtopics are finally formed using a strict
clustering technique in order to eliminate redundancy in the segment groups. For experimentation, we tested our system on a collection of economic and web news articles using Recall, Precision, F-measure and WindowDiff. The obtained
results are very promising.
Téléchargement : PDF