View article

[PDF] from psu.edu

A novel Arabic lemmatization algorithm

Authors

Eiman Al-Shammari, Jessica Lin

Publication date

2008/7/24

Book

Proceedings of the second workshop on Analytics for noisy unstructured text data

Pages

113-118

Description

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, lemmatization and stemming.

Both stemming and lemmatization share a common goal of reducing a word to its base. However, lemmatization is more robust than stemming as it often involves usage of vocabulary and morphological analysis, as opposed to simply removing the suffix of the word. In this work, we introduce a novel lemmatization algorithm for the Arabic Language.

The new lemmatizer proposed here is a part of a comprehensive Arabic tokenization system, with a stop words list exceeding 2200 Arabic words. Currently, there are two Arabic leading stemmers: the root-based stemmer and the light stemmer. We hypothesize that lemmatization would be more …

Total citations

Cited by 76

200820092010201120122013201420152016201720182019202020212022202320241 2 5 2 1 6 2 12 1 12 6 4 6 5 3 4 3

Scholar articles

A novel Arabic lemmatization algorithm

E Al-Shammari, J Lin - Proceedings of the second workshop on Analytics for …, 2008