Authors
Eiman Al-Shammari, Jessica Lin
Publication date
2008/7/24
Book
Proceedings of the second workshop on Analytics for noisy unstructured text data
Pages
113-118
Description
Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, lemmatization and stemming.
Both stemming and lemmatization share a common goal of reducing a word to its base. However, lemmatization is more robust than stemming as it often involves usage of vocabulary and morphological analysis, as opposed to simply removing the suffix of the word. In this work, we introduce a novel lemmatization algorithm for the Arabic Language.
The new lemmatizer proposed here is a part of a comprehensive Arabic tokenization system, with a stop words list exceeding 2200 Arabic words. Currently, there are two Arabic leading stemmers: the root-based stemmer and the light stemmer. We hypothesize that lemmatization would be more …
Total citations
200820092010201120122013201420152016201720182019202020212022202320241252162121126465343
Scholar articles
E Al-Shammari, J Lin - Proceedings of the second workshop on Analytics for …, 2008