Authors
Mathias Johan Philip Creutz, Krista Hannele Lagus
Publication date
2004
Conference
7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON)
Pages
43-51
Description
This paper presents an algorithm for the unsupervised learning of a simple morphology of a natural language from raw text. A generative probabilistic model is applied to segment word forms into morphs. The morphs are assumed to be generated by one of three categories, namely prefix, suffix, or stem, and we make use of some observed asymmetries between these categories. The model learns a word structure, where words are allowed to consist of lengthy sequences of alternating stems and affixes, which makes the model suitable for highly-inflecting languages. The ability of the algorithm to find real morpheme boundaries is evaluated against a gold standard for both Finnish and English. In comparison with a state-of-the-art algorithm the new algorithm performs best on the Finnish data, and on roughly equal level on the English data.
Total citations
200420052006200720082009201020112012201320142015201620172018201920202021202220232024111126537711333743221411
Scholar articles
MJP Creutz, KH Lagus - 7th Meeting of the ACL Special Interest Group in …, 2004