Authors
Maike ERDMANN, Gen HATTORI, Kazunori MATSUMOTO, Yasuhiro TAKISHIMA
Description
† KDDI R&D Laboratories, Inc. 2-1-15 Ohara, Fujimino, Saitama, 356-8502 Japan E-mail:†(ma-erdmann, gen, matsu, takisima)@ kddilabs. jp Abstract Social media platforms such as Twitter are an invaluable source of information. However, one of the problems that arise when analyzing Twitter messages is the ambiguity of many named entities. Named entity disambiguation is usually performed by comparing the text surrounding the occurrence of the ambiguous term to the text in a knowledge base such as Wikipedia. However, texts published via social media are usually very short and written in an informal way, thus the overlap of terms is too small for accurate entity matching. Apart from that, the usage of a term in social media can differ greatly from the entities represented in the knowledge base. Therefore, we propose an unsupervised and domain independent tweet clustering method based on co-occurring terms in the tweets. Our method extracts characteristic keywords for a named entity from the tweets and adds terms proposed by Google Autocomplete. Then, it clusters all keywords according to the entity they represent and assigns one of the keyword categories to each tweet. In an experiment with ambiguous company names, car names and TV show titles, our proposed method achieved both a higher precision and a higher recall than two named entity disambiguation methods matching named entities in the tweets to corresponding Wikipedia entities.