Identifying topic models for user generated content like hotel reviewsturns out to be difficult with the standard approach of LDA (LatentDirichlet Allocation; Blei et al., 2003). Hotel review texts usuallydon't differ as much in the topics that are covered as is typical withother genres such as Wikipedia or newsgroup articles where there iscommonly only a very small set of topics present in each document.To this end, we developed our own approach to topic modeling that isespecially tailored to non-edited texts like hotel reviews. The approachcan be divided into three major steps. First, using the concept ofsecond-order cooccurrences we define a contextual similarity score thatenables us to identify words that are similar with respect to certaintopics. This score allows us to build up a topic network where nodes arewords and edges the contextual similarity between the words. With thehelp of algorithms from graph theory, like the Infomap algorithm(Rosvall and Bergstrom, 2008), we are able to detect clusters of highlyconnected words that can be identified as topics in our review texts. Ina further step, we use these clusters and the respective words to get atopic similarity score for each word in the network. In other words, wetransform a hard clustering of words into topics into a probabilityscore of how likely a certain word belongs to a given topic-cluster.The presentation is structured as follows:References: David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latentdirichlet allocation. In: Journal of Machine Learning Research, Jg. 3(2003), S. 993–1022, ISSN 1532-4435 M. Rosvall and C. T. Bergstrom, Mapsof information flow reveal community structure in complex networks, PNAS105, 1118 (2008) http:--dx.doi.org-10.1073-pnas.0706851105,http:--arxiv.org-abs-0707.0609
I am looking for editors/curators to help with branches of the tree. Please send me an email if you are interested.