June 1, 2007
Generating topic detection training corpora from social bookmarking sites

Fall 2006 with Chris Harman. Advisor: Rich Wicentowski

This project focused on generating training/testing documents (corpora, for those in the know) for automated tagging using the social bookmarking site del.icio.us.

Social bookmarking sites provide a wealth of heavily cross-referenced tagging data. However, only a very particular subset of the Web gets added to these sites. So our motivation is to build an engine that leverages the human-verified data (namely, tagged sites) to tag novel text (the rest of the internet). We generated a sizable corpus of about 19,000 documents for about 1,000 different tags and trained topic detection algorithm that uses latent semantic analysis.

Download: paper (PDF)

8:00pm  |   URL: http://tumblr.com/Zn_4by9Tph0
Filed under: nlp academic