Generating topic detection training corpora from social bookmarking sites
Fall 2006 with Chris Harman. Advisor: Rich Wicentowski
This project focused on generating training/testing documents (corpora, for those in the know) for automated tagging using the social bookmarking site del.icio.us.
Social bookmarking sites provide a wealth of heavily cross-referenced tagging data. However, only a very particular subset of the Web gets added to these sites. So our motivation is to build an engine that leverages the human-verified data (namely, tagged sites) to tag novel text (the rest of the internet). We generated a sizable corpus of about 19,000 documents for about 1,000 different tags and trained topic detection algorithm that uses latent semantic analysis.
Download: paper (PDF)