Generating topic detection training corpora from social bookmarking sites

Fall 2006 with Chris Harman. Advisor: Rich Wicentowski

This project focused on generating training/testing documents (corpora, for those in the know) for automated tagging using the social bookmarking site del.icio.us.

Social bookmarking sites provide a wealth of heavily cross-referenced tagging data. However, only a very particular subset of the Web gets added to these sites. So our motivation is to build an engine that leverages the human-verified data (namely, tagged sites) to tag novel text (the rest of the internet). We generated a sizable corpus of about 19,000 documents for about 1,000 different tags and trained topic detection algorithm that uses latent semantic analysis.

Download: paper (PDF)

Parallel interpolation of elevation grids

Fall 2006 with Scott Blaha. Advisor: Andy Danner

Real life geographic elevation data comes in three-dimensional point clouds, meaning data is not aligned along a grid or even has uniform distribution. Geographic Information Systems take elevation grids for most processing tasks (viewshed computation, watershed computation, flow routing etc.). The natural way of getting grids from point clouds is interpolation. However, interpolation is extremely computationally intensive and GIS data sets are getting bigger by the second.

Thankfully interpolation is RP (ridiculously parallelizable), which can theoretically give us nx speed up in n-way parallelization. We implemented a few interpolation algorithms, parallelize them and observed the improvement in performance. Programmed in C using LAM/MPI.

Download: paper (PDF)

Audit logs for computer security monitoring

Summer ‘05 with Ben Kuperman

During an undergraduate research fellowship at Swarthmore College I worked with Professor Kuperman on expanding a preliminary system written for Sun Solaris. We first ported it over to Debian Linux. Afterwards, I redesigned and reimplemented the log recording mechanism with easily synchronizable log entry commit semantics, and wrote a highly modular log reader/processor from scratch.

I wrote and presented a poster on this project for Sigma Xi September ‘05.

Download (PDF’s): abstract and summary or poster