Adam Oest's Web Portal
 

Welcome to Adam's MS Project Portal

Click for larger photo
Me in Norway, 2009 (click to enlarge)

I have decided to focus my master's project on emergent semantics through the use of data mining, data preparation, and information retrieval techniques. The title of my project is Feedback-Driven Clustering for Automated Linking of Web Pages. As of June 17th, 2013, I have completed the project implementation and report.

Problem

Internet advertising has for years been used to drive revenue for publishers while increasing traffc and brand awareness of advertisers. The same principles can be applied by publishers to route more traffc to their own content. Even though mul- timedia is now a large part of the Internet, text content such as articles, blog posts, and forum messages continues to dominate almost every type of site. Web researches have shown that within a web document, text links are very effective. In fact, they are dozens of times more likely to receive clicks than prominent banner ads due to their natural appeal as part of the "organic" content of a document. Therefore, it is benefcial for web publishers to include appropriate links within their text con- tent if they wish to promote additional pages or simply increase internal site traffc. While it is relatively straightforward to create a self-maintaining linking framework for navigation bars or breadcrumbs, interspersing paragraphs with inline links proves to be a much more tedious task that generally has to be done manually for each individual piece of content. This is due to the fact that links must be relevant to the content at hand (and ultimately of interest to the end user) in order to be successful. Unfortunately, on large web sites it is unrealistic for editors to manually add such links, especially if a large portion of the site's content is user-submitted. Luckily, the interactive nature of today's web can be used as a vehicle of semantic analysis of web content, which could eventually lead to an automated yet effective inline linking system with the ultimate goal of retaining more users and increasing the number of pages that each user browses per visit.

Abstract

In this work I propose a system that indexes a large collection of HTML documents (i.e. an entire web site) and automatically generates inline text links between pairs of relevant documents (i.e. web pages). The goal of the system is threefold: to increase user interaction with the site he or she is browsing, to detect relevant keywords for each document, and to effectively cluster the documents into relevant groupings. The quality of the links is improved over time through passive user feedback collection. A variety of data preparation, natural language processing, and data mining techniques are used in the implementa- tion of the system. The system can be deployed as a web service and has been tested on offine datasets as well as a live web site.
Keywords: natural language processing, information retrieval, data mining, data cleaning, automatic linking, inline links, web page clustering, HTML

To learn more about my project, please download the proposal. You must be on the RIT network to download the project report and code.