Welcome to Adam's MS Project Portal
Me in Norway, 2009 (click to enlarge)
I have decided to focus my master's project on emergent semantics through the use of data mining, data preparation, and information retrieval techniques.
The title of my project is Feedback-Driven Clustering for Automated Linking of Web Pages. As of June 17th, 2013, I have completed the project implementation and report.
Internet advertising has for years been used to drive revenue for publishers while
increasing traffc and brand awareness of advertisers. The same principles can be
applied by publishers to route more traffc to their own content. Even though mul-
timedia is now a large part of the Internet, text content such as articles, blog posts,
and forum messages continues to dominate almost every type of site. Web researches
have shown that within a web document, text links are very effective. In fact, they
are dozens of times more likely to receive clicks than prominent banner ads due to
their natural appeal as part of the "organic" content of a document. Therefore,
it is benefcial for web publishers to include appropriate links within their text con-
tent if they wish to promote additional pages or simply increase internal site traffc.
While it is relatively straightforward to create a self-maintaining linking framework
for navigation bars or breadcrumbs, interspersing paragraphs with inline links proves to be a much more tedious task that generally has to be done manually for each individual piece of content.
This is due to the fact that links must be relevant to the
content at hand (and ultimately of interest to the end user) in order to be successful.
Unfortunately, on large web sites it is unrealistic for editors to manually add such
links, especially if a large portion of the site's content is user-submitted. Luckily,
the interactive nature of today's web can be used as a vehicle of semantic analysis of
web content, which could eventually lead to an automated yet effective inline linking
system with the ultimate goal of retaining more users and increasing the number of
pages that each user browses per visit.
In this work I propose a system that indexes a large collection of HTML
documents (i.e. an entire web site) and automatically generates inline text
links between pairs of relevant documents (i.e. web pages). The goal of the
system is threefold: to increase user interaction with the site he or she is browsing, to detect relevant keywords for each document, and to effectively cluster
the documents into relevant groupings. The quality of the links is improved over time through passive user feedback collection. A variety of data preparation, natural
language processing, and data mining techniques are used in the implementa-
tion of the system. The system can be deployed as a web service and has been
tested on offine datasets as well as a live web site.
Keywords: natural language processing, information retrieval,
data mining, data cleaning, automatic linking, inline links, web page
To learn more about my project, please download the proposal. You must be on the RIT network to download the project report and code.