Masters Project

Document and Word retrieval in Structured Peer-to-Peer Networks




Pre-proposal
Proposal
Report
 
Document and word co-clustering is widely used across applications that involve data mining. It is generally used for providing recommendations to the users who are trying to find certain items on the web, finding correct documents corresponding to the word and vice-versa. Documents and words are clustered using a word-document matrix whose rows are represented as the words collected from the documents while the columns are represented as the documents themselves. Clustering is then performed on the matrix to find out the related information.
 
      Document and word clustering is performed on a central server that contains the entire list of documents and their corresponding words. A large word-document matrix is built during the clustering process. Centralized servers are not only well known for their scalability and performance issues, but also act as a single point of failure. Central servers can be decentralized using structured peer-to-peer networks or unstructured peer-to-peer networks. Unstructured peer-to-peer networks use the flooding technique to lookup data across the network that can lead to scalability and performance issues. Structured peer-to-peer network solves the above-mentioned problems by using distributed hash tables (DHT) for lookups. Some examples of DHT-based peer-to-peer lookup protocols are Chord, CAN and Pastry.
 
      The project aims to develop a distributed system where the co-clustering of documents and words will be performed in a structured peer-to-peer network using Chord. Instead of storing the entire word-document matrix on a central server, each peer across the network will contain a part of the matrix. In addition to facilitating the design of an efficient architecture that is scalable, robust and transparent, the project also overcomes the disadvantages posed by the centralized servers. Additionally, we also balance the load of the overloaded node that stores the frequently accessed keywords by integrating the distributed binomial lookup tree within the Chord network.