Clustering Evolving Data

Eye-tracking based Image Retrieval

Web Service Community Discovery

Semi-supervised Data Clustering with Non-negative Matrix Factorization

Web Image Clustering with integration of Visual and Textual features

Isoperimetric Co-clustering Algorithm (ICA) for pairwise data co-clustering

Co-clustering image features and semantic concepts

Hierarchical image clustering from multi-user feedback


Clustering Evolving Data

Traditional clustering techniques are inapplicable to problems where the relationship between data points evolve over time. Not only is it important for the clustering algorithm to adapt to the recent changes in the evolving data, but it also needs to take the historical relationship between the data points into consideration. We address evolutionary clustering large-scale data by the amalgamation of low-rank kernel matrix factorization and matrix factorization based clustering. Since the low-rank approximation provides a compact representation of the original matrix, and especially, the near-optimal low-rank approximation can preserve the sparsity of the original data, we obtain computational efficiency when dealing with large evolutionary datasets. Moreover, matrix factorization based methods have been shown to effectively cluster high dimensional data in text mining and multimedia data analysis.

We use information from the previous timestep (t - 1) to maintain cluster membership into the present time (t) in order to provide a smoother transition. Here, we can seen that in order to maintain that membership, despite the evenly weighted cut in the present time, the past comes into effect to make cut2 the better choice at time-step t.

Related Publications:

 

TOP


Eye-tracking based Image Retrieval

Within the field of content-based image retrieval (CBIR), a wide variety of solutions have been proposed to perform efficient image retrieval. While many of the initial solutions focused on performing retrieval based on low-level feature similarity of images, it was soon realized that the performance of these systems was limited due to the semantic gap as they were unable to infer the interest of the user. To overcome this, most content-based image retrieval systems typically utilize mouse-clicks and other traditional forms of input to identify the regions or objects of interest. In this work, we explore the utilization of eye-tracking as a control mechanism for content-based image retrieval.

Related Publications:

 

TOP


Web Service Community Discovery

Efficient and accurate discovery of user desired Web services is a key component for achieving the full potential of service computing. However, service discovery is a non-trivial task considering the large and fast growing service space. Meanwhile, Web services are typically autonomous and a priori unknown. This further complicates the service discovery problem. We propose a service community learning algorithm that can generate homogeneous communities from the heterogeneous service space. This can greatly facilitate the service discovery process as the users only need to search within their desired service communities. A key ingredient of the community learning algorithm is a co-clustering scheme that leverages the duality relationship between services and operations. Experimental results on both synthetic and real Web services demonstrate the effectiveness of the proposed service community learning algorithm.

Related Publications:

 

TOP


Semi-supervised Data Clustering with Non-negative Matrix Factorization

Traditional clustering algorithms are inapplicable to many real-world problems where limited knowledge from domain experts is available. Incorporating the domain knowledge can guide a clustering algorithm, consequently improving the quality of clustering. In this paper, we propose SS-NMF: a Semi-Supervised Non-negative Matrix Factorization framework for data clustering. In SS-NMF, users are able to provide supervision for clustering in terms of pairwise constraints on a few data objects specifying whether they “must" or “cannot" be clustered together. Through an iterative algorithm, we perform symmetric tri-factorization of the data similarity matrix to infer the clusters. Theoretically, we show the correctness and convergence of SS-NMF. Moveover, we show that SS-NMF provides a general framework for semi-supervised clustering. Existing approaches can be considered as special cases of it. Through extensive experiments conducted on publicly available datasets, we demonstrate the superior performance of SS-NMF for clustering.

Related Publications:

 

TOP

 


Web Image Clustering with integration of Visual and Textual features

With the explosive growth of Web and the recent development in digital media technology, the number of images on the Web has grown tremendously. Consequently, Web image clustering has emerged as an important application. Some of the initial efforts along this direction revolved around clustering Web images based on the visual features of images or textual features by making use of the text surrounding the images. However, not much work has been done in using multimodal information for clustering Web images. In this paper, we propose a graph theoretical framework for simultaneously integrating visual and textual features for efficient Web image clustering. Specifically, we model visual features, images and words from surrounding text using a tripartite graph. Partitioning this graph leads to clustering of the Web images. Although, graph partitioning approach has been adopted before, the main contribution of this work lies in a new algorithm that we propose - Consistent Isoperimetric High-order Co-clustering (CIHC), for partitioning the tripartite graph. Computationally, CIHC is very quick as it requires a simple solution to a sparse system of linear equations. Our theoretical analysis and extensive experiments performed on real Web images demonstrate the performance of CIHC in terms of the quality, efficiency and scalability in partitioning the visual feature-image-word tripartite graph.

Related Publications:

TOP


Isoperimetric Co-clustering Algorithm (ICA) for pairwise data co-clustering

Data co-clustering refers to the problem of simultaneous clustering of two data types. Typically, the data is stored in a contingency or co-occurrence matrix C where rows and columns of the matrix represent the data types to be co-clustered. An entry Cij of the matrix signifies the relation between the data type represented by row i and column j. Co-clustering is the problem of deriving sub-matrices from the larger data matrix by simultaneously clustering rows and columns of the data matrix. In this paper, we present a novel graph theoretic approach to data co-clustering. The two data types are modeled as the two sets of vertices of a weighted bipartite graph. We then propose Isoperimetric Co-clustering Algorithm (ICA) - a new method for partitioning the bipartite graph. ICA requires a simple solution to a sparse system of linear equations instead of the eigenvalue or SVD problem in the popular spectral co-clustering approach. Our theoretical analysis and extensive experiments performed on publicly available datasets demonstrate the advantages of ICA over spectral approach in terms of the quality, efficiency and stability in partitioning the bipartite graph.

Some results:

In the figure above, (a) shows a typical word-document matrix before co-clustering. Co-clustering leads to rearranging of the documents and words such that the most co-related get grouped together simultaneously. (b) shows a dataset having 2 clusters in the ground truth, bipartitioned using ICA. (c) Similarly, dataset having 6 clusters in the ground truth has been k-partitioned by ICA.

The above figure shows the performance of ICA and Spectral-SVD in the presence of additive and multiplicative Gaussian noise on some datasets. X-axis has the variance of the noise while the isoperimetric ratio of the partitioning is along the Y-axis. Additive noise had zero mean with variance increasing from 1 to the maximum value in the original data. Multiplicative noise had mean of 1 with its variance going from 1 to a maximum of 5. First two plots (top row) are bipartitioning in the presence of additive noise on Interest-Trade dataset and multiplicative noise on ArachidonicAcids-Hematocrit dataset. Similarly, next two plots are for k-partitioning with additive noise on wap and multiplicative on re0 datasets. From these results, we can see that inspite of the varying amounts and kinds of noise in the data, ICA is able to perform optimal partitioning indicated by its low isoperimetric ratio. Second noticeable fact is in regards to stability. Rising ratios as the variance increases indicates that the performance of algorithm is gradually decreasing. However, fluctuating ratios indicates instability and inconsistency to partition optimally.

We plotted the computational speed of ICA and Spectral-SVD in the above figure. The time required to compute the indicator vector is along the Y-axis and the number of vertices in the bipartite graph are shown along the X-axis. Time for ICA gradually increases as the number of vertices increase. However, Spectral-SVD time increases more rapidly comparatively and hence is clearly outperformed by ICA.

Related Publications:

·         Manjeet Rege, Ming Dong and Farshad Fotouhi, “Bipartite Isoperimetric Graph Partitioning for Data Co-clustering”, Data Mining and Knowledge Discovery Journal, Vol. 16, No. 3, 2008.

·         Manjeet Rege, Ming Dong and Farshad Fotouhi, “Co-clustering documents and words using Bipartite Isoperimetric Graph Partitioning”, in proceedings of IEEE International Conference on Data Mining, 2006. (regular paper - acceptance rate 10%, oral presentation)

TOP


  Co-clustering image features and semantic concepts

We propose a bipartite graph-based algorithm for co-clustering image features and semantic concepts. The basic idea is to incorporate semantic information embedded in the accumulated user feedback logs into low-level feature-based clustering, and thus achieve better grouping of images on a semantic level. Compared with prior works, the major advantages of the proposed work can be summarized as follows:

1.      We perform co-clustering of image features and feedback logs (semantic concepts). A cluster of logs naturally represents a semantic concept. Subsequently, image groupings on a semantic level are inferred through clusters of logs.

2.      By performing co-clustering of feedback logs and image features, we can identify features relevant to a semantic concept and hence narrow the semantic gap.

3.      Based on the feedback logs and features co-clustered in our system, we provide a novel approach to utilize a user’s feedback in CBIR. Specifically, a feedback is first classified to a log-feature cluster. Then, images are retrieved using corresponding feature subset to achieve higher precision.

Some results:

Image features used in our experiments were the weights for principal components. After co-clustering, every cluster contains relevance feedback logs and image features grouped together. The above figure shows the principal component images (features) corresponding to a cluster in each relevance feedback logs for Playing Cards were clustered. We can see that the 26 images representing the feature components are indeed Playing Cards like. That is, the features corresponding to this cluster are able to represent the underlying semantic concept.

We used the principal components assigned to a cluster along the V dimension to reconstruct the original image. In the above figure, the top row displays original images in V space and the lower row shows the corresponding reconstruction achieved. It is clear that the corresponding principal components can reconstruct an image resembling the original one. This indicates that the features assigned to a cluster are able to successfully capture the respective semantic concepts.

Related Publication:

·         Manjeet Rege, Ming Dong and Farshad Fotouhi, “Co-clustering image features and semantic concepts”, in proceedings of IEEE International Conference on Image Processing, 2006. (oral presentation)

TOP


  Hierarchical image clustering from multi-user feedback

We present a new paradigm to establish the semantics, i.e. to find meaningful categories and their relations, in an image database based on multi-user relevance feedback. Relevance feedback mechanism is one way to incorporate the users’ perception during image retrieval. By treating each user as an independent weak classifier, we show that combining multi-users’ feedback is equivalent to the combinations of weak independent classifiers. The latter classification system has been shown to have good generalization performance both theoretically and empirically. By including the users in the loop, we build a user-centered category hierarchy that reflects the interests of most current users of our system. Our experimental results based on a city-landscape image database show that the proposed framework supports effective and efficient search and
browsing in image databases.

Some results:

We compare the semantic image hierarchy built by the proposed framework with the other two hierarchies built by clustering algorithms (top-down and bottom-up) in terms of query precision and recall. The mean and variance of query precision (top row) and recall (bottom row) for two image categories as a function of k (the number of retrieved images). It is obvious that query precision and recall based on our semantic hierarchy constantly has greater mean and less variance than that with clustering approaches. We get similar results for all other categories.
 

Related Publications:

·         Manjeet Rege, Ming Dong and Farshad Fotouhi, “Building a User-Centered Semantic Hierarchy in Image Databases”, ACM Multimedia Systems Journal, Vol. 12, No. 2, 2006.

·         Manjeet Rege, Ming Dong and Farshad Fotouhi, “Finding a semantic structure interactively in image databases”, in proceedings of IEEE International Conference on Multimedia and Expo, 2006. (oral presentation) 

TOP