Clustering is one of the most basic mental activities used by humans to handle the huge amount of information they receive every day. As such, clustering has been extensively studied in different disciplines including: statistics, pattern recognition, machine learning and data mining. Nevertheless, the body of knowledge concerning clustering has focused on objects represented as feature vectors stored in a single dataset. Clustering in this setting aims at grouping objects of a single type in a single table into clusters using the feature vectors. On the other hand, modern real-world applications are composed of multiple, large interrelated datasets comprising distinct attribute sets and containing objects from many domains; typically such data is stored in an information network. The types of patterns and knowledge desired in these applications goes far beyond grouping similar homogenous objects, but rather involves unveiling dependency structures in the data in addition to pinpointing hidden associations across objects in multiple datasets and domains. For example consider an information network that contains the domains of authors, papers and conferences. Two authors a1 and a2 may work in the same research field but never publish in the same conference. Hence clustering only the domains of authors and conference would fail to place a1 and a2 in the same cluster; however considering the entire information network would reveal a hidden link via the papers domain, placing a1 and a2 in the same cluster. Notice, that knowledge discovery in the preceding example was derived by clustering objects based on their relation to other objects, as opposed to grouping objects based on their attributes. This form of relational clustering is essential for knowledge discovery in several applications: bioinformatics: exploring the clusters among the domains of genes, diseases, drugs, and patients; social networking: segmenting customers based on friendship relations, social groups, and demographics; and recommender systems: leveraging user ratings, product ratings, product functionality, and blog entries to cluster customers and products simultaneously.
Information-network clustering advances knowledge discovery in two manners. First, hidden associations amongst objects from differing domains are unveiled, leading to a better understanding of the hidden structure of the entire network. Second, local clusters of the objects within a domain are sharpened and put into greater context, leading to more accurate local clustering. In order to extract knowledge of this form, information network clustering algorithms must consider (1) overlapping clusters and (2) a clustering structure that relates the clusters. In this dissertation we develop a framework and several algorithms for information network clustering by leveraging the above two fundamental aspects that facilitate knowledge discovery.
Current state-of-the-art information network clustering algorithms have had relative success in addressing the computational challenge of high dimensional data; however, the majority of these approaches have not addressed the fundamental aspects of overlapping clusters and clustering structure. In this dissertation, we address the information-network clustering problem from a fresh perspective and introduce a novel framework based on Formal Concept Analysis (FCA). Based on mathematical order theory, in particular, the theory of complete lattices, FCA provides a rich theoretical basis for investigating and structuring overlapping relational clustering in a single dataset. Shortcomings of previous methods were overcome by extending FCA to information networks, yielding effective and efficient information network clustering algorithms. Several empirical evaluations performed on a large variety of real-world information networks reveal that the FCA-based algorithms work more effectively and efficiently than the current state of the art.
Additionally, the dissertation addresses two drawbacks of FCA in single-edge information network (bi-clustering). One drawback is that the set of bi-clusters tends to be quite large, which makes reasoning about the bi-clusters quite difficult. We address this problem by introducing the idea of significant distinguishing sets, and an algorithm to efficiently enumerate these sets. The second shortcoming of the FCA framework is the strict definition of a bi-cluster. FCA specifies that a bi-cluster is a maximal sub-matrix of 1s in the dataset, however, in many knowledge discovery tasks the bi-clusters of interest are those subspaces of objects and attributes that exhibit a banded structure. We show, theoretically, the correspondence between bandedness and FCA based clustering. Moreover, we present an algorithm to uncover banded structures via intelligent search of the bi-cluster lattice.