Querying and mining graph databases
by He, Huahai, Ph.D., UNIVERSITY OF CALIFORNIA, SANTA BARBARA, 2007, 209 pages; 3283657

Abstract:

Graphs have become popular for modeling complex data in a variety of domains: chemical compounds, protein structures, protein interaction networks, database schemas, social networks, Web, XML, multimedia, etc. As a result, graph querying and mining has become important for information retrieval and analysis. However, many challenges arise in efficient processing of graph querying and mining. How to address the subgraph isomorphism problem which is NP-hard? How to measure the similarity between graphs? How to index a large collection of graphs for fast retrieval? How to optimize query processing over large-scale graphs? How to define and discover significant graph patterns in a graph database?

In this dissertation, I demonstrate that the above questions can be well addressed with both theoretical soundness and practical efficiency. I first consider queries over a large collection of small graphs. For subgraph queries, I develop an approximation algorithm for the subgraph isomorphism problem. For similarity queries, I measure graph similarity through edit distance using heuristic graph mapping methods. Our index structure, called Closure-tree, organizes graphs hierarchically where each node summarizes its descendants by a generalized graph called a graph closure. Then, I propose GraphQL, a graph query language where graphs are the basic units of information and each query manipulates collections of graphs. The core of GraphQL is a graph algebra extended from the relational algebra where the selection operator is generalized to graph pattern matching and a composition operator is introduced for rewriting matched graphs. I then present efficient graph pattern matching over large graphs.

In graph mining, I focus on finding significant graph patterns. For the case of a large collection of graphs, I present GraphRank, a technique that evaluates and ranks frequent subgraphs by their statistical significance. I also address feature vector mining that generalizes frequent itemset mining. In the case of mining a large weighted graph, I consider local maximal substructures around a given node. A scalable algorithm is developed for the k-MST problem (minimum spanning tree over k vertices) with approximation guarantees. All the presented techniques have been validated through extensive experiments on real and synthetic graphs.

 
AdviserAmbuj K. Singh
SchoolUNIVERSITY OF CALIFORNIA, SANTA BARBARA
SourceDAI/B 68-10, p. , Jan 2008
Source TypeDissertation
SubjectsComputer science
Publication Number3283657
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3283657
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.