Distance-based indexing and its applications in bioinformatics
by Mao, Rui, Ph.D., THE UNIVERSITY OF TEXAS AT AUSTIN, 2007, 137 pages; 3297062

Abstract:

Distance-based indexing is a general solution to the problem of searching based on similarity of complex data types. Distances between pairs of data objects are specified by a metric function, which is positive definite, symmetric and satisfies the triangle inequality. Distance-based indexing is exploited in many data intensive applications, such as multimedia and biological databases. This dissertation concerns distance-based indexing in the context of the MoBIoS (Molecular Biology Information System) project, a next generation DBMS aimed at bioinformatics applications.

The three primary contributions of this dissertation include evaluation and improvement of existing distance-based index structures, a pivot space model for the pivot selection problem of distance-based indexing and the MoBIoS Index software package.

For the first part, we designed a bi-directional construction algorithm for Metric Tree (M-tree). It increased the construction speed and improved the query performance. Then, we conducted a comparative analysis of the performance of M-tree, General Hyper-plane Tree (GHT) and Vantage Point Tree (VPT) on biological work loads, and VPT turned out to be the optimal choice. Further, we designed better heuristics for pivot selection and partition of VPT.

Superior to multi-dimensional indexing in its generality, distance-based indexing is difficult by virtue that no coordinate structure is available and thus many mathematical tools in Rn cannot be applied. A common method used is to map the metric space into Rn and then apply geometric methods. Summarizing this method, we propose the pivot space model, which demonstrates the validity of tackling distance-based problems in Rn. Further, we study distance-based indexing from the perspective of dimension reduction. We prove that the only form of dimension reduction leading to performance gain for distance-based indexing is pivot selection, which does not create new dimensions. To adapt general dimension reduction techniques to distance-based problems, we propose a heuristic that approximates the dimensions created by dimension reduction. As a demonstration, a dimension reduction algorithm and a method to estimate the intrinsic dimension of data are proposed based on Principal Component Analysis (PCA). These ideas are evaluated on a test suite consisting of synthetic and real workloads.

The MoBIoS Index was released in 2006. It supports similarity queries for user defined data types and distance functions. It has been used to build several biological application.

 
AdviserDaniel P. Miranker
SchoolTHE UNIVERSITY OF TEXAS AT AUSTIN
SourceDAI/B 69-02, p. , Apr 2008
Source TypeDissertation
SubjectsBioinformatics; Computer science
Publication Number3297062
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3297062
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.