Mining, indexing, and search approaches to entity and graph information retrieval for chemoinformatics
by Sun, Bingjun, Ph.D., THE PENNSYLVANIA STATE UNIVERSITY, 2008, 154 pages; 3346376

Abstract:

In this work, we show how to build a domain specific search engine that enables both entity and 2D graph searches for chemical molecules. First of all, documents are collected from the Web, and then preprocessed using document classification and segmentation. We apply Support Vector Machines for classification and propose a novel method of text segmentation. Then chemical entities in the documents are tagged and indexed to provide fast searches. Simultaneously, chemical structure information are collected, processed, and indexed for fast graph searches.

We show how chemical entity searches can improve the relevance of returned documents by avoiding those ambiguous terms. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for entity tagging that considers long-term dependencies at the sentence level. Then to support efficient and effective entity searches, we propose two feature selection methods for entity index building. One is to select frequent and discriminative subsequences from all the candidate features for chemical formula indexing. The other is to first discover subterms of chemical names with corresponding probabilities using a proposed independent frequent subsequence mining algorithm, and then segment a chemical name hierarchically into a tree structure based on discovered independent frequent subsequences. A unsupervised hierarchical text segmentation (HTS) method is proposed for this. Then subterms on the HTS tree can be indexed. Finally, query models with corresponding ranking functions are introduced for chemical entity searches. Experiments show that our approaches to chemical entity tagging, indexing, and search perform well.

In addition to text searches, massive amounts of structured data in Cheminformatics and Bioinformatics raise an issue of effective and efficient structure searches. Graphs are used to model structures for a long time. A typical type of graph search is a subgraph query that retrieves all graphs containing the query graph. To support efficient search, graph features such as subgraphs are extracted from graphs for graph indexing. One of the key issues is, given the set of all the possible subgraphs of the graph set, which subset of features is the optimal to achieve the highest precision for a given query set? To answer this question, first, we introduce a graph search process for subgraph queries. Then, we propose several novel feature selection criteria, Max-Precision, Max-Irredundant-Information, and Max-Information-Min-Redundancy, based on pointwise mutual information. We also propose a greedy feature search algorithm to find a near optimal feature set based on Max-Information-Min-Redundancy. Finally we show empirically that our proposed methods achieve a higher precision than previous methods.

Besides subgraph queries, users often execute similarity graph queries to search for graphs similar to the query graph. Previous methods usually use the maximum common edge subgraph (MCEG) to measure the similarity of two graphs. However, MCEG isomorphism is NP-hard and prohibitively slow to scan all graphs in the data set, so that it is not feasible for online searches to use MCEG in the fly to measure the similarity of retrieved graphs. We propose a novel approach to this issue from a different view by ranking search results using a weighted linear graph kernel that can avoid real time MCEG isomorphism tests, but achieve a reasonably high quality of search results. First, subgraphs features are extracted from graphs and indexed. Second, for a given graph query, graphs containing its subgraph features are retrieved and similarity scores are computed based on the indexed subgraph features. Finally, graphs are sorted and returned based on similarity scores. Subgraph weights used by the graph kernel are learned offline from a training set generated using MCEG isomorphism. We show empirically that the proposed methods of learning to rank for similarity graph queries can achieve a reasonably high normalized discounted cumulative gain in comparison with the "gold standard" method based on MCEG isomorphism. (Abstract shortened by UMI.)

 
AdvisersC. Lee Giles; Prasenjit Mitra
SchoolTHE PENNSYLVANIA STATE UNIVERSITY
SourceDAI/B 70-02, p. , Apr 2009
Source TypeDissertation
SubjectsInformation science; Computer science
Publication Number3346376
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3346376
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.