In this work, we show how to build a domain specific search engine that enables both entity and 2D graph searches for chemical molecules. First of all, documents are collected from the Web, and then preprocessed using document classification and segmentation. We apply Support Vector Machines for classification and propose a novel method of text segmentation. Then chemical entities in the documents are tagged and indexed to provide fast searches. Simultaneously, chemical structure information are collected, processed, and indexed for fast graph searches.
We show how chemical entity searches can improve the relevance of returned documents by avoiding those ambiguous terms. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for entity tagging that considers long-term dependencies at the sentence level. Then to support efficient and effective entity searches, we propose two feature selection methods for entity index building. One is to select frequent and discriminative subsequences from all the candidate features for chemical formula indexing. The other is to first discover subterms of chemical names with corresponding probabilities using a proposed independent frequent subsequence mining algorithm, and then segment a chemical name hierarchically into a tree structure based on discovered independent frequent subsequences. A unsupervised hierarchical text segmentation (HTS) method is proposed for this. Then subterms on the HTS tree can be indexed. Finally, query models with corresponding ranking functions are introduced for chemical entity searches. Experiments show that our approaches to chemical entity tagging, indexing, and search perform well.
In addition to text searches, massive amounts of structured data in Cheminformatics and Bioinformatics raise an issue of effective and efficient structure searches. Graphs are used to model structures for a long time. A typical type of graph search is a subgraph query that retrieves all graphs containing the query graph. To support efficient search, graph features such as subgraphs are extracted from graphs for graph indexing. One of the key issues is, given the set of all the possible subgraphs of the graph set, which subset of features is the optimal to achieve the highest precision for a given query set? To answer this question, first, we introduce a graph search process for subgraph queries. Then, we propose several novel feature selection criteria, Max-Precision, Max-Irredundant-Information, and Max-Information-Min-Redundancy, based on pointwise mutual information. We also propose a greedy feature search algorithm to find a near optimal feature set based on Max-Information-Min-Redundancy. Finally we show empirically that our proposed methods achieve a higher precision than previous methods.
Besides subgraph queries, users often execute similarity graph queries to search for graphs similar to the query graph. Previous methods usually use the maximum common edge subgraph (MCEG) to measure the similarity of two graphs. However, MCEG isomorphism is NP-hard and prohibitively slow to scan all graphs in the data set, so that it is not feasible for online searches to use MCEG in the fly to measure the similarity of retrieved graphs. We propose a novel approach to this issue from a different view by ranking search results using a weighted linear graph kernel that can avoid real time MCEG isomorphism tests, but achieve a reasonably high quality of search results. First, subgraphs features are extracted from graphs and indexed. Second, for a given graph query, graphs containing its subgraph features are retrieved and similarity scores are computed based on the indexed subgraph features. Finally, graphs are sorted and returned based on similarity scores. Subgraph weights used by the graph kernel are learned offline from a training set generated using MCEG isomorphism. We show empirically that the proposed methods of learning to rank for similarity graph queries can achieve a reasonably high normalized discounted cumulative gain in comparison with the "gold standard" method based on MCEG isomorphism. (Abstract shortened by UMI.)