Using automated methods to identify connections across biomedical terminologies
by Patel, Chintan, Ph.D., COLUMBIA UNIVERSITY, 2009, 127 pages; 3393592

Abstract:

Terminologies play important roles in the annotation, organization, sharing and retrieval of the biomedical data. Several biomedical informatics applications such as information retrieval, decision support and knowledge discovery depend on identification of mappings, translations, or links (collectively referred to as connections) between different terminologies. Consider, for example, the connection between Hb (SNOMED CT) and Hemoglobin (LOINC) for terminology mapping application, or between Hb (SNOMED CT) and Anemia (MeSH) for an information retrieval application. Identification of such cross-terminology connections for a given application or task is a challenging problem. Existing research methods to identify connections are based on either manual selection, or problem-specific algorithms that are labor intensive, difficult to maintain, or unscalable beyond their original domains.

Integrated Biomedical Terminology Resources (IBTRs) such the Unified Medical Language System and Open Biomedical Ontologies provide rich knowledge sources that can be potentially used towards connection identification. In this research, we propose a novel machine learning-based approach that uses the semantic and structural features of an IBTR to identify connections across biomedical terminologies. First, we model the semantic features in the IBTR based on the hierarchical organization of concepts and the set of associative relationships asserted between indirect, transitively related concept pairs. Second, we analyze the structural topology of the IBTR using network theoretic methods such as the scale-free property, clustering coefficient and topological overlap in the IBTRs to identify connections. Finally, the semantic and structural properties of the IBTR are combined into an integrated model for connection identification. A web-based tool, TermLink is developed based on the proposed methods to perform a cost-benefit analysis and disseminate the research to the wider community.

We evaluated the proposed methods across eight training datasets corresponding to the connection identification tasks such as terminology mapping, and information retrieval. The results indicate that the semantic properties provide high classification accuracy (80–90%) across all the training datasets. The structural analysis of IBTR revealed new classes of concepts such as informational and noisy hubs that provide strong cues towards connection identification. The integrated semantic and structural approach was found to be a more effective approach than each of the individual approaches in terms of classification accuracy and computational time. The cost-benefit analysis revealed that the TermLink tool could potentially reduce time for the manual approach to identify connections.

In conclusion, the existing knowledge in the IBTRs can be modeled using semantic and structural features to identify connections for different applications. The proposed approach provides a robust, automated and problem independent method that can be combined with manual methods to reduce the time and cost in identifying connections across biomedical terminologies.

 
AdviserJames Cimino
SchoolCOLUMBIA UNIVERSITY
SourceDAI/B 71-02, p. , Apr 2010
Source TypeDissertation
SubjectsBioinformatics; Computer science
Publication Number3393592
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3393592
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.