Automatic wrapper generation for the extraction of search result records from search engines
by Zhao, Hongkun, Ph.D., STATE UNIVERSITY OF NEW YORK AT BINGHAMTON, 2007, 162 pages; 3289098

Abstract:

The deep web, which is estimated about 500 times larger than that of the surface web, is extremely under-utilized. Researchers have been working on various issues towards the building of large-scale deep web applications, which aim at unleashing the real power of the deep web. One of the key issues facing large-scale deep applications is the extraction and understanding of the data returned by deep web sites. In order to utilize the data in deep web sites, we need to extract the data (search result records) from the search result pages, which are web pages that contain both the data of interest and other unrelated content, returned by the deep web sites. Data extraction from web pages is generally a very hard problem. The performances of existing researches in the literature are far from satisfactory.

This dissertation studies the problem of extracting search result records from search engine returned pages in both the deep web sites and the surface web sites. A method that combines both the visual content features and the HTML tag structures the result pages is proposed to generate wrappers for the extraction of search result records. This novel technique archives significantly better performance than that of the state-of-the-art researches.

To extract search result records from categorized result pages requires maintaining the section-record relationships. Major issues like section boundaries and optional sections make achieving a good performance difficult. We introduce a novel method based on the content properties of search result records and the dynamic properties of sections.

A search result record usually consists of multiple data units. The semi-structured nature of search result records makes the data units extraction a hard problem. The mismatches between the HTML tag structures and the data structure of search result records as well as the optional and disjunctive data units further limit the performance. We introduce a novel directed acyclic graph representation of search result record templates, which can be used to extract data units from search result records. An effective machine learning and statistics based algorithm that extracts templates from search result records is also presented.

 
AdviserWeiyi Meng
SchoolSTATE UNIVERSITY OF NEW YORK AT BINGHAMTON
SourceDAI/B 68-11, p. , Apr 2008
Source TypeDissertation
SubjectsComputer science
Publication Number3289098
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3289098
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.