Faceted searching and browsing over large collections of textual and text-annotated objects
by Dakka, Wisam, Ph.D., COLUMBIA UNIVERSITY, 2008, 193 pages; 3343500

Abstract:

The vast majority of Internet users utilize search functionality to navigate the text and text-annotated collections of a variety of web sites. Users of sites such as The New York Times archive, YouTube, and others often face long lists of results for their queries due to the large size of the collections. Processing numerous items is also a hurdle for "exploratory" users who have no specific query in mind, such as a new shopper in an online store or a researcher accessing a news archive. In this thesis, we attempt to address this problem. We investigate faceted searching and browsing to provide users with access methods that are useful for discovering the content and the structure of long search results or large collections. Hierarchies that organize items based on their topics are common for browsing a large set of items. For example, Yahoo! uses a topic-based hierarchy to guide users to their web pages of interest. Google News and Newsblaster enable news readers to quickly navigate the daily news based on a hierarchy of topics and related events. We first present a technique for summarization-aware topic faceted searching and browsing, which integrates clustering and summarization so that users can browse a list of summarized clusters in the query results instead of individual documents. We have built a fully functional summarization-aware search system for daily news. In addition to the topic facet, time can be used as an alternative facet for browsing search results. We explore time as an important dimension and suggest a general framework for time-based language models to incorporate time into the retrieval task. In fact, many facets, other than topic and time, can be useful for faceted searching and browsing. As a result, we propose supervised and unsupervised methods to identify and extract multiple relevant facets from collections. Yet incorporating such facets in searching or browsing is not an easy task. A typical approach to utilize facets in searching and browsing is to build individual hierarchies for each facet. Unfortunately, these hierarchies are currently manually or semi-manually constructed and populated, which prevents deploying such hierarchies for large collections due to the cost of manually annotating each item in the collections. To solve this problem, we propose a system to automate the construction of hierarchies for the extracted facets, and show its effectiveness through appropriate user studies. We apply the faceted hierarchies to a range of large data sets, including collections of annotated images, television programming schedules, and web pages.

 
AdviserLuis Gravano
SchoolCOLUMBIA UNIVERSITY
SourceDAI/B 70-01, p. , Apr 2009
Source TypeDissertation
SubjectsComputer science
Publication Number3343500
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3343500
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.