Preserving the decision boundary through data selection for support vector machines
by Sun, Chaofan, Ph.D., UNIVERSITY OF HOUSTON, 2007, 138 pages; 3279598

Abstract:

As a state-of-the-art learning approach, support vector machines (SVMs) have been demonstrated to be advantageous over other learning approaches. Due to high time complexity, conventional SVM training becomes intolerably slow and sometimes impractical for big data sets. However, support vectors (SVs) are only a small subset of the data set in many cases, and only this small subset determines the decision boundaries. The goal of this study is to develop data pre-processing procedures, which can efficiently reduce big data sets and preserve the decision boundary without degrading SVM performance.

This study consists of three levels of data selection for SVMs. In the first level, data selection is carried out using the closest pairs (CPs) and the nearest neighbors of the opposite class (NNOs) approaches. These approaches select only boundary region vectors (BRVs), which preserve the decision boundary, implying the SVM performance comparable to that of the full data set. Investigations show that BRV based data selection works well for small data sets. In the second level, spatial approximation sample hierarchy (SASH) trees are used to speed up BRV-based data selection for big data sets. Investigations show that by using SASHs we can approximate the exact BRVs with 90% or higher accuracy. The overall time saved in this level can be 60% or more if data sets are larger than 30k vectors. In the third level, limited-size SASHs are used to further reduce the time used in data selection for over-sized data sets. Analysis and experiments demonstrate that data selection can be done in linear time. Throughout this study, we have demonstrated that the proposed data selection approaches can efficiently select BRVs by which the decision boundary can be well-preserved. The same idea is also applied in active learning.

 
Advisor
SchoolUNIVERSITY OF HOUSTON
SourceDAI/B 68-09, p. , Dec 2007
Source TypeDissertation
SubjectsComputer science
Publication Number3279598
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3279598
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.