Robust and efficient feature selection for high-dimensional datasets
by Mo, Dengyao, Ph.D., UNIVERSITY OF CINCINNATI, 2011, 131 pages; 3458192

Abstract:

Feature selection is an active research topic in the community of machine learning and knowledge discovery in databases (KDD). It contributes to making the data mining model more comprehensible to domain experts, improving the prediction performance and robustness of the model, and reducing model training. This dissertation aims to provide solutions to three issues that are overlooked by many current feature selection researchers. These issues are feature interaction, data imbalance, and multiple subsets of features.

Most of extant filter feature selection methods are pair-wise comparison methods which test each pair of variables, i.e., one predictor variable and the response variable, and provide a correlation measure for each feature associated with the response variable. Such methods cannot take into account feature interactions.

Data imbalance is another issue in feature selection. Without considering data imbalance, the features selected will be biased towards the majority class.

In high dimensional datasets with sparse data samples, there will be many different feature sets that are highly correlated with the output. Domain experts usually expect us to identify multiple feature sets for them so that they can evaluate them based on their domain knowledge.

This dissertation aims to solve these three issues based on a criterion called minimum expected cost of misclassification (MECM). MECM is a model independent evaluation measure. It evaluates the classification power of the tested feature subset as a whole. MECM has adjustable weights to deal with imbalanced datasets. A number of case studies showed that MECM had some favorable properties for searching a compact subset of interacting features. In addition, an algorithm and corresponding data structure were developed to produce multiple feature subsets.

The success of this research will have broad applications ranging from engineering, business, to bioinformatics, such as credit card fraud detection, email filter setting for spam classification, gene selection for disease diagnosis.

 
AdviserSamuel H. Huang
SchoolUNIVERSITY OF CINCINNATI
SourceDAI/A 72-09, p. , Jul 2011
Source TypeDissertation
SubjectsStatistics; Industrial engineering; Information science
Publication Number3458192
Adobe PDF Access the complete dissertation:
 

» This is an open access dissertation.
  Use the link below to access the full text PDF of this graduate work:
  http://gradworks.umi.com/3458192.pdf
  Use the link below to search and retrieve all open access dissertations:
  http://pqdtopen.proquest.com

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.