Functional data analysis for environmental and biomedical problems
by Temiyasathit, Chivalai, Ph.D., THE UNIVERSITY OF TEXAS AT ARLINGTON, 2008, 109 pages; 3339222

Abstract:

To efficiently extract implicit patterns from datasets, data mining methods are beneficial tools for analyzing large and complicated as well as small and scarce data. Despite the great potential of applying data mining methods to complicated data, the appropriate methods remain premature and insufficient. The major aim of this dissertation is to present some data mining methods, along with the real data, as a tool for analyzing the complex behavior of functional data.

In the first part, this dissertation presents a data mining application to: (1) Identify an efficient way to characterize the spatial variations of PM2.5 concentrations based solely upon their temporal patterns, and (2) Analyze the temporal and seasonal patterns of PM2.5 concentrations in spatially homogenous regions. This study used 24-hour average PM2.5 concentrations measured every third day during the period between 2001 and 2005 at 522 monitoring sites in the continental United States. A k-means clustering algorithm using the correlation distance was employed to investigate the similarity in patterns between temporal profiles observed at the monitoring sites. A k-means clustering analysis produced six clusters of sites with distinct temporal patterns which were able to identify and characterize spatially homogeneous regions of the United States. The study also presents a rotated principal component analysis (RPCA) that has been used for characterizing spatial patterns of air pollution and discusses the difference between the clustering algorithm and RPCA.

Data mining application for investigating the behavior of ozone concentration will be presented in the followed chapter. Ozone has been known to be associated with human health. Ozone data are generally collected over a long period of time from interested locations. However, constructing ozone monitoring sites may not possible or cost effective due to some limitations such as hazardous environment or inaccessible area. The objective of this present study is: (1) To interpolate ozone concentrations as a functional response at an unsampled location, and (2) To reduce model complexity by constructing a data compression and reduction model which achieve the highest accuracy as much as possible. This study used daily maximum 8-hour ozone concentrations between 2003 and 2006 at 14 monitoring sites in Dallas-Fort Worth area. Wavelet decomposition broke down the data into multiscale data analysis. Regression Analysis was used as a data compression method. Kriging was applied as a spatial interpolation. In addition, model refining step helped tune the ozone concentration with different variability. This study reveals that our model can achieve up to 6.99 ppb in mean absolute error (MAE) and 9.76 ppb in mean absolute error for high ozone day (MAE75).

Finally, an efficient strategy for classification of prostate cancer in near infrared spectra is illustrated. Prostate cancer is the most common male cancer and the second leading cause of cancer death in the United States. The main purpose of this study is to develop an efficient tool that classifies the near infrared (NIR) spectroscopic data taken from ex vivo human prostate glands as normal or cancer. Our proposed procedure consists of several steps. First, to ensure the comparability between spectra, normalization was done by dividing each spectral point by the area of the total intensity of the spectrum. Second, clustering analysis was performed with these normalized spectra to separate the spectra that represent the normal pattern from a mixed group that contains both normal and tumor spectra. Third, we conducted two-stage classification, the first being an effort to construct a classification model with the labels obtained from the preceding clustering analysis and the second being a classification to focus on the mixed group classified from the first classification model. To increase the accuracy, the second classification model was constructed based on the selected features that capture important characteristics of the spectral data. Our proposed procedure was evaluated by its classification ability in testing samples using a leave-one-out cross validation technique, yielding an accuracy of 90%. (Abstract shortened by UMI.)

 
AdviserSeoung Bum Kim
SchoolTHE UNIVERSITY OF TEXAS AT ARLINGTON
SourceDAI/B 69-12, p. , Feb 2009
Source TypeDissertation
SubjectsStatistics; Operations research; Computer science
Publication Number3339222
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3339222
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.