Ensemble methods in Large Vocabulary Continuous Speech Recognition
by Chen, Xin, M.S., UNIVERSITY OF MISSOURI - COLUMBIA, 2008, 69 pages; 1471989

Abstract:

Combining a group of classifiers and therefore improving the overall classification performance is a young and promising direction in Large Vocabulary Continuous Speech Recognition (LVCSR). Previous works on acoustic modeling of speech signals such as Random Forests (RFs) of Phonetic Decision Trees (PDTs) has produced significant improvements in word recognition accuracy. In this thesis, several new ensemble approaches are proposed for LVCSR and experimental evaluations have shown absolute accuracy gains up to 2.3% over the conventional PDT-based acoustic models in our telehealth conversational speech recognition task.

Unlike the implicit PDT based states tying that has been used in most ASR systems as well as in the recent RFs based PDTs, this author considers that explicit PDT (EPDT) tying that allows Phoneme data Sharing (PS) may be superior in capturing pronunciation variations. The author adopted the idea of combining multiple acoustic models and applied this idea to the EPDT models. A combination of EPDT and the implicit PDT models has been investigated to reduce phone confusions that may be introduced by the EPDT model. A 1.3% absolute gain on word accuracy is observed in this experiment on the telehealth task.

Data sampling is one of the primary ways to generate different classifiers for an ensemble classifier. In this thesis, Cross Validation (CV) based data sampling is proposed, and random sampling without replacement is used as a reference for comparison. With different datasets generated by data sampling, different PDTs and therefore different Gaussian mixture models are generated, and the diversity of the multiple models helps improve recognition accuracy. When a 10-fold-CV is used, a 2.3% absolute gain in word recognition accuracy is obtained. Several experimental parameter settings and combining methods have been investigated in the experiments and the findings are discussed in this thesis.

The word accuracy performance improvement achieved in this thesis work is significant and the techniques have been integrated in the telemedicine automatic captioning system developed by the SLIPL group of the University of Missouri – Columbia.

 
AdviserYunxin Zhao
SchoolUNIVERSITY OF MISSOURI - COLUMBIA
SourceMAI/ 48-02, p. , Dec 2009
Source TypeThesis
SubjectsArtificial intelligence; Computer science
Publication Number1471989
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:1471989
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.