Parameterizing Phrase Based Statistical Machine Translation Models: An Analytic Study
by Cer, Daniel, Ph.D., UNIVERSITY OF COLORADO AT BOULDER, 2011, 130 pages; 3453692

Abstract:

The goal of this dissertation is to determine the best way to train a statistical machine translation system. I first develop a state-of-the-art machine translation system called Phrasal and then use it to examine a wide variety of potential learning algorithms and optimization criteria and arrive at two very surprising results. First, despite the strong intuitive appeal of more recent evaluation metrics, training to these metrics is no better than the older traditional approach of training to BLEU. Second, the most widely used learning algorithm for training machine translation systems, called minimum error rate training (MERT), works no better than standard machine learning algorithms such as log-linear models. This result demonstrates that machine translation does not require using a special purpose learning algorithm, but rather can be approached in a manner similar to other natural language processing and machine learning tasks. These results have a number of important implications. Contrary to existing beliefs, work on improving machine translation evaluation metrics and then training to the improved metrics will not in itself result in improved translation systems. Even more significantly, the widespread usage of MERT has limited the sort of models that can be used for machine translation, as it does not scale well to large numbers of features. If it is not necessary to use MERT to train competitive systems, machine translation can be treated similarly to any other natural language processing task with models that include arbitrarily large feature sets.

 
AdvisersJames H. Martin; Daniel Jurafsky
SchoolUNIVERSITY OF COLORADO AT BOULDER
SourceDAI/B 72-07, p. , Jul 2011
Source TypeDissertation
SubjectsLinguistics; Computer science
Publication Number3453692
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3453692
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.