AlphaRank: A new smoothing algorithm based on combination of link analysis techniques and frequency based methods
by Mukhtar, Omar, M.S., STATE UNIVERSITY OF NEW YORK AT BUFFALO, 2009, 80 pages; 1469108

Abstract:

Smoothing a probability distribution so that it generalizes well is a hard machine learning problem. It is particularly challenging when building a statistical language model with insufficient training data. We have developed a new smoothing algorithm (called AlphaRank) to overcome the data sparseness problem by viewing language as a large graph where each word is a vertex and the probability of using another word is determined by the edge weight connecting two words (vertices). Thus, instead of using frequency based rules as is done in prior work, we propose a graph based method to smooth a statistical language model. Our method combines features of context-dependent probability estimators such as n-grams and features from context-independent probability estimators such as the steady-state distribution of a discrete time-step Markov chain. We have tested on a large collection of Arabic newswire articles and compared with previous approaches using the perplexity measure and found our method to be superior.

 
AdviserVenu Govindaraju
SchoolSTATE UNIVERSITY OF NEW YORK AT BUFFALO
SourceMAI/ 48-01, p. , Oct 2009
Source TypeThesis
SubjectsLinguistics; Artificial intelligence; Computer science
Publication Number1469108
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:1469108
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.