A non-parametric model for the discovery of inflectional paradigms from plain text using graphical models over strings
by Dreyer, Markus, Ph.D., THE JOHNS HOPKINS UNIVERSITY, 2011, 279 pages; 3463374

Abstract:

he field of statistical natural language processing has been turning toward morphologically rich languages. These languages have vocabularies that are often orders of magnitude larger than that of English, since words may be inflected in various different ways. This leads to problems with data sparseness and calls for models that can deal with this abundance of related words—models that can learn, analyze, reduce and generate morphological inflections. But surprisingly, statistical approaches to morphology are still rare, which stands in contrast to the many recent advances of sophisticated models in parsing, grammar induction, translation and many other areas of natural language processing.

This thesis presents a novel, unified statistical approach to inflectional morphology, an approach that can decode and encode the inflectional system of a language. At the center of this approach stands the notion of inflectional paradigms. These paradigms cluster the large vocabulary of a language into structured chunks; inflections of the same word, like break, broke, breaks, breaking, … , all belong in the same paradigm. And moreover, each of these inflections has an exact place within a paradigm, since each paradigm has designated slots for each possible inflection; for verbs, there is a slot for the first person singular indicative present, one for the third person plural subjunctive past and slots for all other possible forms. The main goal of this thesis is to build probability models over inflectional paradigms, and therefore to sort the large vocabulary of a morphologically rich language into structured clusters. These models can be learned with minimal supervision for any language that has inflectional morphology. As training data, some sample paradigms and a raw, unannotated text corpus can be used.

The models over morphological paradigms are developed in three main chapters that start with smaller components and build up to larger ones.

The first of these chapters (Chapter 2) presents novel probability models over strings and string pairs. These are applicable to lemmatization or to relate a past tense form to its associated present tense form, or for similar morphological tasks. It turns out they are general enough to tackle the popular task of transliteration very well, as well as other string-to-string tasks.

The second (Chapter 3) introduces the notion of a probability model over multiple strings, which is a novel variant of Markov Random Fields. These are used to relate the many inflections in an inflectional paradigm to one another, and they use the probability models from Chapter 2 as components. A novel version of belief propagation is presented, which propagates distributions over strings through a network of connected finite-state transducers, to perform inference in morphological paradigms (or other string fields).

Finally (Chapter 4), a non-parametric joint probability model over an unannotated text corpus and the morphological paradigms from Chapter 3 is presented. This model is based on a generative story for inflectional morphology that naturally incorporates common linguistic notions, such as lexemes, paradigms and inflections. Sampling algorithms are presented that perform inference over large text corpora and their implicit, hidden morphological paradigms. We show that they are able to discover the morphological paradigms that are implicit in the corpora. The model is based on finite-state operations and seamlessly handles concatenative and nonconcatenative morphology.

 
AdviserJason Eisner
SchoolTHE JOHNS HOPKINS UNIVERSITY
SourceDAI/B 72-10, p. , Aug 2011
Source TypeDissertation
SubjectsLinguistics; Computer science
Publication Number3463374
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3463374
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.