Similarity-based generalization in language
by Yarlett, Daniel G., Ph.D., STANFORD UNIVERSITY, 2008, 228 pages; 3332956

Abstract:

This thesis is about language acquisition considered from a statistical point of view. It takes as its starting point empirical evidence that people are sensitive to statistical properties of language, such as the frequency and probability of linguistic events, and asks how it is that people can learn this sort of information based on exposure to the language used around them. We review arguments against the idea that statistical information about language can be reliably learned from the environment and conclude that the central objection to this idea is the problem of data sparsity.

The problem of data sparsity, in its most acute form, is simply stated: how can you learn or know anything about the statistical association between two events if you have never seen them occur together? Data sparsity is particularly challenging in linguistic domains because of the highly productive nature of language: we are continually being exposed to novel combinations of words, and if we take a simplistic view of learning - for example, a view of associative learning in which we can only learn about the relationship between A and B if they have been directly observed to occur together in experience - then it appears impossible to learn about the statistical relationship between such events.

In this thesis we explore the degree to which similarity-based generalization can help overcome the problem of data sparsity. We first explore how distributional models can be used to derive representations for words based on the contexts in which they have been observed to occur. These models are attractive because they learn based on locally available information using simple associative mechanisms, and have been shown to capture various aspects of our knowledge about words. We explore a variety of distributional models to see how the large number of ways in which they can be parameterized affects the information they encode, and report an experiment examining the way in which people attend to the local linguistic context when learning about the meaning of words.

Then, building on the work of Dagan, Lee and Pereira (1999), we propose a similarity-based bigram prediction model and show that this model is highly competitive with existing engineering methods in bigram prediction tasks (predicting the next word in a linguistic sample based on the previously observed word). In particular, the model we propose is particularly successful when data sparsity is most severe - when the bigram in question has either never been seen before, or has been seen only a handful of times. We interpret this as evidence that similarity-based generalization can be a useful strategy for combating data sparsity.

Finally, we consider the prospects for applying similarity-based generalization to higher-order language modeling (i.e. cases where the previous n > 1 words are used in order to predict the next word in a language sample). We show that distributional models can be generalized in order to derive representations of multi-word sequences, instead of just individual words, and explore how the resulting similarity-metrics can be used as the basis for making predictions about language in these contexts. We conclude by considering the potential for extending these techniques in the future, and the consequences of these results for our understanding of language acquisition. While the problem of data sparsity remains far from being solved, we argue that similarity-based generalization can be a helpful strategy in reducing the severity of this pervasive problem.

 
Advisor
SchoolSTANFORD UNIVERSITY
SourceDAI/B 69-10, p. , Dec 2008
Source TypeDissertation
SubjectsDevelopmental psychology; Cognitive psychology
Publication Number3332956
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3332956
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.