Processing highly variant language using incremental model selection
by Rodrigues, Paul, Ph.D., INDIANA UNIVERSITY, 2012, 211 pages; 3498944

Abstract:

This dissertation demonstrates a framework for incremental model selection and processing of highly variant speech transcripts and user-generated text. The system reduces natural language processing (NLP) ambiguity by segmenting text by domain, allowing for domain-specific downstream processes to analyze each segment independently.

A tokenized text input stream is received by the system. At every word, an Indicator Function calculates a quantitative feature signal we call an Indicator Value Signal, that runs in parallel to the input stream. This feature signal is monitored for domain changes by an event controller, which segments the stream into feature chunks. The event controller can activate slowly over large spans of text, or rapidly and intrasententially. As the event controller indicates each domain change with an event signal, pipeline processes assigned to specific indicator function values are executed to process the segment, and add additional feature signals to the feature signal stack. At the end of the pipeline, feature signals are unified to produce a single annotated output stream.

To exemplify the framework, this dissertation makes three additional contributions. The first is a novel short-string language identification system that calculates our Indicator Value Signal. The second is a machine transliteration system to convert the Arabizi chat alphabet into Arabic script. The third is a modular part of speech tagger for multilingual code-mixing.

The short-string language identification system extracts an n-gram, and selects the closest language out of 373 reference languages by using a Support Vector Machine (SVM) classifier trained on a matrix of language model measurements. This classifier learns patterns of similarity and divergence of a language's tokens across all reference languages, leading to high accuracy on in-domain n-grams from a legal corpus as well as out-of-domain tokens from an English-Egyptian Arabic code-mixing microblog corpus.

The machine transliteration system converts Arabizi, a Latinized Arabic chat alphabet into Arabic script, in order to utilize existing NLP tools on Arabic chat text. A parallel, word-aligned corpus of the chat alphabet was collected from a dozen Arabic speakers. From the corpus we induced a probabilistic mapping of cross-dialect Arabizi characters to Arabic script and used this to train a highly accurate transducer.

The multilingual part of speech tagger demonstrates the modularity of our framework. We find that segmenting language before tagging, and then applying single-language homogeneous language models, is competitive to multilingual heterogeneous tagging models. We compare the two approaches on a speech transcript of English-Spanish code-mixing.

In addition to language identification, we consider a range of alternative indicator functions, such as genre identification, entropy, and gender identification, which could add a language adaptation ability on top of existing NLP systems and provide a boost in accuracy and performance on variational processing.

To summarize, this dissertation provides an architecture for NLP that allows for better handling of complicated language variation. To demonstrate the model, we introduce a short-string language identification system with state of the art accuracy, the first research on machine transliteration for a chat alphabet, and a modular part of speech tagger for multilingual code-mixing.

 
AdviserSandra Kuebler
SchoolINDIANA UNIVERSITY
SourceDAI/A 73-07(E), p. , Mar 2012
Source TypeDissertation
SubjectsLinguistics; Sociolinguistics; Computer science
Publication Number3498944
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3498944
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.