This dissertation demonstrates a framework for incremental model selection and processing of highly variant speech transcripts and user-generated text. The system reduces natural language processing (NLP) ambiguity by segmenting text by domain, allowing for domain-specific downstream processes to analyze each segment independently.
A tokenized text input stream is received by the system. At every word, an Indicator Function calculates a quantitative feature signal we call an Indicator Value Signal, that runs in parallel to the input stream. This feature signal is monitored for domain changes by an event controller, which segments the stream into feature chunks. The event controller can activate slowly over large spans of text, or rapidly and intrasententially. As the event controller indicates each domain change with an event signal, pipeline processes assigned to specific indicator function values are executed to process the segment, and add additional feature signals to the feature signal stack. At the end of the pipeline, feature signals are unified to produce a single annotated output stream.
To exemplify the framework, this dissertation makes three additional contributions. The first is a novel short-string language identification system that calculates our Indicator Value Signal. The second is a machine transliteration system to convert the Arabizi chat alphabet into Arabic script. The third is a modular part of speech tagger for multilingual code-mixing.
The short-string language identification system extracts an n-gram, and selects the closest language out of 373 reference languages by using a Support Vector Machine (SVM) classifier trained on a matrix of language model measurements. This classifier learns patterns of similarity and divergence of a language's tokens across all reference languages, leading to high accuracy on in-domain n-grams from a legal corpus as well as out-of-domain tokens from an English-Egyptian Arabic code-mixing microblog corpus.
The machine transliteration system converts Arabizi, a Latinized Arabic chat alphabet into Arabic script, in order to utilize existing NLP tools on Arabic chat text. A parallel, word-aligned corpus of the chat alphabet was collected from a dozen Arabic speakers. From the corpus we induced a probabilistic mapping of cross-dialect Arabizi characters to Arabic script and used this to train a highly accurate transducer.
The multilingual part of speech tagger demonstrates the modularity of our framework. We find that segmenting language before tagging, and then applying single-language homogeneous language models, is competitive to multilingual heterogeneous tagging models. We compare the two approaches on a speech transcript of English-Spanish code-mixing.
In addition to language identification, we consider a range of alternative indicator functions, such as genre identification, entropy, and gender identification, which could add a language adaptation ability on top of existing NLP systems and provide a boost in accuracy and performance on variational processing.
To summarize, this dissertation provides an architecture for NLP that allows for better handling of complicated language variation. To demonstrate the model, we introduce a short-string language identification system with state of the art accuracy, the first research on machine transliteration for a chat alphabet, and a modular part of speech tagger for multilingual code-mixing.