News Based Forecasting and Modeling
by Zhang, Wenbin, Ph.D., STATE UNIVERSITY OF NEW YORK AT STONY BROOK, 2011, 205 pages; 3481927

Abstract:

This thesis focuses on forecasting and modeling problems based on quantitative news data. The news data in my experiments are produced from Lydia , our news analytical system, which is capable of analyzing spatial, temporal, and linguistic statistics of named entity occurrences in text corpora across different news sources. Specifically, the problems I studied fall into two categories: (1) how could the news data help people to analyze and predict societal variables, especially financial variables like movie gross, stock prices, etc. (2) what is the process by which news data is generated, and how can we predict the distribution of future news generations.

On the one hand, traditional financial analysis lays emphasis on how price data incorporate other relevant financial indicators. Since the 1990s, linguistic sources such as news have been continuously proven to carry extra and meaningful information beyond traditional quantitative financial data, and thus they can be used as predictive indicators in finance. In this thesis, we conduct a comprehensive study on large-scale news data modeling and how they help people on financial analysis in a large sense with analyzing two important financial markets. First, we show how news data help people to build models to analyze and predict financial markets with coarse time granularity, such as movie market. The next, we show how financial markets with finer time granularity such as stock markets could be factored and analyzed with news as well. Our analysis provides concrete evidence in confirming that news data are highly informative and have significant predictive power on financial analysis, which is previously mentioned in some literatures but has never been practically proven by real large-scale analysis.

On the other hand, the thesis will also study news statistical patterns, build models to generate news time series, and try to forecast future news fluctuations. Our statistical analysis shows that log-normal and power-law distributions generally could describe news behaviors in many aspects. Based on the principles we discovered, we proposed two models—Log-Normal (LN) model, and an innovative Layered Hidden Markov Model (LHMM) to describe news. Our careful studies show that LHMM model is overall a favorable model to simulate news data and forecast future news pulses. Most importantly, we study and forecast the future of news entities in a group context. Based on our analysis, we could answer some interesting news forecasting questions. For example, what is the probability that an entity become the most famous one among a group? And what is the likelihood that a trivial entity becomes incredibly important in the next certain time period? Our analysis shows these questions could be solved by fitting power-law tails and we validated the model with several interesting news groups in different domains. Our study provides very useful insights for the analysis of issues in finance, political science, or social science.

 
AdviserSteven Skiena
SchoolSTATE UNIVERSITY OF NEW YORK AT STONY BROOK
SourceDAI/B 73-03, p. , Dec 2011
Source TypeDissertation
SubjectsFinance; Computer science
Publication Number3481927
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3481927
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.