Efficient methods to store and query network data

by Giura, Paul, Ph.D., POLYTECHNIC INSTITUTE OF NEW YORK UNIVERSITY, 2010, 141 pages; 3457995


Network data crosses network boundaries in and out and many organizations record traces of network connections for monitoring and investigation purposes. With the increase in network traffic and sophistication of the attacks there is a need for efficient methods to store and query these data. In this dissertation we propose new efficient methods for storing and querying network payload and flow data that can be used to enhance the performance of monitoring and forensic analysis.

We first address the efficiency of various methods used for payload attribution. Given a history of packet transmissions and an excerpt of a possible packet payload, a Payload Attribution System (PAS) makes it feasible to identify the sources, destinations and the times of appearance on a network of all the packets that contained the specified payload excerpt. A PAS, as one of the core components in a network forensics system, enables investigating cybercrimes on the Internet, by, for example, tracing the spread of worms and viruses, identifying who has received a phishing email in an enterprise, or discovering which insider allowed an unauthorized disclosure of sensitive information. Considering the increasing volume of network traffic in today's networks it is infeasible to effectively store and query all the actual packets for extended periods of time for investigations. In this dissertation we focus on extremely compressed digests of payload data, we analyze the existing approaches and propose several new methods for payload attribution which utilize Rabin fingerprinting, shingling, and winnowing. Our best methods allow building payload attribution systems which provide data reduction ratios greater than 100:1 while supporting efficient queries with very low false positive rates. We demonstrate the properties of the proposed methods and specifically analyze their performance and practicality when used as modules of a network forensics system.

Consequently, we propose a column oriented storage infrastructure for storing historical network flow data. Transactional row-oriented databases provide satisfactory query performance for network flow data collected only over a period of several hours. In many cases, such as the detection of sophisticated coordinated attacks, it is crucial to query days, weeks or even months worth of disk resident historical data rapidly. For such monitoring and forensics queries, row oriented databases become I/O bound due to long disk access times. Furthermore, their data insertion rate is proportional to the number of indexes used, and query processing time is increased when it is necessary to load unused attributes along with the used ones. To overcome these problems in this dissertation we propose a new column oriented storage infrastructure for network flow records and present the performance evaluation of a prototype storage system implementation called NetStore. The system is aware of network data semantics and access patterns, and benefits from the simple column oriented layout without the need to meet general purpose databases requirements. We show that NetStore can potentially achieve more than ten times query speedup and ninety times less storage requirements compared to traditional row-stores, while it performs better than existing open source column-stores for network flow data.

Finally, we propose an efficient querying framework to represent, implement and execute forensics and monitoring queries faster on historical network flow data. Using efficient filtering methods, the query processing algorithms can improve the query runtime performance up to an order of magnitude for simple filtering and aggregation queries, and up to six times for batch complex queries when compared to naive approaches. Additionally, we propose a simple SQL extension that implements a subset of standard SQL commands and operators and a small set of features useful for network monitoring and forensics. The presented query processing engine together with a column storage infrastructure create a complete system for storing and querying network flow data efficiently when used for monitoring and forensic analysis.

AdviserNasir Memon
Source TypeDissertation
SubjectsComputer science
Publication Number3457995

About ProQuest Dissertations & Theses
With nearly 4 million records, the ProQuest Dissertations & Theses (PQDT) Global database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

PQDT Global combines content from a range of the world's premier universities - from the Ivy League to the Russell Group. Of the nearly 4 million graduate works included in the database, ProQuest offers more than 2.5 million in full text formats. Of those, over 1.7 million are available in PDF format. More than 90,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or contact ProQuest Support.