Mitigating spam using network-level features
by Ramachandran, Anirudh V., Ph.D., GEORGIA INSTITUTE OF TECHNOLOGY, 2011, 231 pages; 3484132

Abstract:

Spam is an increasing menace for all forms of online messaging including email, instant messaging, social media, blogs, and Web forums. Many past and current approaches to tackling spam rely too heavily on content-based approaches, where filters use the content of spam messages to distinguish them from legitimate messages. This approach, however, aims at a moving target: spammers are free to evolve the content of their messages in a variety of ways in response to filtering rules, leaving content-based filters to play “catch-up”. Content-based filters also incur more overhead, because they need to accept, store, and process the content of an email before making a decision; with 90% of email—over 50 billion messages a day—being spam, content-based filters are expensive both to maintain and to scale.

In this dissertation, we introduce email spam filtering using network-level features. Network-level features are based on lightweight measurements that can be made in the network, often without processing or storing a message. Beyond just the IP address of a traffic source, network-level features also include the Autonomous System (AS) numbers of the source, flow sizes, packet header information, data that can be collected from structured application-level traffic streams such as DNS or HTTP information, and aggregates of these features (e.g., the historical behavior of an IP address). Unlike content-based features Network-level features also affords the opportunity to observe the coordinated behavior of spammers. Network-level attributes of traffic stay relevant for longer periods and are harder for criminals to alter at will (e.g., a bot cannot act independently of other bots in the botnet).

This dissertation has the following contributions. (1) We perform a detailed characterization of the network-level behavior of spam including its origins, volumetric and temporal behavior, and its relation to botnets and hijacked BGP routes. We further perform a longitudinal analysis of these features over a 6 year period to examine the robustness of network-level features for email classification. We find that IP-based reputation systems such as IP blacklists may not be able to keep up with the threat of spam from previously unseen IP addresses, and from new and stealthy attacks. (2) We present three unsupervised algorithms that detect correlated behavior of spammers using network-level features. First, we introduce the stealthy spammer behavior of reconnoitering IP Blacklists, and present techniques to detect such queries using temporal and spatial features. Second, we present SpamTracker, a system that distinguishes spammers from legitimate senders by applying clustering on the set of domains to which email is sent. Third, we introduce vote-gaming attacks in large Web-based email systems that pollutes user feedback on spam emails, and present an efficient clustering-based method to mitigate such attacks.

We have evaluated our algorithms on real-world datasets, and our work has also resulted in practical tools and applications: Our vote-gaming attack detection system has been put to use by Yahoo! Mail to detect compromised bot-controlled accounts. We have also designed a system to detect spam from potentially hijacked BGP prefixes and integrated it with our real-time dynamic blacklisting system, SpamSpotter.

 
AdviserNick Feamster
SchoolGEORGIA INSTITUTE OF TECHNOLOGY
SourceDAI/B 73-02, p. , Nov 2011
Source TypeDissertation
SubjectsComputer science
Publication Number3484132
Adobe PDF Access the complete dissertation:
 

» Find an electronic copy at your library.
  Use the link below to access a full citation record of this graduate work:
  http://gateway.proquest.com/openurl%3furl_ver=Z39.88-2004%26res_dat=xri:pqdiss%26rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation%26rft_dat=xri:pqdiss:3484132
  If your library subscribes to the ProQuest Dissertations & Theses (PQDT) database, you may be entitled to a free electronic version of this graduate work. If not, you will have the option to purchase one, and access a 24 page preview for free (if available).

About ProQuest Dissertations & Theses
With over 2.3 million records, the ProQuest Dissertations & Theses (PQDT) database is the most comprehensive collection of dissertations and theses in the world. It is the database of record for graduate research.

The database includes citations of graduate works ranging from the first U.S. dissertation, accepted in 1861, to those accepted as recently as last semester. Of the 2.3 million graduate works included in the database, ProQuest offers more than 1.9 million in full text formats. Of those, over 860,000 are available in PDF format. More than 60,000 dissertations and theses are added to the database each year.

If you have questions, please feel free to visit the ProQuest Web site - http://www.proquest.com - or call ProQuest Hotline Customer Support at 1-800-521-3042.