Spam is an increasing menace for all forms of online messaging including email, instant messaging, social media, blogs, and Web forums. Many past and current approaches to tackling spam rely too heavily on content-based approaches, where filters use the content of spam messages to distinguish them from legitimate messages. This approach, however, aims at a moving target: spammers are free to evolve the content of their messages in a variety of ways in response to filtering rules, leaving content-based filters to play “catch-up”. Content-based filters also incur more overhead, because they need to accept, store, and process the content of an email before making a decision; with 90% of email—over 50 billion messages a day—being spam, content-based filters are expensive both to maintain and to scale.
In this dissertation, we introduce email spam filtering using network-level features. Network-level features are based on lightweight measurements that can be made in the network, often without processing or storing a message. Beyond just the IP address of a traffic source, network-level features also include the Autonomous System (AS) numbers of the source, flow sizes, packet header information, data that can be collected from structured application-level traffic streams such as DNS or HTTP information, and aggregates of these features (e.g., the historical behavior of an IP address). Unlike content-based features Network-level features also affords the opportunity to observe the coordinated behavior of spammers. Network-level attributes of traffic stay relevant for longer periods and are harder for criminals to alter at will (e.g., a bot cannot act independently of other bots in the botnet).
This dissertation has the following contributions. (1) We perform a detailed characterization of the network-level behavior of spam including its origins, volumetric and temporal behavior, and its relation to botnets and hijacked BGP routes. We further perform a longitudinal analysis of these features over a 6 year period to examine the robustness of network-level features for email classification. We find that IP-based reputation systems such as IP blacklists may not be able to keep up with the threat of spam from previously unseen IP addresses, and from new and stealthy attacks. (2) We present three unsupervised algorithms that detect correlated behavior of spammers using network-level features. First, we introduce the stealthy spammer behavior of reconnoitering IP Blacklists, and present techniques to detect such queries using temporal and spatial features. Second, we present SpamTracker, a system that distinguishes spammers from legitimate senders by applying clustering on the set of domains to which email is sent. Third, we introduce vote-gaming attacks in large Web-based email systems that pollutes user feedback on spam emails, and present an efficient clustering-based method to mitigate such attacks.
We have evaluated our algorithms on real-world datasets, and our work has also resulted in practical tools and applications: Our vote-gaming attack detection system has been put to use by Yahoo! Mail to detect compromised bot-controlled accounts. We have also designed a system to detect spam from potentially hijacked BGP prefixes and integrated it with our real-time dynamic blacklisting system, SpamSpotter.