Feature

Click here for Top of Page
Right Arrow: Next
Right Arrow: Previous
Newsletter

The emerging role of visual pattern recognition in spam filtering: challenge and opportunity for IAPR researchers

 

By Fabio Roli, Giorgio Fumera, Ignazio Pillai

 

Department of Electrical and Electronic Engineering – University of Cagliari

Piazza d’Armi – I-09123 Cagliari (Italy)

{roli, fumera, pillai}@diee.unica.it

Although anti-spam filters have recently adopted text categorisation techniques based on pattern recognition approaches for e-mail semantic content analysis (e.g., a module of the popular SpamAssassin filter is based on a Bayesian classifier; see wiki.apache.org/spamassassin/BayesInSpamAssassin), spam filtering is not in the mainstream of IAPR research. This topic is not commonly highlighted in the call for papers of main IAPR conferences, and only one paper out of 1168 dealt with spam filtering at the last ICPR conference in Hong Kong. But things could, or maybe should, change in the near future.

 

Very recently, spammers introduced a new trick consisting of embedding the spam message into attached images, which can make all current techniques based on the analysis of digital text in the subject and body fields of e-mails ineffective (see Figure 1 for an example of spam email). Spam email like the one in Figure 1 offered a first challenge and opportunity for IAPR researchers, as they made spam filtering a matter of visual pattern recognition, which is obviously a topic in the mainstream of IAPR research. In particular, as a recent paper showed [1], this kind of spam made it apparent that the full arsenal of OCR and document analysis methods, mainly developed within the IAPR community, could be exploited in the spam “war”.

 

But that’s not all. Just in the last months, further challenges and opportunities have emerged from the spam arena for IAPR researchers. Spammers are applying content obscuring techniques to images (see Figure 2), to make OCR systems ineffective without compromising human readability, and they are also starting to use methods similar to the ones used to create CAPTCHAs (Figure 3). (If you are not familiar with CAPTCHAs, we suggest you to browse the web site of Prof. H. Baird www.cse.lehigh.edu/~baird/research_hips.html and the site of the Captcha Project www.captcha.net/). Ironically enough, spammers are using CAPTCHAs (which were invented to defend against robot spamming) to evade anti-spam robots.

 

This kind of spam has been growing so quickly (approximately 30% of all spam is now image based [2]) that a name was coined in the Internet, and it is now referred as “image-based” spam (or simply “image spam”). Many commercial products have been the target of image spam. In addition to the usual products promising weight loss or improved sexual performance, a cut-price edition of Windows Vista has been recently offered using image spam [2].

 

Although this is bad news for our inboxes, it could be good news for IAPR researchers, as the roles of visual pattern recognition, image processing, and, in general, computer vision, could become strategic in the future spam war. Because all the traditional modules of current anti-spam filters are ineffective against image spam, visual pattern recognition methods become crucial for new detection modules. Recently, two OCR-based plug-in modules of the SpamAssassin filter were delivered that are capable of analysing text embedded into images (wiki.apache.org/spamassassin/CustomPlugins).

 

But image spam could be more than a new and stimulating application for IAPR researchers. There is the opportunity of convergence and synergy with the field of CAPTCHAs. One common issue is the trade-off between the “hardness of evasion” of a content obscuring technique applied to a text image and the users’ tolerance to reading such cluttered image. The other side of the coin of text recognition for CAPTCHAs used for authentication (see the seminal paper by Mori and Malik [3]), is text recognition in image spam. Advanced OCR methods are required to analyse text embedded into images that spammers obscured in a hostile way.

 

Unfortunately, spammers could also exploit the similarities with CAPTCHAs and related research fields. Quoting from an article of the Technology Guardian newspaper [4], “One worrying thought: if we ever devise computers smart enough to read images, and so block those image spams, the spammers will, equally, have access to programs that can defeat CAPTCHAs”.

 

So, the approach to image spam filtering based on the analysis of text embedded into images might have both intrinsic limits (OCR of an adversarially obscured text image is a challenging task) and side effects (spammers could use similar techniques to break CAPTCHAs).  On the other hand, we know that humans sometimes identify a potential intruder as their attention is drawn to a suspect camouflage of the subject.  Analogously, one approach to image spam filtering could be aimed at detecting the noise, that is the adversarial clutter contained in the image, instead of the “signal” (the spam text message).  This is just the alternative approach to image spam filtering that the authors are currently developing [5].

 

As P.K. Chan and R.P. Lippmann recently pointed out [6], the existence of image spam, just like other computer security threats, presents a new, important direction for pattern recognition research:  the development of approaches that provide sustainably good performance in hostile environments where an adversary takes actions to evade a classifier.

 

 

[1]G. Fumera, I. Pillai, F. Roli, Spam filtering based on the analysis of text information embedded into images, Journal of Machine Learning Research, Vol. 7, pp. 2699-2720, 2006.

[2]E. Dallaway, Spammers use Windows Vista as bait, Infosecurity, Vol. 4, Issue 1, pp. 6, Jan./Feb. 2007.

[3]G. Mori, J. Malik, Recognizing objects in adversarial clutter: breaking a visual CAPTCHA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2003), Vol. 1, pp. 134-141, 2003.

[4]W.M. Grossman, What have image spam and Captchas got in common?, Technology Guardian, January 11, 2007.

[5]B. Biggio, G. Fumera, I. Pillai, F. Roli, Image spam filtering using visual information, submitted to the 2007 International Conference on Image Analysis and Processing.

 [6]P.K. Chan, R.P. Lippmann, Machine learning for computer security, Journal of Machine Learning Research, Vol. 7, pp. 2669-2672, 2006.

Feature Articles on uses of

Pattern Recognition (PR)

 

 

PR in Digital Libraries, Jul. ‘06

             [html]     [pdf]

 

PR at the US Postal Service:  A Decade of Achievement, Apr. ‘06

             [html]     [pdf]

 

PR in Two National Labs, Jan. ‘06

             [html]     [pdf]

 

PR in Traffic Engineering, Jul. ‘05

             [html]     [pdf]

 

PR in Astronomy and Photonics, Apr. ‘05

             [html]     [pdf]

 

PR in Origami, Jan. ‘05

             [html]     [pdf]

 

 

PR in Defense Applications, Jan. ‘04

                          [pdf]

 

PR in Maps, Sep. ‘03

                          [pdf]

 

PR in Security and Entertainment, Jun. ‘03

                          [pdf]

 

PR in Sports, Apr. ‘03

                          [pdf]

Fig. 1: Example of spam e-mail in which the text of the spam message is embedded into an attached image. The subject and body fields contain only bogus text.

Fig. 2: Example of a recent spam e-mail designed to make difficult OCR extraction without compromising human readability.

Fig. 3: Example of an image attached to a spam mail received from the authors in January 2007 which seems to exploit methods similar to the ones used in CAPTCHAs.