Get to Know

I started my research career in the area of engineering document analysis, or more particularly, engineering drawing recognition. Supervised by Prof. Dov Dori, IAPR Fellow, my PhD study was quite productive, yielding three major contributions. The first one was the sparse pixel tracking algorithm for engineering drawing vectorization, which was published in IEEE T-PAMI (1999). The second one was the generic graphics recognition algorithm and its applications on a variety of graphic objects, including lines of various shapes (e.g., straight, circular, and polyline) and various styles (e.g., solid, dashed, dash-dotted, dash-dot-dotted), text areas, arrowheads, leaders, dimension sets, and hatched areas (published in CVIU 1998, IEEE T-PAMI 1998, and IEEE T-SMC 1999). The third one was the performance evaluation protocol of graphics recognition algorithms (published in MVA 1997), which has been successfully used for the arc segmentation contest series (2001, 2003, 2005 and 2009). All of these works have been implemented in the Machine Drawing Understanding System (MDUS), which won the first place in the Dashed Line Detection Contest during GREC1995. The updated version of MDUS is also committed as an open source project at code.google.com/p/vrliu/.

After graduation, I extended my research area to online graphics recognition and web document analysis. For online graphics recognition, we have developed a system named QuickDiagram (previously, SmartSketchpad) for quick circuit diagram input and understanding, which involves quite a number of methods for stroke processing, symbol recognition, and syntax and semantic recognition. When a user is sketching a (complete or partial) symbol or wire (connecting two symbols) of the diagram, the system can recognize and beautify it immediately. After the entire circuit diagram is complete, it can be analyzed and understood via Nodal Analysis and PSpice code can be generated. The QuickDiagram system has also been released as an open source project at code.google.com/p/quickdiagram/.

On some day in 2004, while I was reading a newspaper, the news that certain banks’ websites had been mimicked and some victims had suffered the loss of their private information and money to phishers inspired me to explore the threat of phishing and the problem of anti-phishing. Immediately, I thought we could apply the document analysis approach to anti-phishing. Particularly, if a suspicious webpage is very similar to a legitimate webpage (or what we called phishing target), we are more confident that it is a fraudulent, fake webpage. From then on, we have developed quite a few practical solutions to anti-phishing.

The first approach we invented is a so-called active and visual strategy for anti-phishing, which was published in the International World Wide Web Conference (2005) and IEEE Internet Computing Magazine (2006). Compared with black-list-based solutions that wait for phishing attacks to arrive at the end user, our active and visual strategy actively goes out to search for and detect possible phishing attacks that look similar to the protected, true websites. In this method, suspicious URLs are found not only at email servers and browsers but also at DNSes, and possible suspicious domain names are enumerated according to the protected true domain names and their variations. Webpages at these suspicious URLs then undergo visual comparison with the protected true webpages to further test whether they are phishing or not. Similar webpages are more likely phishing webpages. An image-based webpage similarity assessment method was published in IEEE Transactions on Dependable and Secure Computing (2006).

Following the active and visual strategy, we also foresaw the problem of Unicode-based phishing, and then provided a set of counter measures, including the coloring-based method. Unicode has become a useful tool for information internationalization, particularly for applications in web links, webpages, and emails. However, many Unicode glyphs look so similar that this feature could be utilized maliciously to trick people’s eyes. We proposed to use Unicode string coloring as a promising countermeasure to this emerging threat. This solution assigned colors to a set of required languages/scripts such that each language/script is displayed uniquely in color, while the color difference among different languages is maximized. Fixed and adaptive coloring schemes were used to render Unicode strings in web links and documents so as to distinguish mixed Unicode characters from different language/script groups and vividly illustrate potential homograph obfuscation intentions for end users and for possible forensic usages. A paper is published in Asia-Pacific Web Conference (2008).

Most of the existing anti-phishing solutions (including our previous ones) need to know the phishing target in order to determine whether a suspicious webpage is a phishing page or not. Why not try to find phishing targets automatically? We then proposed the problem of phishing target discovery as an important task for anti-phishing. Now we have proposed quite a few solutions to this problem. Given a suspicious URL, we could determine if it is a phishing webpage, and if so, which true webpage (or the phishing target) it is attacking. One of these solutions was published in the journal of Future Generation Computer Systems (2010) and some others are under review. There are many advantages of finding phishing targets. On the one hand, if we find the phishing target of a suspicious webpage correctly, we can inform the target’s owner such that they can immediately take necessary actions as countermeasures. On the other hand, if we can find the target, we can also confirm that the suspicious webpage is a phishing webpage and prevent the end user’s personal information from being leaked.

All of these anti-phishing solutions have been implemented and commercialized in practical systems. So far, our commercial anti-phishing product, SiteWatcher, includes three versions, namely, SiteWatcher Client for end-users, which pops up a warning and colors suspicious URLs, SiteWatcher Enterprise, which is used for company users to find what websites are attacking their protected true websites, and SiteWatcher Service, which can recognize whether a user-query URL is a phishing URL or not and its phishing target if it is. SiteWatcher Client has more than 50,000 downloads from a few free-ware websites including download.com. Two licenses of SiteWatcher Enterprise have been sold, and a prestigious bank in Hong Kong is actively using it everyday. SiteWatcher Service is deployed at www.SiteWatcher.cn and SiteWatcher.cs.cityu.edu.hk, which has received more than 20,000 queries. For the past five years, this anti-phishing project has received attention from media; more than 20 media outlets (in Chinese, English, and Japanese, in TVs, newspapers, and magazines) have reported on it.

I am very happy that my research work is not only impacting the academic society but also the industrial society, and even end-users’ daily lives. Like all researchers, I am always looking for problems in my daily life, and I have been excited to have identified and solved new problems in pattern recognition. Especially, I am very proud of bringing new vitality to document analysis.

Getting to know…

Wenyin Liu, IAPR Fellow