Using Network Tracing to Address Internet Phishing

Stefan Saroiu (faculty) and Troy Ronda (graduate student)
Math & Computational Sciences
University of Toronto at Mississauga


1 Background, Purpose Objectives

The goal of our research is to address Internet security problems using Internet traffic passive measurement. In this document, we focus on a recent security problem: Web phishing. Web phishing is a security attack in which Web users disclose their credentials and personal information to malicious third parties on the Internet. Our project is centered on answering three simple questions about Web phishing: (1) who are the attackers; (2) who are the victims; and (3) how can we detect, prevent, and eradicate these attacks? We plan to answer these questions by analyzing the Internet traffic exchanged by a large population of users, such as all Internet users at the University of Toronto at Mississauga.

We will investigate these three questions without revealing the identity of the victims or the attackers. For example, we will examine whether Web phishing victims are frequent Internet users, whether they receive a large amount of spam e-mail, and whether they download executable code that can be exploited by a malicious third party. Similarly, we will not identify the attackers’ identities. Instead, we will examine whether these attacks come from parts of the Internet that no other traffic is exchanged with (i.e., network address ranges that do not send any traffic other than Web phishing traffic.) Answering such questions could provide initial insight into how to build simpler and more effective Web phishing detection and prevention solutions. We would like to emphasize again that we do not inspect nor record the identities of any of the parties whose traffic we are monitoring.

This document’s goal is to address the ethical issues related to monitoring Internet traffic at the University of Toronto at Mississauga. We would like to mention that this proposal’s PI operated a similar network measurement platform at the University of Washington [6]. In this earlier project, the PI addressed similar ethical issues and concerns. The measurement platform at the University of Washington led to a fruitful research agenda: it was used in four research studies over seven years leading to four Ph.D. dissertations (including the PI’s) and many research publications appearing in top-tier journals and conferences.

In the remainder of this section, we start by describing Web phishing and showing that it is a timely and important Internet security research topic. Next, we present our Internet traffic passive measurement project focusing on why our measurement will reveal necessary insight into how to address the Web phishing threat. Finally, we briefly summarize the ethical issues related to this project and how we plan to address them.

What is Web phishing?

Web phishing is a security attack in which malicious entities attempt to obtain users’ personal information and credentials by impersonating legitimate Internet Websites. A typical Web phishing attack occurs as follows. Malicious parties contact tens of thousands of unsuspecting Internet users by e-mail that appears to originate from legitimate companies. These e-mail messages contain fabricated reasons inviting their recipients to login to a Website whose address is included in the e-mail’s body. Once an unsuspected user logs in, their credentials are sent to the malicious party’s Website. Figure 1 illustrates a real Web phishing e-mail in which users are invited to login to a bank’s fabricated Website.


PIC


Figure 1: A Real Phishing Attack. On the left, the figure shows the e-mail received by many unsuspected Internet users. On the right, the figure shows the Website asking for users to divulge their credentials. If successful, victims will disclose their bank accounts password. This attack occurred on June 16, 2006

Many Internet users do not fall prey to these “luring” e-mails, avoiding these attacks. Nevertheless, even if a very small fraction of Internet users follow these e-mails and disclose their credentials, this small fraction results in potentially thousands of credentials being disclosed to the attacker, leading to serious damage.

The Web phishing security threat is serious. A recent study [3] estimates that Web phishing costs the American economy about one billion dollars last year. Web phishing is one of the fastest growing Internet attacks; it affected over one million Internet users in the U.S. alone in 2005. Addressing Web phishing is important because successful phishing attacks erode the customers’ confidence in the Web as a reliable and secure e-commerce platform. Even the local bank industry mentioned Web phishing as the most important Internet security issue they are facing [9].

Because Web phishing is recent, it has received little attention from the research community. Most academic work has focused on individual user behavior. These studies range from building email filters [2] to participant studies of mock attacks [110] to mounting a “real” attack [4]. Solutions to Web phishing that involve secure hardware (e.g., a smart cell-phone) have also been proposed [7].

What is Internet Traffic Passive Measurement?

Our project plans to monitor the Internet traffic exchanged by University of Toronto at Mississauga (UTM) with the rest of the Internet. We are developing a measurement platform that will be integrated with the campus network at UTM. Our platform is passive: it does not interfere with the existing traffic whatsoever. Instead, the campus network provides a copy of all data on the network to our platform. In this way, all failures or errors of our platform do not affect the campus network or the users’ traffic in any way.

By passively monitoring the Internet traffic of a large user population, we can monitor and analyze the manner in which Web phishing attacks occur in practice. We are developing heuristics to identify mass e-mails carrying phishing attacks. Our heuristics are similar to the ones used in state-of-the-art spam filters. To be effective, these heuristics must be applied on a collection of e-mails received by a large user population. Our Internet traffic passive measurement platform will allow us to determine the heuristics’ effectiveness.

Next, we are planning to monitor whether users fall prey to these attacks (these attacks are successful) and to identify the circumstances leading to these attacks being successful. We plan to develop techniques that can warn users when and if they enter their credentials to a suspicious Website. We plan to measure where these Websites are located and whether firewalls can automatically prevent home users from sending their credentials out.

We work in close contact with the campus network engineers and operators at UTM. In the previous project at the University of Washington, we found that an Internet passive traffic measurement platform is helpful to debug network configuration problems or identify new kinds of malicious traffic. Such a tool allows the engineers to perform tests and queries “on-the-spot” when a new, previously unknown campus traffic problem arises.

Our Goals and Guarantees

Our goal is to address Web phishing using passive Internet traffic measurement. Only by monitoring a large sample of Internet users, we can understand who the attackers and the victims are and what solutions are effective in practice. The Internet traffic passive measurement platform allows us to investigate the aggregate behavior of large user population, such as all Internet users are the University of Toronto at Mississauga.

Internet traffic passive measurement raises important privacy and anonymity concerns. As a result, we have made privacy and anonymity the primary requirement of our measurement platform. At a high-level, we use four principles to guarantee user privacy and anonymity:

  1. We do not store any identifiable information on disk. We do not record Internet users’ identities, their traffic, or their e-mail. All personal information collected is anonymized before being stored on disk for later analysis. This information includes IP addresses, Web destinations, Web and e-mail content, and most other gathered information. We leave very little data un-anonymized. For example, we do not anonymize the length of the visited Web pages or the time when the Web page visit occurred.

    Another consequence of this principle is that the mere act of disconnecting the power from our measurement platform will leave no un-anonymized identifiable information on the platform whatsoever. If we chose to store the data first and later anonymize it, the users’ privacy could have been compromised in certain scenarios. For example, a subpoena could force us to let others inspect the data being stored on the disk, revealing personal information. Note that erasing personal data in the face of a potential compromising scenario is insufficient: there are well-known ways to recover data from a disk even after being erased. To deal with such scenarios, we chose to never store any personal data unless is anonymized.

  2. Our platform cannot be remotely accessed from the Internet. Although it receives Internet traffic, our machine does not have an output interface to forward any data out. Being accessible from the Internet leaves the possibility of having an intruder access the measurement platform and compromising the users’ privacy. While many intrusion prevention mechanisms exist, none of these work 100% of the time. Instead, we chose to make our entire platform unreachable from the Internet.
  3. Our platform is physically safeguarded. We will place our measurement infrastructure in the campus network machine room that has strict access control mechanisms. Access to our machines is limited to two members of our research group (this document’s authors) and the campus network engineers who have regular access to the room. We restrict login access to our platform only to our research group’s members.
  4. We will make all our software open-source. We believe that the only way in which privacy and anonymity guarantees are convincing is by making all our measurement software publicly available. In this way, anybody can audit our software and check for privacy guarantees. We are hopeful that software transparency will lead to others reporting bugs and problems with our software.
  5. We will permanently destroy the collected data after our results are published in a peer-reviewed journal or after three years, whichever occurs earlier.

In the remainder of this document, we describe the methodology of our study and we present the details of how we implement the principles that guarantee the users’ privacy and anonymity.

2 Research Methodology

To gain insight into the Web phishing problem, we capture, anonymize, and analyze Web (HTTP) and email traffic between the University of Toronto at Mississauga (UTM) and the Internet. To start a Web phishing attack, malicious parties send emails containing a link to a website which collects the victims personal information. Detecting these attacks requires that the website’s URL can be correlated with the link contained inside the email. For this reason, we must capture and correlate portions of both email and Web traffic.

The network capture hardware connects to the UTM’s border router (the device that connects UTM to the rest of the Internet.) All traffic traversing this router is copied to our system. Figure 2 illustrates a high-level view of our network measurement infrastructure. Our platform is based on a software system developed by the Intel Research Labs at Cambridge, UK, called CoMo [5]. We extend the CoMo software to capture Web phishing attacks and we add anonymization support. (We have received an in-kind donation from Intel for the measurement hardware.)


PIC


Figure 2: Campus Network Layout at UTM. Our platform receives copies of traffic exchanged between UTM and the rest of the Internet, including the other U. of T. campuses. We anonymize all traffic before performing any analysis.

Data Collection

We do not store any identifiable information on disk. We do not record Internet users’ identities, their traffic, or their e-mail. All data is inspected in volatile memory. Any information is anonymized before stored to the disk.

At a high-level, we will inspect two types of information. First, we examine information about received e-mail. We use an e-mail spam detector online to label e-mails as spam on the fly. We need to detect whether the e-mail is spam before storing it. Once stored, any information present in the e-mail is anonymized; once we anonymize the data, we cannot classify it as spam anymore. We will add our own phishing detection heuristics to the spam detector.

Second, we inspect information about the Web pages visited by UTM Internet users. We record the server addresses, the objects downloaded, their sizes and types. Once anonymized, the data is stored on our storage nodes.

After a trace is collected, we perform off-line analysis. In particular, we investigate whether UTM users receiving spam e-mails click on the links present in the body of these e-mails. Even if data is anonymized, we can still check whether a visited Web page was received by a UTM user via an e-mail labeled as spam. If the user sends any information to the suspicious Web site whose link was received in a spam e-mail, this strongly suggests that the Web phishing attack was successful.

Initially, we will repeatedly take traces in increasingly longer time intervals to test both load and correctness. We will start with a 10 minute interval and examine the load on the capture hardware and the packets flagged by the analysis module as having errors or unknown header fields. If we are overloading the capture hardware, we will optimize the software bottlenecks to reduce load. We will similarly investigate packets flagged with errors or unknown header fields, and update the packet analysis fields appropriately (we don’t expect there to be many of these if we have done appropriate testing in the lab). Lastly, we will run some post-processing tools on the generated traces to verify internal trace consistency.

After iterating on short time intervals and convincing ourselves that our capture hardware and software are adequate to sustain load and analyze the campus traffic, we will trace for an hour and again examine the load and error logs, and repeat if necessary.

Assuming we solve any problems revealed using short time intervals, we will perform an overnight tracing. On the next morning we will examine the state of the system. Unless load or error logs indicate that a problem needs to be fixed and the trace restarted, we will continue to let it trace and use it as the first trace run. We will continue to monitor the trace twice daily until a week has elapsed.

We believe that successful Web phishing attacks are not very common. We anticipate we need to monitor the Internet traffic at UTM for several months before we have recorded a representative sample of successful Web phishing attacks.

Software Testing And Transparency

We implement, test, and debug all our software in our lab before deployment. We will provide an external Website describing the methodology of our project as well as providing a link to the software source we use. We encourage the campus network engineers and anyone else to inspect and audit our measurement methodology and software. If privacy or anonymity problems are discovered (e.g., bugs), fixing them will become our first priority.

Professor Kumar Murty from the Mathematics Department and Joe Lim who is the Chief Information Officer at University of Toronto at Mississauga (UTM) will conduct independent security audits of our methodology to verify that the privacy and anonymity requirements described by this document are enforced. Professor Murty’s main area of research is cryptography. In his role of Chief Information Officer, Joe Lim’s responsibilities include overseeing the computing and networking infrastructure at UTM. Neither Professor Murty nor Joe Lim are associated with our project (they are at an “arm’s length” from our project.)

3 Participants

We estimate that we will monitor the Internet traffic of about 15,000 faculty, students, and staff at UTM. We do not monitor any traffic internal to UTM, we only monitor traffic exchanged between UTM and the rest of the Internet (including other U. of T. campuses.) Answering our research questions requires monitoring the Internet traffic of a large user population. We are unsure whether the sample is large enough to provide a statistically significant measurement of Web phishing. Unfortunately, we do not know how to determine an adequate sample size for our study in advance.

To accommodate those who do not wish to take part in our research study, we will provide an opt-out procedure. We will create a Web page where anyone can opt-out from our research study. To exclude a user from our study, we will require the IP address of their machine on campus. The criterion for participation in our project is all Internet users at UTM who do not opt-out of our study. We will provide a link to our opt-out Web page from the UTM Research Website. We will also send an announcement on a public forum of the student community at UTM (Erincomm). Erincomm is a mailing list including everyone with a mail account at University of Toronto at Mississauga. By sending an announcement on Erincomm, we will send an e-mail to all Internet users at UTM providing a description of our project and a link to the opt-out Web page. Finally, we will post fliers on the bulletin boards at UTM advertising our study and the available opt-out procedure.

4 Recruitment

No recruitment is required. Participants will not be directly involved in this study. In fact, no active participation is required; participants will merely be conducting their usual activities.

5 Risks and Benefits

We have motivated in the introduction why Web phishing is an important problem to address for the information technology community. For the research community, our Internet traffic measurement will provide necessary insight into how Web phishing attacks occur in practice. We believe that our project will make University of Toronto take initiative in this research area.

For the campus network engineers and operators (i.e., CNS) our platform will provide a way to investigate traffic problems and solutions. This group will enhance their ability to troubleshoot and debug network issues on-the-fly.

The University community will receive an educational benefit from this study. We hope that investigating the Web phishing will raise awareness about Internet email and Web safety.

This project poses minimal risk for its participants. Our measurements are passive; that is, they do not influence the measured traffic in any way. Furthermore, all sensitive information and measurements are anonymized so that no individual will be at risk.

6 Privacy

Our primary requirement is to ensure the privacy and anonymity of the Internet traffic of all participants in our study. All data collected will be anonymized before we use it in our experiments. As a rule of thumb, whenever we are in doubt whether any data collected is potentially sensitive, we will anonymize it (i.e., we will “err” on the safe side.)

In the remainder of this section, we describe the technical details of our anonymization scheme.

Anonymization

All server and object names are anonymized using the keyed MD5 hash digest algorithm. This algorithm maps an input key to a sequence of 128 bits. The map is one-way: it is practically impossible to re-construct the input of a keyed MD5 hash.

We use a slightly different anonymization scheme for client and server’s addresses than for names. We would like to be able to associate addresses from the same local network even after they have been anonymized, without compromising the anonymization scheme. We use an MD5-based anonymization algorithm that preserves the network prefix of addresses: if two addresses share a network prefix, the anonymized version of the addresses will share an anonymized network prefix.

Generating and Maintaining Keys

The key for the MD5 hash (and the basis for the seeds for the address randomization) will be input into the capture system by hand at the console at the start of each experiment. The key will be generated using the PGP1 key generator (seeded by sampling keystroke arrival times) and written on paper (not recorded electronically on any non-volatile storage medium) and is the responsibility of our group. After a tracing experiment is finished, the key will be wiped from system memory and the paper with the key written on it destroyed. When the paper holding the key is destroyed, any requests to reveal the key cannot be satisfied since it will no longer exist.

Assuming that the people involved in doing the tracing act responsibly, the period of vulnerability of this scheme is the tracing period itself; before the tracing period the key does not exist, and after the tracing period the key is destroyed. During the tracing period, the key is in the volatile RAM of the capture machine. Anyone with access to the machine during the tracing period could potentially dump the key from RAM (assuming its address could be determined). Note, however, that performing the tracing does not provide any fundamental additional opportunities for someone to use legal means to gain access to campus network data. If legal means can be used to gain access to the capture hardware in the campus network machine room while a trace is being taken, then legal means can be used to gain access to all of the other hardware in the machine room, too.

Physical Security

The capture hardware will connect to the border router of the University of Toronto at Mississauga. The border router will use the “port mirroring” mechanism to copy all of its traffic into our system. The port mirroring mechanism protects the router from our capture hardware. To protect the capture hardware from external tampering, the capture hardware will reside in a locked computing and network services (CNS) machine room. The capture hardware will be disconnected from all networks except the connections to the router, and all management will be done via the machine console. Anonymized measurement data will be transferred onto a storing system via a cable between the machines. Once this data is transferred, the storage system is transferred into our lab for analysis. As described in more detail above, only anonymized data will be stored on the storing system, and therefore only anonymized data will be taken out of CNS.

Data Maintenance

Once data is anonymized, the data is stored on a set of storage nodes for later analysis. The storage nodes are located in our lab. Only the members of our research group (Troy and Stefan) will have access to these storage nodes.

7 Compensation

There will be no compensation provided for participation in the study.

8 Conflicts of Interest

There are no conflicts of interest. Our interest is from a research perspective only.

9 Informed Consent Process

There is no explicit informed consent process in this project. We provide an opt-out procedure described earlier. It is impractical to obtain consent from 15,000 Internet users at UTM. We have made their privacy and anonymity our project’s primary requirement. This project represents a minimal risk to participants; it is unlikely to adversely affect them in any way. We will make the entire process as transparent as possible.

10 Scholarly Review

We intend to publish the results of our study in journals and conferences. These results will be subject to peer-review. We also intend to contribute our source code back to Intel Cambridge’s traffic measurement tool, CoMo. We anticipate that Intel’s researchers will review our code before adding it to CoMo.

11 Additional Ethics Reviews

N/A

12 Contracts

N/A

13 Clinical Trials

N/A

References

[1]   R. Dhamija, J. D. Tygar, and M. Hearst. Why Phishing Works. In Proceedings of Conference on Human Factors in Computing Systems (CHI), April 2006.

[2]   I. Fette, N. Sadeh, and A. Tomasic. Learning to Detect Phishing E-Mails. Technical Report CMU-ISRI-06-112, Carnegie Mellon University, June 2006.

[3]   Gartner. Gartner Survey Shows Frequent Data Security Lapses and Increased Cyber Attacks Damage Consumer Trust in Online Commerce, June 2005. http://www.gartner.com/press_releases/asset_129754_11.html.

[4]   M. Jakobsson and J. Ratkiewicz. Designing Ethical Phishing Experiments: A Study of (ROT13) rOnl Query Features. In Proceedings of the World Wide Web Conference (WWW), 2006.

[5]   I. R. C. Lab. The CoMo Project. http://como.intel-research.net/.

[6]   U. of Washington. Internet Systems Measurement and Analysis Research at UW. http://www.cs.washington.edu/research/networking/websys/.

[7]   B. Parno, C. Kuo, and A. Perrig. Phoolproof Phishing Prevention. In Proceedings of Financial Cryptography and Data Security (FC), 2006.

[8]   Wikipedia. Pretty Good Privacy. http://en.wikipedia.org/wiki/Pretty_Good_Privacy.

[9]   G. Wolfond. Personal Communication. Greg is the CEO of Blue Sky, a company consulting for Canadian banks.

[10]   M. Wu, R. Miller, and S. Garfinkel. Do Security Toolbars Actually Prevent Phishing Attacks? In Proceedings of Conference on Human Factors in Computing Systems (CHI), April 2006.