![]() |
![]() |
| Let me give you some examples of recent work in data mining.
In the early 1990s, I worked on learning good diagnostic rules for Alzheimer's disease. The data were records of several hundred patients, each of which had been diagnosed by a panel of five clinicians, and each of which had answered a large number of interview questions ("What day of the week is it?", "What is your address?", "Repeat the words 'apple, table, penny' in reverse order."). The goal was to construct accurate diagnostic rules based on patients' responses to those questions (and the answers of a close relative or caregiver). We were able to come up with surprisingly simple rules that could be used for screening purposes, as a way of determining which patients should be referred to a clinician for a more thorough evaluation. A colleague of mine has worked on automatic identification of electronic junk mail. By applying data mining algorithms, he found a surprisingly simple rule that works quite well. He discovered that the number of exclamation points in a message did a surprisingly good job distinguishing between ordinary and junk mail messages. Identifying credit card fraud is another widely deployed application of data mining techniques. These techniques often look at small windows of transactions and search for patterns that correlate highly with fraud. For example, one known pattern is a small gasoline purchase (to see if the card is still valid) followed by a large purchase (that allows the fraudulent user to profit from a valid card). All of these examples are quite simple, but they give you a flavor of the types of models produced. Each of these examples also shares a common characteristic. Each data instance (e.g., patient, mail message, or transaction window) can be treated as if it were independent of any other. Knowing something about one instance is assumed to tell you nothing about any other instance. In most cases, we know this assumption is false. For example, Alzheimer's disease has a genetic component, so familial relationships among patients can matter. Similarly, a single mail address will probably not send you both junk mail and legitimate mail, so relationships among messages can matter. However, these types of relationships are relatively rare, so assuming independence among the instances is still reasonable. |