Data Mining in Networks

One issue that deserves continuing attention is the problem of false positives. The concern is that models produced by a data mining system will flag an entirely innocent person as a terrorist or classify a legitimate group as a terrorist cell. This is a particular concern when statistical models are used for screening large numbers of cases and where the number of true positive cases is very low. In these situations, the vast majority of all positive cases will be false positives. A nearly identical concern has been raised about the use of polygraphs, rules for detecting money laundering, and AIDS tests.

The types of inference enabled by relational data may help reduce this problem. A single individual flagged as a terrorist could eventually be identified as an error if they remain unconnected to any other inferred terrorist activity. In contrast, true positives are likely to be connected to other positive cases as investigation proceeds.

That said, the problem of false positives emphasizes the need for overall control, oversight, and auditing by expert human analysts. It also reinforces the need to start with primary data sets that contain a higher proportion of true positives than the secondary data sets that may provide supplementary evidence.