Data Mining in Networks

This type of probability estimation tree may appear relatively simple, but such techniques can work surprisingly well. Indeed, a long history of work in statistics, machine learning, and data mining, shows that relatively simple approaches to statistical modeling often perform surprisingly well.

However, substantial sophistication is often needed in selecting and applying these methods. For example, Paul Cohen and I did work several years ago on multiple comparison procedures (MCPs) — a fundamental procedure underlying many data mining algorithms. We showed that MCPs introduce statistical biases into the results of these algorithms, and that these biases are often ignored. Similarly, work on simple Bayesian classifiers (SBCs) was, for many years, marred by an error in handling rarely occurring values. This caused the performance of SBC to be systematically underestimated until the error was corrected. Finally, multicollinearity and heteroskedasticity can cause errors in the estimation of simple linear models. Many of these issues only became apparent after years of research and use of the relevant techniques. The development of new techniques should be viewed as an ongoing process of research, and widespread testing and critique of new methods should be encouraged.