Are Outliers a Problem?

outliersHere’s another question from the Q&A session held after December’s “Data Mining: Failure to Launch,” a free webinar  that TMA offers every single month. One caller wanted to know what practices he should follow or what precautions he should take in the event that he found outliers in the data.

Here’s the short answer. You really shouldn’t be afraid of outliers.

The longer answer is, of course, a little more complex. As usual, much depends on what you’re trying to do.

Having said that, outliers aren’t necessarily a problem. In fact, sometimes they’re the most interesting data points, the ones that really matter to your project.

For example, fraud detection is all about identifying and using outliers. It’s the exception, and not the rule, that is useful in this case. After all, the “rule” is the behavior that is not fraudulent.

And sometimes, understanding what drives the outlier is actually the nature of the problem. Why did sales spike for three days during the month of August last year, and what can we do to repeat that increase on a more consistent basis? That sort of problem is all about the outlier.

In fact, it would be fair to say that most of the problems that an organization would need or want to solve with data analytics will have more to do with the outliers than with more normal distributions of data.

Of course, you do always need to understand exactly what you are looking at in order to treat the outliers correctly, particularly at the data preparation phase. That’s why it’s a very good idea to speak to the people who must deal with the problem that you are trying to solve on a daily basis. They can help you understand the outlier better, and they can help you decide the proper way to deal with it.

But at the end of the day, many outliers are nothing more than opportunities waiting to be embraced.

