Even once your data obeys all of your technical rules, it'll still have outliers. The post gives my favorite definition yet of what exactly an outlier is:
[An outlier is an] observation which deviates so much from other observations as to arouse suspicion it was generated by a different mechanism.
The reason I like this definition is that it focuses on the data creation process, not the data point itself. Outliers are not data points that you don't like—that disconfirm your hypothesis—they're data points whose variance suggests they were captured erroneously. We see this frequently in web analytics: because the data collection environment is so heterogenous, some outliers show a time on site of over a decade.
If you're going to provide high quality analytics to your end consumers on an ongoing basis, outlier detection and correction needs to be a part of your pipeline.Read more...