3 Comments
Mar 28, 2023Liked by Devansh

Hi, Devansh! Great article, as always.

When we're talking about data distribution, though, are we mainly interested in the target or in each feature, too? I understand how identifying outliers in the feature set can be useful if we, for example, notice that data was incorrectly processed or gathered. But for statistical tests, what is our main figure of interest?

You also raised an interesting and very valid point: removing outliers from the data will probably result in better performance metrics-wise but may be detrimental for practical use of the model. A couple days ago, I was doing my data science bootcamp homework (yes, I'm almost 40 and still do the homework), and the dataset I was working with was California housing prices. It has this huge anti-linear regression set of points for the highest priced houses. I could've removed it and, probably, get my model to perform 20% better but at which, pun intended, price? :)

Sorry for the ramblings, but your articles are very thought-provoking!

Expand full comment