When we're talking about data distribution, though, are we mainly interested in the target or in each feature, too? I understand how identifying outliers in the feature set can be useful if we, for example, notice that data was incorrectly processed or gathered. But for statistical tests, what is our main figure of interest?
You also raised an interesting and very valid point: removing outliers from the data will probably result in better performance metrics-wise but may be detrimental for practical use of the model. A couple days ago, I was doing my data science bootcamp homework (yes, I'm almost 40 and still do the homework), and the dataset I was working with was California housing prices. It has this huge anti-linear regression set of points for the highest priced houses. I could've removed it and, probably, get my model to perform 20% better but at which, pun intended, price? :)
Sorry for the ramblings, but your articles are very thought-provoking!
I love interacting with my audience and seeing what you think. Your Ramblings tells me what stands out/is important, so please keep them coming.
1) Depends on your task. Analyzing your target set is a useful practice to catch data-drift (since you can detect things like changing tastes/outcomes). Since your target set is smaller than your feature set, it can be a good first place to look.
Also, in multivariate analysis, you can also try to combine multiple target variables. This might be useful to reduce the cardinality while keeping the meaning intact.
There is also something called the Ostrich principle- wherein you sometimes choose to ignore problems entirely (when the ROI of solving them is not worth it). Analyzing outliers/general statistical tests can be a god-send for this. There are many other reasons, which I'll get into future writing.
2) Good on you for doing homework. I respect it.
Thank you once again for your kind words. I put some effort into mastering this craft, and this hasn't had great returns so far. Kind words like yours have been the only reason I have maintained this as long as I have.
I totally understand what you mean. I've been thinking about maintaining a free Wix site for myself where I'd write about math, statistics, and other data science under-the-hood stuff mostly to understand it better myself. But it's still an idea that floats in the air for now especially considering that we're at the deep learning part at my bootcamp - so many things are happening at the same time! Your articles here and at your other Substack are very grounding and help with keeping track of what's currently hot and interesting without being overwhelming.
I wish you only the best things and to enjoy what you do!
Hi, Devansh! Great article, as always.
When we're talking about data distribution, though, are we mainly interested in the target or in each feature, too? I understand how identifying outliers in the feature set can be useful if we, for example, notice that data was incorrectly processed or gathered. But for statistical tests, what is our main figure of interest?
You also raised an interesting and very valid point: removing outliers from the data will probably result in better performance metrics-wise but may be detrimental for practical use of the model. A couple days ago, I was doing my data science bootcamp homework (yes, I'm almost 40 and still do the homework), and the dataset I was working with was California housing prices. It has this huge anti-linear regression set of points for the highest priced houses. I could've removed it and, probably, get my model to perform 20% better but at which, pun intended, price? :)
Sorry for the ramblings, but your articles are very thought-provoking!
I love interacting with my audience and seeing what you think. Your Ramblings tells me what stands out/is important, so please keep them coming.
1) Depends on your task. Analyzing your target set is a useful practice to catch data-drift (since you can detect things like changing tastes/outcomes). Since your target set is smaller than your feature set, it can be a good first place to look.
Also, in multivariate analysis, you can also try to combine multiple target variables. This might be useful to reduce the cardinality while keeping the meaning intact.
There is also something called the Ostrich principle- wherein you sometimes choose to ignore problems entirely (when the ROI of solving them is not worth it). Analyzing outliers/general statistical tests can be a god-send for this. There are many other reasons, which I'll get into future writing.
2) Good on you for doing homework. I respect it.
Thank you once again for your kind words. I put some effort into mastering this craft, and this hasn't had great returns so far. Kind words like yours have been the only reason I have maintained this as long as I have.
I totally understand what you mean. I've been thinking about maintaining a free Wix site for myself where I'd write about math, statistics, and other data science under-the-hood stuff mostly to understand it better myself. But it's still an idea that floats in the air for now especially considering that we're at the deep learning part at my bootcamp - so many things are happening at the same time! Your articles here and at your other Substack are very grounding and help with keeping track of what's currently hot and interesting without being overwhelming.
I wish you only the best things and to enjoy what you do!
I'll keep rambling at you, then. :)