Why you should analyze the distribution of your Data [Math Mondays]
One of the most underrated steps when building intelligent systems
Hey, it’s your favorite cult leader here 🐱👤
Mondays are dedicated to theoretical concepts. We’ll cover ideas in Computer Science💻💻, math, software engineering, and much more. Use these days to strengthen your grip on the fundamentals and 10x your skills 🚀🚀.
To get access to all the articles and support my crippling chocolate milk addiction, consider subscribing if you haven’t already!
p.s. you can learn more about the paid plan here.
Recently something very interesting happened to me,
LinkedIn invited me to contribute to their advice pieces geared at helping developers and other tech people get better.
This is obviously a huge W for this cult, and I’m excited to see the opportunities it presents. As I looked through some articles they had asked me to contribute to, one stood out How do you apply data transformations to improve model performance in EDA? (go engage with the article to support me). However, the character limit for contributions was too little so I thought I’d do a more fleshed-out piece on this newsletter.
In this email/post, we will discuss why analyzing the distribution of your data is important, the common techniques to do so, and some common mistakes people make. This post will be longer than usual since we will be highlighting many important ideas. I will elaborate upon the points mentioned here in further articles.
Why is analyzing the distribution of your data important?
Analyzing the distribution of your data is important because it can have a significant impact on the results of your analysis. Understanding the distribution of your data can help you choose the right statistical test, identify outliers, check for normality, and visualize the data. By understanding the distribution of your data, you can ensure that your results are accurate, reliable, and valid.
It can also serve another function, one that is often overlooked by data scientists- Confirming your data collection systems. If you have ground truth about what your distribution is supposed to look like, checking your collected data distribution can help you identify both data drift and problems in your collection systems. Both of these problems can slip under the hood and mess up protocols. Doing an analysis of the Data Distribution collected and comparing it to ground truth can be a great way to confirm that systems are working as they should.
Common techniques to analyze the distribution of your data
Histograms: Histograms are graphical representations of the distribution of numerical data. They display the frequency or proportion of data points in different ranges. Histograms can give you a quick visual idea of the shape of your distribution.
Boxplots: Boxplots are graphical representations of the distribution of numerical data through their quartiles. They display the median, the upper and lower quartiles, and the minimum and maximum values. Boxplots can show you how the data is distributed, whether it is skewed, and whether there are any outliers.
Quantile-Quantile (Q-Q) Plots: The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution. It is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles come from the same distribution, we should see the points forming a line that’s roughly straight (linear). Very useful if you are testing your data distribution for an underlying property.
Descriptive statistics: Basic statistics, such as mean, median, mode, skewness, and kurtosis, can provide a summary of the distribution of your data. However, there is a problem with relying exclusively on these? Do you remember what it was? We discussed it in an earlier piece. The answer comes later in this email, in the mistakes section.
Statistical tests: Statistical tests, such as the Shapiro-Wilk test or the Kolmogorov-Smirnov test, can formally test the normality of your data or compare it to a known distribution.
Common mistakes people make when analyzing the distribution of their data
Ignoring outliers: Outliers can distort the analysis and affect the overall results. It's important to identify and handle outliers appropriately. There are times when you might judge that the outlier is not worth modeling for these outliers (the ROI is not worth it). In that case, you can ignore them. But this should be a deliberate choice.
Dropping outliers: This is a cardinal sin (one of the few things I generally tell people to never do). Yes, this can improve your model performance. But what good is that performance, if you can’t model real-life data? Generally, outliers will help you improve generalization and robustness which is worth it for a worse raw performance. Once again, this is different if you decide that the ROI of working with these is not worth it.
Using the wrong statistical test: Different statistical tests have different assumptions about the distribution of the data. If you don't understand the distribution of your data, you may end up using the wrong statistical test, which can lead to incorrect conclusions.
Not visualizing the data: Visualizing the distribution of the data can help you identify patterns and trends that might not be apparent from just looking at the raw data. This can help you generate hypotheses, test assumptions, and identify potential problems with the data. This was the mistake I mentioned earlier, related to relying just on descriptive statistics. Sometimes, very different distributions can have identical descriptive stats. We discussed this phenomenon in the post- When Statistics Lie. Anscombe's Quartet [Math Mondays]
In conclusion, analyzing the distribution of your data is crucial before conducting any data analysis. It can help ensure that your methods are appropriate, your results are accurate, and your conclusions are valid. By using common techniques and avoiding common mistakes, you can improve the quality of your data analysis and make more informed decisions.
That is it for this piece. I appreciate your time. As always, if you’re interested in reaching out to me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people.
Upgrade your tech career with a premium subscription ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!
Using this discount will drop the prices-
800 INR (10 USD) → 533 INR (8 USD) per Month
8000 INR (100 USD) → 6400INR (80 USD) per year
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
Small Snippets about Tech, AI and Machine Learning over here
If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here.
To help me understand you fill out this survey (anonymous)
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
Hi, Devansh! Great article, as always.
When we're talking about data distribution, though, are we mainly interested in the target or in each feature, too? I understand how identifying outliers in the feature set can be useful if we, for example, notice that data was incorrectly processed or gathered. But for statistical tests, what is our main figure of interest?
You also raised an interesting and very valid point: removing outliers from the data will probably result in better performance metrics-wise but may be detrimental for practical use of the model. A couple days ago, I was doing my data science bootcamp homework (yes, I'm almost 40 and still do the homework), and the dataset I was working with was California housing prices. It has this huge anti-linear regression set of points for the highest priced houses. I could've removed it and, probably, get my model to perform 20% better but at which, pun intended, price? :)
Sorry for the ramblings, but your articles are very thought-provoking!