A neat way to catch unrelaible data [Technique Tuesdays]
Brought to you by one of the best Machine Learning Engineering and MLOps resources online.
Hey, it’s your favorite cult leader here 🐱👤
On Tuesdays, I will cover problem-solving techniques that show up in software engineering, computer science, and Leetcode Style Questions📚📚📝📝.
To get access to all my articles and support my crippling chocolate milk addiction, consider subscribing if you haven’t already!
p.s. you can learn more about the paid plan here.
As you level up in your software engineering career, the scope of your work increases. You become responsible for larger and larger scale decisions. In order to make the correct decisions- you will likely be expected to use some degree of data-driven decision-making. But there’s a problem with that.
How do you know when your data becomes stale? Data Drift is the phenomenon that occurs when the distribution of your data changes over time (more details through this 1 min video). This is a completely natural process, as the world is likely to change as you collect data over time. Dealing with this is a huge must since you don’t want to take decisions on outdated data. Remember that at the scale of senior engineers- mistakes can be worth millions of dollars.
In today’s post/article, we will be covering a very interesting technique to catch a particularly nasty form of Data Drift- Feature Drift. This technique was recommended by
one of the best resources to learn about Machine Learning, System Design, and MLOps online. His newsletter is exceptional (I even recommend it on AI Made Simple, the AI-focused sister newsletter to TMS). Would highly recommend checking it out, and following Damien’s Twitter (that’s where I found the piece for today’s article).How to Catch Feature Drift
What is feature drift- Feature Drift is a subset of Data Drift. Say you wanted to see how the use of different phrases in your sales pitch affected your sales conversion rate. You would track the inclusion of each of those phrases and the end result. In our case, each phrase is a feature, and whether the sale was converted or not is the target variable. This should make the meaning of feature drift clear- feature drift occurs when the distribution of the features you’re tracking changes.
Why Feature Drift can be tricky- If you have a lot of variables (features) you’re tracking, then detecting the drift of one feature might be hard. This can be a problem since this drift would build up over time, leading to worse results. If you only have macro-level data-drift detection, then you will have to make much larger changes to the system (which increases the chances for errors).
How to catch feature drift with simple models: Part 1- First divvy your feature data into 2 sets- X_new—> repping your current data and X__old—> for the old data. It is wise to have multiple time splits- 1 day, week, month, etc for the best results. Now create an artificial target Y_now where all the values are 1 and another Y_old where all the values are 0. Y_old would be paired with the X_old samples and Y_now with the X_now ones. These fake targets are a neat way to segregate between old and new samples. Then we simply prepare our datasets-
X = [X_now, X_old] Y = [Y_now, Y_old]
This preps up for the next stage.
How to catch feature drift with simple models: Part 2- Now take that data (X,Y) and train a Supervised Learning algorithm on it. Damien recommends RF because of the built-in Feature Importance measurement process (or another model with built-in importance). I would imagine that any model would be fine since you can apply permutation-based importance in that case (try it out and tell me if it works). Either way, the features with a high feature importance to predict that artificial target, have likely been drifting over time.
Why I like this- I like it for a simple reason: it’s a fairly cheap and reliable way to check for things. Since you don’t need a powerful model to catch the drift, it’s going to be fairly efficient. It also allows you to get a feature-level view of the drift- which can be very insightful.
What techniques do you like to use to detect data drift? I usually rely on building distributions of the new and old data and comparing divergence. This is an interesting technique that I would have to experiment on. Would love to hear from your experiences.
That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.
Save the time, energy, and money you would burn by going through all those videos, courses, products, and ‘coaches’ and easily find all your needs met in one place at ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and the tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!
Using this discount will drop the prices-
800 INR (10 USD) → 640 INR (8 USD) per Month
8000 INR (100 USD) → 6400INR (80 USD) per year (533 INR /month)
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
Small Snippets about Tech, AI and Machine Learning over here
AI Newsletter- https://artificialintelligencemadesimple.substack.com/
My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819