A quick introduction to Reinforcement Learning [Math Mondays]
Some observations on one of Machine Learning's controversial fields.
Hey, it’s your favorite cult leader here 🐱👤
Mondays are dedicated to theoretical concepts. We’ll cover ideas in Computer Science💻💻, math, software engineering, and much more. Use these days to strengthen your grip on the fundamentals and 10x your skills 🚀🚀.
To get access to all the articles and support my crippling chocolate milk addiction, consider subscribing if you haven’t already!
p.s. you can learn more about the paid plan here.
Reinforcement learning is one of the more contentious fields in AI.
Once hyped as the field that would lead to human experts everywhere being replaced, it’s now fallen short of expectations. AI Legend Yann LeCun had this to say about the impact of RL
That being said, I believe that RL has some very interesting applications when it comes to platform testing and security. I will be doing a breakdown of how in an upcoming piece combining old Pokémon games, RL, and discovering Glitches on our sister publication AI Made Simple. Consider this a primer for that.
Understanding Reinforcement Learning
What is Reinforcement Learning- RL is one of the big 3 paradigms of ML Research (along with supervised and unsupervised learning). RL is based on teaching a model to maximize a reward function.
Setting up RL Agents- Usually I would talk about why it’s useful first, but in this case, talking about how we set up RL agents will give us a strong indication on where it excels. RL problems require a few different components-
Agent: The agent is the entity that is learning to behave in the environment. It acts and observes the resulting rewards and penalties, tweaking future action based on the feedback it receives.
Environment: The environment is the world in which the agent operates. It provides the agent (and programmers) with states and rewards.
Policy: The policy is a function that maps states to actions. The agent uses the policy to decide which action to take in any given state.
Reward: The reward is a signal that the environment provides to the agent after it takes an action. The reward indicates how well the agent is doing.
When it makes sense to use Reinforcement Learning- Generally we see Reinforcement Learning to teach AI models to play games, develop investment strategies, develop other skills (generating better text for ChatGPT, making coffee, driving etc). Based on my analysis on RL and its uses, I have a checklist of two items that combine to give us an RL friendly setup:
You can’t boil the process into a data-point: RL-friendly processes are inherently hyper-relational (they rely on information from previous states) and have a continuous element to them. This makes meaningfully labelling them for supervised learning a giant pain (it is possible though). You can label every little step/jump Mario makes and build a full state tree, or you can let an AI agent take control of that psychotic little turtle hating mushroom addict and just let the AI just fuck around and find out. One requires a lot of manpower. The other lets you run the code in the background while you go hard-sparring with your bros and still call it work. The choice is yours.
You know what you want- Generally speaking, RL Researchers spend most of their time tweaking the Reward Function to account for the many ways AI invariably misbehaves. As long as you know what you want the AI to do (explore map, take coin, don’t die etc)- you can generally make the changes relatively quickly (the process is simple, but not always easy). In the same way that Supervised Learning reduced a lot of work by allowing engineers to build AI without explicitly feeding it the relationships b/w the targets and features, RL saves dev time by not forcing devs/researchers to handhold their agent through every step.
Why RL is useful- RL is valuable not because it behaves like a human, but precisely because it will not. It will accurately show you all the ways your system can be messed up. Look at any video where someone trained an RL agent, and you will see all kinds of examples where the agents find completely unexpected loopholes. In my upcoming Pokémon article, you will see that the Agent figured out that it was losing a battle. Instead of continuing and taking an L, it refused to do anything. By doing nothing, it was technically never defeated. Other agents also discovered rng exploits, completely unprompted. All of this has a lot of implications for doing security research in increasingly complex tech-stacks. By behaving in ways that normal humans would not, it can add a lot to security testers.
There’s been greater investment into RL recently, and as LLMs hit their limitations of scale and cost, we’re going to see calls for looks into alt-AI. RL will definitely make a comeback, and you should know about it. To end on a fun note, here is a video where DeepMind (one of Google’s AI Research Groups) taught an AI to walk-
That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.
Save the time, energy, and money you would burn by going through all those videos, courses, products, and ‘coaches’ and easily find all your needs met in one place at ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and the tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!
Using this discount will drop the prices-
800 INR (10 USD) → 640 INR (8 USD) per Month
8000 INR (100 USD) → 6400INR (80 USD) per year (533 INR /month)
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
Small Snippets about Tech, AI and Machine Learning over here
AI Newsletter- https://artificialintelligencemadesimple.substack.com/
My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
what do you think about this?
Is reinforcement learning a dead end?
Provocative question! Having spent at least 25 years studying RL, ever since my first real job at IBM Research, where I explored the use of methods like Q-learning from 1990–93 to train robots new tasks, I’ve watched the field through its various phases. In the early 1990s, when I got involved, it was restricted to a small handful of aficionados. I organized the first National Science Foundation workshop on RL, to which about 50–60 senior researchers were invited (in 1995).
Gradually, through the early part of the 2000s, the field gained popularity, but never seemed to become a mainstream research topic within ML. Then, wham! Deep Mind did its thingie with the combination of deep learning and RL, applied to a visually appealing domain of Atari video games, and (deep) RL’s popularity went through the roof. Now, it seems all the rage, and certainly, many employers are hiring (in the Bay Area, it’s an area sought after by some of the labs doing autonomous driving). Google paid half a billion Euros for Deep Mind (supposedly!), on the basis of their deep RL Atari demo. So, this looked like a real turning point, and RL came to life!
So, getting back to the question, is RL a “dead end”? In answering this provocative question, one has to clarify one’s point of view. Certainly, from the standpoint of the work going on in Deep Mind and other places on using deep RL to play games like Go or Chess, or train given an accurate simulator of the world for a self-driving car, RL is certainly poised to become well established technology, and its popularity is only going to increase. RL sessions at major AI and ML conferences are very well attended, and RL submissions are definitely increasing. In all these dimensions, RL is very much not at a “dead end”, in fact, its popularity is only increasing.
But, but, …. you knew there was a but coming there!
When you impose on RL the goal of “online learning in real time from the real world”, and not doing millions of simulation steps where agents can be killed thousands of times with no penalty, I fear RL is very much at a dead end. It is not clear to me that any extension of the au courant deep RL methods is going to lead to successes in the real world, in terms of a physical agent that can learn in real time with a small number of examples.
That is, if your goal is to build a model of how humans learn complex skills, such as driving, then RL to me is a very poor explanation of how such skills are acquired. One has to only look at the comparative results reported in the AAAI 2017 paper by Tsividis et al., comparing random humans on Amazon Turk with the best deep RL programs at Atari video games to see where deep RL simply flounders. Humans learn Atari video games, like Frostbite, about 1000x faster than the fastest deep RL methods.
A typical human learned Frostbite in 1 minute with a few hundred examples at most. DQN or other deep RL programs take days with millions of examples. It’s not even close, it’s like another galaxy in terms of the speed of learning differences. So, looking at this paper, I’d have to say I don’t see any way to capture such large differences with any incremental tweaking of deep RL methods, such as being reported annually in ICML or NIPS papers (of which I review a bunch each year, hoping against hope to see a new idea emerge, only to be disappointed!).
So, what’s to be done to “rescue RL”. I’m not sure there’s really a solution out there. I for one have stopped believing that we learn complex skills like driving by something that resembles “pure RL” (that is, from rewards alone). Humans learn to drive because they in fact “know” how to drive even before they even try to drive once. They’ve seen their parents, friends, lovers, Uber drivers, etc. drive many many times, and they’ve seen driving behavior in movies for thousands of hours. So, when they finally get behind the wheel, they instinctively “know” what driving means, but of course, they have never actually controlled a physical car before. So, there is that all important “last mile” of actual driving that needs to be learned.
But, since the driving program is largely already in place, built in by many thousands of hours of observation, not to mention active instruction by a driving teacher or an anxious parent, what needs to be “learned” are a few control parameters that tell the human brain how much to turn the wheel, or press the brake, and more importantly, where to look on the road etc. This is course not trivial, which is why humans take a few weeks to get comfortable behind a wheel, But, if you look at real hours of practice, humans learn to drive in a few hundred hours — for those paying for driving instruction, this is expensive since you are charged by the hour.
Also, all important to remember is that when you impose the condition of learning in the real world, there can be “no cheating”! That is, unlike the ridiculous 2D world of Atari video games, like Enduro, where one is given a 2D highly simplified visual world, and actions are limited to a few discrete choices, humans must drive in the full 3D real world and have the huge task of controlling both legs, both hands, neck, body, etc, many hundreds of continuous degrees of freedom, as well as have to cope with an immense sensory space of stereo vision, and binaural hearing as well.
The only way humans ever learn to drive in a few hundred hours is the simple fact that we already almost know driving, and we have obviously a fully working vision system, so we can read signs, recognize cars and pedestrians, and our hearing system also recognizes sirens, alerts, horns etc. So, if you look at the immensity of the whole driving task, I would claim more than 95% of the driving knowledge is already known, and the small remaining part has to be acquired from practice. This is the only explanation for how humans learn such a complex skill as driving in a few hundred hours. There is NO magic here.
So, in that sense, pure (deep) RL seems like a dead end. The pure (deep) RL problem formulation really does not hold much interest for me any more. What is needed in its place is a more complex model of how learning happens by combining observation, transfer learning, and many other types of behavior cloning from observed demonstration to the learner, and finally being able to take this knowledge, and then improve it with some actual trial and error RL.
One can generalize this to other modes of learning as well. The late Richard Feynman, who was arguably the most influential physicist after the 2nd world war, taught a classic introductory course at Caltech, which led to probably the best selling college textbook of all time, the Feynman Lectures on Physics (still being sold almost 60 years later, in the nth edition). When he looked at how students handled his problem sets, Feynman was ultimately disappointed. He realized that even the extremely bright students at Caltech could not “learn” physics, simply sitting in his class and absorbing his lectures. So, he ended his preface to the textbook with a disappointing conclusion, quoting Gibbons (which I had long ago memorized):
“The power of instruction is seldom of much efficacy, except in those happy dispositions where it is almost superfluous”.
I realized the wisdom of this saying after spending two decades or more teaching machine learning to graduate students at several institutions. It seems almost paradoxical, but what Gibbons is saying, and what Feynman and I both discovered is that learning from teaching only works when the learner “almost already knows” the subject.
But, this is precisely what the various theoretical formulations of ML predict must be the case, there is no “free lunch” in terms of being able to learn. Deep Mind’s DQN network takes millions and millions of steps to learn an apparently trivial task (to humans) like Frostbite, because initially DQN knows nothing. Humans, in contrast, learn Frostbite in < 1 minute because they have spent many many hours building the background needed to learn Frostbite so quickly (e.g, vision, hand eye coordination, general game playing strategies).
Unfortunately, the prevailing currents in the field, at venues like “NeurIPS” (NIPS) and ICML and AAAI conferences, tend to “glorify” knowledge-free learning, so you end up with hundreds, if not thousands, of (deep) RL papers, where agents take millions of time steps to learn apparently simple tasks. To me, this approach is ultimately a “dead end”, if your goal is to develop a computational model of how humans learn.