2 Comments

what do you think about this?

Is reinforcement learning a dead end?

Provocative question! Having spent at least 25 years studying RL, ever since my first real job at IBM Research, where I explored the use of methods like Q-learning from 1990–93 to train robots new tasks, I’ve watched the field through its various phases. In the early 1990s, when I got involved, it was restricted to a small handful of aficionados. I organized the first National Science Foundation workshop on RL, to which about 50–60 senior researchers were invited (in 1995).

Gradually, through the early part of the 2000s, the field gained popularity, but never seemed to become a mainstream research topic within ML. Then, wham! Deep Mind did its thingie with the combination of deep learning and RL, applied to a visually appealing domain of Atari video games, and (deep) RL’s popularity went through the roof. Now, it seems all the rage, and certainly, many employers are hiring (in the Bay Area, it’s an area sought after by some of the labs doing autonomous driving). Google paid half a billion Euros for Deep Mind (supposedly!), on the basis of their deep RL Atari demo. So, this looked like a real turning point, and RL came to life!

So, getting back to the question, is RL a “dead end”? In answering this provocative question, one has to clarify one’s point of view. Certainly, from the standpoint of the work going on in Deep Mind and other places on using deep RL to play games like Go or Chess, or train given an accurate simulator of the world for a self-driving car, RL is certainly poised to become well established technology, and its popularity is only going to increase. RL sessions at major AI and ML conferences are very well attended, and RL submissions are definitely increasing. In all these dimensions, RL is very much not at a “dead end”, in fact, its popularity is only increasing.

But, but, …. you knew there was a but coming there!

When you impose on RL the goal of “online learning in real time from the real world”, and not doing millions of simulation steps where agents can be killed thousands of times with no penalty, I fear RL is very much at a dead end. It is not clear to me that any extension of the au courant deep RL methods is going to lead to successes in the real world, in terms of a physical agent that can learn in real time with a small number of examples.

That is, if your goal is to build a model of how humans learn complex skills, such as driving, then RL to me is a very poor explanation of how such skills are acquired. One has to only look at the comparative results reported in the AAAI 2017 paper by Tsividis et al., comparing random humans on Amazon Turk with the best deep RL programs at Atari video games to see where deep RL simply flounders. Humans learn Atari video games, like Frostbite, about 1000x faster than the fastest deep RL methods.

A typical human learned Frostbite in 1 minute with a few hundred examples at most. DQN or other deep RL programs take days with millions of examples. It’s not even close, it’s like another galaxy in terms of the speed of learning differences. So, looking at this paper, I’d have to say I don’t see any way to capture such large differences with any incremental tweaking of deep RL methods, such as being reported annually in ICML or NIPS papers (of which I review a bunch each year, hoping against hope to see a new idea emerge, only to be disappointed!).

So, what’s to be done to “rescue RL”. I’m not sure there’s really a solution out there. I for one have stopped believing that we learn complex skills like driving by something that resembles “pure RL” (that is, from rewards alone). Humans learn to drive because they in fact “know” how to drive even before they even try to drive once. They’ve seen their parents, friends, lovers, Uber drivers, etc. drive many many times, and they’ve seen driving behavior in movies for thousands of hours. So, when they finally get behind the wheel, they instinctively “know” what driving means, but of course, they have never actually controlled a physical car before. So, there is that all important “last mile” of actual driving that needs to be learned.

But, since the driving program is largely already in place, built in by many thousands of hours of observation, not to mention active instruction by a driving teacher or an anxious parent, what needs to be “learned” are a few control parameters that tell the human brain how much to turn the wheel, or press the brake, and more importantly, where to look on the road etc. This is course not trivial, which is why humans take a few weeks to get comfortable behind a wheel, But, if you look at real hours of practice, humans learn to drive in a few hundred hours — for those paying for driving instruction, this is expensive since you are charged by the hour.

Also, all important to remember is that when you impose the condition of learning in the real world, there can be “no cheating”! That is, unlike the ridiculous 2D world of Atari video games, like Enduro, where one is given a 2D highly simplified visual world, and actions are limited to a few discrete choices, humans must drive in the full 3D real world and have the huge task of controlling both legs, both hands, neck, body, etc, many hundreds of continuous degrees of freedom, as well as have to cope with an immense sensory space of stereo vision, and binaural hearing as well.

The only way humans ever learn to drive in a few hundred hours is the simple fact that we already almost know driving, and we have obviously a fully working vision system, so we can read signs, recognize cars and pedestrians, and our hearing system also recognizes sirens, alerts, horns etc. So, if you look at the immensity of the whole driving task, I would claim more than 95% of the driving knowledge is already known, and the small remaining part has to be acquired from practice. This is the only explanation for how humans learn such a complex skill as driving in a few hundred hours. There is NO magic here.

So, in that sense, pure (deep) RL seems like a dead end. The pure (deep) RL problem formulation really does not hold much interest for me any more. What is needed in its place is a more complex model of how learning happens by combining observation, transfer learning, and many other types of behavior cloning from observed demonstration to the learner, and finally being able to take this knowledge, and then improve it with some actual trial and error RL.

One can generalize this to other modes of learning as well. The late Richard Feynman, who was arguably the most influential physicist after the 2nd world war, taught a classic introductory course at Caltech, which led to probably the best selling college textbook of all time, the Feynman Lectures on Physics (still being sold almost 60 years later, in the nth edition). When he looked at how students handled his problem sets, Feynman was ultimately disappointed. He realized that even the extremely bright students at Caltech could not “learn” physics, simply sitting in his class and absorbing his lectures. So, he ended his preface to the textbook with a disappointing conclusion, quoting Gibbons (which I had long ago memorized):

“The power of instruction is seldom of much efficacy, except in those happy dispositions where it is almost superfluous”.

I realized the wisdom of this saying after spending two decades or more teaching machine learning to graduate students at several institutions. It seems almost paradoxical, but what Gibbons is saying, and what Feynman and I both discovered is that learning from teaching only works when the learner “almost already knows” the subject.

But, this is precisely what the various theoretical formulations of ML predict must be the case, there is no “free lunch” in terms of being able to learn. Deep Mind’s DQN network takes millions and millions of steps to learn an apparently trivial task (to humans) like Frostbite, because initially DQN knows nothing. Humans, in contrast, learn Frostbite in < 1 minute because they have spent many many hours building the background needed to learn Frostbite so quickly (e.g, vision, hand eye coordination, general game playing strategies).

Unfortunately, the prevailing currents in the field, at venues like “NeurIPS” (NIPS) and ICML and AAAI conferences, tend to “glorify” knowledge-free learning, so you end up with hundreds, if not thousands, of (deep) RL papers, where agents take millions of time steps to learn apparently simple tasks. To me, this approach is ultimately a “dead end”, if your goal is to develop a computational model of how humans learn.

Expand full comment
author

Not a bad argument.

To me, this approach is ultimately a “dead end”, if your goal is to develop a computational model of how humans learn.- Agreed. But not really relevant to the article.

Expand full comment