Introduction
This paper critically examines the assumptions commonly accepted in modeling reinforcement learning problems, suggesting that these assumptions may impede progress in the field. The authors refer to these assumptions as “dogmas.”
Dogma: A fixed, especially religious, belief or set of beliefs that people are expected to accept without any doubts. âFrom the Cambridge Advanced Learner’s Dictionary & Thesaurus
The paper introduces three central dogmas:
- The Environment Spotlight
- Learning as Finding a Solution
- The Reward Hypothesis (although not exactly a dogma)
The author argues the true reinforcement learning landscape is actualy like this,
In the words of Rich Sutton:
RL can be viewed as a microcosm of the whole AI problem.
However, today’s RL landscape is overly simplified,
These three dogmas are responsible for narrowing the potential of RL,
The authors propose that we consider moving beyond these dogmas,
To reclaim the true landscape of RL,
Background
The authors reference Thomas Kuhn’s book, “The Structure of Scientific Revolutions”,
Kuhn distinguishes between two phases of scientific activity,
- Normal Science: Resembling puzzle-solving.
- Revolutionary Phase: Involving a fundamental rethinking of the values, methods, and commitments of science, which Kuhn calls a “paradigm.”
Here’s an example of a previous paradigm shift in science:
The authors explore the paradigm shift needed in RL:
Dogma One: The Environment Spotlight
The first dogma we call the environment spotlight, which refers to our collective focus on modeling environments and environment-centric concepts rather than agents.
What do we mean when we say that we focus on environments? We suggest that it is easy to answer only one of the following two questions:
What is at least one canonical mathematical model of an environment in RL?
- MDP and its variants! And we define everything in terms of it. By embracing the MDP, we are allowed to import a variety of fundamental results and algorithms that define much of our primary research objectives and pathways. For example, we know every MDP has at least one deterministic, optimal, stationary policy, and that dynamic programming can be used to identify this policy.
What is at least one canonical mathematical model of an agent in RL?
- In contrast, this question has no clear answer!
The author suggests it is important to define, model, and analyse agents in addition to environments. We should build toward a canonical mathematical model of an agent that can open us to the possibility of discovering general laws governing agents (if they exist).
Dogma Two: Learning as Finding a Solution
The second dogma is embedded in the way we treat the concept of learning. We tend to view learning as a finite process involving the search forâand eventual discovery ofâa solution to a given task.
We tend to implicitly assume that the learning agents we design will eventually find a solution to the task at hand, at which point learning can cease. Such agents can be understood as searching through a space of representable functions that captures the possible action-selection strategies available to an agent, similar to the Problem Space Hypothesis, and, critically, this space contains at least one functionâsuch as the optimal policy of an MDPâthat is of sufficient quality to consider the task of interested solved. Often, we are then interested in designing learning agents that are guaranteed to converge to such an endpoint, at which point the agent can stop its search (and thus, stop its learning).
The author suggests to embrace the view that learning can also be treated as adaptation. As a consequence, our focus will drift away from optimality and toward a version of the RL problem in which agents continually improve, rather than focus on agents that are trying to solve a specific problem.
When we move away from optimality,
- How do we think about evaluation?
- How, precisely, can we define this form of learning, and differentiate it from others?
- What are the basic algorithmic building blocks that carry out this form of learning, and how are they different from the algorithms we use today?
- Do our standard analysis tools such as regret and sample complexity still apply?
These questions are important, and require reorienting around this alternate view of learning.
The authors introduce the book “Finite and Infinite Games”,
And the concept of Finite and Infinite Games is summarized in the following quote,
There are at least two kinds of games, One could be called finite; the other infinite. A finite game is played for the purpose of winning, an infinite game for the purpose of continuing the play.
And argues alignment is an infinite game.
Dogma Three: The Reward Hypothesis
The third dogma is the reward hypothesis, which states “All of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).”
The authors argue that the reward hypothesis is not truly a dogma. Nevertheless, it is crucial to understand its nuances as we continue to design intelligent agents.
The reward hypothesis basically says,
In recent analysis by [2] fully characterizes the implicit conditions required for the hypothesis to be true. These conditions come in two forms. First, [2] provide a pair of interpretative assumptions that clarify what it would mean for the reward hypothesis to be true or falseâroughly, these amount to saying two things (brwon doors).
- First, that “goals and purposes” can be understood in terms of a preference relation on possible outcomes.
- Second, that a reward function captures these preferences if the ordering over agents induced by value functions matches that of the ordering induced by preference on agent outcomes.
This leads to the following conjecture,
Then, under this interpretation, a Markov reward function exists to capture a preference relation if and only if the preference relation satisfies the four von Neumann-Morgenstern axioms, and a fifth Bowling et al. call $\gamma$-Temporal Indifference.
Axiom 1: Completeness > You have a preference between every outcome pair.
- You can always compare any two choices.
Axiom 2: Transitivity > No preference cycles.
- If you like chocolate more than vanilla, and vanilla more than strawberry, you must like chocolate more than strawberry.
Axiom 3: Independence > Independent alternatives can’t change your preference.
- If you like pizza more than salad, and you have to choose between a lottery of pizza or ice cream and a lottery of salad or ice cream, you should still prefer the pizza lottery over the salad lottery.
Axiom 4: Continuity > There is always a break even chance.
- Imagine you like a 100 dollar bill more than a 50 dollar bill, and a 50 dollar bill more than a 1 dollar bill. There should be a scenario where getting a chance at 100 dollar and 1 dollar, with certain probabilities, is equally good as getting the 50 dollar for sure.
These 4 axioms are called the von Neumann-Morgenstern axioms.
- Axiom 5: Temporal $\boldsymbol{\gamma}$-Indifference > Discounting is consistent throughout time.
- Temporal $\gamma$-indifference says that if you are indifferent between receiving a reward at time $t$ and receiving the same reward at time $t+1$, then your preference should not change if we move both time points by the same amount. For instance, if you don’t care whether you get a candy today or tomorrow, then you should also not care whether you get the candy next week or the week after.
Taking these axioms into account, the reward conjecture becomes the reward theorem,
It is essential to consider that people do not always conform to these axioms, and human preferences can vary.
It is important that we are aware of the implicit restrictions we are placing on the viable goals and purposes under consideration when we represent a goal or purpose through a reward signal. We should become familiar with the requirements imposed by the five axioms, and be aware of what specifically we might be giving up when we choose to write down a reward function.
See Also
- David Abel Presentation @ ICML 2023
- David Abel Personal Website
- Mark Ho Personal Website
- Anna Harutyunyan Personal Website