202509281156 Status: #idea Tags: #reinforcement_learning #ai # Core reinforcement learning terms - **Reinforcement Learning** - the task of learning through trial and error. In this type of task, no human labels data and no human collects or explicitly designs the collection of data. - [[The core mental model of reinforcement learning]] - [[Generalized policy iteration (GPI)]] - **Deep Reinforcement Learning** - involves using multi-layered non-linear function approximation (i.e. deep neural networks) in a reinforcement learning problem - [[The types of feedback in deep reinforcement learning]] - [[Function approximation is necessary in RL problems with high-dimensional or continuous state or action spaces]] - [[The key decision points when creating a deep RL approach]] - [[Large neural networks can help solve the non-stationarity problem]] - Specific Approaches/Models: - [[Deep Q-Network (DQN)]] - [[Double DQN (DDQN)]] - [[Prioritized Experience Replay (PER)]] - **Agent** - the decision maker in a reinforcement learning problem. - **Environment** - everything outside of the agent in a reinforcement learning problem. - [[The environment in RL is everything not contained in the agent]] - **State Space** - the set of variables that describe the environment, along with all possible values that the variables take - **State** - a specific set of values that the state space variables take at any given time - **Observation** - the part of the state that an agent can actually observe. In many cases, agents don't have access to the actual full state of the environment. - **Observation Space** - the set of all possible values that an agent's observation can take on (that is, it's the subset of the State Space that the agent can actually observe) - **Actions** - decisions the agent can take to influence the environment - **Action Space** - the set of all possible actions in all states - **Transition Function** - the function responsible for mapping a state & action pair to a new state. That is, this function tells us - when an agent takes action $a$ in state $s$, what state am I likely to end up in? - **Reward** - a signal provided by the environment that determines how "good" a specific action taken by the agent was - [[There is a trade off between the density of the reward signal and the bias we inject into a model]] - **Reward Function** - a function mapping state pairs to a reward signal. That is, it tells us how much reward $r$ the agent receives for transition from state $s$ to state $s'$ - **Model of the Environment** - the set of transition and reward functions describing the environment - [[Planning problems vs learning problems]] - **Time Step** - one cycle of a typical agent-environment interaction. This interaction usually has three steps: the agent interacts with the environment, the agent evaluates its behavior, and the agent improves its responses - **Experience** - the set of the state, the action, the reward, and the new state coming from a single time step - **Episodic Tasks** - tasks that have a natural ending (e.g. a chess game) - **Continuing Tasks** - tasks that don't have a natural end state (e.g. learning forward motion) - **Episode** - the sequence of time steps from the beginning to the end of an episodic task - **Return** - the sum of rewards collected in a single episode - **Sequential Feedback** - the feedback received by the RL agent is not independent and identically distributed. Instead, subsequent actions and their associated rewards depend upon the actions taken earlier in an episode. - **Temporal Credit Assignment Problem** - Feedback is sequential and rewards in reinforcement learning may be sparse and only manifest after several time steps. This presents the challenge of determining which state and/or action is responsible for a reward. - **Evaluative Feedback** - typically rewards do not indicate whether or not an action was correct (as in supervised learning), but just give us a measure of goodness. That is, we evaluate the action or episode and determine its value. Critically, this doesn't tell us whether an action was the best action or not. - **Exploration versus Exploitation Trade-off** - since feedback is evaluative, the RL agent must explore many parts of the state space to identify the "best" actions to take. However, in continuous state and action spaces, it is impossible to exhaustively explore all combinations. Hence, the agent must balance exploration of new actions and states with the exploitation of the already-identified good actions. - **Sampled Feedback** - frequently, agents do not have access to a full model of the environment (i.e. the transition function and the reward function). Hence, the agent must learn only from a sample of the environment and use that sample to generalize. - **Policy** - a function mapping an observation to the agent's next action - **Greedy Policy** - a policy that always selects the action that maximizes the expected return according to the value function - **Epsilon-greedy Policy** - a policy that is greedy with probability $1 - \epsilon$ and selects a random policy with probability $\epsilon$ for $0 < \epsilon < 1$ - **Optimal Policy** - a policy that always selects the actions actually yielding the highest expected return from each and every state. This policy is greedy with respect to the optimal value function. - **Model** - a function mapping observations to new observations and/or rewards - **Value Function** - a function mapping observations to reward-to-go estimates - [[Learning state-value functions]] - [[The value function is limited without knowledge of the MDP]] - **Action Value Function** - a function mapping (observation, action) pairs to reward-to-go estimates - [[Learning action-value function]] - **Policy-based Agents** - agents that are designed to approximate a policy - [[Policy-Gradient Methods Overview]] - Specific Approaches/Models: - [[REINFORCE]] - [[Vanilla Policy Gradient (VPG)]] - **Value-based Agents** - agents that are designed to approximate value functions - [[The main weakness of value-based deep reinforcement learning algorithms is their tendency to diverge]] - [[Samples are not independent in value-based deep reinforcement learning]] - [[Samples are not identically-distributed in value-based deep reinforcement learning]] - [[The target moves in value-based deep reinforcement learning]] - **Model-based Agents** - agents that are designed to approximate models of the environment - **Actor-Critic Agents** - agents that are designed to approximate both policies and value functions - [[Actor-Critic Methods Overview]] - Specific Approaches/Models: - [[Asynchronous Advantage Actor-Critic (A3C)]] - [[Generalized Advantage Estimation (GAE)]] - [[Advantage Actor-Critic (A2C)]] - **Bandit Environment** - an environment with a single non-terminal state - **Horizon** - the length of time that the agent is planning for. Horizons can be finite (in episodic tasks) or infinite (in continuing tasks) - **Discount Factor** - a factor that adjusts the importance of rewards over time. We typically use a positive real number less than one to exponentially discount the value of future rewards. - [[Why we use a discount factor in reinforcement learning]] - **Markov Decision Process** - the mathematical framework used to represent decision-making processes in RL. It is composed of a state space, a set of per-state actions, a transition function, a reward signal, a horizon, a discount factor, and an initial state distribution. States describe the configuration of the environment. Actions allow agents to interact with the environment. The transition function tells how the environment evolves and reacts to the agent's actions. The reward signal encodes the goal to be achieved by the agent. The horizon and discount factor add a notion of time to the interactions. - **Prediction Problem** - the problem of evaluating policies - that is, estimating the value function for a given policy - **Control Problem** - the problem of finding optimal policies. Solved by following the pattern of generalized policy iteration (GPI) - Non-deep RL methods to solve the control problem: - [[Monte Carlo Control]] - [[SARSA]] - [[Q-Learning]] - [[Double Q-Learning]] - **Policy Evaluation** - refers to algorithms that solve the prediction problem - **Policy Improvement** - algorithms that make new policies that improve on an original policy by making it greedier than the original with respect to the value function of that original policy. - **On-policy Method** - a policy evaluation method in which the policy used to generate data is the same policy being evaluated - **Off-policy Method** a policy evaluation method in which the policy used to generate data is different from the policy being evaluated --- # References [[Grokking Deep Reinforcement Learning]]