CS 221 - Artificial Intelligence

States-based models with search optimization and MDP

By Afshine Amidi and Shervine Amidi

Star

Search optimization

In this section, we assume that by accomplishing action aa from state ss, we deterministically arrive in state Succ(s,a)\textrm{Succ}(s,a). The goal here is to determine a sequence of actions (a1,a2,a3,a4,...)(a_1,a_2,a_3,a_4,...) that starts from an initial state and leads to an end state. In order to solve this kind of problem, our objective will be to find the minimum cost path by using states-based models.

Tree search

This category of states-based algorithms explores all possible states and actions. It is quite memory efficient, and is suitable for huge state spaces but the runtime can become exponential in the worst cases.

Tree

Search problem A search problem is defined with:

Search problem

The objective is to find a path that minimizes the cost.


Backtracking search Backtracking search is a naive recursive algorithm that tries all possibilities to find the minimum cost path. Here, action costs can be either positive or negative.


Breadth-first search (BFS) Breadth-first search is a graph search algorithm that does a level-by-level traversal. We can implement it iteratively with the help of a queue that stores at each step future nodes to be visited. For this algorithm, we can assume action costs to be equal to a constant c0c\geqslant0.

BFS

Depth-first search (DFS) Depth-first search is a search algorithm that traverses a graph by following each path as deep as it can. We can implement it recursively, or iteratively with the help of a stack that stores at each step future nodes to be visited. For this algorithm, action costs are assumed to be equal to 0.

DFS

Iterative deepening The iterative deepening trick is a modification of the depth-first search algorithm so that it stops after reaching a certain depth, which guarantees optimality when all action costs are equal. Here, we assume that action costs are equal to a constant c0c\geqslant0.


Tree search algorithms summary By noting bb the number of actions per state, dd the solution depth, and DD the maximum depth, we have:

Algorithm Action costs Space Time
Backtracking search any O(D)\mathcal{O}(D) O(bD)\mathcal{O}(b^D)
Breadth-first search c0c\geqslant0 O(bd)\mathcal{O}(b^d) O(bd)\mathcal{O}(b^d)
Depth-first search 0 O(D)\mathcal{O}(D) O(bD)\mathcal{O}(b^D)
DFS-Iterative deepening c0c\geqslant0 O(d)\mathcal{O}(d) O(bd)\mathcal{O}(b^d)

Graph search

This category of states-based algorithms aims at constructing optimal paths, enabling exponential savings. In this section, we will focus on dynamic programming and uniform cost search.

Graph A graph is comprised of a set of vertices VV (also called nodes) as well as a set of edges EE (also called links).

Graph

Remark: a graph is said to be acylic when there is no cycle.


State A state is a summary of all past actions sufficient to choose future actions optimally.


Dynamic programming Dynamic programming (DP) is a backtracking search algorithm with memoization (i.e. partial results are saved) whose goal is to find a minimum cost path from state ss to an end state sends_\textrm{end}. It can potentially have exponential savings compared to traditional graph search algorithms, and has the property to only work for acyclic graphs. For any given state ss, the future cost is computed as follows:

FutureCost(s)={0if IsEnd(s)minaActions(s)[Cost(s,a)+FutureCost(Succ(s,a))]otherwise\boxed{\textrm{FutureCost}(s)=\left\{\begin{array}{lc}0 & \textrm{if IsEnd}(s)\\\underset{a\in\textrm{Actions}(s)}{\textrm{min}}\big[\textrm{Cost}(s,a)+\textrm{FutureCost(Succ}(s,a))\big] & \textrm{otherwise}\end{array}\right.}
Dynamic Programming

Remark: the figure above illustrates a bottom-to-top approach whereas the formula provides the intuition of a top-to-bottom problem resolution.


Types of states The table below presents the terminology when it comes to states in the context of uniform cost search:

State Explanation
Explored E\mathcal{E} States for which the optimal path has already been found
Frontier F\mathcal{F} States seen for which we are still figuring out how to get there with the cheapest cost
Unexplored U\mathcal{U} States not seen yet

Uniform cost search Uniform cost search (UCS) is a search algorithm that aims at finding the shortest path from a state sstarts_\textrm{start} to an end state sends_\textrm{end}. It explores states ss in increasing order of PastCost(s)\textrm{PastCost}(s) and relies on the fact that all action costs are non-negative.

Uniform Cost Search

Remark 1: the UCS algorithm is logically equivalent to Dijkstra's algorithm.

Remark 2: the algorithm would not work for a problem with negative action costs, and adding a positive constant to make them non-negative would not solve the problem since this would end up being a different problem.


Correctness theorem When a state ss is popped from the frontier F\mathcal{F} and moved to explored set E\mathcal{E}, its priority is equal to PastCost(s)\textrm{PastCost}(s) which is the minimum cost path from sstarts_\textrm{start} to ss.


Graph search algorithms summary By noting NN the number of total states, nn of which are explored before the end state sends_\textrm{end}, we have:

Algorithm Acyclicity Costs Time/space
Dynamic programming yes any O(N)\mathcal{O}(N)
Uniform cost search no c0c\geqslant0 O(nlog(n))\mathcal{O}(n\log(n))

Remark: the complexity countdown supposes the number of possible actions per state to be constant.


Learning costs

Suppose we are not given the values of Cost(s,a)\textrm{Cost}(s,a), we want to estimate these quantities from a training set of minimizing-cost-path sequence of actions (a1,a2,...,ak)(a_1, a_2, ..., a_k).

Structured perceptron The structured perceptron is an algorithm aiming at iteratively learning the cost of each state-action pair. At each step, it:

Remark: there are several versions of the algorithm, one of which simplifies the problem to only learning the cost of each action aa, and the other parametrizes Cost(s,a)\textrm{Cost}(s,a) to a feature vector of learnable weights.


A\textrm{A}^{\star} search

Heuristic function A heuristic is a function hh over states ss, where each h(s)h(s) aims at estimating FutureCost(s)\textrm{FutureCost}(s), the cost of the path from ss to sends_\textrm{end}.

A star

Algorithm AA^{*} is a search algorithm that aims at finding the shortest path from a state ss to an end state sends_\textrm{end}. It explores states ss in increasing order of PastCost(s)+h(s)\textrm{PastCost}(s) + h(s). It is equivalent to a uniform cost search with edge costs Cost(s,a)\textrm{Cost}'(s,a) given by:

Cost(s,a)=Cost(s,a)+h(Succ(s,a))h(s)\boxed{\textrm{Cost}'(s,a)=\textrm{Cost}(s,a)+h(\textrm{Succ}(s,a))-h(s)}

Remark: this algorithm can be seen as a biased version of UCS exploring states estimated to be closer to the end state.


Consistency A heuristic hh is said to be consistent if it satisfies the two following properties:


Correctness If hh is consistent, then AA^* returns the minimum cost path.


Admissibility A heuristic hh is said to be admissible if we have:

h(s)FutureCost(s)\boxed{h(s)\leqslant\textrm{FutureCost}(s)}

Theorem Let h(s)h(s) be a given heuristic. We have:

h(s) consistenth(s) admissible\boxed{h(s)\textrm{ consistent}\Longrightarrow h(s)\textrm{ admissible}}

Efficiency AA^* explores all states ss satisfying the following equation:

PastCost(s)PastCost(send)h(s)\boxed{\textrm{PastCost}(s)\leqslant\textrm{PastCost}(s_{\textrm{end}})-h(s)}
Efficiency

Remark: larger values of h(s)h(s) is better as this equation shows it will restrict the set of states ss going to be explored.


Relaxation

It is a framework for producing consistent heuristics. The idea is to find closed-form reduced costs by removing constraints and use them as heuristics.


Relaxed search problem The relaxation of search problem PP with costs Cost\textrm{Cost} is denoted PrelP_{\textrm{rel}} with costs Costrel\textrm{Cost}_{\textrm{rel}}, and satisfies the identity:

Costrel(s,a)Cost(s,a)\boxed{\textrm{Cost}_{\textrm{rel}}(s,a)\leqslant\textrm{Cost}(s,a)}

Relaxed heuristic Given a relaxed search problem PrelP_{\textrm{rel}}, we define the relaxed heuristic h(s)=FutureCostrel(s)h(s)=\textrm{FutureCost}_{\textrm{rel}}(s) as the minimum cost path from ss to an end state in the graph of costs Costrel(s,a)\textrm{Cost}_{\textrm{rel}}(s,a).


Consistency of relaxed heuristics Let PrelP_{\textrm{rel}} be a given relaxed problem. By theorem, we have:

h(s)=FutureCostrel(s)h(s) consistent\boxed{h(s)=\textrm{FutureCost}_{\textrm{rel}}(s)\Longrightarrow h(s)\textrm{ consistent}}

Tradeoff when choosing heuristic We have to balance two aspects in choosing a heuristic:


Max heuristic Let h1(s)h_1(s), h2(s)h_2(s) be two heuristics. We have the following property:

h1(s), h2(s) consistenth(s)=max{h1(s), h2(s)} consistent\boxed{h_1(s),\textrm{ }h_2(s)\textrm{ consistent}\Longrightarrow h(s)=\max\{h_1(s),\textrm{ }h_2(s)\}\textrm{ consistent}}

Markov decision processes

In this section, we assume that performing action aa from state ss can lead to several states s1,s2,...s_1',s_2',... in a probabilistic manner. In order to find our way between an initial state and an end state, our objective will be to find the maximum value policy by using Markov decision processes that help us cope with randomness and uncertainty.

Notations

Definition The objective of a Markov decision process is to maximize rewards. It is defined with:

Markov Decision Process

Transition probabilities The transition probability T(s,a,s)T(s,a,s') specifies the probability of going to state ss' after action aa is taken in state ss. Each sT(s,a,s)s' \mapsto T(s,a,s') is a probability distribution, which means that:

s,a,s StatesT(s,a,s)=1\forall s,a,\quad\boxed{\displaystyle\sum_{s'\in\textrm{ States}}T(s,a,s')=1}

Policy A policy π\pi is a function that maps each state ss to an action aa, i.e.

π:sa\boxed{\pi : s \mapsto a}

Utility The utility of a path (s0,...,sk)(s_0, ..., s_k) is the discounted sum of the rewards on that path. In other words,

u(s0,...,sk)=i=1kriγi1\boxed{u(s_0,...,s_k)=\sum_{i=1}^{k}r_i\gamma^{i-1}}
Utility

The figure above is an illustration of the case k=4k=4.


Q-value The QQ-value of a policy π\pi at state ss with action aa, also denoted Qπ(s,a)Q_{\pi}(s,a), is the expected utility from state ss after taking action aa and then following policy π\pi. It is defined as follows:

Qπ(s,a)=s StatesT(s,a,s)[Reward(s,a,s)+γVπ(s)]\boxed{Q_{\pi}(s,a)=\sum_{s'\in\textrm{ States}}T(s,a,s')\left[\textrm{Reward}(s,a,s')+\gamma V_\pi(s')\right]}

Value of a policy The value of a policy π\pi from state ss, also denoted Vπ(s)V_{\pi}(s), is the expected utility by following policy π\pi from state ss over random paths. It is defined as follows:

Vπ(s)=Qπ(s,π(s))\boxed{V_\pi(s)=Q_\pi(s,\pi(s))}

Remark: Vπ(s)V_\pi(s) is equal to 0 if ss is an end state.


Applications

Policy evaluation Given a policy π\pi, policy evaluation is an iterative algorithm that aims at estimating VπV_\pi. It is done as follows:

Remark: by noting SS the number of states, AA the number of actions per state, SS' the number of successors and TT the number of iterations, then the time complexity is of O(TPESS)\mathcal{O}(T_{\textrm{PE}}SS').


Optimal Q-value The optimal QQ-value Qopt(s,a)Q_{\textrm{opt}}(s,a) of state ss with action aa is defined to be the maximum QQ-value attained by any policy. It is computed as follows:

Qopt(s,a)=s StatesT(s,a,s)[Reward(s,a,s)+γVopt(s)]\boxed{Q_{\textrm{opt}}(s,a)=\sum_{s'\in\textrm{ States}}T(s,a,s')\left[\textrm{Reward}(s,a,s')+\gamma V_\textrm{opt}(s')\right]}

Optimal value The optimal value Vopt(s)V_{\textrm{opt}}(s) of state ss is defined as being the maximum value attained by any policy. It is computed as follows:

Vopt(s)=maxa Actions(s)Qopt(s,a)\boxed{V_{\textrm{opt}}(s)=\underset{a\in\textrm{ Actions}(s)}{\textrm{max}}Q_\textrm{opt}(s,a)}

Optimal policy The optimal policy πopt\pi_{\textrm{opt}} is defined as being the policy that leads to the optimal values. It is defined by:

s,πopt(s)=argmaxa Actions(s)Qopt(s,a)\forall s,\quad\boxed{\pi_{\textrm{opt}}(s)=\underset{a\in\textrm{ Actions}(s)}{\textrm{argmax}}Q_\textrm{opt}(s,a)}

Value iteration Value iteration is an algorithm that finds the optimal value VoptV_{\textrm{opt}} as well as the optimal policy πopt\pi_{\textrm{opt}}. It is done as follows:

Remark: if we have either γ<1\gamma < 1 or the MDP graph being acyclic, then the value iteration algorithm is guaranteed to converge to the correct answer.


When unknown transitions and rewards

Now, let's assume that the transition probabilities and the rewards are unknown.

Model-based Monte Carlo The model-based Monte Carlo method aims at estimating T(s,a,s)T(s,a,s') and Reward(s,a,s)\textrm{Reward}(s,a,s') using Monte Carlo simulation with:

T^(s,a,s)=# times (s,a,s) occurs# times (s,a) occurs\boxed{\widehat{T}(s,a,s')=\frac{\#\textrm{ times }(s,a,s')\textrm{ occurs}}{\#\textrm{ times }(s,a)\textrm{ occurs}}}
and
Reward^(s,a,s)=r in (s,a,r,s)\boxed{\widehat{\textrm{Reward}}(s,a,s')=r\textrm{ in }(s,a,r,s')}
These estimations will be then used to deduce QQ-values, including QπQ_\pi and Qopt.Q_\textrm{opt}.

Remark: model-based Monte Carlo is said to be off-policy, because the estimation does not depend on the exact policy.


Model-free Monte Carlo The model-free Monte Carlo method aims at directly estimating QπQ_{\pi}, as follows:

Q^π(s,a)=average of ut where st1=s,at=a\boxed{\widehat{Q}_\pi(s,a)=\textrm{average of }u_t\textrm{ where }s_{t-1}=s, a_t=a}
where utu_t denotes the utility starting at step tt of a given episode.

Remark: model-free Monte Carlo is said to be on-policy, because the estimated value is dependent on the policy π\pi used to generate the data.


Equivalent formulation By introducing the constant η=11+(#updates to (s,a))\eta=\frac{1}{1+(\#\textrm{updates to }(s,a))} and for each (s,a,u)(s,a,u) of the training set, the update rule of model-free Monte Carlo has a convex combination formulation:

Q^π(s,a)(1η)Q^π(s,a)+ηu\boxed{\widehat{Q}_\pi(s,a)\leftarrow(1-\eta)\widehat{Q}_\pi(s,a)+\eta u}
as well as a stochastic gradient formulation:
Q^π(s,a)Q^π(s,a)η(Q^π(s,a)u)\boxed{\widehat{Q}_\pi(s,a)\leftarrow\widehat{Q}_\pi(s,a) - \eta (\widehat{Q}_\pi(s,a) - u)}


SARSA State-action-reward-state-action (SARSA) is a boostrapping method estimating QπQ_\pi by using both raw data and estimates as part of the update rule. For each (s,a,r,s,a)(s,a,r,s',a'), we have:

Q^π(s,a)(1η)Q^π(s,a)+η[r+γQ^π(s,a)]\boxed{\widehat{Q}_\pi(s,a)\longleftarrow(1-\eta)\widehat{Q}_\pi(s,a)+\eta\Big[r+\gamma\widehat{Q}_\pi(s',a')\Big]}

Remark: the SARSA estimate is updated on the fly as opposed to the model-free Monte Carlo one where the estimate can only be updated at the end of the episode.


Q-learning QQ-learning is an off-policy algorithm that produces an estimate for QoptQ_\textrm{opt}. On each (s,a,r,s,a)(s,a,r,s',a'), we have:

Q^opt(s,a)(1η)Q^opt(s,a)+η[r+γmaxa Actions(s)Q^opt(s,a)]\boxed{\widehat{Q}_{\textrm{opt}}(s,a)\leftarrow(1-\eta)\widehat{Q}_{\textrm{opt}}(s,a)+\eta\Big[r+\gamma\underset{a'\in\textrm{ Actions}(s')}{\textrm{max}}\widehat{Q}_{\textrm{opt}}(s',a')\Big]}

Epsilon-greedy The epsilon-greedy policy is an algorithm that balances exploration with probability ϵ\epsilon and exploitation with probability 1ϵ1-\epsilon. For a given state ss, the policy πact\pi_{\textrm{act}} is computed as follows:

πact(s)={argmax a ActionsQ^opt(s,a)with proba 1ϵrandom from Actions(s)with proba ϵ\boxed{\pi_\textrm{act}(s)=\left\{\begin{array}{ll}\underset{a\in\textrm{ Actions}}{\textrm{argmax }}\widehat{Q}_\textrm{opt}(s,a) & \textrm{with proba }1-\epsilon\\\textrm{random from Actions}(s) & \textrm{with proba }\epsilon\end{array}\right.}

Game playing

In games (e.g. chess, backgammon, Go), other agents are present and need to be taken into account when constructing our policy.

Game tree A game tree is a tree that describes the possibilities of a game. In particular, each node is a decision point for a player and each root-to-leaf path is a possible outcome of the game.


Two-player zero-sum game It is a game where each state is fully observed and such that players take turns. It is defined with:

Remark: we will assume that the utility of the agent has the opposite sign of the one of the opponent.


Types of policies There are two types of policies:


Expectimax For a given state ss, the expectimax value Vexptmax(s)V_{\textrm{exptmax}}(s) is the maximum expected utility of any agent policy when playing with respect to a fixed and known opponent policy πopp\pi_{\textrm{opp}}. It is computed as follows:

Vexptmax(s)={Utility(s)IsEnd(s)maxaActions(s)Vexptmax(Succ(s,a))Player(s)=agentaActions(s)πopp(s,a)Vexptmax(Succ(s,a))Player(s)=opp\boxed{V_{\textrm{exptmax}}(s)=\left\{\begin{array}{ll}\textrm{Utility}(s) & \textrm{IsEnd}(s)\\\underset{a\in\textrm{Actions}(s)}{\textrm{max}}V_{\textrm{exptmax}}(\textrm{Succ}(s,a)) & \textrm{Player}(s)=\textrm{agent}\\\displaystyle\sum_{a\in\textrm{Actions}(s)}\pi_{\textrm{opp}}(s,a)V_{\textrm{exptmax}}(\textrm{Succ}(s,a)) & \textrm{Player}(s)=\textrm{opp}\end{array}\right.}

Remark: expectimax is the analog of value iteration for MDPs.

Expectimax

Minimax The goal of minimax policies is to find an optimal policy against an adversary by assuming the worst case, i.e. that the opponent is doing everything to minimize the agent's utility. It is done as follows:

Vminimax(s)={Utility(s)IsEnd(s)maxaActions(s)Vminimax(Succ(s,a))Player(s)=agentminaActions(s)Vminimax(Succ(s,a))Player(s)=opp\boxed{V_{\textrm{minimax}}(s)=\left\{\begin{array}{ll}\textrm{Utility}(s) & \textrm{IsEnd}(s)\\\underset{a\in\textrm{Actions}(s)}{\textrm{max}}V_{\textrm{minimax}}(\textrm{Succ}(s,a)) & \textrm{Player}(s)=\textrm{agent}\\\underset{a\in\textrm{Actions}(s)}{\textrm{min}}V_{\textrm{minimax}}(\textrm{Succ}(s,a)) & \textrm{Player}(s)=\textrm{opp}\end{array}\right.}

Remark: we can extract πmax\pi_{\textrm{max}} and πmin\pi_{\textrm{min}} from the minimax value VminimaxV_{\textrm{minimax}}.

Minimax

Minimax properties By noting VV the value function, there are 3 properties around minimax to have in mind:

In the end, we have the following relationship:
V(πexptmax,πmin)V(πmax,πmin)V(πmax,π)V(πexptmax,π)\boxed{V(\pi_{\textrm{exptmax}},\pi_{\textrm{min}})\leqslant V(\pi_{\textrm{max}},\pi_{\textrm{min}})\leqslant V(\pi_{\textrm{max}},\pi)\leqslant V(\pi_{\textrm{exptmax}},\pi)}


Speeding up minimax

Evaluation function An evaluation function is a domain-specific and approximate estimate of the value Vminimax(s)V_{\textrm{minimax}}(s). It is denoted Eval(s)\textrm{Eval}(s).

Remark: FutureCost(s)\textrm{FutureCost}(s) is an analogy for search problems.


Alpha-beta pruning Alpha-beta pruning is a domain-general exact method optimizing the minimax algorithm by avoiding the unnecessary exploration of parts of the game tree. To do so, each player keeps track of the best value they can hope for (stored in α\alpha for the maximizing player and in β\beta for the minimizing player). At a given step, the condition β<α\beta < \alpha means that the optimal path is not going to be in the current branch as the earlier player had a better option at their disposal.

Alpha beta pruning

TD learning Temporal difference (TD) learning is used when we don't know the transitions/rewards. The value is based on exploration policy. To be able to use it, we need to know rules of the game Succ(s,a)\textrm{Succ}(s,a). For each (s,a,r,s)(s,a,r,s'), the update is done as follows:

wwη[V(s,w)(r+γV(s,w))]wV(s,w)\boxed{w\longleftarrow w-\eta\big[V(s,w)-(r+\gamma V(s',w))\big]\nabla_wV(s,w)}

Simultaneous games

This is the contrary of turn-based games, where there is no ordering on the player's moves.


Single-move simultaneous game Let there be two players AA and BB, with given possible actions. We note V(a,b)V(a,b) to be AA's utility if AA chooses action aa, BB chooses action bb. VV is called the payoff matrix.


Strategies There are two main types of strategies:


Game evaluation The value of the game V(πA,πB)V(\pi_A,\pi_B) when player AA follows πA\pi_A and player BB follows πB\pi_B is such that:

V(πA,πB)=a,bπA(a)πB(b)V(a,b)\boxed{V(\pi_A,\pi_B)=\sum_{a,b}\pi_A(a)\pi_B(b)V(a,b)}

Minimax theorem By noting πA,πB\pi_A,\pi_B ranging over mixed strategies, for every simultaneous two-player zero-sum game with a finite number of actions, we have:

maxπAminπBV(πA,πB)=minπBmaxπAV(πA,πB)\boxed{\max_{\pi_A}\min_{\pi_B}V(\pi_A,\pi_B)=\min_{\pi_B}\max_{\pi_A}V(\pi_A,\pi_B)}

Non-zero-sum games

Payoff matrix We define Vp(πA,πB)V_p(\pi_A,\pi_B) to be the utility for player pp.


Nash equilibrium A Nash equilibrium is (πA,πB)(\pi_A^*,\pi_B^*) such that no player has an incentive to change its strategy. We have:

πA,VA(πA,πB)VA(πA,πB)andπB,VB(πA,πB)VB(πA,πB)\boxed{\forall \pi_A, V_A(\pi_A^*,\pi_B^*)\geqslant V_A(\pi_A,\pi_B^*)}\quad\textrm{and}\quad\boxed{\forall \pi_B, V_B(\pi_A^*,\pi_B^*)\geqslant V_B(\pi_A^*,\pi_B)}

Remark: in any finite-player game with finite number of actions, there exists at least one Nash equilibrium.