## Motivation§

The study of games in AI is useful as they represent adversarial environments. The world is not controlled solely by the agent, the world can change due to actions of other agents. Other agents may have different objectives – leading towards competition or cooperation.

## Games§

There are many types of games:

• Simulataneous, sequential, iterated, etc.
• Single or multiplayer
• Stochastic properties
• Complete vs Incomplete Information (partially observable)

Sometimes, games are applicable in economics contexts such as pricing of goods, auctions, contract negotiations, etc.

In this course (and in the text), we focus on sequential games – tic tac toe, checkers, chess, go, etc. – but there does exist other, interesting, games such as simulataneous games that branches its way into game theory.

### Simultaneous Games§

Multiple agents are allowed to act at the same time – both choosing from a discrete action space. Some examples include the prisoner’s dilemma, rock paper scissors, and a game of chicken.

Prisoners' Dilemma

In this example of the prisoners’ dilemma, we have two agents that are making a choice, simulataneously.

Prisoner B Stays SilentPrisoner B Betrays
Prisoner A Stays SilentBoth A and B serve 1 yearB goes free, A serves 5 years
Prisoner A BetraysA goes free, B serves 5 yearsBoth A and B serve 3 years

In a simulataneous game, the agents can – unilaterally – decide what to do by finding strategies that lead to states in Nash equilibrium.

### Sequential Games§

In sequentual games, players take turns performing actions and each player is assigned a utility function.

Utility Function

A utility function assigns a score to a given state. The score should reward the agent for achieving goals, and cost of actions or resources used.

Some examples would be money in poker, or in tic-tac-toe we use +1 for win, -1 for lose, 0 for nonterminal states, etc.

Zero-sum Game

Zero-sum games are games such that the sum of utilities of all players in the game for a given state must be equal, i.e.

$$\sum_i u_i(s) = 0.$$

In the simplest case, a 2-player zero-sum game,

$$u_1(s) = -u_2(s).$$

In a 2-player zero-sum game such as tic-tac-toe, how can we decide what move to make?

One method would be writing a set of rules that encode a strategy.

Another would be using a systematic search, with a look-ahead of each possible action and the opponent’s reaction.

We can assume that the opponent has a utility opposite of ours, hence, we can anticipate the moves made by an opponent. Thus, the opponent will influence the game in a manner that is best for them – which, for us, is the worst.

Must apply this thinking recursively; to simulate the opponent’s reasoning, we must consider the opponent’s response to our move, and we must respond to them, and so on.

Minimax Search

In minimax search, we label each layer of the search tree, in alternating order, min or max; starting with max at the root.

Then, we define the $\operatorname{minimax}$ value for each state $s$ as follows,

$$\text{minimax}(s) = \begin{cases} u_i(s) & \text{if s is terminal},\\ \displaystyle\argmax_{s’ \in \text{Suc}(s)} \text{minimax}(s’) & \text{if s is a max node},\\ \displaystyle\argmin_{s’ \in \text{Suc}(s)} \text{minimax}(s’) & \text{if s is a min node}, \end{cases}$$

where $\text{Suc}(s)$ represents the set of succeeding states of $s.$

At the root node, we perform $\argmax \text{minimax}(s’),$ this yields the action that leads to the successor with the highest utility, i.e. highest expected payoff.

Minimax Search

function MinimaxSearch(game, state) -> Action {
let player := game.current_player(state)
let value, move := MaxValue(game, state)

return move
}

function MaxValue(game, state) -> (Utility, Action) {
if game.is_terminal(state)
return (game.utility(state), None)

let v := -Infinity
for let action in game.actions(state) {
let v2, a2 := MinValue(game, game.result(state, action)) // Call opposite function

if v2 > v
v, move := v2, a
}

return v, move
}

function MinValue(game, state) -> (Utility, Action) {
if game.is_terminal(state)
return (game.utility(state), None)

let v:= Infinity
for let action in game.actions(state) {
let v2, a2 := MaxValue(game, game.result(state, action)) // Call opposite function

if v2 < v
v, move := v2, a
}

return v, move
}


Note that minimax only determines the next move by the current player. We cannot expect the opponent to always make a rational or optimal move, we must recompute the game tree from the game state after the opponent’s move. Minimax does not determine the entire sequence of play, as we cannot force the choices of the other player – we can only assume that an opponent will always make an optimal move, for them.

### Minimax Analysis§

The issue of minimax is that it is impractical to apply it to most games, as the search space is too large.

Take chess, an average game can be on the order of 70 moves, the average branching factor is around 35 so the state space is on the order of $36^{70}$ total states. As such, it is impossible to keep track of all the descendant states in order to propagate the minimax results up the tree.

One solution is to use intelligent pruning to reduce the search space, as sometimes we may use inference to avoid parts of the search space.

## $\alpha/\beta-$pruning§

To improve upon minimax, we can – at each node – keep track of two values, $\alpha$ and $\beta,$ in addition to the minimax value.

$\alpha/\beta-$pruning

$\alpha$ and $\beta$ represent the lower- and upper-bound on the possible value of $\text{minimax}(s)$ would eventually be.

At each node, we first initialize $\alpha, \beta = [-\infty, +\infty].$ As children are processed,

• at max nodes, $\alpha = \max\{\alpha, \text{minimax}(s’)\ \forall s’ \in \text{Suc}(s)\}$
• at min nodes, $\beta = \min\{\beta, \text{minimax}(s’)\ \forall s’ \in \text{Suc}(s)\}$

When the interval of a node and its parent no longer overlaps, we prune the node and its descendants from the search.

function AlphaBetaSearch(game, state) -> Action {
let player := game.current_player(state)
let value, move := MaxValue(game, state, -Infinity, Infinity)

return move
}

function MaxValue(game, state, alpha, beta) -> (Utility, Action) {
if game.is_terminal(state)
return (game.utility(state), None)

let v := -Infinity
for let action in game.actions(state) {
let v2, a2 := MinValue(game, game.result(state, action), alpha, beta)

if v2 > v {
v, move := v2, a
alpha := max(alpha, v)
}

// Prune if score is greater than upperbound
if v >= beta
return v, move
}

return v, move
}

function MinValue(game, state, alpha, beta) -> (Utility, Action) {
if game.is_terminal(state)
return (game.utility(state), None)

let v:= Infinity
for let action in game.actions(state) {
let v2, a2 := MaxValue(game, game.result(state, action), alpha, beta)

if v2 < v {
v, move := v2, a
beta := min(beta, v)
}

// Prune if score is less than lowerbound
if v <= alpha
return v, move
}

return v, move
}


Another solution is to use a depth limit in searching the game tree, à la iterative deepening.

This requires a board-evaluation function to score internal nodes.

Board evaluation function

The board-evaluation function assigns a score to nonterminal states, which is an estimate of the probability of winning or a heuristic of the expected payoff from each state.

We can perform minimax using $\alpha/\beta-$pruning using the board-eval instead of the utility function down to a specified depth.

The depth limit is dependent on the time available; using this approach a depth of 2-6 corresponds to an amateur in chess.

### Issues of Board Evaluation§

Quiescent States

A state that is quiescent is one where its value has converged.

A board-eval function should only be applied to quiescent states, if there are still large fluctuations in value we must extend the search until it has quiesced. One way to accomplish this is to use a nonuniform depth limit, rather than a struct cutoff.

Horizon Effect

An opponent may perform a number of moves which forestalls a bad outcome so it occurs just outside of the depth limit – a kin to being behind the horizon.

Some trivial examples of include moving a bishop back and forth to delay capture, or repeatedly checking the opponent’s king.

Generally, it is difficult to detect moves that lead to the horizon effect and to mitigate it.

### Deep Blue§

One of the first computers to reach grandmaster rating was IBM’s Deep Blue, where it defeated Gary Kasparov in 1997.

It was a supercomputer built using ASICs designed specifically for minimax search, capable of exploring over 100 million moves per second, down to a (nonuniform) depth of 10 to 12 moves in the future. It included 30 nodes of IBM’s RS/6000 SP computers, each node had 16 of these custom ASICs for generating moves and computing their board evaluation function.

For states near the end-game, they had a lookup table to find the optimal moves that were precomputed.

Interestingly, if we look at the history of computers playing chess, we see that their performance as measured by elo had steadily increased with computing power.

### Connect4§

In Connect4, we place stack pieces in vertical columns, a player wins should they connect a line of 4 pieces.

This was food-for-thought/bonus assignment.

The search space of Connect4 is much larger than that of tic-tac-toe, so we must implement an ID approach towards minimax and a board evaluation function.

## Expectiminimax§

In stochastic games, we do not have an exact measure of utility or value of a state.

Expectiminimax

If we interleave the min and max layers with a level of chance nodes, we can then perform expectiminimax.

At chance nodes, the score is the weighted sum of the children, weighted by their probability – or expected outcome.

$$\text{expectiminimax}(s) = \begin{cases} u_i(s) & \text{if s is terminal,}\\ \displaystyle\argmax_{s’ \in \text{Suc}(s)} \text{expectiminimax}(s’) & \text{if s is a max node},\\ \displaystyle\argmin_{s’ \in \text{Suc}(s)} \text{expectiminimax}(s’) & \text{if s is a min node},\\ \displaystyle\sum_{s’ \in \text{Suc}(s)} P(s’)\text{expectiminimax}(s’) & \text{if s is a chance node} \end{cases}$$

## Monte Carlo Tree Search (MCTS)§

Instead of exhaustively exploring a search tree, we can take random walks (rollouts) to terminal states.

The value of a state is taken as the statistical average outcome of those random walks that pass through those states (“back-propagation” of outcomes).

Also, we keep track of the number of random walks that pass through each node $n,$ and variance $\sigma^2$ to assess certainty.

function MonteCarloTreeSearch(state) -> Action {
let tree := Node(state)

while TimeRemaining() {
let leaf := Select(tree)
let child := Expand(leaf)
let result := Simulate(child)

BackPropagate(result, child)
}

return move in Actions(state) with highest number of playouts
}


There are many choices on how to make moves during a simulation, such as how do we select which states to start from?

• Do we want to explore or expand the tree?
• Do we want to refine the estimate of the value at good nodes, or have more certainty of bad nodes?
• Do we want to allow for the occasional suboptimal choice?

And the playout policy, as making purely random moves is not realistic. We can make an approximate strategy to select or prefer moves that would be reasonable.

### AlphaGO§

Go is a game played using black and white stones on a 19 x 19 board, its branching factor dwarfs that of chess at 361 (recall that chess has an average factor of 30).

Google DeepMind created AlphaGO using a deep neural network consisting of 14 convolutional layers to reprsent the board’s valuation; they also used reinforcement learning and MCTS.

## Games with Imperfect Information§

In tic-tac-toe or chess, the entire state is revealed to both players. In other games, some information is not observable to others such as poker or blackjack.

Note: In terms of card games, we talk not of the uncertainty of the card to be drawn next but of the uncertainty of the hands of the other players.

In these games, the optimal move depends on information that is not accessible to us – the environment is partially observable.

The simplest view is to look at the payoff of actions averaged over all possible opponent hands (a probability distribution).

We can also use inference from opponents’ actions to determine their hand.

More sophisticated methods exist in AI, such as partially observable Markov decision problems (POMDPs) to estimate and reason about “belief states”.

In some cases, there is value in taking actions whose primary purpose is to gain information – even at a cost; there is value in information for environments that are partially observable.