153
153
The average strategy is the average of strategies followed in each round,
154
154
for all $I \in \mathcal{I}, a \in A(I)$
155
155
156
- $${\color {cyan}\b ar{\sigma}^T_i(I)(a)} =
157
- \f rac{\sum_{t=1}^T \pi_i^{\sigma^t}(I){\color {lightgreen}\sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
156
+ $$\t extcolor {cyan}{ \b ar{\sigma}^T_i(I)(a)} =
157
+ \f rac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\t extcolor {lightgreen}{ \sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
158
158
159
159
That is the mean regret of not playing with the optimal strategy.
160
160
210
210
211
211
### Counterfactual regret
212
212
213
- **Counterfactual value** $\color {pink}{v_i(\sigma, I)}$ is the expected utility for player $i$ if
213
+ **Counterfactual value** $\t extcolor {pink}{v_i(\sigma, I)}$ is the expected utility for player $i$ if
214
214
if player $i$ tried to reach $I$ (took the actions leading to $I$ with a probability of $1$).
215
215
216
- $$\color {pink}{v_i(\sigma, I)} = \sum_{z \in Z_I} \pi^\sigma_{-i}(z[I]) \pi^\sigma(z[I], z) u_i(z)$$
216
+ $$\t extcolor {pink}{v_i(\sigma, I)} = \sum_{z \in Z_I} \pi^\sigma_{-i}(z[I]) \pi^\sigma(z[I], z) u_i(z)$$
217
217
218
218
where $Z_I$ is the set of terminal histories reachable from $I$,
219
219
and $z[I]$ is the prefix of $z$ up to $I$.
227
227
228
228
$$R^T_{i,imm}(I) = \f rac{1}{T} \sum_{t=1}^T
229
229
\Big(
230
- \color {pink}{v_i(\sigma^t |_{I \r ightarrow a}, I)} - \color {pink}{v_i(\sigma^t, I)}
230
+ \t extcolor {pink}{v_i(\sigma^t |_{I \r ightarrow a}, I)} - \t extcolor {pink}{v_i(\sigma^t, I)}
231
231
\Big)$$
232
232
233
233
where $\sigma |_{I \r ightarrow a}$ is the strategy profile $\sigma$ with the modification
244
244
245
245
The strategy is calculated using regret matching.
246
246
247
- The regret for each information set and action pair $\color {orange}{R^T_i(I, a)}$ is maintained,
247
+ The regret for each information set and action pair $\t extcolor {orange}{R^T_i(I, a)}$ is maintained,
248
248
249
249
\b egin{align}
250
- \color {coral}{r^t_i(I, a)} &=
251
- \color {pink}{v_i(\sigma^t |_{I \r ightarrow a}, I)} - \color {pink}{v_i(\sigma^t, I)}
250
+ \t extcolor {coral}{r^t_i(I, a)} &=
251
+ \t extcolor {pink}{v_i(\sigma^t |_{I \r ightarrow a}, I)} - \t extcolor {pink}{v_i(\sigma^t, I)}
252
252
\\
253
- \color {orange}{R^T_i(I, a)} &=
254
- \f rac{1}{T} \sum_{t=1}^T \color {coral}{r^t_i(I, a)}
253
+ \t extcolor {orange}{R^T_i(I, a)} &=
254
+ \f rac{1}{T} \sum_{t=1}^T \t extcolor {coral}{r^t_i(I, a)}
255
255
\end{align}
256
256
257
257
and the strategy is calculated with regret matching,
258
258
259
259
\b egin{align}
260
- \color {lightgreen}{\sigma_i^{T+1}(I)(a)} =
260
+ \t extcolor {lightgreen}{\sigma_i^{T+1}(I)(a)} =
261
261
\b egin{cases}
262
- \f rac{\color {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')}},
263
- & \t ext{if} \sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')} \gt 0 \\
262
+ \f rac{\t extcolor {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\t extcolor {orange}{R^{T,+}_i(I, a')}},
263
+ & \t ext{if} \sum_{a'\in A(I)}\t extcolor {orange}{R^{T,+}_i(I, a')} \gt 0 \\
264
264
\f rac{1}{\lvert A(I) \r vert},
265
265
& \t ext{otherwise}
266
266
\end{cases}
267
267
\end{align}
268
268
269
- where $\color {orange}{R^{T,+}_i(I, a)} = \max \Big(\color {orange}{R^T_i(I, a)}, 0 \Big)$
269
+ where $\t extcolor {orange}{R^{T,+}_i(I, a)} = \max \Big(\t extcolor {orange}{R^T_i(I, a)}, 0 \Big)$
270
270
271
271
The paper
272
272
The paper
279
279
280
280
### Monte Carlo CFR (MCCFR)
281
281
282
- Computing $\color {coral}{r^t_i(I, a)}$ requires expanding the full game tree
282
+ Computing $\t extcolor {coral}{r^t_i(I, a)}$ requires expanding the full game tree
283
283
on each iteration.
284
284
285
285
The paper
296
296
297
297
Then we get **sampled counterfactual value** fro block $j$,
298
298
299
- $$\color {pink}{\t ilde{v}(\sigma, I|j)} =
299
+ $$\t extcolor {pink}{\t ilde{v}(\sigma, I|j)} =
300
300
\sum_{z \in Q_j} \f rac{1}{q(z)}
301
301
\pi^\sigma_{-i}(z[I]) \pi^\sigma(z[I], z) u_i(z)$$
302
302
303
303
The paper shows that
304
304
305
- $$\mathbb{E}_{j \sim q_j} \Big[ \color {pink}{\t ilde{v}(\sigma, I|j)} \Big]
306
- = \color {pink}{v_i(\sigma, I)}$$
305
+ $$\mathbb{E}_{j \sim q_j} \Big[ \t extcolor {pink}{\t ilde{v}(\sigma, I|j)} \Big]
306
+ = \t extcolor {pink}{v_i(\sigma, I)}$$
307
307
308
308
with a simple proof.
309
309
310
310
Therefore we can sample a part of the game tree and calculate the regrets.
311
311
We calculate an estimate of regrets
312
312
313
313
$$
314
- \color {coral}{\t ilde{r}^t_i(I, a)} =
315
- \color {pink}{\t ilde{v}_i(\sigma^t |_{I \r ightarrow a}, I)} - \color {pink}{\t ilde{v}_i(\sigma^t, I)}
314
+ \t extcolor {coral}{\t ilde{r}^t_i(I, a)} =
315
+ \t extcolor {pink}{\t ilde{v}_i(\sigma^t |_{I \r ightarrow a}, I)} - \t extcolor {pink}{\t ilde{v}_i(\sigma^t, I)}
316
316
$$
317
317
318
- And use that to update $\color {orange}{R^T_i(I, a)}$ and calculate
319
- the strategy $\color {lightgreen}{\sigma_i^{T+1}(I)(a)}$ on each iteration.
320
- Finally, we calculate the overall average strategy $\color {cyan}{\b ar{\sigma}^T_i(I)(a)}$.
318
+ And use that to update $\t extcolor {orange}{R^T_i(I, a)}$ and calculate
319
+ the strategy $\t extcolor {lightgreen}{\sigma_i^{T+1}(I)(a)}$ on each iteration.
320
+ Finally, we calculate the overall average strategy $\t extcolor {cyan}{\b ar{\sigma}^T_i(I)(a)}$.
321
321
322
322
Here is a [Kuhn Poker](kuhn/index.html) implementation to try CFR on Kuhn Poker.
323
323
@@ -422,24 +422,24 @@ class InfoSet:
422
422
# Total regret of not taking each action $A(I_i)$,
423
423
#
424
424
# \begin{align}
425
- # \color {coral}{\tilde{r}^t_i(I, a)} &=
426
- # \color {pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
427
- # \color {pink}{\tilde{v}_i(\sigma^t, I)}
425
+ # \textcolor {coral}{\tilde{r}^t_i(I, a)} &=
426
+ # \textcolor {pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
427
+ # \textcolor {pink}{\tilde{v}_i(\sigma^t, I)}
428
428
# \\
429
- # \color {orange}{R^T_i(I, a)} &=
430
- # \frac{1}{T} \sum_{t=1}^T \color {coral}{\tilde{r}^t_i(I, a)}
429
+ # \textcolor {orange}{R^T_i(I, a)} &=
430
+ # \frac{1}{T} \sum_{t=1}^T \textcolor {coral}{\tilde{r}^t_i(I, a)}
431
431
# \end{align}
432
432
#
433
- # We maintain $T \color {orange}{R^T_i(I, a)}$ instead of $\color {orange}{R^T_i(I, a)}$
433
+ # We maintain $T \textcolor {orange}{R^T_i(I, a)}$ instead of $\textcolor {orange}{R^T_i(I, a)}$
434
434
# since $\frac{1}{T}$ term cancels out anyway when computing strategy
435
- # $\color {lightgreen}{\sigma_i^{T+1}(I)(a)}$
435
+ # $\textcolor {lightgreen}{\sigma_i^{T+1}(I)(a)}$
436
436
regret : Dict [Action , float ]
437
437
# We maintain the cumulative strategy
438
- # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}$$
438
+ # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor {lightgreen}{\sigma^t(I)(a)}$$
439
439
# to compute overall average strategy
440
440
#
441
- # $$\color {cyan}{\bar{\sigma}^T_i(I)(a)} =
442
- # \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
441
+ # $$\textcolor {cyan}{\bar{\sigma}^T_i(I)(a)} =
442
+ # \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor {lightgreen}{\sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
443
443
cumulative_strategy : Dict [Action , float ]
444
444
445
445
def __init__ (self , key : str ):
@@ -489,59 +489,59 @@ def calculate_strategy(self):
489
489
Calculate current strategy using [regret matching](#RegretMatching).
490
490
491
491
\b egin{align}
492
- \color {lightgreen}{\sigma_i^{T+1}(I)(a)} =
492
+ \t extcolor {lightgreen}{\sigma_i^{T+1}(I)(a)} =
493
493
\b egin{cases}
494
- \f rac{\color {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')}},
495
- & \t ext{if} \sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')} \gt 0 \\
494
+ \f rac{\t extcolor {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\t extcolor {orange}{R^{T,+}_i(I, a')}},
495
+ & \t ext{if} \sum_{a'\in A(I)}\t extcolor {orange}{R^{T,+}_i(I, a')} \gt 0 \\
496
496
\f rac{1}{\lvert A(I) \r vert},
497
497
& \t ext{otherwise}
498
498
\end{cases}
499
499
\end{align}
500
500
501
- where $\color {orange}{R^{T,+}_i(I, a)} = \max \Big(\color {orange}{R^T_i(I, a)}, 0 \Big)$
501
+ where $\t extcolor {orange}{R^{T,+}_i(I, a)} = \max \Big(\t extcolor {orange}{R^T_i(I, a)}, 0 \Big)$
502
502
"""
503
- # $$\color {orange}{R^{T,+}_i(I, a)} = \max \Big(\color {orange}{R^T_i(I, a)}, 0 \Big)$$
503
+ # $$\textcolor {orange}{R^{T,+}_i(I, a)} = \max \Big(\textcolor {orange}{R^T_i(I, a)}, 0 \Big)$$
504
504
regret = {a : max (r , 0 ) for a , r in self .regret .items ()}
505
- # $$\sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')}$$
505
+ # $$\sum_{a'\in A(I)}\textcolor {orange}{R^{T,+}_i(I, a')}$$
506
506
regret_sum = sum (regret .values ())
507
- # if $\sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')} \gt 0$,
507
+ # if $\sum_{a'\in A(I)}\textcolor {orange}{R^{T,+}_i(I, a')} \gt 0$,
508
508
if regret_sum > 0 :
509
- # $$\color {lightgreen}{\sigma_i^{T+1}(I)(a)} =
510
- # \frac{\color {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')}}$$
509
+ # $$\textcolor {lightgreen}{\sigma_i^{T+1}(I)(a)} =
510
+ # \frac{\textcolor {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\textcolor {orange}{R^{T,+}_i(I, a')}}$$
511
511
self .strategy = {a : r / regret_sum for a , r in regret .items ()}
512
512
# Otherwise,
513
513
else :
514
514
# $\lvert A(I) \rvert$
515
515
count = len (list (a for a in self .regret ))
516
- # $$\color {lightgreen}{\sigma_i^{T+1}(I)(a)} =
516
+ # $$\textcolor {lightgreen}{\sigma_i^{T+1}(I)(a)} =
517
517
# \frac{1}{\lvert A(I) \rvert}$$
518
518
self .strategy = {a : 1 / count for a , r in regret .items ()}
519
519
520
520
def get_average_strategy (self ):
521
521
"""
522
522
## Get average strategy
523
523
524
- $$\color {cyan}{\b ar{\sigma}^T_i(I)(a)} =
525
- \f rac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}}
524
+ $$\t extcolor {cyan}{\b ar{\sigma}^T_i(I)(a)} =
525
+ \f rac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\t extcolor {lightgreen}{\sigma^t(I)(a)}}
526
526
{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
527
527
"""
528
- # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I) \color {lightgreen}{\sigma^t(I)(a)}$$
528
+ # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I) \textcolor {lightgreen}{\sigma^t(I)(a)}$$
529
529
cum_strategy = {a : self .cumulative_strategy .get (a , 0. ) for a in self .actions ()}
530
530
# $$\sum_{t=1}^T \pi_i^{\sigma^t}(I) =
531
531
# \sum_{a \in A(I)} \sum_{t=1}^T
532
- # \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}$$
532
+ # \pi_i^{\sigma^t}(I)\textcolor {lightgreen}{\sigma^t(I)(a)}$$
533
533
strategy_sum = sum (cum_strategy .values ())
534
534
# If $\sum_{t=1}^T \pi_i^{\sigma^t}(I) > 0$,
535
535
if strategy_sum > 0 :
536
- # $$\color {cyan}{\bar{\sigma}^T_i(I)(a)} =
537
- # \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}}
536
+ # $$\textcolor {cyan}{\bar{\sigma}^T_i(I)(a)} =
537
+ # \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor {lightgreen}{\sigma^t(I)(a)}}
538
538
# {\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
539
539
return {a : s / strategy_sum for a , s in cum_strategy .items ()}
540
540
# Otherwise,
541
541
else :
542
542
# $\lvert A(I) \rvert$
543
543
count = len (list (a for a in cum_strategy ))
544
- # $$\color {cyan}{\bar{\sigma}^T_i(I)(a)} =
544
+ # $$\textcolor {cyan}{\bar{\sigma}^T_i(I)(a)} =
545
545
# \frac{1}{\lvert A(I) \rvert}$$
546
546
return {a : 1 / count for a , r in cum_strategy .items ()}
547
547
@@ -610,7 +610,7 @@ def walk_tree(self, h: History, i: Player, pi_i: float, pi_neg_i: float) -> floa
610
610
$$\sum_{z \in Z_h} \pi^\sigma(h, z) u_i(z)$$
611
611
where $Z_h$ is the set of terminal histories with prefix $h$
612
612
613
- While walking the tee it updates the total regrets $\color {orange}{R^T_i(I, a)}$.
613
+ While walking the tee it updates the total regrets $\t extcolor {orange}{R^T_i(I, a)}$.
614
614
"""
615
615
616
616
# If it's a terminal history $h \in Z$ return the terminal utility $u_i(h)$.
@@ -656,27 +656,27 @@ def walk_tree(self, h: History, i: Player, pi_i: float, pi_neg_i: float) -> floa
656
656
# update the cumulative strategies and total regrets
657
657
if h .player () == i :
658
658
# Update cumulative strategies
659
- # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}
659
+ # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor {lightgreen}{\sigma^t(I)(a)}
660
660
# = \sum_{t=1}^T \Big[ \sum_{h \in I} \pi_i^{\sigma^t}(h)
661
- # \color {lightgreen}{\sigma^t(I)(a)} \Big]$$
661
+ # \textcolor {lightgreen}{\sigma^t(I)(a)} \Big]$$
662
662
for a in I .actions ():
663
663
I .cumulative_strategy [a ] = I .cumulative_strategy [a ] + pi_i * I .strategy [a ]
664
664
# \begin{align}
665
- # \color {coral}{\tilde{r}^t_i(I, a)} &=
666
- # \color {pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
667
- # \color {pink}{\tilde{v}_i(\sigma^t, I)} \\
665
+ # \textcolor {coral}{\tilde{r}^t_i(I, a)} &=
666
+ # \textcolor {pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
667
+ # \textcolor {pink}{\tilde{v}_i(\sigma^t, I)} \\
668
668
# &=
669
669
# \pi^{\sigma^t}_{-i} (h) \Big(
670
670
# \sum_{z \in Z_h} \pi^{\sigma^t |_{I \rightarrow a}}(h, z) u_i(z) -
671
671
# \sum_{z \in Z_h} \pi^\sigma(h, z) u_i(z)
672
672
# \Big) \\
673
- # T \color {orange}{R^T_i(I, a)} &=
674
- # \sum_{t=1}^T \color {coral}{\tilde{r}^t_i(I, a)}
673
+ # T \textcolor {orange}{R^T_i(I, a)} &=
674
+ # \sum_{t=1}^T \textcolor {coral}{\tilde{r}^t_i(I, a)}
675
675
# \end{align}
676
676
for a in I .actions ():
677
677
I .regret [a ] += pi_neg_i * (va [a ] - v )
678
678
679
- # Update the strategy $\color {lightgreen}{\sigma^t(I)(a)}$
679
+ # Update the strategy $\textcolor {lightgreen}{\sigma^t(I)(a)}$
680
680
I .calculate_strategy ()
681
681
682
682
# Return the expected utility for player $i$,
@@ -685,7 +685,7 @@ def walk_tree(self, h: History, i: Player, pi_i: float, pi_neg_i: float) -> floa
685
685
686
686
def iterate (self ):
687
687
"""
688
- ### Iteratively update $\color {lightgreen}{\sigma^t(I)(a)}$
688
+ ### Iteratively update $\t extcolor {lightgreen}{\sigma^t(I)(a)}$
689
689
690
690
This updates the strategies for $T$ iterations.
691
691
"""
0 commit comments