Skip to content

Commit 7f74eeb

Browse files
committed
\textcolor
1 parent 9c9276d commit 7f74eeb

File tree

19 files changed

+288
-286
lines changed

19 files changed

+288
-286
lines changed

labml_nn/cfr/__init__.py

Lines changed: 60 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -153,8 +153,8 @@
153153
The average strategy is the average of strategies followed in each round,
154154
for all $I \in \mathcal{I}, a \in A(I)$
155155
156-
$${\color{cyan}\bar{\sigma}^T_i(I)(a)} =
157-
\frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I){\color{lightgreen}\sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
156+
$$\textcolor{cyan}{\bar{\sigma}^T_i(I)(a)} =
157+
\frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor{lightgreen}{\sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
158158
159159
That is the mean regret of not playing with the optimal strategy.
160160
@@ -210,10 +210,10 @@
210210
211211
### Counterfactual regret
212212
213-
**Counterfactual value** $\color{pink}{v_i(\sigma, I)}$ is the expected utility for player $i$ if
213+
**Counterfactual value** $\textcolor{pink}{v_i(\sigma, I)}$ is the expected utility for player $i$ if
214214
if player $i$ tried to reach $I$ (took the actions leading to $I$ with a probability of $1$).
215215
216-
$$\color{pink}{v_i(\sigma, I)} = \sum_{z \in Z_I} \pi^\sigma_{-i}(z[I]) \pi^\sigma(z[I], z) u_i(z)$$
216+
$$\textcolor{pink}{v_i(\sigma, I)} = \sum_{z \in Z_I} \pi^\sigma_{-i}(z[I]) \pi^\sigma(z[I], z) u_i(z)$$
217217
218218
where $Z_I$ is the set of terminal histories reachable from $I$,
219219
and $z[I]$ is the prefix of $z$ up to $I$.
@@ -227,7 +227,7 @@
227227
228228
$$R^T_{i,imm}(I) = \frac{1}{T} \sum_{t=1}^T
229229
\Big(
230-
\color{pink}{v_i(\sigma^t |_{I \rightarrow a}, I)} - \color{pink}{v_i(\sigma^t, I)}
230+
\textcolor{pink}{v_i(\sigma^t |_{I \rightarrow a}, I)} - \textcolor{pink}{v_i(\sigma^t, I)}
231231
\Big)$$
232232
233233
where $\sigma |_{I \rightarrow a}$ is the strategy profile $\sigma$ with the modification
@@ -244,29 +244,29 @@
244244
245245
The strategy is calculated using regret matching.
246246
247-
The regret for each information set and action pair $\color{orange}{R^T_i(I, a)}$ is maintained,
247+
The regret for each information set and action pair $\textcolor{orange}{R^T_i(I, a)}$ is maintained,
248248
249249
\begin{align}
250-
\color{coral}{r^t_i(I, a)} &=
251-
\color{pink}{v_i(\sigma^t |_{I \rightarrow a}, I)} - \color{pink}{v_i(\sigma^t, I)}
250+
\textcolor{coral}{r^t_i(I, a)} &=
251+
\textcolor{pink}{v_i(\sigma^t |_{I \rightarrow a}, I)} - \textcolor{pink}{v_i(\sigma^t, I)}
252252
\\
253-
\color{orange}{R^T_i(I, a)} &=
254-
\frac{1}{T} \sum_{t=1}^T \color{coral}{r^t_i(I, a)}
253+
\textcolor{orange}{R^T_i(I, a)} &=
254+
\frac{1}{T} \sum_{t=1}^T \textcolor{coral}{r^t_i(I, a)}
255255
\end{align}
256256
257257
and the strategy is calculated with regret matching,
258258
259259
\begin{align}
260-
\color{lightgreen}{\sigma_i^{T+1}(I)(a)} =
260+
\textcolor{lightgreen}{\sigma_i^{T+1}(I)(a)} =
261261
\begin{cases}
262-
\frac{\color{orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\color{orange}{R^{T,+}_i(I, a')}},
263-
& \text{if} \sum_{a'\in A(I)}\color{orange}{R^{T,+}_i(I, a')} \gt 0 \\
262+
\frac{\textcolor{orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\textcolor{orange}{R^{T,+}_i(I, a')}},
263+
& \text{if} \sum_{a'\in A(I)}\textcolor{orange}{R^{T,+}_i(I, a')} \gt 0 \\
264264
\frac{1}{\lvert A(I) \rvert},
265265
& \text{otherwise}
266266
\end{cases}
267267
\end{align}
268268
269-
where $\color{orange}{R^{T,+}_i(I, a)} = \max \Big(\color{orange}{R^T_i(I, a)}, 0 \Big)$
269+
where $\textcolor{orange}{R^{T,+}_i(I, a)} = \max \Big(\textcolor{orange}{R^T_i(I, a)}, 0 \Big)$
270270
271271
The paper
272272
The paper
@@ -279,7 +279,7 @@
279279
280280
### Monte Carlo CFR (MCCFR)
281281
282-
Computing $\color{coral}{r^t_i(I, a)}$ requires expanding the full game tree
282+
Computing $\textcolor{coral}{r^t_i(I, a)}$ requires expanding the full game tree
283283
on each iteration.
284284
285285
The paper
@@ -296,28 +296,28 @@
296296
297297
Then we get **sampled counterfactual value** fro block $j$,
298298
299-
$$\color{pink}{\tilde{v}(\sigma, I|j)} =
299+
$$\textcolor{pink}{\tilde{v}(\sigma, I|j)} =
300300
\sum_{z \in Q_j} \frac{1}{q(z)}
301301
\pi^\sigma_{-i}(z[I]) \pi^\sigma(z[I], z) u_i(z)$$
302302
303303
The paper shows that
304304
305-
$$\mathbb{E}_{j \sim q_j} \Big[ \color{pink}{\tilde{v}(\sigma, I|j)} \Big]
306-
= \color{pink}{v_i(\sigma, I)}$$
305+
$$\mathbb{E}_{j \sim q_j} \Big[ \textcolor{pink}{\tilde{v}(\sigma, I|j)} \Big]
306+
= \textcolor{pink}{v_i(\sigma, I)}$$
307307
308308
with a simple proof.
309309
310310
Therefore we can sample a part of the game tree and calculate the regrets.
311311
We calculate an estimate of regrets
312312
313313
$$
314-
\color{coral}{\tilde{r}^t_i(I, a)} =
315-
\color{pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} - \color{pink}{\tilde{v}_i(\sigma^t, I)}
314+
\textcolor{coral}{\tilde{r}^t_i(I, a)} =
315+
\textcolor{pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} - \textcolor{pink}{\tilde{v}_i(\sigma^t, I)}
316316
$$
317317
318-
And use that to update $\color{orange}{R^T_i(I, a)}$ and calculate
319-
the strategy $\color{lightgreen}{\sigma_i^{T+1}(I)(a)}$ on each iteration.
320-
Finally, we calculate the overall average strategy $\color{cyan}{\bar{\sigma}^T_i(I)(a)}$.
318+
And use that to update $\textcolor{orange}{R^T_i(I, a)}$ and calculate
319+
the strategy $\textcolor{lightgreen}{\sigma_i^{T+1}(I)(a)}$ on each iteration.
320+
Finally, we calculate the overall average strategy $\textcolor{cyan}{\bar{\sigma}^T_i(I)(a)}$.
321321
322322
Here is a [Kuhn Poker](kuhn/index.html) implementation to try CFR on Kuhn Poker.
323323
@@ -422,24 +422,24 @@ class InfoSet:
422422
# Total regret of not taking each action $A(I_i)$,
423423
#
424424
# \begin{align}
425-
# \color{coral}{\tilde{r}^t_i(I, a)} &=
426-
# \color{pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
427-
# \color{pink}{\tilde{v}_i(\sigma^t, I)}
425+
# \textcolor{coral}{\tilde{r}^t_i(I, a)} &=
426+
# \textcolor{pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
427+
# \textcolor{pink}{\tilde{v}_i(\sigma^t, I)}
428428
# \\
429-
# \color{orange}{R^T_i(I, a)} &=
430-
# \frac{1}{T} \sum_{t=1}^T \color{coral}{\tilde{r}^t_i(I, a)}
429+
# \textcolor{orange}{R^T_i(I, a)} &=
430+
# \frac{1}{T} \sum_{t=1}^T \textcolor{coral}{\tilde{r}^t_i(I, a)}
431431
# \end{align}
432432
#
433-
# We maintain $T \color{orange}{R^T_i(I, a)}$ instead of $\color{orange}{R^T_i(I, a)}$
433+
# We maintain $T \textcolor{orange}{R^T_i(I, a)}$ instead of $\textcolor{orange}{R^T_i(I, a)}$
434434
# since $\frac{1}{T}$ term cancels out anyway when computing strategy
435-
# $\color{lightgreen}{\sigma_i^{T+1}(I)(a)}$
435+
# $\textcolor{lightgreen}{\sigma_i^{T+1}(I)(a)}$
436436
regret: Dict[Action, float]
437437
# We maintain the cumulative strategy
438-
# $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color{lightgreen}{\sigma^t(I)(a)}$$
438+
# $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor{lightgreen}{\sigma^t(I)(a)}$$
439439
# to compute overall average strategy
440440
#
441-
# $$\color{cyan}{\bar{\sigma}^T_i(I)(a)} =
442-
# \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color{lightgreen}{\sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
441+
# $$\textcolor{cyan}{\bar{\sigma}^T_i(I)(a)} =
442+
# \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor{lightgreen}{\sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
443443
cumulative_strategy: Dict[Action, float]
444444

445445
def __init__(self, key: str):
@@ -489,59 +489,59 @@ def calculate_strategy(self):
489489
Calculate current strategy using [regret matching](#RegretMatching).
490490
491491
\begin{align}
492-
\color{lightgreen}{\sigma_i^{T+1}(I)(a)} =
492+
\textcolor{lightgreen}{\sigma_i^{T+1}(I)(a)} =
493493
\begin{cases}
494-
\frac{\color{orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\color{orange}{R^{T,+}_i(I, a')}},
495-
& \text{if} \sum_{a'\in A(I)}\color{orange}{R^{T,+}_i(I, a')} \gt 0 \\
494+
\frac{\textcolor{orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\textcolor{orange}{R^{T,+}_i(I, a')}},
495+
& \text{if} \sum_{a'\in A(I)}\textcolor{orange}{R^{T,+}_i(I, a')} \gt 0 \\
496496
\frac{1}{\lvert A(I) \rvert},
497497
& \text{otherwise}
498498
\end{cases}
499499
\end{align}
500500
501-
where $\color{orange}{R^{T,+}_i(I, a)} = \max \Big(\color{orange}{R^T_i(I, a)}, 0 \Big)$
501+
where $\textcolor{orange}{R^{T,+}_i(I, a)} = \max \Big(\textcolor{orange}{R^T_i(I, a)}, 0 \Big)$
502502
"""
503-
# $$\color{orange}{R^{T,+}_i(I, a)} = \max \Big(\color{orange}{R^T_i(I, a)}, 0 \Big)$$
503+
# $$\textcolor{orange}{R^{T,+}_i(I, a)} = \max \Big(\textcolor{orange}{R^T_i(I, a)}, 0 \Big)$$
504504
regret = {a: max(r, 0) for a, r in self.regret.items()}
505-
# $$\sum_{a'\in A(I)}\color{orange}{R^{T,+}_i(I, a')}$$
505+
# $$\sum_{a'\in A(I)}\textcolor{orange}{R^{T,+}_i(I, a')}$$
506506
regret_sum = sum(regret.values())
507-
# if $\sum_{a'\in A(I)}\color{orange}{R^{T,+}_i(I, a')} \gt 0$,
507+
# if $\sum_{a'\in A(I)}\textcolor{orange}{R^{T,+}_i(I, a')} \gt 0$,
508508
if regret_sum > 0:
509-
# $$\color{lightgreen}{\sigma_i^{T+1}(I)(a)} =
510-
# \frac{\color{orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\color{orange}{R^{T,+}_i(I, a')}}$$
509+
# $$\textcolor{lightgreen}{\sigma_i^{T+1}(I)(a)} =
510+
# \frac{\textcolor{orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\textcolor{orange}{R^{T,+}_i(I, a')}}$$
511511
self.strategy = {a: r / regret_sum for a, r in regret.items()}
512512
# Otherwise,
513513
else:
514514
# $\lvert A(I) \rvert$
515515
count = len(list(a for a in self.regret))
516-
# $$\color{lightgreen}{\sigma_i^{T+1}(I)(a)} =
516+
# $$\textcolor{lightgreen}{\sigma_i^{T+1}(I)(a)} =
517517
# \frac{1}{\lvert A(I) \rvert}$$
518518
self.strategy = {a: 1 / count for a, r in regret.items()}
519519

520520
def get_average_strategy(self):
521521
"""
522522
## Get average strategy
523523
524-
$$\color{cyan}{\bar{\sigma}^T_i(I)(a)} =
525-
\frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color{lightgreen}{\sigma^t(I)(a)}}
524+
$$\textcolor{cyan}{\bar{\sigma}^T_i(I)(a)} =
525+
\frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor{lightgreen}{\sigma^t(I)(a)}}
526526
{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
527527
"""
528-
# $$\sum_{t=1}^T \pi_i^{\sigma^t}(I) \color{lightgreen}{\sigma^t(I)(a)}$$
528+
# $$\sum_{t=1}^T \pi_i^{\sigma^t}(I) \textcolor{lightgreen}{\sigma^t(I)(a)}$$
529529
cum_strategy = {a: self.cumulative_strategy.get(a, 0.) for a in self.actions()}
530530
# $$\sum_{t=1}^T \pi_i^{\sigma^t}(I) =
531531
# \sum_{a \in A(I)} \sum_{t=1}^T
532-
# \pi_i^{\sigma^t}(I)\color{lightgreen}{\sigma^t(I)(a)}$$
532+
# \pi_i^{\sigma^t}(I)\textcolor{lightgreen}{\sigma^t(I)(a)}$$
533533
strategy_sum = sum(cum_strategy.values())
534534
# If $\sum_{t=1}^T \pi_i^{\sigma^t}(I) > 0$,
535535
if strategy_sum > 0:
536-
# $$\color{cyan}{\bar{\sigma}^T_i(I)(a)} =
537-
# \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color{lightgreen}{\sigma^t(I)(a)}}
536+
# $$\textcolor{cyan}{\bar{\sigma}^T_i(I)(a)} =
537+
# \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor{lightgreen}{\sigma^t(I)(a)}}
538538
# {\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
539539
return {a: s / strategy_sum for a, s in cum_strategy.items()}
540540
# Otherwise,
541541
else:
542542
# $\lvert A(I) \rvert$
543543
count = len(list(a for a in cum_strategy))
544-
# $$\color{cyan}{\bar{\sigma}^T_i(I)(a)} =
544+
# $$\textcolor{cyan}{\bar{\sigma}^T_i(I)(a)} =
545545
# \frac{1}{\lvert A(I) \rvert}$$
546546
return {a: 1 / count for a, r in cum_strategy.items()}
547547

@@ -610,7 +610,7 @@ def walk_tree(self, h: History, i: Player, pi_i: float, pi_neg_i: float) -> floa
610610
$$\sum_{z \in Z_h} \pi^\sigma(h, z) u_i(z)$$
611611
where $Z_h$ is the set of terminal histories with prefix $h$
612612
613-
While walking the tee it updates the total regrets $\color{orange}{R^T_i(I, a)}$.
613+
While walking the tee it updates the total regrets $\textcolor{orange}{R^T_i(I, a)}$.
614614
"""
615615

616616
# If it's a terminal history $h \in Z$ return the terminal utility $u_i(h)$.
@@ -656,27 +656,27 @@ def walk_tree(self, h: History, i: Player, pi_i: float, pi_neg_i: float) -> floa
656656
# update the cumulative strategies and total regrets
657657
if h.player() == i:
658658
# Update cumulative strategies
659-
# $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color{lightgreen}{\sigma^t(I)(a)}
659+
# $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor{lightgreen}{\sigma^t(I)(a)}
660660
# = \sum_{t=1}^T \Big[ \sum_{h \in I} \pi_i^{\sigma^t}(h)
661-
# \color{lightgreen}{\sigma^t(I)(a)} \Big]$$
661+
# \textcolor{lightgreen}{\sigma^t(I)(a)} \Big]$$
662662
for a in I.actions():
663663
I.cumulative_strategy[a] = I.cumulative_strategy[a] + pi_i * I.strategy[a]
664664
# \begin{align}
665-
# \color{coral}{\tilde{r}^t_i(I, a)} &=
666-
# \color{pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
667-
# \color{pink}{\tilde{v}_i(\sigma^t, I)} \\
665+
# \textcolor{coral}{\tilde{r}^t_i(I, a)} &=
666+
# \textcolor{pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
667+
# \textcolor{pink}{\tilde{v}_i(\sigma^t, I)} \\
668668
# &=
669669
# \pi^{\sigma^t}_{-i} (h) \Big(
670670
# \sum_{z \in Z_h} \pi^{\sigma^t |_{I \rightarrow a}}(h, z) u_i(z) -
671671
# \sum_{z \in Z_h} \pi^\sigma(h, z) u_i(z)
672672
# \Big) \\
673-
# T \color{orange}{R^T_i(I, a)} &=
674-
# \sum_{t=1}^T \color{coral}{\tilde{r}^t_i(I, a)}
673+
# T \textcolor{orange}{R^T_i(I, a)} &=
674+
# \sum_{t=1}^T \textcolor{coral}{\tilde{r}^t_i(I, a)}
675675
# \end{align}
676676
for a in I.actions():
677677
I.regret[a] += pi_neg_i * (va[a] - v)
678678

679-
# Update the strategy $\color{lightgreen}{\sigma^t(I)(a)}$
679+
# Update the strategy $\textcolor{lightgreen}{\sigma^t(I)(a)}$
680680
I.calculate_strategy()
681681

682682
# Return the expected utility for player $i$,
@@ -685,7 +685,7 @@ def walk_tree(self, h: History, i: Player, pi_i: float, pi_neg_i: float) -> floa
685685

686686
def iterate(self):
687687
"""
688-
### Iteratively update $\color{lightgreen}{\sigma^t(I)(a)}$
688+
### Iteratively update $\textcolor{lightgreen}{\sigma^t(I)(a)}$
689689
690690
This updates the strategies for $T$ iterations.
691691
"""

0 commit comments

Comments
 (0)