Scaling limitations for Ramsey problems #71

ariasanovsky · 2024-01-11T13:16:18Z

The number of allocations is proportional to the product of:

BATCH ($\approx 512$ or $2048$): the number of trees explored in parallel,
episodes ($\approx 800$): the number of rollouts (each results in at least $1$ new node for active trees),
num_permitted_edges_range.end() ($\approx E$): the number of edges which may be modified during a search.

Improving the space complexity

This is because each node stores a value $g_T(s, a)$ for each node $s$ and each action $a\in\mathcal{A}(s)$.
We can address this a few ways:

~~[ ] ?break epochs into sub-epochs~~
- ~~i.e., rather than $2048$ simultaneous tree searches, we instead have $256$ simultaneous searches which terminate with node selection~~
- not needed right now
- after all $8$ mini-epochs, we load all $2048$ state vectors and action vectors into buffers to update the model with
~~[ ] ?restrict the number of permitted edges explicitly~~
- this does not seem to perform well for searches like $R(3, 3, 3) > 16$
- actually, the problem is more the search policy. mu zero still uses PUCT rule for exploration, it just also uses soft labeling, a dynamics model, and some mid-search normalization so that the upper estimate formula has terms at the correct scale
- we still should model the agent's policy for the sake of defining the upper estimate
~~[ ] use SamplePattern to truncate to a subset of stored actions~~
- this is reasonable since episodes heavily restricts the number of possible actions we visit from each node
- something like SamplePattern { head: 19, body: 7, tail: 4 } should be more than sufficient
- since we are reintroducing the policy to the model as a direct sum with the prediction function via
  dfdx::nn::modules::SplitInto and using Dirichlet noise, there is no need to build a complicated action sample policy
instead of SamplePattern { head, body, tail }, use ActionBudget { g_budet, p_budget }
?reduce the size of ActionData (currently $24$ bytes)
- a: usize (e.g., u16)
- next_position: Option<NonZeroUsize> (e.g., Option<NonZeroU16>)
- g_sa: Option<f32> (e.g., f32 if we bytepack with next_position)
decrease memory footprint of positions: BTreeMap<P, NonZeroUsize>
- eliminated with a dynamic tree search
- this eliminates $O(k\ell)$ space for each tree and removes a $O(\ell\cdot \log k)$ search through the BTreeMap
- in exchange, the search algorithm operates on a path which occupies $O(\ell)$ space and runs in $O(e)$, which may even be faster (both are eclipsed by the memcpy's and massive tensor computations
  - $k, \ell, e$ are the search tree's node count, height, and arc count, respectively

Improving the algorithm

As outlined here, we are moving closer towards the battle-tested tricks from the $\mu$-zero paper. The reason for this is the failure of the current algorithm to perform on the $R(3, 3, 3) > 16$ problem. For comparison, my model-free MCTS solved this problem in $8$ seconds and my $\alpha$ zero approach took about a minute, IIRC.

In summary:

reintroduce $p_\theta(s, \cdot)$ to the model output alongside the softly labeled $h_\theta(s, \cdot)$ values
add Dirichlet noise to the predicted distribution $p_\theta(s, \cdot)$ for each root node $s$
- In Chess (Go), the DeepMind team used $\text{Dir}(\alpha)$ where $\alpha = 0.3$ ($0.03$)
- Go boards begin with $361$ moves and retain this order of magnitude until the endgame where boards have tens of moves
- Chess boards begin with $20$ moves and peak move counts well within a small factor of this value
- For this reason, we will first try $\alpha = 10\cdot |\mathcal{A}(s)|^{-1}$
nix SamplePattern in favor of ActionBudget which selects a budgeted number of maximizers of $g_\theta(s, \cdot)$ and $p_\theta(s, \cdot)$ to retain for exploration
select actions greedily w.r.t. $u(s, a) = \hat{g}_T(s, a) + p(s, a)\cdot x(s, a)\cdot y(s, a)$ where:
- $x(s, a) = \dfrac{\sqrt{n'}}{1 + n_T(a\cdot s)}$
- $y(s, a) = c_1 + \log\left(\dfrac{n' + c_2 + 1}{c_2}\right)$
- $n' = \sum_b n_T(b\cdot s)$
- $\hat{g}_T(s, a) = \dfrac{r(s, a) + g_T^\ast(a\cdot s)}{c(s)}$
- trying $c_1 = 1.25$ and $c_2 = 19652$
weighing the cross entropy loss corresponding to $(s, a)$ where $s$ is a root as a function, e.g., $\sqrt{\cdot}$, of $n_T(a\cdot s)$
- consider using an importance factor w.r.t. to a priority function
- when weighting, we can use dfdx's built-in cross entropy function with target distributions scaled by their weights
- this is valid by inspection of the code, but later we can write a weighted cross entropy function for completeness, using the same off-tape precision and perf optimization, which can save one call to a higher-rank tensor product by delaying the weighting
mask invalid actions when calculating each $p_\theta(s, \cdot)$
- that is, we require valid_actions, a BATCH x ACTION-shaped tensor of bool values
- we use valid_actions.choose(action_logits, action_mask) where action_mask is a broadcast of a Rank0 tensor consisting of f32::MIN
- we also need to scale the logits by $|\mathcal{A}(s)|^{-1/2}$ which can be calculated with valid_actions.choose(one, zero).sum().sqrt().recip()
  - test a masked cross entropy
  - separate mask-and-rescale into its own method

The text was updated successfully, but these errors were encountered:

ariasanovsky · 2024-02-27T14:50:15Z

$\mu$-zero refactor

Theory Vs Implementation

$f_\theta(s) = h_\theta(s)\oplus p_\theta(s)$
- use dfdx::nn::modules::SplitInto
- here, each $h_\theta(s): \mathcal{A}(s)\to\mathbb{R}$ maps each $a\in\mathcal{A}(s)$ to some $h_\theta(s, a)\in\mathbb{R}$
  - $h_\theta(s)$ is implemented with a tensor h[s] of shape A x L
    - for each $0\leq i < |\mathcal{A}|$, h[s][a] is a Rank1<L> of logits
  - here, we need a partition $P = [h_0 < h_1 < \cdots < h_{L-1}]$ of $H = [h_0, h_{L-1}]$
  - we form soft labels for each $h\in H$:
    - here, we somehow write $h = u_i\cdot h_i + v_i\cdot h_{i+1}$ with $u + v = 1$ and $u,v\geq 0$

Action Values

math	tensor	rank
$h_\theta(s, a)\in\mathbb{R}$	`h[s][a]`	`Rank0`
$h_\theta(s, \cdot): \mathcal{A}(s)\to\mathbb{R}$	`h[s]`	`Rank1<A>`
$\mathcal{H} = [h_0,\dots, h_{L-1}]$	`h_labels`	`Rank1<L>`
$u = [u_i]_i$, $\vert\text{supp}(u)\vert \leq 2$	`h_probs[s][a]`	`Rank1<L>`
$h = \sum_i u_i h_i$	`h_unlabel`	`Rank1<L> -> Rank0`
$\hat{h} = \sum_i u_i e_i$	`h_label`	`Rank0 -> Rank1<L>`
$u = \text{softmax}\left(\dfrac{\text{value logits}}{\sqrt{L}}\right)$	`h_logits[s][a]`	`Rank1<L>`
$\ell_h = H(\text{obs}_h(T)\text{ } \| \text{ pred}_h(\theta))$	`cross_entropy_with_logits_loss`	`(Rank3<B, A, L>, Rank3<B, A, L>) -> Rank0`

Transition Probabilities

math	tensor	rank
$p_\theta(s, a)\in [0, 1]$	`p[s][a]`	`Rank0`
$p_\theta(s, \cdot): \mathcal{A}(s)\to [0, 1]$	`p[s]`	`Rank1<A>`
$p_\theta = \text{softmax}\left(\dfrac{\text{action logits}}{\sqrt{L}}\right)$	`p_logits[s]`	`Rank1<A>`
$\ell_p = H(\text{obs}_p(T)\text{ } \| \text{ pred}_p(\theta))$	`cross_entropy_with_logits_loss`	`(Rank2<B, A>, Rank2<B, A>) -> Rank0`

$Dir(\alpha)$

For Dirichlet noise given root state $s$, use:

$\alpha \approx 10 |\mathcal{A}(s)|^{-1}$
$d(s)\sim \text{Dir}(\alpha)$
$p = p_\theta(s)\cdot \dfrac{3}{4} + d(s)\cdot \dfrac{1}{4}$

Use $p_\theta(s)$ for non-root states.

Storing Predictions

Select a bounded number of maximizers of $g_T^\ast(s, \cdot)$ and of $p_\theta(s, \cdot)$.

ariasanovsky self-assigned this Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling limitations for Ramsey problems #71

Scaling limitations for Ramsey problems #71

ariasanovsky commented Jan 11, 2024 •

edited

Loading

ariasanovsky commented Feb 27, 2024

Scaling limitations for Ramsey problems #71

Scaling limitations for Ramsey problems #71

Comments

ariasanovsky commented Jan 11, 2024 • edited Loading

Improving the space complexity

Improving the algorithm

ariasanovsky commented Feb 27, 2024

$\mu$-zero refactor

Theory Vs Implementation

Action Values

Transition Probabilities

$Dir(\alpha)$

Storing Predictions

ariasanovsky commented Jan 11, 2024 •

edited

Loading