Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling limitations for Ramsey problems #71

Open
7 of 11 tasks
ariasanovsky opened this issue Jan 11, 2024 · 1 comment
Open
7 of 11 tasks

Scaling limitations for Ramsey problems #71

ariasanovsky opened this issue Jan 11, 2024 · 1 comment
Assignees

Comments

@ariasanovsky
Copy link
Owner

ariasanovsky commented Jan 11, 2024

The number of allocations is proportional to the product of:

  • BATCH ($\approx 512$ or $2048$): the number of trees explored in parallel,
  • episodes ($\approx 800$): the number of rollouts (each results in at least $1$ new node for active trees),
  • num_permitted_edges_range.end() ($\approx E$): the number of edges which may be modified during a search.

Improving the space complexity

This is because each node stores a value $g_T(s, a)$ for each node $s$ and each action $a\in\mathcal{A}(s)$.
We can address this a few ways:

  • [ ] ?break epochs into sub-epochs
    • i.e., rather than $2048$ simultaneous tree searches, we instead have $256$ simultaneous searches which terminate with node selection
    • not needed right now
    • after all $8$ mini-epochs, we load all $2048$ state vectors and action vectors into buffers to update the model with
  • [ ] ?restrict the number of permitted edges explicitly
    • this does not seem to perform well for searches like $R(3, 3, 3) > 16$
    • actually, the problem is more the search policy. mu zero still uses PUCT rule for exploration, it just also uses soft labeling, a dynamics model, and some mid-search normalization so that the upper estimate formula has terms at the correct scale
    • we still should model the agent's policy for the sake of defining the upper estimate
  • [ ] use SamplePattern to truncate to a subset of stored actions
    • this is reasonable since episodes heavily restricts the number of possible actions we visit from each node
    • something like SamplePattern { head: 19, body: 7, tail: 4 } should be more than sufficient
    • since we are reintroducing the policy to the model as a direct sum with the prediction function via
      dfdx::nn::modules::SplitInto and using Dirichlet noise, there is no need to build a complicated action sample policy
  • instead of SamplePattern { head, body, tail }, use ActionBudget { g_budet, p_budget }
  • ?reduce the size of ActionData (currently $24$ bytes)
    • a: usize (e.g., u16)
    • next_position: Option<NonZeroUsize> (e.g., Option<NonZeroU16>)
    • g_sa: Option<f32> (e.g., f32 if we bytepack with next_position)
  • decrease memory footprint of positions: BTreeMap<P, NonZeroUsize>
    • eliminated with a dynamic tree search
    • this eliminates $O(k\ell)$ space for each tree and removes a $O(\ell\cdot \log k)$ search through the BTreeMap
    • in exchange, the search algorithm operates on a path which occupies $O(\ell)$ space and runs in $O(e)$, which may even be faster (both are eclipsed by the memcpy's and massive tensor computations
      • $k, \ell, e$ are the search tree's node count, height, and arc count, respectively

Improving the algorithm

As outlined here, we are moving closer towards the battle-tested tricks from the $\mu$-zero paper. The reason for this is the failure of the current algorithm to perform on the $R(3, 3, 3) &gt; 16$ problem. For comparison, my model-free MCTS solved this problem in $8$ seconds and my $\alpha$ zero approach took about a minute, IIRC.

In summary:

  • reintroduce $p_\theta(s, \cdot)$ to the model output alongside the softly labeled $h_\theta(s, \cdot)$ values
  • add Dirichlet noise to the predicted distribution $p_\theta(s, \cdot)$ for each root node $s$
    • In Chess (Go), the DeepMind team used $\text{Dir}(\alpha)$ where $\alpha = 0.3$ ($0.03$)
    • Go boards begin with $361$ moves and retain this order of magnitude until the endgame where boards have tens of moves
    • Chess boards begin with $20$ moves and peak move counts well within a small factor of this value
    • For this reason, we will first try $\alpha = 10\cdot |\mathcal{A}(s)|^{-1}$
  • nix SamplePattern in favor of ActionBudget which selects a budgeted number of maximizers of $g_\theta(s, \cdot)$ and $p_\theta(s, \cdot)$ to retain for exploration
  • select actions greedily w.r.t. $u(s, a) = \hat{g}_T(s, a) + p(s, a)\cdot x(s, a)\cdot y(s, a)$ where:
    • $x(s, a) = \dfrac{\sqrt{n'}}{1 + n_T(a\cdot s)}$
    • $y(s, a) = c_1 + \log\left(\dfrac{n' + c_2 + 1}{c_2}\right)$
    • $n' = \sum_b n_T(b\cdot s)$
    • $\hat{g}_T(s, a) = \dfrac{r(s, a) + g_T^\ast(a\cdot s)}{c(s)}$
    • trying $c_1 = 1.25$ and $c_2 = 19652$
  • weighing the cross entropy loss corresponding to $(s, a)$ where $s$ is a root as a function, e.g., $\sqrt{\cdot}$, of $n_T(a\cdot s)$
    • consider using an importance factor w.r.t. to a priority function
    • when weighting, we can use dfdx's built-in cross entropy function with target distributions scaled by their weights
    • this is valid by inspection of the code, but later we can write a weighted cross entropy function for completeness, using the same off-tape precision and perf optimization, which can save one call to a higher-rank tensor product by delaying the weighting
  • mask invalid actions when calculating each $p_\theta(s, \cdot)$
    • that is, we require valid_actions, a BATCH x ACTION-shaped tensor of bool values
    • we use valid_actions.choose(action_logits, action_mask) where action_mask is a broadcast of a Rank0 tensor consisting of f32::MIN
    • we also need to scale the logits by $|\mathcal{A}(s)|^{-1/2}$ which can be calculated with valid_actions.choose(one, zero).sum().sqrt().recip()
      • test a masked cross entropy
      • separate mask-and-rescale into its own method
@ariasanovsky ariasanovsky self-assigned this Jan 11, 2024
@ariasanovsky
Copy link
Owner Author

$\mu$-zero refactor

Theory Vs Implementation

  • $f_\theta(s) = h_\theta(s)\oplus p_\theta(s)$
    • use dfdx::nn::modules::SplitInto
    • here, each $h_\theta(s): \mathcal{A}(s)\to\mathbb{R}$ maps each $a\in\mathcal{A}(s)$ to some $h_\theta(s, a)\in\mathbb{R}$
      • $h_\theta(s)$ is implemented with a tensor h[s] of shape A x L
        • for each $0\leq i &lt; |\mathcal{A}|$, h[s][a] is a Rank1<L> of logits
      • here, we need a partition $P = [h_0 &lt; h_1 &lt; \cdots &lt; h_{L-1}]$ of $H = [h_0, h_{L-1}]$
      • we form soft labels for each $h\in H$:
        • here, we somehow write $h = u_i\cdot h_i + v_i\cdot h_{i+1}$ with $u + v = 1$ and $u,v\geq 0$

Action Values

math tensor rank
$h_\theta(s, a)\in\mathbb{R}$ h[s][a] Rank0
$h_\theta(s, \cdot): \mathcal{A}(s)\to\mathbb{R}$ h[s] Rank1<A>
$\mathcal{H} = [h_0,\dots, h_{L-1}]$ h_labels Rank1<L>
$u = [u_i]_i$, $\vert\text{supp}(u)\vert \leq 2$ h_probs[s][a] Rank1<L>
$h = \sum_i u_i h_i$ h_unlabel Rank1<L> -> Rank0
$\hat{h} = \sum_i u_i e_i$ h_label Rank0 -> Rank1<L>
$u = \text{softmax}\left(\dfrac{\text{value logits}}{\sqrt{L}}\right)$ h_logits[s][a] Rank1<L>
$\ell_h = H(\text{obs}_h(T)\text{ } | \text{ pred}_h(\theta))$ cross_entropy_with_logits_loss (Rank3<B, A, L>, Rank3<B, A, L>) -> Rank0

Transition Probabilities

math tensor rank
$p_\theta(s, a)\in [0, 1]$ p[s][a] Rank0
$p_\theta(s, \cdot): \mathcal{A}(s)\to [0, 1]$ p[s] Rank1<A>
$p_\theta = \text{softmax}\left(\dfrac{\text{action logits}}{\sqrt{L}}\right)$ p_logits[s] Rank1<A>
$\ell_p = H(\text{obs}_p(T)\text{ } | \text{ pred}_p(\theta))$ cross_entropy_with_logits_loss (Rank2<B, A>, Rank2<B, A>) -> Rank0

$Dir(\alpha)$

For Dirichlet noise given root state $s$, use:

  • $\alpha \approx 10 |\mathcal{A}(s)|^{-1}$
  • $d(s)\sim \text{Dir}(\alpha)$
  • $p = p_\theta(s)\cdot \dfrac{3}{4} + d(s)\cdot \dfrac{1}{4}$

Use $p_\theta(s)$ for non-root states.

Storing Predictions

Select a bounded number of maximizers of $g_T^\ast(s, \cdot)$ and of $p_\theta(s, \cdot)$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant