You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The number of allocations is proportional to the product of:
BATCH ($\approx 512$ or $2048$): the number of trees explored in parallel,
episodes ($\approx 800$): the number of rollouts (each results in at least $1$ new node for active trees),
num_permitted_edges_range.end() ($\approx E$): the number of edges which may be modified during a search.
Improving the space complexity
This is because each node stores a value $g_T(s, a)$ for each node $s$ and each action $a\in\mathcal{A}(s)$.
We can address this a few ways:
[ ] ?break epochs into sub-epochs
i.e., rather than $2048$ simultaneous tree searches, we instead have $256$ simultaneous searches which terminate with node selection
not needed right now
after all $8$ mini-epochs, we load all $2048$ state vectors and action vectors into buffers to update the model with
[ ] ?restrict the number of permitted edges explicitly
this does not seem to perform well for searches like $R(3, 3, 3) > 16$
actually, the problem is more the search policy. mu zero still uses PUCT rule for exploration, it just also uses soft labeling, a dynamics model, and some mid-search normalization so that the upper estimate formula has terms at the correct scale
we still should model the agent's policy for the sake of defining the upper estimate
[ ] use SamplePattern to truncate to a subset of stored actions
this is reasonable since episodes heavily restricts the number of possible actions we visit from each node
something like SamplePattern { head: 19, body: 7, tail: 4 } should be more than sufficient
since we are reintroducing the policy to the model as a direct sum with the prediction function via dfdx::nn::modules::SplitInto and using Dirichlet noise, there is no need to build a complicated action sample policy
instead of SamplePattern { head, body, tail }, use ActionBudget { g_budet, p_budget }
?reduce the size of ActionData (currently $24$ bytes)
g_sa: Option<f32> (e.g., f32 if we bytepack with next_position)
decrease memory footprint of positions: BTreeMap<P, NonZeroUsize>
eliminated with a dynamic tree search
this eliminates $O(k\ell)$ space for each tree and removes a $O(\ell\cdot \log k)$ search through the BTreeMap
in exchange, the search algorithm operates on a path which occupies $O(\ell)$ space and runs in $O(e)$, which may even be faster (both are eclipsed by the memcpy's and massive tensor computations
$k, \ell, e$ are the search tree's node count, height, and arc count, respectively
Improving the algorithm
As outlined here, we are moving closer towards the battle-tested tricks from the $\mu$-zero paper. The reason for this is the failure of the current algorithm to perform on the $R(3, 3, 3) > 16$ problem. For comparison, my model-free MCTS solved this problem in $8$ seconds and my $\alpha$ zero approach took about a minute, IIRC.
In summary:
reintroduce $p_\theta(s, \cdot)$ to the model output alongside the softly labeled $h_\theta(s, \cdot)$ values
add Dirichlet noise to the predicted distribution $p_\theta(s, \cdot)$ for each root node $s$
In Chess (Go), the DeepMind team used $\text{Dir}(\alpha)$ where $\alpha = 0.3$ ($0.03$)
Go boards begin with $361$ moves and retain this order of magnitude until the endgame where boards have tens of moves
Chess boards begin with $20$ moves and peak move counts well within a small factor of this value
For this reason, we will first try $\alpha = 10\cdot |\mathcal{A}(s)|^{-1}$
nix SamplePattern in favor of ActionBudget which selects a budgeted number of maximizers of $g_\theta(s, \cdot)$ and $p_\theta(s, \cdot)$ to retain for exploration
select actions greedily w.r.t. $u(s, a) = \hat{g}_T(s, a) + p(s, a)\cdot x(s, a)\cdot y(s, a)$ where:
$x(s, a) = \dfrac{\sqrt{n'}}{1 + n_T(a\cdot s)}$
$y(s, a) = c_1 + \log\left(\dfrac{n' + c_2 + 1}{c_2}\right)$
$n' = \sum_b n_T(b\cdot s)$
$\hat{g}_T(s, a) = \dfrac{r(s, a) + g_T^\ast(a\cdot s)}{c(s)}$
trying $c_1 = 1.25$ and $c_2 = 19652$
weighing the cross entropy loss corresponding to $(s, a)$ where $s$ is a root as a function, e.g., $\sqrt{\cdot}$, of $n_T(a\cdot s)$
consider using an importance factor w.r.t. to a priority function
when weighting, we can use dfdx's built-in cross entropy function with target distributions scaled by their weights
this is valid by inspection of the code, but later we can write a weighted cross entropy function for completeness, using the same off-tape precision and perf optimization, which can save one call to a higher-rank tensor product by delaying the weighting
mask invalid actions when calculating each $p_\theta(s, \cdot)$
that is, we require valid_actions, a BATCH x ACTION-shaped tensor of bool values
we use valid_actions.choose(action_logits, action_mask) where action_mask is a broadcast of a Rank0 tensor consisting of f32::MIN
we also need to scale the logits by $|\mathcal{A}(s)|^{-1/2}$ which can be calculated with valid_actions.choose(one, zero).sum().sqrt().recip()
test a masked cross entropy
separate mask-and-rescale into its own method
The text was updated successfully, but these errors were encountered:
The number of allocations is proportional to the product of:
BATCH
(episodes
(num_permitted_edges_range.end()
(Improving the space complexity
This is because each node stores a value$g_T(s, a)$ for each node $s$ and each action $a\in\mathcal{A}(s)$ .
We can address this a few ways:
[ ] ?break epochs into sub-epochsi.e., rather than$2048$ simultaneous tree searches, we instead have $256$ simultaneous searches which terminate with node selection[ ] ?restrict the number of permitted edges explicitly[ ] useSamplePattern
to truncate to a subset of stored actionsepisodes
heavily restricts the number of possible actions we visit from each nodeSamplePattern { head: 19, body: 7, tail: 4 }
should be more than sufficientdfdx::nn::modules::SplitInto
and using Dirichlet noise, there is no need to build a complicated action sample policySamplePattern { head, body, tail }
, useActionBudget { g_budet, p_budget }
ActionData
(currentlya: usize
(e.g.,u16
)next_position: Option<NonZeroUsize>
(e.g.,Option<NonZeroU16>
)g_sa: Option<f32>
(e.g.,f32
if we bytepack withnext_position
)positions: BTreeMap<P, NonZeroUsize>
BTreeMap
memcpy
's and massive tensor computationsImproving the algorithm
As outlined here, we are moving closer towards the battle-tested tricks from the$\mu$ -zero paper. The reason for this is the failure of the current algorithm to perform on the $R(3, 3, 3) > 16$ problem. For comparison, my model-free MCTS solved this problem in $8$ seconds and my $\alpha$ zero approach took about a minute, IIRC.
In summary:
SamplePattern
in favor ofActionBudget
which selects a budgeted number of maximizers ofdfdx
's built-in cross entropy function with target distributions scaled by their weightsvalid_actions
, aBATCH x ACTION
-shaped tensor ofbool
valuesvalid_actions.choose(action_logits, action_mask)
whereaction_mask
is a broadcast of aRank0
tensor consisting off32::MIN
valid_actions.choose(one, zero).sum().sqrt().recip()
The text was updated successfully, but these errors were encountered: