Skip to content

[TEMP] A Very Draft Proposal

Jun Tian edited this page Mar 2, 2019 · 4 revisions

A Draft Design of Distributed Reinforcement Learning in Julia

I've been thinking for a while about how to design a distributed reinforcement learning package in Julia. Recently I read through the source code of some packages again, including:

and some other resources included here by Joel. Although I still don't have a very clear design, I would like to write down my thoughts here in case they are useful for someone else.

The abstractions for reinforcement learning in rllib are quite straightforward. You may refer RLlib: Abstractions for Distributed Reinforcement Learning for what I'm talking in the next.

  1. BaseEnv
  2. Policy Graph
  3. Policy Evaluation
  4. Policy Optimizer
  5. Agent

It has been demonstrated that by using the concepts above most of the popular reinforcement learning algorithms can be implemented in rllib. However, it's not that easy to port those concepts directly into Julia. One of the most important reason is that we don't have an existing foundamental package like Ray. And the infrastructure of parallel programming in Julia is quite different. In the next section, I will try to adapt those concepts in Julia and describe how to implement some typical algorithms in the very high level with the actor model.

Notice that, here we even don't have a scheduler or an object store! I just want to picture how it would be like with only actor model (as a negative example 😿 ).

Actors Actors Actors

Environment

Let's start from the environments part first. Environments in RL are relatively independent. By treating all environments asynchronously, rllib shows that it would be very convenient to introduce new environments. So here we also treat environments as actors running asynchronously.

First, we introduce the concept of AbstractEnv.

abstract type AbstractEnv end

function interact!(env, actions...) end
function observe(env, role) end
function reset!(env) end
# ...

Then we can wrap it into an actor

env_actor = @actor begin
    env = ExampleEnv(init_configs)
    while true
        sender, msg = receive()
        @match msg
            (:interact!, actions) => interact!(env, actions)
            (:observe, role) => tell(sender, observe(env, role))
            (:reset!,) => reset!(env)
            # do something else
            (:ping,) => tell(sender, :pong)
        end
    end
end

# The code above can be further simplified by introducing an `@wrap_actor` macro
env_actors = @wrap_environment_actor ExampleEnv(init_configs)

Policy

In the next, we can have a PolicyGraph object like the one in rllib:

abstract type AbstractPolicy end

function act(pg, obs) end
function learn(pg, batch) end
function set_weights(pg, weight) end
function get_weights(pg) end

Evaluator

An evaluator will combine Policy and Environment together.

abstract type AbstractEvaluator end

struct ExampleEvaluator <: AbstractEvaluator
    env
    policy
    #...
end

function sample(ev::Evaluator) end

Again, we can wrap it into an actor.

ev_actor = @wrap_evaluator_actor ExampleEvaluator(env, policy, params...)

When the ev_actor is invoked, a corresponding environment actor will also be invoked as its child (in the same processor by default)

Optimizer

An optimizer will interact with evaluators and do something like parameter updating and distributed sampling. It can have several (remote) Policy Evaluator actors as its children.

Demo

Putting all components together. We have the following graph to show how each component is working in the Ape-X algorithm.

TODO: Add figure

And the pseudocode is:

# 1. create environments
env_creator = configs -> CartPoleEnv(configs)

# 2. create policies
policy_initilizer = configs -> DQNPolicy(configs)

# 3. define evaluators
mutable struct ApeXEvaluator
    env_actor
    policy
    batch_size
    n_samples
    replay_buffer
    ApeXEvaluator(params...) = new()
end

function sample(ev::ApeXEvaluator)
    while true
        if ev.n_samples >= ev.batch_size
            return sample(replay_buffer, evn.batch_size)
        else
            r, d, s = @await observe(ev.env_actor)  # it will be translated into send/receive
            a = ev.policy_actor(s)  # it will be translated into send/receive
            # update replay_buffer
            # calc loss
            # update grad
        end
    end
end

# 4. optimizer
mutable struct ApeXOptimizer
    local_ev
    remote_evs
end

function step(optimizer::ApeXOptimizer)
    samples = @await get_high_priority_samples(optimizer.remote_evs)
    # evaluate local_env
    # broadcast local_weights
    # update priority of replay buffer
end

@schedule @wrap_optimizer_actor(ApeXOptimizer(configs))

Conclusion

  • I haven't even explained the implementation details of actors here.
  • No Scheduler (I find that most calculations can be finished in local processor, so maybe we don't need a very general computation graph model?)
  • Not sure about the performance
  • Not that easy to debug
  • Very intuitive. Actors, children of actors and grandchildren...

Update

To apply the actor model, one of the most critical issues is to guarantee the immutability of messages. If we just pass states/gradients (of type Array) between different actors, then the immutability is broken. So how does ray/rllib handle this problem? By calling the .remote function, the results are written to the Object Store first and an object ID is returned. Once the writing operation get finished, the data will be sealed and become immutable. And the messages passing between different actors are just object IDs. The benefit is that we only need to write once and then read across different processors (in one node) without copying data. Obviously there is a drawback. If the data will only be consumed for one time in the same processor by another actor(task), then we don't need to serialize/deserialize it at all. And this is a very common case in most Reinforcement Learning algorithms. First we collect observation from an environment, then an action is generated by a neural network. In this process, the observation will only be consumed once. (However, for Ape-X, experience may be used later)

It seems that if we want to adapt the actor model, then we may create another ray/rllib in the end.

Or we can look for some other computation patterns. (Joel mentioned DAG last time.)

Hope this can inspire you guys to move on and figure out a more practical approach!

TODO:

  • Add a simple scheduler and some test cases based on Distributed. (2019-03-03)

Cheers!