I am not sure why you you did the random prediction this way action = np.random.choice(self.action_space, p=prediction) also you are picking the random outcome as the chosen action during testing and training. I can understand why during training but why during testing as well ?