Back to home

Evolving a Recurrent Spiking Neural Network on a Partially Observable Variant of CartPole

GitHub

Abstract

This project explores the use of recurrent spiking neural networks (RSNNs) to solve a partially observable version of the classic CartPole control problem. The agent's observations are limited to only the cart's position and the pole's angle, with both the cart velocity and the pole’s angular velocity omitted. As a result, the agent must implicitly infer these hidden variables within its recurrent neurons. The internal parameters of each neuron, as well as the topology and weights of a leaky integrate-and-fire RSNN are evolved using a genome based evolutionary algorithm. The performance is compared to several baseline policies, including random action selection and the naive policy for the partially observable state. Results show that the evolved RSNN reliably balances the pole by implicitly estimating the missing velocities, outperforming all baselines policies.

Evolved RSNN (Generation 272) (avg: 1002.50)

CartPole

CartPole is a classic benchmark in reinforcement learning and control theory. The environment consists of a cart that can move left or right along a track, with a pole hinged to its top. The goal is to keep the pole balanced upright by applying forces to the cart to move it left or right. The cart itself must also stay on the screen. In the standard version, the agent receives four observations at each timestep: the cart's position, cart velocity, pole angle, and pole angular velocity.

In this project, the environment is made more challenging by restricting the agent's observations to only the cart's position and the pole's angle. All velocity information is hidden, making the problem partially observable. This means the agent must infer the missing velocities from the sequence of past observations, rather than having direct access to them. This restricted version is far closer to how humans approach this problem - the only things we can see are the position of the cart and pole, and we infer the velocities within our brains based on our memory of the previous frames.

Simple Policy Baselines

Simple, non-neural network policies:

These baselines highlight the limitations of memoryless, reactive strategies in a partially observable setting.
Go Left (avg: 9.36)
Go Right (avg: 9.36)
Random (avg: 22.25)
Naive (avg: 42.12)

Recurrent Spiking Neural Networks

Recurrent Spiking Neural Networks (RSNNs) are a class of artificial neural networks that combine two key features: recurrence and spiking dynamics.

The combination of recurrence and spiking enables RSNNs to process sequences of inputs and to infer hidden variables—such as the unobserved velocities in CartPole—by integrating information over time. However, the non-differentiable nature of spiking makes these networks challenging to train with standard gradient-based methods.

RSNN architecture diagram
Rostami, Amirhossein & Vogginger, Bernhard & Yan, Yexin & Mayr, Christian. (2022). E-prop on SpiNNaker 2: Exploring online learning in spiking RNNs on neuromorphic hardware. Frontiers in Neuroscience. 16. 10.3389/fnins.2022.1018006.
Leaky Integrate-and-Fire (LIF) neuron model diagram
Kosta, Adarsh & Roy, Kaushik. (2022). Adaptive-SpikeNet: Event-based Optical Flow Estimation using Spiking Neural Networks with Learnable Neuronal Dynamics. 10.48550/arXiv.2209.11741.

Genetic Evolution

To train the RSNN for the CartPole task, a genetic (evolutionary) algorithm is used. Evolutionary algorithms are well-suited for this problem because the spiking dynamics are non-differentiable, making gradient-based learning infeasible.

Each individual agent in the population encodes a complete RSNN "genome," which specifies:

At the start, all these parameters are initialized randomly. During each generation of evolution, the agents are evaluated on their ability to balance the pole. The best-performing agents are selected to reproduce, meaning their genome is duplicated, randomly mutated, and placed in the next generation. Over many generations, these random mutations allow evolution to discover effective combinations of structure and parameters that perform well on the task.

This approach allows the discovery of both the architecture and the detailed dynamics of the spiking network, starting from pure randomness and without any hand-designed solutions.

Results

The evolved RSNN consistently achieves stable control of the CartPole system using only partial observations, far surpassing the performance of all baseline policies. The RSNN effectively reconstructs the missing state variables over time through recurrent neural loops, forming a crude short term memory. Future work includes training the RSNN on more complex environments, and eventually attempting to build a mechanism for long term memory through allowing synapse modification over the course of a single agent-life.