Charlie Hepburn
Summary
PhD title: State-constrained offline deep reinforcement learning
I am a final-year PhD student at the Mathematics for Real-World Systems (MathSys) CDT under the supervision of Prof. Giovanni MontanaLink opens in a new window. My research involves developing new methods in offline reinforcement learning (RL).
Offline RL aims to apply RL methods to fixed datasets. RL usually occurs in an online fashion where the decision-making policy is trained by interacting with an environment. Through trial-and-error, the RL agent seeks to find the best actions to take dependent on the current situation. Applying online methods to offline datasets results in a distributional shift between the policy of the dataset and the current policy of the RL agent. This distributional shift causes the value of unseen actions (in the dataset) to be over-estimated, resulting in poor performance. As a result, offline RL methods are usually constrained to stay close to the actions in the dataset, an approach initially labelled as batch-constrained.
Constraining exclusively to the state-action pairs found in the dataset can be too restrictive. My research aims to reduce this restriction by constraining the agent to the states found in the dataset. This leads to higher-quality policies as out-of-distribution actions can now be taken, as long as they lead to in-distribution states. Thus, avoiding distributional shift and increasing learning potential. Through my research, I have developed two broad approaches to state-constrained offline RL. The first approach is a data augmentation method, called model-based trajectory stitching (MBTS). MBTS alters the sub-optimal trajectories in the dataset by introducing unseen action between high-value reachable states [1]. This leads to a new dataset that has a provably higher-quality underlying behaviour policy (policy that created the dataset). Improving the dataset via MBTS can also be used to improve batch-constrained offline RL methods [2]. The second approach is a state-constrained offline RL algorithm, called state-constrained deep Q-learning (StaCQ) [3]. StaCQ is built upon solid theoretical findings, showing that state-constrained offline RL methods use the data more efficiently and always lead to higher quality policies than their batch-constrained counterparts. StaCQ constrains the policy to the action that leads to the highest value next state. Both a one-step and multi-step method were introduced, showing state-of-the-art results in common benchmarking datasets.
Publications
[1] Hepburn, C.A. and Montana, G. Model-based Trajectory Stitching for Improved Offline Reinforcement Learning. 3rd Offline RL Workshop at Neural Information Processing Systems (2022) (PublicationLink opens in a new window) (ArxivLink opens in a new window)
[2] Hepburn, C.A. and Montana, G. Model-based trajectory stitching for improved behavioural cloning and its applications. Machine Learning (2023) (PublicationLink opens in a new window) (ArxivLink opens in a new window)
[3] Hepburn, C.A., Jin, Y. and Montana, G. State-constrained offline reinforcement learning. Preprint (2024) (ArxivLink opens in a new window)
Education
2021 - Present: PhD | University of Warwick
Supervised by Prof. Giovanni Montana.
2020 - 2021: MSc in Mathematics of Systems (Distinction) | University of Warwick
Individual Project: A Critical Analysis of Selected Offline Reinforcement Learning Algorithms. Supervised by: Prof. Giovanni Montana.
Group Project: Modelling substantia-nigra neurons to quantify the effects of alpha-synuclein in Parkinson’s disease.
2016 - 2020: MMath Mathematics (First Class Hons.) | University of Edinburgh
Dissertation: Randomized Iterative Methods for Linear Systems. Supervised by Dr. Aretha Teckentrup.
Group Project: Approximate Bayesian Computation.