A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

This repository contains the codes for the paper

A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

If you find this code useful in your research, please consider citing:

@inproceedings{li2023statistical,
title={A statistical analysis of {P}olyak-{R}uppert averaged {Q}-learning},
author={Li, Xiang and Yang, Wenhao and Liang, Jiadong and Zhang, Zhihua and Jordan, Michael I},
booktitle={International Conference on Artificial Intelligence and Statistics},
pages={2207--2261},
year={2023},
organization={PMLR}
}

File name	Description
env.py	Two used environments, namely RandomMDP and SimpleMDP
main.py	Execution code in RandomMDP
main_simple.py	Execution code in SimpleMDP
algo.py	Functions for considered algorithms and the statistical inference procedure.
vi.py	Apply value iteration to estimate the optimal value function
plot_online.py	Code for producing Figure 1
plot_convergence.py	Code for producing Figure 2

Some notes

1. Acceleration by Ray:

Due to the need of repeated simulations, we use the distributed execution engine Ray to accelerate the fitting and save the experiment time. As a result, you may find the following code at the beginning of these main files:

import ray
ray.init(num_cpus=NUM_CPU, num_gpus=NUM_GPU)
res = ray.get(FUNCTION_TO_BE_PARALLELED)

2. Detail about the SimpleMDP:

The SimpleMDP is borrowed from [Wainwright 2019]. See Section 3.4 and Figure 1 therein.

3. Online statistical procedure for Q-Learning:

Once we establish the functional central limit theorem for the partial-sum process (Theorem 3.1 in our paper), we can apply Algorithm 1 in [Lee et al., 2021] to compute the pivotal statistic.

4. Considered algorithms:

We consider two algorithms. One is the averaged Q learning (named as "aql" in the code), while the other is the entropy regularized variant (named as "entropy" in the code).

About

Codes for the AISTATS 2023 paper, A Statistical Analysis of Polyak-Ruppert Averaged Q-learning.

Languages

Language:Python 100.0%

lx10077 / AveQLearning

A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

Table of Contents

Some notes

About

Languages