Excursions in Reinforcement Learning

This course (IFT6760C) is intended for advanced graduate students with a good background in machine learning, mathematics, operations research or statistics. You can register to IFT6760C on Synchro if your affiliation is with UdeM, or via the CREPUQ if you are from another institution. Due to the research-oriented nature of this class, you need to be comfortable with a teaching format involving open-ended questions and assignments. You will be required to think critically and adopt an open mindset. My teaching goal with this course is for all the participants to build their own understanding of reinforcement learning in relation to their primary research area while sharing their unique perspective and insights with the entire class.

Excursion:

A short journey or trip, especially one taken as a leisure activity.
A deviation from a regular activity or course.

Origin: from the Latin verb excurrere which means to run out. This is also the intended meaning behind the title of this course. I want us to deviate from the usual paths and explore the rich connections between reinforcement learning and other disciplines, in particular: optimization, control theory and simulation. And of course, I'm also hoping that this will be a fun activity for everyone.

Time and Location

Twice a week, on Tuesday from 9:30 to 11:30AM and on Friday from 13h30 to 15h40. The course is taught at Mila in the Agora of the 6650 Saint-Urbain. You don't need badge access to enter the classroom. Here's a video showing you how to access the classroom from Saint-Zotique Ouest.

Evaluation

The following evaluation structure is subject to change depending on the class size.

Class participation: 5%
Assignments: 3 x 15% = 45%
Final project: 50%

Markov Decision Processes: Discrete Stochastic Dynamic Programming by Martin Puterman
Reinforcement Learning and Optimal Control by Dimitri Bertsekas
Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
Simulation and the Monte Carlo Method, Second Edition by Rubinstein and Kroese (2008)
Nonlinear Programming: 3rd Edition by Dimitri Bertsekas
Practical Methods for Optimal Control and Estimation Using Nonlinear Programming by John T. Betts
Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition by Andreas Griewank and Andrea Walther

Schedule

Tuesday January 7

First class. Markov Decision Processes, induced process

Friday January 10

Examples of MDPs, constrained MLE as sequential allocation, criteria: finite horizon, infinite horizon, average reward, random horizon interpretation of infinite discounted setting. Bellman optimality

Tuesday January 14

Absorbing states and discounting.

Section 5.3, proposition 5.3.1 and page 127 in Puterman (1994)
Death and discounting by Adam Shwartz

@article{Shwartz2001,
    doi = {10.1109/9.917668},
    url = {https://doi.org/10.1109/9.917668},
    year = {2001},
    month = apr,
    publisher = {Institute of Electrical and Electronics Engineers ({IEEE})},
    volume = {46},
    number = {4},
    pages = {644--647},
    author = {A. Shwartz},
    title = {Death and discounting},
    journal = {{IEEE} Transactions on Automatic Control}
    }

Vector notation: Section 5.6 in Puterman (1994)
Neumann series and policy evaluation: Theorem 6.1.1 and corrolary C.4 in Puterman (1994)

Matrix splitting methods:

Iterative Methods by Martin H. Gutknecht
Matrix iterative analysis by Richard Varga

@book{varga1962,
    title       = {Matrix iterative analysis},
    author  = {Richard S. Varga},
    lccn        = {62021277},
    year        = {1962},
    publisher   = {Prentice-Hall},
    address = {Englewood Cliffs}
    }

Friday January 17

Primal and Dual LP formulations: section 6.9.1 in Puterman (1994)

Finite State Markovian Decision Processes by Cyrus Derman

@book{Derman1970,
     author = {Derman, Cyrus},
     title = {Finite State Markovian Decision Processes},
     year = {1970},
     isbn = {0122092503},
     publisher = {Academic Press, Inc.},
     address = {Orlando, FL, USA},
    }

Markov Decision Processes by Lodewijk Kallenberg
Constrained Markov Decision Processes by Eitan Altman

@book{altman1999,
      author    = {Altman, E.},
      publisher = {Chapman and Hall},
      title     = {Constrained Markov Decision Processes},
      year      = 1999,
      pages     = 256
    }

Occupation measures, induced randomized stationary policy: theorem 6.9.1 in Puterman (1994)
Characterization of basic feasible solutions as Markov Deterministic decision rules: proposition 6.9.3 in Puterman (1994)
Bias of the MRP in the average reward formulation: section 8.2.1 in Puterman (1994)
Laurent series expansion in the average reward forumulation: section 8.2.2 in Puterman (1994)
- Representation of the total expected discounted reward in terms of the interest rate
- Corollary 8.2.4: decomposition of the total expected discounted reward of a policy into the gain and bias terms
- Corollary 8.2.5: gain as the limit of discount factor going to 1 in the total expected discounted reward
Bellman optimality equations in the total expected discounted reward criterion
- Equality between taking the maximum over the set of deterministic Markovian decision rules instead of the randomized Markovian decision rules: proposition 6.2.1 in Puterman (1994)
- Contraction mappings: section 6.2.3 in Puterman (1994)
- Banach fixed-point theorem: theorem 6.2.3 in Puterman (1994)

Tuesday January 21

Contraction mapping theorem: theorem 6.2.3 of Puterman (1994)
The Bellman optimality operator is a contraction: proposition 6.2.4
Conserving decision rules: section 6.2.4
Existence of stationary policies: theorem 6.2.7
Value iteration: section 6.3.2
Termination criterion: theorem 6.3.1 c) and d)

Friday January 24

Policy Iteration: section 6.4
- Monotonicity of the value iterates: proposition 6.4.1
- Convergence of policy iteration: theorem 6.4.2
- PI as a constructive proof for the existence of a solution to the Bellman optimality equations

Newton-Kantorovich Iterations: section 6.4.3

Gateaux derivatives
Policy iteration as Newton-Kantorovich: proposition 6.4.4
See Functional Analysis 2nd edition by Kantorovich and Akilov (1982)

  @book{KantorovichAkilov1982,
      title={Functional Analysis},
      author={Kantorovich, L.V. and Akilov, G.P.},
      isbn={9780080230368},
      lccn={80041734},
      year={1982},
      publisher={Pergamon Press}
    }

See On the Convergence of Policy Iteration in Stationary Dynamic Programming by Puterman and Brumelle (1979)

@article{PutermanBrumelle1979,
      author = {Martin L. Puterman and Shelby L. Brumelle},
      title = {On the Convergence of Policy Iteration in Stationary Dynamic Programming},
      journal = {Mathematics of Operations Research},
      volume = {4},
      number = {1},
      pages = {60-69},
      year = {1979}
    }

See Numerical dynamic programming in economics by John Rust (1996)

@incollection{Rust1996,
      doi = {10.1016/s1574-0021(96)01016-7},
      url = {https://doi.org/10.1016/s1574-0021(96)01016-7},
      year = {1996},
      publisher = {Elsevier},
      pages = {619--729},
      author = {John Rust},
      title = {Chapter 14 Numerical dynamic programming in economics},
      booktitle = {Handbook of Computational Economics}
    }

Modified policy iteration: section 6.5

Generalized Bellman optimality operator
See Stopping times and Markov programming by Van Nunen 1976

@TechReport{vannunen1976,
    author  = "J.A.E.E. van Nunen and J. Wessels",
    title       = "Stopping times and Markov programming",
    institution = "Technische Hogeschool Eindhoven",
    address = "Eindhoven:",
    number  = "76-22",
    year        = "1976"
    }

See The generation of successive approximation methods for Markov decision processes by using stopping times by Van Nunen and Wessels (1976)

@techreport{vannunentechreport1976,
    author = {van Nunen, J. A. E. E. and Wessels, J.},
    year = {1976},
    title = {The generation of successive approximation methods for Markov decision processes by using stopping times},
    institution = {Eindhoven: Technische Hogeschool Eindhoven.},
    number = {Memorandum COSOR; Vol. 7622}
    }

Tuesday January 28

Derivation of the smooth Bellman optimality equations from the point of view of a noisy reward function where the noise process comes from a Gumbel distribution
Social surplus function
Williams-Daly-Zachary theorem
See McFadden's Econometric Models of Probabilistic Choice (1981)
See Rust (1996) and:

  @article{Rust1988,
      doi = {10.1137/0326056},
      url = {https://doi.org/10.1137/0326056},
      year = {1988},
      month = sep,
      publisher = {Society for Industrial {\&} Applied Mathematics ({SIAM})},
      volume = {26},
      number = {5},
      pages = {1006--1024},
      author = {Rust John},
      title = {Maximum Likelihood Estimation of Discrete Control Processes},
      journal = {{SIAM} Journal on Control and Optimization}
    }

Introduction to the projected bellman equations, Galerkin method, variational formulation

Friday January 31

Lemma 6.3.1 on the property of the weighted Euclidean norm taken under a stationary distribution
Proposition 6.3.1 using Lemma 6.3.1 to show that the projected policy evaluation operator is a \(\gamma\)-contraction and proof of the bound on the error between the fixed-point solution and the true value function
Use of the Pythagorean theorem (footnote on page 425, section 6.3.1)
Matrix form of the projected equations (section 6.3.2)

Tuesday February 4

Projection by Monte Carlo simulation, section 6.1.4 of Bertsekas
Derivation of the closed-form solution of the projection under the weighted Euclidean norm
Monte Carlo estimates of the closed-form solution (see page 406 of Bertsekas)
Simulated-based version of the direct solution to the projected policy evaluation equations: LSTD(0)
Section 6.3.3 of Bertsekas
Section 6.3.4 of Bertsekas

Friday February 7

Assignment 1 released
Perron-Frobenius theorem and numerical methods for computing the stationary distribution
Projected Value Iteration algorithm, see section 6.3.2 of Bertsekas
Simulated-based counterpart to PVI: LSPE(0) see section 6.3.4 of Bertsekas

Tuesday February 11

Oblique Projections
Projected Bellman equations with matrix splitting methods
Simulated-based counterpart: LSTD(\(\lambda\))

Friday February 14

Valentine's day
See (Wikipedia](https://en.wikipedia.org/wiki/Valentine%27s_Day)
ODE Method and Stochastic Approximation
Fitted value methods:
Non-linear case
Counterexample

Tuesday February 18

Intuition for stochastic approximation. Robbins-Monro algorithm
- Root-finding problems
- Newton's method (live coding)
- Martingale differences
- See sections 1.1.1 and 1.1.2 of Kushner and Yin (2003)
ODE of Q-learning in the tabular setting
- Martingale differences for Q-learning
- See page 44 of Kushner and Yin (2003)
Euler's method
- See Nocedal and Wright (2006)

Friday February 21

Martingale differences in stochastic approximation (continued)
- Section 1.1.2 of Kushner and Yin (2003)
- Demonstration with colab notebook
Derivative approximation with finite differences
Forward-difference/one-sided-difference
Central-difference formula
Taylor's Theorem: see thm. 2.1 in Nocedal and Wright 2006
Section 8.1 of Nocedal and Wright (2006)
Effect of noise on finite differences
See section 9.1 of Nocedal and Wright (2006)
- Theorem 9.1 of Nocedal and Wright
Intro to the likelihood ratio method

Tuesday February 25

The likelihood ratio method

See L'Ecuyer 1990

@article{LEcuyer1990,
      doi = {10.1287/mnsc.36.11.1364},
      url = {https://doi.org/10.1287/mnsc.36.11.1364},
      year = {1990},
      month = nov,
      publisher = {Institute for Operations Research and the Management Sciences ({INFORMS})},
      volume = {36},
      number = {11},
      pages = {1364--1383},
      author = {Pierre L{\textquotesingle}Ecuyer},
      title = {A Unified View of the {IPA},  {SF},  and {LR} Gradient Estimation Techniques},
      journal = {Management Science}
    }

Distributional vs structural parameters
- Section 7.1 of Rubinstein and Kroese (2008)
The score function estimator as a subcase of LR
- See the L'Ecuyer paper and section 7.2 of Rubinstein and Kroese (2008)
Connection between the score function and the Fisher information matrix
- See section 1.14.4 of Rubinstein and Kroese (2008)
The connection between the score function and maximum likelihood estimation
- See section 1.14.3 of Rubinstein and Kroese (2008)

Friday February 28

Recap of the likelihood ration method
Transformations of random variables
See section 1.7.2 of Rubinstein and Kroese (2008)
The inverse transform method
- See section 2.3.1 of Rubinstein and Kroese (2008)
- Example with the exponential distribution (live coding)
- See section 2.4.1.1 of Rubinstein and Kroese (2008)
From LR to IPA:
- See L'Ecuyer (1990)
Derivation and empirical comparision of LR and IPA estimators for the exponential distribution
- See colab notebook (shared in the mailing list)

Week of March 2

Spring break

Tuesday March 10

Application of the LR estimator to MDPs
Likelihood ratio martingale
Conditional Monte Carlo method
See Bratley, Fox and Schrage (1987)
Law of total variance
Law of total expectation

Friday March 13

The REINFORCE estimator
Eligibility traces for the REINFORCE estimator
Control variates
Implicit function theorem and the policy gradient theorem
Connection with the average reward objective

Week of March 16

Discrete-time optimal control, Lagrangian formulation, reverse-mode AD

Week of March 23

Continuous-time optimal control, sequential vs simultaneous methods, forward sensitivity analysis and adjoint method

Week of March 30

Formulation of inverse RL and meta-RL as bilevel optimization.

Week of April 6

Methods (contd.): KKT "trick", forward, reverse, implicit, competitive. Case studies

Week of April 13

Challenge and opportunities

Week of April 20

Final project presentations