Excursions in Reinforcement Learning

This course (IFT6760C) is intended for advanced graduate students with a good background in machine learning, mathematics, operations research or statistics. You can register to IFT6760C on Synchro if your affiliation is with UdeM, or via the CREPUQ if you are from another institution. Due to the research-oriented nature of this class, you need to be comfortable with a teaching format involving open-ended questions and assignments. You will be required to think critically and adopt an open mindset. My teaching goal with this course is for all the participants to build their own understanding of reinforcement learning in relation to their primary research area while sharing their unique perspective and insights with the entire class.

• A short journey or trip, especially one taken as a leisure activity.
• A deviation from a regular activity or course.

Origin: from the Latin verb excurrere which means to run out. This is also the intended meaning behind the title of this course. I want us to deviate from the usual paths and explore the rich connections between reinforcement learning and other disciplines, in particular: optimization, control theory and simulation. And of course, I'm also hoping that this will be a fun activity for everyone.

Time and Location

Twice a week, on Tuesday from 9:30 to 11:30AM and on Friday from 13h30 to 15h40. The course is taught at Mila in the Agora of the 6650 Saint-Urbain. You don't need badge access to enter the classroom. Here's a video showing you how to access the classroom from Saint-Zotique Ouest.

Evaluation

The following evaluation structure is subject to change depending on the class size.

• Class participation: 5%
• Assignments: 3 x 15% = 45%
• Final project: 50%

There is no mandatory textbook. I will however be referencing content from:

Schedule

Tuesday January 7

First class. Markov Decision Processes, induced process

Friday January 10

Examples of MDPs, constrained MLE as sequential allocation, criteria: finite horizon, infinite horizon, average reward, random horizon interpretation of infinite discounted setting. Bellman optimality

Tuesday January 14

• Absorbing states and discounting.
• Section 5.3, proposition 5.3.1 and page 127 in Puterman (1994)
• Death and discounting by Adam Shwartz
• @article{Shwartz2001,
doi = {10.1109/9.917668},
url = {https://doi.org/10.1109/9.917668},
year = {2001},
month = apr,
publisher = {Institute of Electrical and Electronics Engineers ({IEEE})},
volume = {46},
number = {4},
pages = {644--647},
author = {A. Shwartz},
title = {Death and discounting},
journal = {{IEEE} Transactions on Automatic Control}
}
• Vector notation: Section 5.6 in Puterman (1994)

• Neumann series and policy evaluation: Theorem 6.1.1 and corrolary C.4 in Puterman (1994)

• Matrix splitting methods:

• Iterative Methods by Martin H. Gutknecht
• Matrix iterative analysis by Richard Varga
• @book{varga1962,
title       = {Matrix iterative analysis},
author  = {Richard S. Varga},
lccn        = {62021277},
year        = {1962},
publisher   = {Prentice-Hall},
}

Friday January 17

• Primal and Dual LP formulations: section 6.9.1 in Puterman (1994)

• Finite State Markovian Decision Processes by Cyrus Derman
• @book{Derman1970,
author = {Derman, Cyrus},
title = {Finite State Markovian Decision Processes},
year = {1970},
isbn = {0122092503},
} 
• Markov Decision Processes by Lodewijk Kallenberg
• Constrained Markov Decision Processes by Eitan Altman
• @book{altman1999,
author    = {Altman, E.},
publisher = {Chapman and Hall},
title     = {Constrained Markov Decision Processes},
year      = 1999,
pages     = 256
}
• Occupation measures, induced randomized stationary policy: theorem 6.9.1 in Puterman (1994)

• Characterization of basic feasible solutions as Markov Deterministic decision rules: proposition 6.9.3 in Puterman (1994)

• Bias of the MRP in the average reward formulation: section 8.2.1 in Puterman (1994)

• Laurent series expansion in the average reward forumulation: section 8.2.2 in Puterman (1994)

• Representation of the total expected discounted reward in terms of the interest rate

• Corollary 8.2.4: decomposition of the total expected discounted reward of a policy into the gain and bias terms

• Corollary 8.2.5: gain as the limit of discount factor going to 1 in the total expected discounted reward

• Bellman optimality equations in the total expected discounted reward criterion

• Equality between taking the maximum over the set of deterministic Markovian decision rules instead of the randomized Markovian decision rules: proposition 6.2.1 in Puterman (1994)

• Contraction mappings: section 6.2.3 in Puterman (1994)

• Banach fixed-point theorem: theorem 6.2.3 in Puterman (1994)

Tuesday January 21

• Contraction mapping theorem: theorem 6.2.3 of Puterman (1994)
• The Bellman optimality operator is a contraction: proposition 6.2.4
• Conserving decision rules: section 6.2.4
• Existence of stationary policies: theorem 6.2.7
• Value iteration: section 6.3.2
• Termination criterion: theorem 6.3.1 c) and d)

Friday January 24

• Policy Iteration: section 6.4

• Monotonicity of the value iterates: proposition 6.4.1

• Convergence of policy iteration: theorem 6.4.2

• PI as a constructive proof for the existence of a solution to the Bellman optimality equations

• Newton-Kantorovich Iterations: section 6.4.3

• Gateaux derivatives

• Policy iteration as Newton-Kantorovich: proposition 6.4.4

• See Functional Analysis 2nd edition by Kantorovich and Akilov (1982)

•   @book{KantorovichAkilov1982,
title={Functional Analysis},
author={Kantorovich, L.V. and Akilov, G.P.},
isbn={9780080230368},
lccn={80041734},
year={1982},
publisher={Pergamon Press}
}
• See On the Convergence of Policy Iteration in Stationary Dynamic Programming by Puterman and Brumelle (1979)

• @article{PutermanBrumelle1979,
author = {Martin L. Puterman and Shelby L. Brumelle},
title = {On the Convergence of Policy Iteration in Stationary Dynamic Programming},
journal = {Mathematics of Operations Research},
volume = {4},
number = {1},
pages = {60-69},
year = {1979}
}
• See Numerical dynamic programming in economics by John Rust (1996)

• @incollection{Rust1996,
doi = {10.1016/s1574-0021(96)01016-7},
url = {https://doi.org/10.1016/s1574-0021(96)01016-7},
year = {1996},
publisher = {Elsevier},
pages = {619--729},
author = {John Rust},
title = {Chapter 14 Numerical dynamic programming in economics},
booktitle = {Handbook of Computational Economics}
}
• Modified policy iteration: section 6.5

• Generalized Bellman optimality operator
• See Stopping times and Markov programming by Van Nunen 1976
• @TechReport{vannunen1976,
author  = "J.A.E.E. van Nunen and J. Wessels",
title       = "Stopping times and Markov programming",
institution = "Technische Hogeschool Eindhoven",
number  = "76-22",
year        = "1976"
}
• See The generation of successive approximation methods for Markov decision processes by using stopping times by Van Nunen and Wessels (1976)
• @techreport{vannunentechreport1976,
author = {van Nunen, J. A. E. E. and Wessels, J.},
year = {1976},
title = {The generation of successive approximation methods for Markov decision processes by using stopping times},
institution = {Eindhoven: Technische Hogeschool Eindhoven.},
number = {Memorandum COSOR; Vol. 7622}
}

Tuesday January 28

• Derivation of the smooth Bellman optimality equations from the point of view of a noisy reward function where the noise process comes from a Gumbel distribution
• Social surplus function
• Williams-Daly-Zachary theorem
• See McFadden's Econometric Models of Probabilistic Choice (1981)
• See Rust (1996) and:
•   @article{Rust1988,
doi = {10.1137/0326056},
url = {https://doi.org/10.1137/0326056},
year = {1988},
month = sep,
publisher = {Society for Industrial {\&} Applied Mathematics ({SIAM})},
volume = {26},
number = {5},
pages = {1006--1024},
author = {Rust John},
title = {Maximum Likelihood Estimation of Discrete Control Processes},
journal = {{SIAM} Journal on Control and Optimization}
}
• Introduction to the projected bellman equations, Galerkin method, variational formulation

Friday January 31

• Lemma 6.3.1 on the property of the weighted Euclidean norm taken under a stationary distribution
• Proposition 6.3.1 using Lemma 6.3.1 to show that the projected policy evaluation operator is a $$\gamma$$-contraction and proof of the bound on the error between the fixed-point solution and the true value function
• Use of the Pythagorean theorem (footnote on page 425, section 6.3.1)
• Matrix form of the projected equations (section 6.3.2)

Tuesday February 4

• Projection by Monte Carlo simulation, section 6.1.4 of Bertsekas
• Derivation of the closed-form solution of the projection under the weighted Euclidean norm
• Monte Carlo estimates of the closed-form solution (see page 406 of Bertsekas)
• Simulated-based version of the direct solution to the projected policy evaluation equations: LSTD(0)
• Section 6.3.3 of Bertsekas
• Section 6.3.4 of Bertsekas

Friday February 7

• Assignment 1 released
• Perron-Frobenius theorem and numerical methods for computing the stationary distribution
• Projected Value Iteration algorithm, see section 6.3.2 of Bertsekas
• Simulated-based counterpart to PVI: LSPE(0) see section 6.3.4 of Bertsekas

Tuesday February 11

• Oblique Projections
• Projected Bellman equations with matrix splitting methods
• Simulated-based counterpart: LSTD($$\lambda$$)

Friday February 14

• Valentine's day
• See (Wikipedia](https://en.wikipedia.org/wiki/Valentine%27s_Day)
• ODE Method and Stochastic Approximation
• Fitted value methods:
• Non-linear case
• Counterexample

Week of February 17

Policy gradients: occupation measures, discounted objective, implicit differentiation and derivation in the infinite horizon case

Week of February 24

Policy gradients: derivative estimation, likelihood ratio methods (REINFORCE), reparametrization (IPA), baselines (control variates), actor-critic systems

Spring break

Week of March 9

Policy gradients: application for learning temporal abstractions, the option-critic architecture, hierarchical and goal-conditioned RL

Week of March 23

Automatic differentiation as discrete-time optimal control

Week of March 30

Formulation of inverse RL and meta-RL as bilevel optimization.

Week of April 6

Methods (contd.): KKT "trick", forward, reverse, implicit, competitive. Case studies

Week of April 13

Challenge and opportunities

Week of April 20

Final project presentations

Wellbeing

Academic life can sometimes be overwhelming. Don't hesitate to find support: