This course (IFT6760C) is intended for advanced graduate students with a good background in machine learning, mathematics, operations research or statistics. You can register to IFT6760C on Synchro if your affiliation is with UdeM, or via the CREPUQ if you are from another institution. Due to the research-oriented nature of this class, you need to be comfortable with a teaching format involving *open-ended* questions and assignments. You will be required to think critically and adopt an open mindset. My teaching goal with this course is for all the participants to build their own understanding of reinforcement learning in relation to their primary research area while sharing their unique perspective and insights with the entire class.

*A short journey or trip, especially one taken as a leisure activity*.*A deviation from a regular activity or course.*

Origin: from the Latin verb *excurrere* which means *to run out*. This is also the intended meaning behind the title of this course. I want us to deviate from the usual paths and explore the rich connections between reinforcement learning and other disciplines, in particular: optimization, control theory and simulation. And of course, I'm also hoping that this will be a fun activity for everyone.

Twice a week, on Tuesday from 9:30 to 11:30AM and on Friday from 13h30 to 15h40. The course is taught at Mila in the Agora of the 6650 Saint-Urbain. You don't need badge access to enter the classroom. Here's a video showing you how to access the classroom from Saint-Zotique Ouest.

The following evaluation structure is subject to change depending on the class size.

- Class participation: 5%
- Assignments: 3 x 15% = 45%
- Final project: 50%

There is no mandatory textbook. I will however be referencing content from:

- Markov Decision Processes: Discrete Stochastic Dynamic Programming by Martin Puterman
- Reinforcement Learning and Optimal Control by Dimitri Bertsekas
- Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
- Nonlinear Programming: 3rd Edition by Dimitri Bertsekas
- Practical Methods for Optimal Control and Estimation Using Nonlinear Programming by John T. Betts
- Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition by Andreas Griewank and Andrea Walther

First class. Markov Decision Processes, induced process

Examples of MDPs, constrained MLE as sequential allocation, criteria: finite horizon, infinite horizon, average reward, random horizon interpretation of infinite discounted setting. Bellman optimality

- Absorbing states and discounting.
- Section 5.3, proposition 5.3.1 and page 127 in Puterman (1994)
- Death and discounting by Adam Shwartz
`@article{Shwartz2001, doi = {10.1109/9.917668}, url = {https://doi.org/10.1109/9.917668}, year = {2001}, month = apr, publisher = {Institute of Electrical and Electronics Engineers ({IEEE})}, volume = {46}, number = {4}, pages = {644--647}, author = {A. Shwartz}, title = {Death and discounting}, journal = {{IEEE} Transactions on Automatic Control} }`

Vector notation: Section 5.6 in Puterman (1994)

Neumann series and policy evaluation: Theorem 6.1.1 and corrolary C.4 in Puterman (1994)

Matrix splitting methods:

- Iterative Methods by Martin H. Gutknecht
- Matrix iterative analysis by Richard Varga
`@book{varga1962, title = {Matrix iterative analysis}, author = {Richard S. Varga}, lccn = {62021277}, year = {1962}, publisher = {Prentice-Hall}, address = {Englewood Cliffs} }`

Primal and Dual LP formulations: section 6.9.1 in Puterman (1994)

- Finite State Markovian Decision Processes by Cyrus Derman
`@book{Derman1970, author = {Derman, Cyrus}, title = {Finite State Markovian Decision Processes}, year = {1970}, isbn = {0122092503}, publisher = {Academic Press, Inc.}, address = {Orlando, FL, USA}, }`

- Markov Decision Processes by Lodewijk Kallenberg
- Constrained Markov Decision Processes by Eitan Altman
`@book{altman1999, author = {Altman, E.}, publisher = {Chapman and Hall}, title = {Constrained Markov Decision Processes}, year = 1999, pages = 256 }`

Occupation measures, induced randomized stationary policy: theorem 6.9.1 in Puterman (1994)

Characterization of basic feasible solutions as Markov Deterministic decision rules: proposition 6.9.3 in Puterman (1994)

Bias of the MRP in the average reward formulation: section 8.2.1 in Puterman (1994)

Laurent series expansion in the average reward forumulation: section 8.2.2 in Puterman (1994)

Representation of the total expected discounted reward in terms of the interest rate

Corollary 8.2.4: decomposition of the total expected discounted reward of a policy into the gain and bias terms

Corollary 8.2.5: gain as the limit of discount factor going to 1 in the total expected discounted reward

Bellman optimality equations in the total expected discounted reward criterion

Equality between taking the maximum over the set of deterministic Markovian decision rules instead of the randomized Markovian decision rules: proposition 6.2.1 in Puterman (1994)

Contraction mappings: section 6.2.3 in Puterman (1994)

Banach fixed-point theorem: theorem 6.2.3 in Puterman (1994)

- Contraction mapping theorem: theorem 6.2.3 of Puterman (1994)
- The Bellman optimality operator is a contraction: proposition 6.2.4
- Conserving decision rules: section 6.2.4
- Existence of stationary policies: theorem 6.2.7
- Value iteration: section 6.3.2
- Termination criterion: theorem 6.3.1 c) and d)

Policy Iteration: section 6.4

Monotonicity of the value iterates: proposition 6.4.1

Convergence of policy iteration: theorem 6.4.2

PI as a constructive proof for the existence of a solution to the Bellman optimality equations

Newton-Kantorovich Iterations: section 6.4.3

Gateaux derivatives

Policy iteration as Newton-Kantorovich: proposition 6.4.4

See Functional Analysis 2nd edition by Kantorovich and Akilov (1982)

`@book{KantorovichAkilov1982, title={Functional Analysis}, author={Kantorovich, L.V. and Akilov, G.P.}, isbn={9780080230368}, lccn={80041734}, year={1982}, publisher={Pergamon Press} }`

See On the Convergence of Policy Iteration in Stationary Dynamic Programming by Puterman and Brumelle (1979)

`@article{PutermanBrumelle1979, author = {Martin L. Puterman and Shelby L. Brumelle}, title = {On the Convergence of Policy Iteration in Stationary Dynamic Programming}, journal = {Mathematics of Operations Research}, volume = {4}, number = {1}, pages = {60-69}, year = {1979} }`

See Numerical dynamic programming in economics by John Rust (1996)

`@incollection{Rust1996, doi = {10.1016/s1574-0021(96)01016-7}, url = {https://doi.org/10.1016/s1574-0021(96)01016-7}, year = {1996}, publisher = {Elsevier}, pages = {619--729}, author = {John Rust}, title = {Chapter 14 Numerical dynamic programming in economics}, booktitle = {Handbook of Computational Economics} }`

Modified policy iteration: section 6.5

- Generalized Bellman optimality operator
- See Stopping times and Markov programming by Van Nunen 1976
`@TechReport{vannunen1976, author = "J.A.E.E. van Nunen and J. Wessels", title = "Stopping times and Markov programming", institution = "Technische Hogeschool Eindhoven", address = "Eindhoven:", number = "76-22", year = "1976" }`

- See The generation of successive approximation methods for Markov decision processes by using stopping times by Van Nunen and Wessels (1976)
`@techreport{vannunentechreport1976, author = {van Nunen, J. A. E. E. and Wessels, J.}, year = {1976}, title = {The generation of successive approximation methods for Markov decision processes by using stopping times}, institution = {Eindhoven: Technische Hogeschool Eindhoven.}, number = {Memorandum COSOR; Vol. 7622} }`

- Derivation of the smooth Bellman optimality equations from the point of view of a noisy reward function where the noise process comes from a Gumbel distribution
- Social surplus function
- Williams-Daly-Zachary theorem
- See McFadden's Econometric Models of Probabilistic Choice (1981)
- See Rust (1996) and:
`@article{Rust1988, doi = {10.1137/0326056}, url = {https://doi.org/10.1137/0326056}, year = {1988}, month = sep, publisher = {Society for Industrial {\&} Applied Mathematics ({SIAM})}, volume = {26}, number = {5}, pages = {1006--1024}, author = {Rust John}, title = {Maximum Likelihood Estimation of Discrete Control Processes}, journal = {{SIAM} Journal on Control and Optimization} }`

- Introduction to the projected bellman equations, Galerkin method, variational formulation

- Lemma 6.3.1 on the property of the weighted Euclidean norm taken under a stationary distribution
- Proposition 6.3.1 using Lemma 6.3.1 to show that the projected policy evaluation operator is a \(\gamma\)-contraction and proof of the bound on the error between the fixed-point solution and the true value function
- Use of the Pythagorean theorem (footnote on page 425, section 6.3.1)
- Matrix form of the projected equations (section 6.3.2)

- Projection by Monte Carlo simulation, section 6.1.4 of Bertsekas
- Derivation of the closed-form solution of the projection under the weighted Euclidean norm
- Monte Carlo estimates of the closed-form solution (see page 406 of Bertsekas)
- Simulated-based version of the direct solution to the projected policy evaluation equations: LSTD(0)
- Section 6.3.3 of Bertsekas
- Section 6.3.4 of Bertsekas

- Assignment 1 released
- Perron-Frobenius theorem and numerical methods for computing the stationary distribution
- Projected Value Iteration algorithm, see section 6.3.2 of Bertsekas
- Simulated-based counterpart to PVI: LSPE(0) see section 6.3.4 of Bertsekas

- Oblique Projections
- Projected Bellman equations with matrix splitting methods
- Simulated-based counterpart: LSTD(\(\lambda\))

- Valentine's day
- See (Wikipedia](https://en.wikipedia.org/wiki/Valentine%27s_Day)
- ODE Method and Stochastic Approximation
- Fitted value methods:
- Non-linear case
- Counterexample

Policy gradients: occupation measures, discounted objective, implicit differentiation and derivation in the infinite horizon case

Policy gradients: derivative estimation, likelihood ratio methods (REINFORCE), reparametrization (IPA), baselines (control variates), actor-critic systems

Spring break

Policy gradients: application for learning temporal abstractions, the option-critic architecture, hierarchical and goal-conditioned RL

Policy gradients: Linear-Quadratic Regulator, Lagrangian formulation, MPC, Monte-Carlo Tree Search

Automatic differentiation as discrete-time optimal control

Formulation of inverse RL and meta-RL as bilevel optimization.

Methods (contd.): KKT "trick", forward, reverse, implicit, competitive. Case studies

Challenge and opportunities

Final project presentations

Academic life can sometimes be overwhelming. Don't hesitate to find support:

- UdeM: Centre de santé et de consultation psychologique
- McGill: Student wellness hub
- Polytechnique: Soutien à la réussite
- HEC: Psychological support and ressources